From f9f08c4b698830b6abcb140d42da98ca7084f082 Mon Sep 17 00:00:00 2001 From: Gengliang Wang Date: Sat, 20 May 2023 22:06:24 -0700 Subject: [PATCH] Add documentation for Databricks integration (#5013) # Add documentation for Databricks integration This is a follow-up of https://github.com/hwchase17/langchain/pull/4702 It documents the details of how to integrate Databricks using langchain. It also provides examples in a notebook. ## Who can review? @dev2049 @hwchase17 since you are aware of the context. We will promote the integration after this doc is ready. Thanks in advance! --- docs/integrations/databricks.ipynb | 280 ++++++++++++++++++++++ docs/modules/chains/examples/sqlite.ipynb | 2 +- 2 files changed, 281 insertions(+), 1 deletion(-) create mode 100644 docs/integrations/databricks.ipynb diff --git a/docs/integrations/databricks.ipynb b/docs/integrations/databricks.ipynb new file mode 100644 index 0000000000..4cce07f747 --- /dev/null +++ b/docs/integrations/databricks.ipynb @@ -0,0 +1,280 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Databricks\n", + "\n", + "This notebook covers how to connect to the [Databricks runtimes](https://docs.databricks.com/runtime/index.html) and [Databricks SQL](https://www.databricks.com/product/databricks-sql) using the SQLDatabase wrapper of LangChain.\n", + "It is broken into 3 parts: installation and setup, connecting to Databricks, and examples." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## Installation and Setup" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [], + "source": [ + "!pip install databricks-sql-connector" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## Connecting to Databricks\n", + "\n", + "You can connect to [Databricks runtimes](https://docs.databricks.com/runtime/index.html) and [Databricks SQL](https://www.databricks.com/product/databricks-sql) using the `SQLDatabase.from_databricks()` method.\n", + "\n", + "### Syntax\n", + "```python\n", + "SQLDatabase.from_databricks(\n", + " catalog: str,\n", + " schema: str,\n", + " host: Optional[str] = None,\n", + " api_token: Optional[str] = None,\n", + " warehouse_id: Optional[str] = None,\n", + " cluster_id: Optional[str] = None,\n", + " engine_args: Optional[dict] = None,\n", + " **kwargs: Any)\n", + "```\n", + "### Required Parameters\n", + "* `catalog`: The catalog name in the Databricks database.\n", + "* `schema`: The schema name in the catalog.\n", + "\n", + "### Optional Parameters\n", + "There following parameters are optional. When executing the method in a Databricks notebook, you don't need to provide them in most of the cases.\n", + "* `host`: The Databricks workspace hostname, excluding 'https://' part. Defaults to 'DATABRICKS_HOST' environment variable or current workspace if in a Databricks notebook.\n", + "* `api_token`: The Databricks personal access token for accessing the Databricks SQL warehouse or the cluster. Defaults to 'DATABRICKS_API_TOKEN' environment variable or a temporary one is generated if in a Databricks notebook.\n", + "* `warehouse_id`: The warehouse ID in the Databricks SQL.\n", + "* `cluster_id`: The cluster ID in the Databricks Runtime. If running in a Databricks notebook and both 'warehouse_id' and 'cluster_id' are None, it uses the ID of the cluster the notebook is attached to.\n", + "* `engine_args`: The arguments to be used when connecting Databricks.\n", + "* `**kwargs`: Additional keyword arguments for the `SQLDatabase.from_uri` method." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "## Examples" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 2, + "outputs": [], + "source": [ + "# Connecting to Databricks with SQLDatabase wrapper\n", + "from langchain import SQLDatabase\n", + "\n", + "db = SQLDatabase.from_databricks(catalog='samples', schema='nyctaxi')" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [], + "source": [ + "# Creating a OpenAI Chat LLM wrapper\n", + "from langchain.chat_models import ChatOpenAI\n", + "\n", + "llm = ChatOpenAI(temperature=0, model_name=\"gpt-4\")" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "source": [ + "### SQL Chain example\n", + "\n", + "This example demonstrates the use of the [SQL Chain](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html) for answering a question over a Databricks database." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "36f2270b", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain import SQLDatabaseChain\n", + "\n", + "db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "4e2b5f25", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "\u001B[1m> Entering new SQLDatabaseChain chain...\u001B[0m\n", + "What is the average duration of taxi rides that start between midnight and 6am?\n", + "SQLQuery:\u001B[32;1m\u001B[1;3mSELECT AVG(UNIX_TIMESTAMP(tpep_dropoff_datetime) - UNIX_TIMESTAMP(tpep_pickup_datetime)) as avg_duration\n", + "FROM trips\n", + "WHERE HOUR(tpep_pickup_datetime) >= 0 AND HOUR(tpep_pickup_datetime) < 6\u001B[0m\n", + "SQLResult: \u001B[33;1m\u001B[1;3m[(987.8122786304605,)]\u001B[0m\n", + "Answer:\u001B[32;1m\u001B[1;3mThe average duration of taxi rides that start between midnight and 6am is 987.81 seconds.\u001B[0m\n", + "\u001B[1m> Finished chain.\u001B[0m\n" + ] + }, + { + "data": { + "text/plain": "'The average duration of taxi rides that start between midnight and 6am is 987.81 seconds.'" + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "db_chain.run(\"What is the average duration of taxi rides that start between midnight and 6am?\")" + ] + }, + { + "cell_type": "markdown", + "source": [ + "### SQL Database Agent example\n", + "\n", + "This example demonstrates the use of the [SQL Database Agent](https://python.langchain.com/en/latest/modules/agents/toolkits/examples/sql_database.html) for answering questions over a Databricks database." + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "9918e86a", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.agents import create_sql_agent\n", + "from langchain.agents.agent_toolkits import SQLDatabaseToolkit\n", + "\n", + "toolkit = SQLDatabaseToolkit(db=db, llm=llm)\n", + "agent = create_sql_agent(\n", + " llm=llm,\n", + " toolkit=toolkit,\n", + " verbose=True\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "c484a76e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "\u001B[1m> Entering new AgentExecutor chain...\u001B[0m\n", + "\u001B[32;1m\u001B[1;3mAction: list_tables_sql_db\n", + "Action Input: \u001B[0m\n", + "Observation: \u001B[38;5;200m\u001B[1;3mtrips\u001B[0m\n", + "Thought:\u001B[32;1m\u001B[1;3mI should check the schema of the trips table to see if it has the necessary columns for trip distance and duration.\n", + "Action: schema_sql_db\n", + "Action Input: trips\u001B[0m\n", + "Observation: \u001B[33;1m\u001B[1;3m\n", + "CREATE TABLE trips (\n", + "\ttpep_pickup_datetime TIMESTAMP, \n", + "\ttpep_dropoff_datetime TIMESTAMP, \n", + "\ttrip_distance FLOAT, \n", + "\tfare_amount FLOAT, \n", + "\tpickup_zip INT, \n", + "\tdropoff_zip INT\n", + ") USING DELTA\n", + "\n", + "/*\n", + "3 rows from trips table:\n", + "tpep_pickup_datetime\ttpep_dropoff_datetime\ttrip_distance\tfare_amount\tpickup_zip\tdropoff_zip\n", + "2016-02-14 16:52:13+00:00\t2016-02-14 17:16:04+00:00\t4.94\t19.0\t10282\t10171\n", + "2016-02-04 18:44:19+00:00\t2016-02-04 18:46:00+00:00\t0.28\t3.5\t10110\t10110\n", + "2016-02-17 17:13:57+00:00\t2016-02-17 17:17:55+00:00\t0.7\t5.0\t10103\t10023\n", + "*/\u001B[0m\n", + "Thought:\u001B[32;1m\u001B[1;3mThe trips table has the necessary columns for trip distance and duration. I will write a query to find the longest trip distance and its duration.\n", + "Action: query_checker_sql_db\n", + "Action Input: SELECT trip_distance, tpep_dropoff_datetime - tpep_pickup_datetime as duration FROM trips ORDER BY trip_distance DESC LIMIT 1\u001B[0m\n", + "Observation: \u001B[31;1m\u001B[1;3mSELECT trip_distance, tpep_dropoff_datetime - tpep_pickup_datetime as duration FROM trips ORDER BY trip_distance DESC LIMIT 1\u001B[0m\n", + "Thought:\u001B[32;1m\u001B[1;3mThe query is correct. I will now execute it to find the longest trip distance and its duration.\n", + "Action: query_sql_db\n", + "Action Input: SELECT trip_distance, tpep_dropoff_datetime - tpep_pickup_datetime as duration FROM trips ORDER BY trip_distance DESC LIMIT 1\u001B[0m\n", + "Observation: \u001B[36;1m\u001B[1;3m[(30.6, '0 00:43:31.000000000')]\u001B[0m\n", + "Thought:\u001B[32;1m\u001B[1;3mI now know the final answer.\n", + "Final Answer: The longest trip distance is 30.6 miles and it took 43 minutes and 31 seconds.\u001B[0m\n", + "\n", + "\u001B[1m> Finished chain.\u001B[0m\n" + ] + }, + { + "data": { + "text/plain": "'The longest trip distance is 30.6 miles and it took 43 minutes and 31 seconds.'" + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agent.run(\"What is the longest trip distance and how long did it take?\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/modules/chains/examples/sqlite.ipynb b/docs/modules/chains/examples/sqlite.ipynb index 56cdf48138..75942025d9 100644 --- a/docs/modules/chains/examples/sqlite.ipynb +++ b/docs/modules/chains/examples/sqlite.ipynb @@ -23,7 +23,7 @@ } }, "source": [ - "Under the hood, LangChain uses SQLAlchemy to connect to SQL databases. The `SQLDatabaseChain` can therefore be used with any SQL dialect supported by SQLAlchemy, such as MS SQL, MySQL, MariaDB, PostgreSQL, Oracle SQL, Databricks and SQLite. Please refer to the SQLAlchemy documentation for more information about requirements for connecting to your database. For example, a connection to MySQL requires an appropriate connector such as PyMySQL. A URI for a MySQL connection might look like: `mysql+pymysql://user:pass@some_mysql_db_address/db_name`. To connect to Databricks, it is recommended to use the handy method `SQLDatabase.from_databricks(catalog, schema, host, api_token, (warehouse_id|cluster_id))`.\n", + "Under the hood, LangChain uses SQLAlchemy to connect to SQL databases. The `SQLDatabaseChain` can therefore be used with any SQL dialect supported by SQLAlchemy, such as MS SQL, MySQL, MariaDB, PostgreSQL, Oracle SQL, [Databricks](../../../integrations/databricks.ipynb) and SQLite. Please refer to the SQLAlchemy documentation for more information about requirements for connecting to your database. For example, a connection to MySQL requires an appropriate connector such as PyMySQL. A URI for a MySQL connection might look like: `mysql+pymysql://user:pass@some_mysql_db_address/db_name`.\n", "\n", "This demonstration uses SQLite and the example Chinook database.\n", "To set it up, follow the instructions on https://database.guide/2-sample-databases-sqlite/, placing the `.db` file in a notebooks folder at the root of this repository."