docs: add csv use case (#16756)

pull/16797/head
Bagatur 5 months ago committed by GitHub
parent 4acd2654a3
commit b0347f3e2b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -0,0 +1,781 @@
{
"cells": [
{
"cell_type": "raw",
"id": "d00a802f-a27e-43a5-af1e-500d4bb70859",
"metadata": {},
"source": [
"---\n",
"sidebar_position: 0.3\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "674a0d41-e3e3-4423-a995-25d40128c518",
"metadata": {},
"source": [
"# CSV\n",
"\n",
"LLMs are great for building question-answering systems over various types of data sources. In this section we'll go over how to build Q&A systems over data stored in a CSV file(s). Like working with SQL databases, the key to working with CSV files is to give an LLM access to tools for querying and interacting with the data. The two main ways to do this are to either:\n",
"\n",
"* **RECOMMENDED**: Load the CSV(s) into a SQL database, and use the approaches outlined in the [SQL use case docs](/docs/use_cases/sql/).\n",
"* Give the LLM access to a Python environment where it can use libraries like Pandas to interact with the data.\n",
"\n",
"## ⚠️ Security note ⚠️\n",
"\n",
"Both approaches mentioned above carry significant risks. Using SQL requires executing model-generated SQL queries. Using a library like Pandas requires letting the model execute Python code. Since it is easier to tightly scope SQL connection permissions and sanitize SQL queries than it is to sandbox Python environments, **we HIGHLY recommend interacting with CSV data via SQL.** For more on general security best practices, [see here](/docs/security)."
]
},
{
"cell_type": "markdown",
"id": "d20c20d7-71e1-4808-9012-48278f3a9b94",
"metadata": {},
"source": [
"## Setup\n",
"Dependencies for this guide:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3fcf245-b0aa-4aee-8f0a-9c9cf94b065e",
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain langchain-openai langchain-community langchain-experimental pandas"
]
},
{
"cell_type": "markdown",
"id": "7f2e34a3-0978-4856-8844-d8dfc6d5ac51",
"metadata": {},
"source": [
"Set required environment variables:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "53913d79-4a11-4bc6-bb49-dea2cc8c453b",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass()\n",
"\n",
"# Using LangSmith is recommended but not required. Uncomment below lines to use.\n",
"# os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
"# os.environ[\"LANGCHAIN_API_KEY\"] = getpass.getpass()"
]
},
{
"cell_type": "markdown",
"id": "c23b4232-2f6a-4eb5-b0cb-1d48a9e02fcc",
"metadata": {},
"source": [
"Download the [Titanic dataset](https://www.kaggle.com/datasets/yasserh/titanic-dataset) if you don't already have it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2c5e524-781c-4b8e-83ec-d302023f8767",
"metadata": {},
"outputs": [],
"source": [
"!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv -O titanic.csv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8431551e-e0d7-4702-90e3-12c53161a479",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(887, 8)\n",
"['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv(\"titanic.csv\")\n",
"print(df.shape)\n",
"print(df.columns.tolist())"
]
},
{
"cell_type": "markdown",
"id": "1779ab07-b715-49e5-ab2a-2e6be7d02927",
"metadata": {},
"source": [
"## SQL\n",
"\n",
"Using SQL to interact with CSV data is the recommended approach because it is easier to limit permissions and sanitize queries than with arbitrary Python.\n",
"\n",
"Most SQL databases make it easy to load a CSV file in as a table ([DuckDB](https://duckdb.org/docs/data/csv/overview.html), [SQLite](https://www.sqlite.org/csv.html), etc.). Once you've done this you can use all of the chain and agent-creating techniques outlined in the [SQL use case guide](/docs/use_cases/sql/). Here's a quick example of how we might do this with SQLite:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "f61e9886-4713-4c88-87d4-dab439687f43",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"887"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.utilities import SQLDatabase\n",
"from sqlalchemy import create_engine\n",
"\n",
"engine = create_engine(\"sqlite:///titanic.db\")\n",
"df.to_sql(\"titanic\", engine, index=False)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "fa314f1f-d764-41a2-8f27-163cd071c562",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sqlite\n",
"['titanic']\n"
]
},
{
"data": {
"text/plain": [
"\"[(1, 2, 'Master. Alden Gates Caldwell', 'male', 0.83, 0, 2, 29.0), (0, 3, 'Master. Eino Viljami Panula', 'male', 1.0, 4, 1, 39.6875), (1, 3, 'Miss. Eleanor Ileen Johnson', 'female', 1.0, 1, 1, 11.1333), (1, 2, 'Master. Richard F Becker', 'male', 1.0, 2, 1, 39.0), (1, 1, 'Master. Hudson Trevor Allison', 'male', 0.92, 1, 2, 151.55), (1, 3, 'Miss. Maria Nakid', 'female', 1.0, 0, 2, 15.7417), (0, 3, 'Master. Sidney Leonard Goodwin', 'male', 1.0, 5, 2, 46.9), (1, 3, 'Miss. Helene Barbara Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 3, 'Miss. Eugenie Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 2, 'Master. Viljo Hamalainen', 'male', 0.67, 1, 1, 14.5), (1, 3, 'Master. Bertram Vere Dean', 'male', 1.0, 1, 2, 20.575), (1, 3, 'Master. Assad Alexander Thomas', 'male', 0.42, 0, 1, 8.5167), (1, 2, 'Master. Andre Mallet', 'male', 1.0, 0, 2, 37.0042), (1, 2, 'Master. George Sibley Richards', 'male', 0.83, 1, 1, 18.75)]\""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db = SQLDatabase(engine=engine)\n",
"print(db.dialect)\n",
"print(db.get_usable_table_names())\n",
"db.run(\"SELECT * FROM titanic WHERE Age < 2;\")"
]
},
{
"cell_type": "markdown",
"id": "42f5a3c3-707c-4331-9f5f-0cb4919763dd",
"metadata": {},
"source": [
"And create a [SQL agent](/docs/use_cases/sql/agents) to interact with it:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "edd92649-b178-47bd-b2b7-d5d4e14b3512",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.agent_toolkits import create_sql_agent\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0)\n",
"agent_executor = create_sql_agent(llm, db=db, agent_type=\"openai-tools\", verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9680e2c0-7957-4dba-9183-9782865176a3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
"\u001b[32;1m\u001b[1;3m\n",
"Invoking: `sql_db_list_tables` with `{}`\n",
"\n",
"\n",
"\u001b[0m\u001b[38;5;200m\u001b[1;3mtitanic\u001b[0m\u001b[32;1m\u001b[1;3m\n",
"Invoking: `sql_db_schema` with `{'table_names': 'titanic'}`\n",
"\n",
"\n",
"\u001b[0m\u001b[33;1m\u001b[1;3m\n",
"CREATE TABLE titanic (\n",
"\t\"Survived\" BIGINT, \n",
"\t\"Pclass\" BIGINT, \n",
"\t\"Name\" TEXT, \n",
"\t\"Sex\" TEXT, \n",
"\t\"Age\" FLOAT, \n",
"\t\"Siblings/Spouses Aboard\" BIGINT, \n",
"\t\"Parents/Children Aboard\" BIGINT, \n",
"\t\"Fare\" FLOAT\n",
")\n",
"\n",
"/*\n",
"3 rows from titanic table:\n",
"Survived\tPclass\tName\tSex\tAge\tSiblings/Spouses Aboard\tParents/Children Aboard\tFare\n",
"0\t3\tMr. Owen Harris Braund\tmale\t22.0\t1\t0\t7.25\n",
"1\t1\tMrs. John Bradley (Florence Briggs Thayer) Cumings\tfemale\t38.0\t1\t0\t71.2833\n",
"1\t3\tMiss. Laina Heikkinen\tfemale\t26.0\t0\t0\t7.925\n",
"*/\u001b[0m\u001b[32;1m\u001b[1;3m\n",
"Invoking: `sql_db_query` with `{'query': 'SELECT AVG(Age) AS AverageAge FROM titanic WHERE Survived = 1'}`\n",
"responded: To find the average age of survivors, I will query the \"titanic\" table and calculate the average of the \"Age\" column for the rows where \"Survived\" is equal to 1.\n",
"\n",
"Here is the SQL query:\n",
"\n",
"```sql\n",
"SELECT AVG(Age) AS AverageAge\n",
"FROM titanic\n",
"WHERE Survived = 1\n",
"```\n",
"\n",
"Executing this query will give us the average age of the survivors.\n",
"\n",
"\u001b[0m\u001b[36;1m\u001b[1;3m[(28.408391812865496,)]\u001b[0m\u001b[32;1m\u001b[1;3mThe average age of the survivors is approximately 28.41 years.\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"{'input': \"what's the average age of survivors\",\n",
" 'output': 'The average age of the survivors is approximately 28.41 years.'}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"agent_executor.invoke({\"input\": \"what's the average age of survivors\"})"
]
},
{
"cell_type": "markdown",
"id": "4d1eb128-842b-4018-87ab-bb269147f6ec",
"metadata": {},
"source": [
"This approach easily generalizes to multiple CSVs, since we can just load each of them into our database as it's own table. Head to the [SQL guide](/docs/use_cases/sql/) for more."
]
},
{
"cell_type": "markdown",
"id": "fe7f2d91-2377-49dd-97a3-19d48a750715",
"metadata": {},
"source": [
"## Pandas\n",
"\n",
"Instead of SQL we can also use data analysis libraries like pandas and the code generating abilities of LLMs to interact with CSV data. Again, **this approach is not fit for production use cases unless you have extensive safeguards in place**. For this reason, our code-execution utilities and constructors live in the `langchain-experimental` package.\n",
"\n",
"### Chain\n",
"\n",
"Most LLMs have been trained on enough pandas Python code that they can generate it just by being asked to:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "cd02e72d-31bf-4ed3-b4fd-643011dab236",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"```python\n",
"correlation = df['Age'].corr(df['Fare'])\n",
"correlation\n",
"```\n"
]
}
],
"source": [
"ai_msg = llm.invoke(\n",
" \"I have a pandas DataFrame 'df' with columns 'Age' and 'Fare'. Write code to compute the correlation between the two columns. Return Markdown for a Python code snippet and nothing else.\"\n",
")\n",
"print(ai_msg.content)"
]
},
{
"cell_type": "markdown",
"id": "f5e84003-5c39-496b-afa7-eaa50a01b7bb",
"metadata": {},
"source": [
"We can combine this ability with a Python-executing tool to create a simple data analysis chain. We'll first want to load our CSV table as a dataframe, and give the tool access to this dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "d8132f75-12d4-4294-b446-2d114e603f4f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"32.30542018038331"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_experimental.tools import PythonAstREPLTool\n",
"\n",
"df = pd.read_csv(\"titanic.csv\")\n",
"tool = PythonAstREPLTool(locals={\"df\": df})\n",
"tool.invoke(\"df['Fare'].mean()\")"
]
},
{
"cell_type": "markdown",
"id": "ab1b2e7c-6ea8-4674-98eb-a43c69f5c19d",
"metadata": {},
"source": [
"To help enforce proper use of our Python tool, we'll using [function calling](/docs/modules/model_io/chat/function_calling):"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "2d30dbca-2d19-4574-bc78-43753f648eb7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_6TZsNaCqOcbP7lqWudosQTd6', 'function': {'arguments': '{\\n \"query\": \"df[[\\'Age\\', \\'Fare\\']].corr()\"\\n}', 'name': 'python_repl_ast'}, 'type': 'function'}]})"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm_with_tools = llm.bind_tools([tool], tool_choice=tool.name)\n",
"llm_with_tools.invoke(\n",
" \"I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "bdec46fb-7296-443c-9e97-cfa9045ff21d",
"metadata": {},
"source": [
"We'll add a [OpenAI tools output parser](/docs/modules/model_io/output_parsers/types/openai_tools) to extract the function call as a dict:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "f0b658cb-722b-43e8-84ad-62ba8929169a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'query': \"df[['Age', 'Fare']].corr()\"}"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.output_parsers.openai_tools import JsonOutputKeyToolsParser\n",
"\n",
"parser = JsonOutputKeyToolsParser(tool.name, return_single=True)\n",
"(llm_with_tools | parser).invoke(\n",
" \"I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "59362ea0-cc5a-4841-b87c-51d6a87d5810",
"metadata": {},
"source": [
"And combine with a prompt so that we can just specify a question without needing to specify the dataframe info every invocation:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "0bd2ecba-90c6-4301-8cc1-bd021a7f74fc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'query': \"df[['Age', 'Fare']].corr()\"}"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"system = f\"\"\"You have access to a pandas dataframe `df`. \\\n",
"Here is the output of `df.head().to_markdown()`:\n",
"\n",
"```\n",
"{df.head().to_markdown()}\n",
"```\n",
"\n",
"Given a user question, write the Python code to answer it. \\\n",
"Return ONLY the valid Python code and nothing else. \\\n",
"Don't assume you have access to any libraries other than built-in Python ones and pandas.\"\"\"\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [(\"system\", system), (\"human\", \"{question}\")]\n",
")\n",
"code_chain = prompt | llm_with_tools | parser\n",
"code_chain.invoke({\"question\": \"What's the correlation between age and fare\"})"
]
},
{
"cell_type": "markdown",
"id": "63989e47-c0af-409e-9766-83c3fe6d69bb",
"metadata": {},
"source": [
"And lastly we'll add our Python tool so that the generated code is actually executed:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "745b5b2c-2eda-441e-8459-275dc1d4d9aa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.11232863699941621"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain = prompt | llm_with_tools | parser | tool # noqa\n",
"chain.invoke({\"question\": \"What's the correlation between age and fare\"})"
]
},
{
"cell_type": "markdown",
"id": "fbb12764-4a90-4e84-88b4-a25949084ea2",
"metadata": {},
"source": [
"And just like that we have a simple data analysis chain. We can take a peak at the intermediate steps by looking at the LangSmith trace: https://smith.langchain.com/public/b1309290-7212-49b7-bde2-75b39a32b49a/r\n",
"\n",
"We could add an additional LLM call at the end to generate a conversational response, so that we're not just responding with the tool output. For this we'll want to add a chat history `MessagesPlaceholder` to our prompt:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "3fe3818d-0657-4729-ac46-ab5d4860d8f6",
"metadata": {},
"outputs": [],
"source": [
"from operator import itemgetter\n",
"\n",
"from langchain_core.messages import ToolMessage\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.prompts import MessagesPlaceholder\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"system = f\"\"\"You have access to a pandas dataframe `df`. \\\n",
"Here is the output of `df.head().to_markdown()`:\n",
"\n",
"```\n",
"{df.head().to_markdown()}\n",
"```\n",
"\n",
"Given a user question, write the Python code to answer it. \\\n",
"Don't assume you have access to any libraries other than built-in Python ones and pandas.\n",
"Respond directly to the question once you have enough information to answer it.\"\"\"\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" system,\n",
" ),\n",
" (\"human\", \"{question}\"),\n",
" # This MessagesPlaceholder allows us to optionally append an arbitrary number of messages\n",
" # at the end of the prompt using the 'chat_history' arg.\n",
" MessagesPlaceholder(\"chat_history\", optional=True),\n",
" ]\n",
")\n",
"\n",
"\n",
"def _get_chat_history(x: dict) -> list:\n",
" \"\"\"Parse the chain output up to this point into a list of chat history messages to insert in the prompt.\"\"\"\n",
" ai_msg = x[\"ai_msg\"]\n",
" tool_call_id = x[\"ai_msg\"].additional_kwargs[\"tool_calls\"][0][\"id\"]\n",
" tool_msg = ToolMessage(tool_call_id=tool_call_id, content=str(x[\"tool_output\"]))\n",
" return [ai_msg, tool_msg]\n",
"\n",
"\n",
"chain = (\n",
" RunnablePassthrough.assign(ai_msg=prompt | llm_with_tools)\n",
" .assign(tool_output=itemgetter(\"ai_msg\") | parser | tool)\n",
" .assign(chat_history=_get_chat_history)\n",
" .assign(response=prompt | llm | StrOutputParser())\n",
" .pick([\"tool_output\", \"response\"])\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "03e14712-9959-4f2d-94d5-4ac2bd9f3f08",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'tool_output': 0.11232863699941621,\n",
" 'response': 'The correlation between age and fare is approximately 0.112.'}"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke({\"question\": \"What's the correlation between age and fare\"})"
]
},
{
"cell_type": "markdown",
"id": "245a5a91-c6d2-4a40-9b9f-eb38f78c9d22",
"metadata": {},
"source": [
"Here's the LangSmith trace for this run: https://smith.langchain.com/public/ca689f8a-5655-4224-8bcf-982080744462/r"
]
},
{
"cell_type": "markdown",
"id": "6c24b4f4-abbf-4891-b200-814eb9c35bec",
"metadata": {},
"source": [
"### Agent\n",
"\n",
"For complex questions it can be helpful for an LLM to be able to iteratively execute code while maintaining the inputs and outputs of its previous executions. This is where Agents come into play. They allow an LLM to decide how many times a tool needs to be invoked and keep track of the executions it's made so far. The [create_pandas_dataframe_agent](https://api.python.langchain.com/en/latest/agents/langchain_experimental.agents.agent_toolkits.pandas.base.create_pandas_dataframe_agent.html) is a built-in agent that makes it easy to work with dataframes:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "b8b3a781-189f-48ff-b541-f5ed2f65e3e7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
"\u001b[32;1m\u001b[1;3m\n",
"Invoking: `python_repl_ast` with `{'query': \"df[['Age', 'Fare']].corr()\"}`\n",
"\n",
"\n",
"\u001b[0m\u001b[36;1m\u001b[1;3m Age Fare\n",
"Age 1.000000 0.112329\n",
"Fare 0.112329 1.000000\u001b[0m\u001b[32;1m\u001b[1;3m\n",
"Invoking: `python_repl_ast` with `{'query': \"df[['Fare', 'Survived']].corr()\"}`\n",
"\n",
"\n",
"\u001b[0m\u001b[36;1m\u001b[1;3m Fare Survived\n",
"Fare 1.000000 0.256179\n",
"Survived 0.256179 1.000000\u001b[0m\u001b[32;1m\u001b[1;3mThe correlation between age and fare is 0.112329, while the correlation between fare and survival is 0.256179. Therefore, the correlation between fare and survival is greater than the correlation between age and fare.\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"{'input': \"What's the correlation between age and fare? is that greater than the correlation between fare and survival?\",\n",
" 'output': 'The correlation between age and fare is 0.112329, while the correlation between fare and survival is 0.256179. Therefore, the correlation between fare and survival is greater than the correlation between age and fare.'}"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_experimental.agents import create_pandas_dataframe_agent\n",
"\n",
"agent = create_pandas_dataframe_agent(llm, df, agent_type=\"openai-tools\", verbose=True)\n",
"agent.invoke(\n",
" {\n",
" \"input\": \"What's the correlation between age and fare? is that greater than the correlation between fare and survival?\"\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"id": "a65322f3-b13c-4949-82b2-4517b9a0859d",
"metadata": {},
"source": [
"Here's the LangSmith trace for this run: https://smith.langchain.com/public/8e6c23cc-782c-4203-bac6-2a28c770c9f0/r"
]
},
{
"cell_type": "markdown",
"id": "68492261-faef-47e7-8009-e20ef1420d5a",
"metadata": {},
"source": [
"### Multiple CSVs\n",
"\n",
"To handle multiple CSVs (or dataframes) we just need to pass multiple dataframes to our Python tool. Our `create_pandas_dataframe_agent` constructor can do this out of the box, we can pass in a list of dataframes instead of just one. If we're constructing a chain ourselves, we can do something like:"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "bb528ab0-4aed-43fd-8a15-a1fe02a33d9e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-0.14384991262954416"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_1 = df[[\"Age\", \"Fare\"]]\n",
"df_2 = df[[\"Fare\", \"Survived\"]]\n",
"\n",
"tool = PythonAstREPLTool(locals={\"df_1\": df_1, \"df_2\": df_2})\n",
"llm_with_tool = llm.bind_tools(tools=[tool], tool_choice=tool.name)\n",
"df_template = \"\"\"```python\n",
"{df_name}.head().to_markdown()\n",
">>> {df_head}\n",
"```\"\"\"\n",
"df_context = \"\\n\\n\".join(\n",
" df_template.format(df_head=_df.head().to_markdown(), df_name=df_name)\n",
" for _df, df_name in [(df_1, \"df_1\"), (df_2, \"df_2\")]\n",
")\n",
"\n",
"system = f\"\"\"You have access to a number of pandas dataframes. \\\n",
"Here is a sample of rows from each dataframe and the python code that was used to generate the sample:\n",
"\n",
"{df_context}\n",
"\n",
"Given a user question about the dataframes, write the Python code to answer it. \\\n",
"Don't assume you have access to any libraries other than built-in Python ones and pandas. \\\n",
"Make sure to refer only to the variables mentioned above.\"\"\"\n",
"prompt = ChatPromptTemplate.from_messages([(\"system\", system), (\"human\", \"{question}\")])\n",
"\n",
"chain = prompt | llm_with_tool | parser | tool\n",
"chain.invoke(\n",
" {\n",
" \"question\": \"return the difference in the correlation between age and fare and the correlation between fare and survival\"\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"id": "7043363f-4ab1-41de-9318-c556e4ae66bc",
"metadata": {},
"source": [
"Here's the LangSmith trace for this run: https://smith.langchain.com/public/653e499f-179c-4757-8041-f5e2a5f11fcc/r"
]
},
{
"cell_type": "markdown",
"id": "a2256d09-23c2-4e52-bfc6-c84eba538586",
"metadata": {},
"source": [
"### Sandboxed code execution\n",
"\n",
"There are a number of tools like [E2B](/docs/integrations/tools/e2b_data_analysis) and [Bearly](/docs/integrations/tools/bearly) that provide sandboxed environments for Python code execution, to allow for safer code-executing chains and agents."
]
},
{
"cell_type": "markdown",
"id": "1728e791-f114-41e6-aa12-0436fdeeedae",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"For more advanced data analysis applications we recommend checking out:\n",
"\n",
"* [SQL use case](/docs/use_cases/sql/): Many of the challenges of working with SQL db's and CSV's are generic to any structured data type, so it's useful to read the SQL techniques even if you're using Pandas for CSV data analysis.\n",
"* [Tool use](/docs/use_cases/tool_use/): Guides on general best practices when working with chains and agents that invoke tools\n",
"* [Agents](/docs/use_cases/agents/): Understand the fundamentals of building LLM agents.\n",
"* Integrations: Sandboxed envs like [E2B](/docs/integrations/tools/e2b_data_analysis) and [Bearly](/docs/integrations/tools/bearly), utilities like [SQLDatabase](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.sql_database.SQLDatabase.html#langchain_community.utilities.sql_database.SQLDatabase), related agents like [Spark DataFrame agent](/docs/integrations/toolkits/spark)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "poetry-venv",
"language": "python",
"name": "poetry-venv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -6,7 +6,6 @@
"metadata": {},
"source": [
"---\n",
"sidebar-position: 1\n",
"title: Synthetic data generation\n",
"---"
]

@ -6,7 +6,6 @@
"metadata": {},
"source": [
"---\n",
"sidebar_position: 1\n",
"title: Extraction\n",
"---"
]
@ -611,7 +610,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
"version": "3.9.1"
}
},
"nbformat": 4,

@ -5,7 +5,7 @@
"metadata": {},
"source": [
"---\n",
"sidebar_position: 0.5\n",
"sidebar_position: 0.1\n",
"---"
]
},

@ -6,7 +6,6 @@
"metadata": {},
"source": [
"---\n",
"sidebar_position: 1\n",
"title: Summarization\n",
"---"
]

@ -6,7 +6,6 @@
"metadata": {},
"source": [
"---\n",
"sidebar_position: 1\n",
"title: Tagging\n",
"---"
]
@ -415,7 +414,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
"version": "3.9.1"
}
},
"nbformat": 4,

@ -6,7 +6,7 @@
"metadata": {},
"source": [
"---\n",
"sidebar_position: 0.9\n",
"sidebar_position: 0.2\n",
"---"
]
},

@ -6,7 +6,6 @@
"metadata": {},
"source": [
"---\n",
"sidebar_position: 1\n",
"title: Web scraping\n",
"---"
]
@ -670,7 +669,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
"version": "3.9.1"
}
},
"nbformat": 4,

@ -2,7 +2,7 @@
from __future__ import annotations
import warnings
from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Sequence, Union
from typing import TYPE_CHECKING, Any, Dict, Literal, Optional, Sequence, Union
from langchain_core.messages import AIMessage, SystemMessage
from langchain_core.prompts import BasePromptTemplate, PromptTemplate
@ -40,7 +40,6 @@ def create_sql_agent(
prefix: Optional[str] = None,
suffix: Optional[str] = None,
format_instructions: Optional[str] = None,
input_variables: Optional[List[str]] = None,
top_k: int = 10,
max_iterations: Optional[int] = 15,
max_execution_time: Optional[float] = None,
@ -70,9 +69,6 @@ def create_sql_agent(
format_instructions: Formatting instructions to pass to
ZeroShotAgent.create_prompt() when 'agent_type' is
"zero-shot-react-description". Otherwise ignored.
input_variables: DEPRECATED. Input variables to explicitly specify as part of
ZeroShotAgent.create_prompt() when 'agent_type' is
"zero-shot-react-description". Otherwise ignored.
top_k: Number of rows to query for by default.
max_iterations: Passed to AgentExecutor init.
max_execution_time: Passed to AgentExecutor init.
@ -203,7 +199,10 @@ def create_sql_agent(
)
else:
raise ValueError(f"Agent type {agent_type} not supported at the moment.")
raise ValueError(
f"Agent type {agent_type} not supported at the moment. Must be one of "
"'openai-tools', 'openai-functions', or 'zero-shot-react-description'."
)
return AgentExecutor(
name="SQL Agent Executor",

@ -1,8 +1,8 @@
# flake8: noqa
"""Tools for interacting with a SQL database."""
from typing import Any, Dict, Optional
from typing import Any, Dict, Optional, Type
from langchain_core.pydantic_v1 import BaseModel, Extra, Field, root_validator
from langchain_core.pydantic_v1 import BaseModel, Field, root_validator
from langchain_core.language_models import BaseLanguageModel
from langchain_core.callbacks import (
@ -24,12 +24,16 @@ class BaseSQLDatabaseTool(BaseModel):
pass
class _QuerySQLDataBaseToolInput(BaseModel):
query: str = Field(..., description="A detailed and correct SQL query.")
class QuerySQLDataBaseTool(BaseSQLDatabaseTool, BaseTool):
"""Tool for querying a SQL database."""
name: str = "sql_db_query"
description: str = """
Input to this tool is a detailed and correct SQL query, output is a result from the database.
Execute a SQL query against the database and get back the result..
If the query is not correct, an error message will be returned.
If an error is returned, rewrite the query, check the query, and try again.
"""
@ -43,15 +47,22 @@ class QuerySQLDataBaseTool(BaseSQLDatabaseTool, BaseTool):
return self.db.run_no_throw(query)
class _InfoSQLDatabaseToolInput(BaseModel):
table_names: str = Field(
...,
description=(
"A comma-separated list of the table names for which to return the schema. "
"Example input: 'table1, table2, table3'"
),
)
class InfoSQLDatabaseTool(BaseSQLDatabaseTool, BaseTool):
"""Tool for getting metadata about a SQL database."""
name: str = "sql_db_schema"
description: str = """
Input to this tool is a comma-separated list of tables, output is the schema and sample rows for those tables.
Example Input: "table1, table2, table3"
"""
description: str = "Get the schema and sample rows for the specified SQL tables."
args_schema: Type[BaseModel] = _InfoSQLDatabaseToolInput
def _run(
self,

@ -466,6 +466,14 @@ class _StringImageMessagePromptTemplate(BaseMessagePromptTemplate):
content=content, additional_kwargs=self.additional_kwargs
)
def pretty_repr(self, html: bool = False) -> str:
# TODO: Handle partials
title = self.__class__.__name__.replace("MessagePromptTemplate", " Message")
title = get_msg_title_repr(title, bold=html)
prompts = self.prompt if isinstance(self.prompt, list) else [self.prompt]
prompt_reprs = "\n\n".join(prompt.pretty_repr(html=html) for prompt in prompts)
return f"{title}\n\n{prompt_reprs}"
class HumanMessagePromptTemplate(_StringImageMessagePromptTemplate):
"""Human message prompt template. This is a message sent from the user."""

@ -74,3 +74,6 @@ class ImagePromptTemplate(BasePromptTemplate[ImageURL]):
# Don't check literal values here: let the API check them
output["detail"] = detail # type: ignore[typeddict-item]
return output
def pretty_repr(self, html: bool = False) -> str:
raise NotImplementedError()

@ -1,26 +1,55 @@
from io import IOBase
from typing import Any, List, Optional, Union
from __future__ import annotations
from langchain.agents.agent import AgentExecutor
from langchain_core.language_models import BaseLanguageModel
from io import IOBase
from typing import TYPE_CHECKING, Any, List, Optional, Union
from langchain_experimental.agents.agent_toolkits.pandas.base import (
create_pandas_dataframe_agent,
)
if TYPE_CHECKING:
from langchain.agents.agent import AgentExecutor
from langchain_core.language_models import LanguageModelLike
def create_csv_agent(
llm: BaseLanguageModel,
llm: LanguageModelLike,
path: Union[str, IOBase, List[Union[str, IOBase]]],
pandas_kwargs: Optional[dict] = None,
**kwargs: Any,
) -> AgentExecutor:
"""Create csv agent by loading to a dataframe and using pandas agent."""
"""Create pandas dataframe agent by loading csv to a dataframe.
Args:
llm: Language model to use for the agent.
path: A string path, file-like object or a list of string paths/file-like
objects that can be read in as pandas DataFrames with pd.read_csv().
pandas_kwargs: Named arguments to pass to pd.read_csv().
**kwargs: Additional kwargs to pass to langchain_experimental.agents.agent_toolkits.pandas.base.create_pandas_dataframe_agent().
Returns:
An AgentExecutor with the specified agent_type agent and access to
a PythonAstREPLTool with the loaded DataFrame(s) and any user-provided extra_tools.
Example:
.. code-block:: python
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_csv_agent
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
agent_executor = create_pandas_dataframe_agent(
llm,
"titanic.csv",
agent_type="openai-tools",
verbose=True
)
""" # noqa: E501
try:
import pandas as pd
except ImportError:
raise ImportError(
"pandas package not found, please install with `pip install pandas`"
"pandas package not found, please install with `pip install pandas`."
)
_kwargs = pandas_kwargs or {}

@ -1,16 +1,26 @@
"""Agent for working with pandas objects."""
from typing import Any, Dict, List, Optional, Sequence, Tuple
from langchain.agents.agent import AgentExecutor, BaseSingleActionAgent
import warnings
from typing import Any, Dict, List, Literal, Optional, Sequence, Union
from langchain.agents import AgentType, create_openai_tools_agent, create_react_agent
from langchain.agents.agent import (
AgentExecutor,
BaseMultiActionAgent,
BaseSingleActionAgent,
RunnableAgent,
RunnableMultiActionAgent,
)
from langchain.agents.mrkl.base import ZeroShotAgent
from langchain.agents.openai_functions_agent.base import OpenAIFunctionsAgent
from langchain.agents.types import AgentType
from langchain.callbacks.base import BaseCallbackManager
from langchain.chains.llm import LLMChain
from langchain.schema import BasePromptTemplate
from langchain.tools import BaseTool
from langchain_core.language_models import BaseLanguageModel
from langchain.agents.openai_functions_agent.base import (
OpenAIFunctionsAgent,
create_openai_functions_agent,
)
from langchain_core.callbacks import BaseCallbackManager
from langchain_core.language_models import LanguageModelLike
from langchain_core.messages import SystemMessage
from langchain_core.prompts import BasePromptTemplate, ChatPromptTemplate
from langchain_core.tools import BaseTool
from langchain_core.utils.interactive_env import is_interactive_env
from langchain_experimental.agents.agent_toolkits.pandas.prompt import (
FUNCTIONS_WITH_DF,
@ -28,257 +38,121 @@ from langchain_experimental.tools.python.tool import PythonAstREPLTool
def _get_multi_prompt(
dfs: List[Any],
*,
prefix: Optional[str] = None,
suffix: Optional[str] = None,
input_variables: Optional[List[str]] = None,
include_df_in_prompt: Optional[bool] = True,
number_of_head_rows: int = 5,
extra_tools: Sequence[BaseTool] = (),
) -> Tuple[BasePromptTemplate, List[BaseTool]]:
num_dfs = len(dfs)
tools: Sequence[BaseTool] = (),
) -> BasePromptTemplate:
if suffix is not None:
suffix_to_use = suffix
include_dfs_head = True
elif include_df_in_prompt:
suffix_to_use = SUFFIX_WITH_MULTI_DF
include_dfs_head = True
else:
suffix_to_use = SUFFIX_NO_DF
include_dfs_head = False
if input_variables is None:
input_variables = ["input", "agent_scratchpad", "num_dfs"]
if include_dfs_head:
input_variables += ["dfs_head"]
if prefix is None:
prefix = MULTI_DF_PREFIX
prefix = prefix if prefix is not None else MULTI_DF_PREFIX
df_locals = {}
for i, dataframe in enumerate(dfs):
df_locals[f"df{i + 1}"] = dataframe
tools = [PythonAstREPLTool(locals=df_locals)] + list(extra_tools)
prompt = ZeroShotAgent.create_prompt(
tools,
prefix=prefix,
suffix=suffix_to_use,
input_variables=input_variables,
)
partial_prompt = prompt.partial()
if "dfs_head" in input_variables:
if "dfs_head" in partial_prompt.input_variables:
dfs_head = "\n\n".join([d.head(number_of_head_rows).to_markdown() for d in dfs])
partial_prompt = partial_prompt.partial(num_dfs=str(num_dfs), dfs_head=dfs_head)
if "num_dfs" in input_variables:
partial_prompt = partial_prompt.partial(num_dfs=str(num_dfs))
return partial_prompt, tools
partial_prompt = partial_prompt.partial(dfs_head=dfs_head)
if "num_dfs" in partial_prompt.input_variables:
partial_prompt = partial_prompt.partial(num_dfs=str(len(dfs)))
return partial_prompt
def _get_single_prompt(
df: Any,
*,
prefix: Optional[str] = None,
suffix: Optional[str] = None,
input_variables: Optional[List[str]] = None,
include_df_in_prompt: Optional[bool] = True,
number_of_head_rows: int = 5,
extra_tools: Sequence[BaseTool] = (),
) -> Tuple[BasePromptTemplate, List[BaseTool]]:
tools: Sequence[BaseTool] = (),
) -> BasePromptTemplate:
if suffix is not None:
suffix_to_use = suffix
include_df_head = True
elif include_df_in_prompt:
suffix_to_use = SUFFIX_WITH_DF
include_df_head = True
else:
suffix_to_use = SUFFIX_NO_DF
include_df_head = False
if input_variables is None:
input_variables = ["input", "agent_scratchpad"]
if include_df_head:
input_variables += ["df_head"]
if prefix is None:
prefix = PREFIX
tools = [PythonAstREPLTool(locals={"df": df})] + list(extra_tools)
prefix = prefix if prefix is not None else PREFIX
prompt = ZeroShotAgent.create_prompt(
tools,
prefix=prefix,
suffix=suffix_to_use,
input_variables=input_variables,
)
partial_prompt = prompt.partial()
if "df_head" in input_variables:
partial_prompt = partial_prompt.partial(
df_head=str(df.head(number_of_head_rows).to_markdown())
)
return partial_prompt, tools
if "df_head" in partial_prompt.input_variables:
df_head = str(df.head(number_of_head_rows).to_markdown())
partial_prompt = partial_prompt.partial(df_head=df_head)
return partial_prompt
def _get_prompt_and_tools(
df: Any,
prefix: Optional[str] = None,
suffix: Optional[str] = None,
input_variables: Optional[List[str]] = None,
include_df_in_prompt: Optional[bool] = True,
number_of_head_rows: int = 5,
extra_tools: Sequence[BaseTool] = (),
) -> Tuple[BasePromptTemplate, List[BaseTool]]:
try:
import pandas as pd
pd.set_option("display.max_columns", None)
except ImportError:
raise ImportError(
"pandas package not found, please install with `pip install pandas`"
)
if include_df_in_prompt is not None and suffix is not None:
raise ValueError("If suffix is specified, include_df_in_prompt should not be.")
if isinstance(df, list):
for item in df:
if not isinstance(item, pd.DataFrame):
raise ValueError(f"Expected pandas object, got {type(df)}")
return _get_multi_prompt(
df,
prefix=prefix,
suffix=suffix,
input_variables=input_variables,
include_df_in_prompt=include_df_in_prompt,
number_of_head_rows=number_of_head_rows,
extra_tools=extra_tools,
)
else:
if not isinstance(df, pd.DataFrame):
raise ValueError(f"Expected pandas object, got {type(df)}")
return _get_single_prompt(
df,
prefix=prefix,
suffix=suffix,
input_variables=input_variables,
include_df_in_prompt=include_df_in_prompt,
number_of_head_rows=number_of_head_rows,
extra_tools=extra_tools,
)
def _get_prompt(df: Any, **kwargs: Any) -> BasePromptTemplate:
return (
_get_multi_prompt(df, **kwargs)
if isinstance(df, list)
else _get_single_prompt(df, **kwargs)
)
def _get_functions_single_prompt(
df: Any,
*,
prefix: Optional[str] = None,
suffix: Optional[str] = None,
suffix: str = "",
include_df_in_prompt: Optional[bool] = True,
number_of_head_rows: int = 5,
) -> Tuple[BasePromptTemplate, List[PythonAstREPLTool]]:
if suffix is not None:
suffix_to_use = suffix
if include_df_in_prompt:
suffix_to_use = suffix_to_use.format(
df_head=str(df.head(number_of_head_rows).to_markdown())
)
elif include_df_in_prompt:
suffix_to_use = FUNCTIONS_WITH_DF.format(
df_head=str(df.head(number_of_head_rows).to_markdown())
)
else:
suffix_to_use = ""
if prefix is None:
prefix = PREFIX_FUNCTIONS
tools = [PythonAstREPLTool(locals={"df": df})]
system_message = SystemMessage(content=prefix + suffix_to_use)
) -> ChatPromptTemplate:
if include_df_in_prompt:
df_head = str(df.head(number_of_head_rows).to_markdown())
suffix = (suffix or FUNCTIONS_WITH_DF).format(df_head=df_head)
prefix = prefix if prefix is not None else PREFIX_FUNCTIONS
system_message = SystemMessage(content=prefix + suffix)
prompt = OpenAIFunctionsAgent.create_prompt(system_message=system_message)
return prompt, tools
return prompt
def _get_functions_multi_prompt(
dfs: Any,
prefix: Optional[str] = None,
suffix: Optional[str] = None,
*,
prefix: str = "",
suffix: str = "",
include_df_in_prompt: Optional[bool] = True,
number_of_head_rows: int = 5,
) -> Tuple[BasePromptTemplate, List[PythonAstREPLTool]]:
if suffix is not None:
suffix_to_use = suffix
if include_df_in_prompt:
dfs_head = "\n\n".join(
[d.head(number_of_head_rows).to_markdown() for d in dfs]
)
suffix_to_use = suffix_to_use.format(
dfs_head=dfs_head,
)
elif include_df_in_prompt:
) -> ChatPromptTemplate:
if include_df_in_prompt:
dfs_head = "\n\n".join([d.head(number_of_head_rows).to_markdown() for d in dfs])
suffix_to_use = FUNCTIONS_WITH_MULTI_DF.format(
dfs_head=dfs_head,
)
else:
suffix_to_use = ""
if prefix is None:
prefix = MULTI_DF_PREFIX_FUNCTIONS
prefix = prefix.format(num_dfs=str(len(dfs)))
df_locals = {}
for i, dataframe in enumerate(dfs):
df_locals[f"df{i + 1}"] = dataframe
tools = [PythonAstREPLTool(locals=df_locals)]
system_message = SystemMessage(content=prefix + suffix_to_use)
suffix = (suffix or FUNCTIONS_WITH_MULTI_DF).format(dfs_head=dfs_head)
prefix = (prefix or MULTI_DF_PREFIX_FUNCTIONS).format(num_dfs=str(len(dfs)))
system_message = SystemMessage(content=prefix + suffix)
prompt = OpenAIFunctionsAgent.create_prompt(system_message=system_message)
return prompt, tools
return prompt
def _get_functions_prompt_and_tools(
df: Any,
prefix: Optional[str] = None,
suffix: Optional[str] = None,
input_variables: Optional[List[str]] = None,
include_df_in_prompt: Optional[bool] = True,
number_of_head_rows: int = 5,
) -> Tuple[BasePromptTemplate, List[PythonAstREPLTool]]:
try:
import pandas as pd
pd.set_option("display.max_columns", None)
except ImportError:
raise ImportError(
"pandas package not found, please install with `pip install pandas`"
)
if input_variables is not None:
raise ValueError("`input_variables` is not supported at the moment.")
if include_df_in_prompt is not None and suffix is not None:
raise ValueError("If suffix is specified, include_df_in_prompt should not be.")
if isinstance(df, list):
for item in df:
if not isinstance(item, pd.DataFrame):
raise ValueError(f"Expected pandas object, got {type(df)}")
return _get_functions_multi_prompt(
df,
prefix=prefix,
suffix=suffix,
include_df_in_prompt=include_df_in_prompt,
number_of_head_rows=number_of_head_rows,
)
else:
if not isinstance(df, pd.DataFrame):
raise ValueError(f"Expected pandas object, got {type(df)}")
return _get_functions_single_prompt(
df,
prefix=prefix,
suffix=suffix,
include_df_in_prompt=include_df_in_prompt,
number_of_head_rows=number_of_head_rows,
)
def _get_functions_prompt(df: Any, **kwargs: Any) -> ChatPromptTemplate:
return (
_get_functions_multi_prompt(df, **kwargs)
if isinstance(df, list)
else _get_functions_single_prompt(df, **kwargs)
)
def create_pandas_dataframe_agent(
llm: BaseLanguageModel,
llm: LanguageModelLike,
df: Any,
agent_type: AgentType = AgentType.ZERO_SHOT_REACT_DESCRIPTION,
agent_type: Union[
AgentType, Literal["openai-tools"]
] = AgentType.ZERO_SHOT_REACT_DESCRIPTION,
callback_manager: Optional[BaseCallbackManager] = None,
prefix: Optional[str] = None,
suffix: Optional[str] = None,
@ -292,54 +166,131 @@ def create_pandas_dataframe_agent(
include_df_in_prompt: Optional[bool] = True,
number_of_head_rows: int = 5,
extra_tools: Sequence[BaseTool] = (),
**kwargs: Dict[str, Any],
**kwargs: Any,
) -> AgentExecutor:
"""Construct a pandas agent from an LLM and dataframe."""
agent: BaseSingleActionAgent
base_tools: Sequence[BaseTool]
"""Construct a Pandas agent from an LLM and dataframe(s).
Args:
llm: Language model to use for the agent.
df: Pandas dataframe or list of Pandas dataframes.
agent_type: One of "openai-tools", "openai-functions", or
"zero-shot-react-description". Defaults to "zero-shot-react-description".
"openai-tools" is recommended over "openai-functions".
callback_manager: DEPRECATED. Pass "callbacks" key into 'agent_executor_kwargs'
instead to pass constructor callbacks to AgentExecutor.
prefix: Prompt prefix string.
suffix: Prompt suffix string.
input_variables: DEPRECATED. Input variables automatically inferred from
constructed prompt.
verbose: AgentExecutor verbosity.
return_intermediate_steps: Passed to AgentExecutor init.
max_iterations: Passed to AgentExecutor init.
max_execution_time: Passed to AgentExecutor init.
early_stopping_method: Passed to AgentExecutor init.
agent_executor_kwargs: Arbitrary additional AgentExecutor args.
include_df_in_prompt: Whether to include the first number_of_head_rows in the
prompt. Must be None if suffix is not None.
number_of_head_rows: Number of initial rows to include in prompt if
include_df_in_prompt is True.
extra_tools: Additional tools to give to agent on top of a PythonAstREPLTool.
**kwargs: DEPRECATED. Not used, kept for backwards compatibility.
Returns:
An AgentExecutor with the specified agent_type agent and access to
a PythonAstREPLTool with the DataFrame(s) and any user-provided extra_tools.
Example:
.. code-block:: python
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
import pandas as pd
df = pd.read_csv("titanic.csv")
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
agent_executor = create_pandas_dataframe_agent(
llm,
df,
agent_type="openai-tools",
verbose=True
)
""" # noqa: E501
try:
import pandas as pd
except ImportError as e:
raise ImportError(
"pandas package not found, please install with `pip install pandas`"
) from e
if is_interactive_env():
pd.set_option("display.max_columns", None)
for _df in df if isinstance(df, list) else [df]:
if not isinstance(_df, pd.DataFrame):
raise ValueError(f"Expected pandas DataFrame, got {type(_df)}")
if input_variables:
kwargs = kwargs or {}
kwargs["input_variables"] = input_variables
if kwargs:
warnings.warn(
f"Received additional kwargs {kwargs} which are no longer supported."
)
df_locals = {}
if isinstance(df, list):
for i, dataframe in enumerate(df):
df_locals[f"df{i + 1}"] = dataframe
else:
df_locals["df"] = df
tools = [PythonAstREPLTool(locals=df_locals)] + list(extra_tools)
if agent_type == AgentType.ZERO_SHOT_REACT_DESCRIPTION:
prompt, base_tools = _get_prompt_and_tools(
if include_df_in_prompt is not None and suffix is not None:
raise ValueError(
"If suffix is specified, include_df_in_prompt should not be."
)
prompt = _get_prompt(
df,
prefix=prefix,
suffix=suffix,
input_variables=input_variables,
include_df_in_prompt=include_df_in_prompt,
number_of_head_rows=number_of_head_rows,
extra_tools=extra_tools,
)
tools = base_tools
llm_chain = LLMChain(
llm=llm,
prompt=prompt,
callback_manager=callback_manager,
tools=tools,
)
tool_names = [tool.name for tool in tools]
agent = ZeroShotAgent(
llm_chain=llm_chain,
allowed_tools=tool_names,
callback_manager=callback_manager,
**kwargs,
agent: Union[BaseSingleActionAgent, BaseMultiActionAgent] = RunnableAgent(
runnable=create_react_agent(llm, tools, prompt), # type: ignore
input_keys_arg=["input"],
return_keys_arg=["output"],
)
elif agent_type == AgentType.OPENAI_FUNCTIONS:
_prompt, base_tools = _get_functions_prompt_and_tools(
elif agent_type in (AgentType.OPENAI_FUNCTIONS, "openai-tools"):
prompt = _get_functions_prompt(
df,
prefix=prefix,
suffix=suffix,
input_variables=input_variables,
include_df_in_prompt=include_df_in_prompt,
number_of_head_rows=number_of_head_rows,
)
tools = list(base_tools) + list(extra_tools)
agent = OpenAIFunctionsAgent(
llm=llm,
prompt=_prompt,
tools=tools,
callback_manager=callback_manager,
**kwargs,
)
if agent_type == AgentType.OPENAI_FUNCTIONS:
agent = RunnableAgent(
runnable=create_openai_functions_agent(llm, tools, prompt), # type: ignore
input_keys_arg=["input"],
return_keys_arg=["output"],
)
else:
agent = RunnableMultiActionAgent(
runnable=create_openai_tools_agent(llm, tools, prompt), # type: ignore
input_keys_arg=["input"],
return_keys_arg=["output"],
)
else:
raise ValueError(f"Agent type {agent_type} not supported at the moment.")
return AgentExecutor.from_agent_and_tools(
raise ValueError(
f"Agent type {agent_type} not supported at the moment. Must be one of "
"'openai-tools', 'openai-functions', or 'zero-shot-react-description'."
)
return AgentExecutor(
agent=agent,
tools=tools,
callback_manager=callback_manager,

@ -175,7 +175,7 @@ class OpenAIFunctionsAgent(BaseSingleActionAgent):
content="You are a helpful AI assistant."
),
extra_prompt_messages: Optional[List[BaseMessagePromptTemplate]] = None,
) -> BasePromptTemplate:
) -> ChatPromptTemplate:
"""Create prompt for this agent.
Args:

Loading…
Cancel
Save