Harrison/agent eval (#1620)

Co-authored-by: jerwelborn <jeremy.welborn@gmail.com>
tool-patch
Harrison Chase 1 year ago committed by GitHub
parent 8965a2f0af
commit 2d098e8869
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -679,7 +679,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
"version": "3.9.1"
}
},
"nbformat": 4,

@ -1,9 +1,85 @@
Evaluation
==============
Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.
This section of documentation covers how we approach and think about evaluation in LangChain.
Both evaluation of internal chains/agents, but also how we would recommend people building on top of LangChain approach evaluation.
The examples here all highlight how to use language models to assist in evaluation of themselves.
The Problem
-----------
It can be really hard to evaluate LangChain chains and agents.
There are two main reasons for this:
**# 1: Lack of data**
You generally don't have a ton of data to evaluate your chains/agents over before starting a project.
This is usually because Large Language Models (the core of most chains/agents) are terrific few-shot and zero shot learners,
meaning you are almost always able to get started on a particular task (text-to-SQL, question answering, etc) without
a large dataset of examples.
This is in stark contrast to traditional machine learning where you had to first collect a bunch of datapoints
before even getting started using a model.
**# 2: Lack of metrics**
Most chains/agents are performing tasks for which there are not very good metrics to evaluate performance.
For example, one of the most common use cases is generating text of some form.
Evaluating generated text is much more complicated than evaluating a classification prediction, or a numeric prediction.
The Solution
------------
LangChain attempts to tackle both of those issues.
What we have so far are initial passes at solutions - we do not think we have a perfect solution.
So we very much welcome feedback, contributions, integrations, and thoughts on this.
Here is what we have for each problem so far:
**# 1: Lack of data**
We have started `LangChainDatasets <https://huggingface.co/LangChainDatasets>`_ a Community space on Hugging Face.
We intend this to be a collection of open source datasets for evaluating common chains and agents.
We have contributed five datasets of our own to start, but we highly intend this to be a community effort.
In order to contribute a dataset, you simply need to join the community and then you will be able to upload datasets.
We're also aiming to make it as easy as possible for people to create their own datasets.
As a first pass at this, we've added a QAGenerationChain, which given a document comes up
with question-answer pairs that can be used to evaluate question-answering tasks over that document down the line.
See `this notebook <./evaluation/qa_generation.html>`_ for an example of how to use this chain.
**# 2: Lack of metrics**
We have two solutions to the lack of metrics.
The first solution is to use no metrics, and rather just rely on looking at results by eye to get a sense for how the chain/agent is performing.
To assist in this, we have developed (and will continue to develop) `tracing <../tracing.html>`_, a UI-based visualizer of your chain and agent runs.
The second solution we recommend is to use Language Models themselves to evaluate outputs.
For this we have a few different chains and prompts aimed at tackling this issue.
The Examples
------------
We have created a bunch of examples combining the above two solutions to show how we internally evaluate chains and agents when we are developing.
In addition to the examples we've curated, we also highly welcome contributions here.
To facilitate that, we've included a `template notebook <./evaluation/benchmarking_template.html>`_ for community members to use to build their own examples.
The existing examples we have are:
`Question Answering (State of Union) <./evaluation/qa_benchmarking_sota.html>`_: An notebook showing evaluation of a question-answering task over a State-of-the-Union address.
`Question Answering (Paul Graham Essay) <./evaluation/qa_benchmarking_pg.html>`_: An notebook showing evaluation of a question-answering task over a Paul Graham essay.
`SQL Question Answering (Chinook) <./evaluation/sql_qa_benchmarking_chinook.html>`_: An notebook showing evaluation of a question-answering task over a SQL database (the Chinook database).
`Agent Vectorstore <./evaluation/agent_vectordb_sota_pg.html>`_: An notebook showing evaluation of an agent doing question answering while routing between two different vector databases.
`Agent Search + Calculator <./evaluation/agent_benchmarking.html>`_: An notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools.
Other Examples
------------
In addition, we also have some more generic resources for evaluation.
`Question Answering <./evaluation/question_answering.html>`_: An overview of LLMs aimed at evaluating question answering systems in general.

@ -0,0 +1,343 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "984169ca",
"metadata": {},
"source": [
"# Agent Benchmarking: Search + Calculator\n",
"\n",
"Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.\n",
"\n",
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "46bf9205",
"metadata": {},
"outputs": [],
"source": [
"# Comment this out if you are NOT using tracing\n",
"import os\n",
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
]
},
{
"cell_type": "markdown",
"id": "8a16b75d",
"metadata": {},
"source": [
"## Loading the data\n",
"First, let's load the data."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5b2d5e98",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--agent-search-calculator-8a025c0ce5fb99d2/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3a275586643f4ccfba1a8d54be28c351",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from langchain.evaluation.loading import load_dataset\n",
"dataset = load_dataset(\"agent-search-calculator\")"
]
},
{
"cell_type": "markdown",
"id": "4ab6a716",
"metadata": {},
"source": [
"## Setting up a chain\n",
"Now we need to load an agent capable of answering these questions."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c18680b5",
"metadata": {},
"outputs": [],
"source": [
"from langchain.llms import OpenAI\n",
"from langchain.chains import LLMMathChain\n",
"from langchain.agents import initialize_agent, Tool, load_tools\n",
"\n",
"tools = load_tools(['serpapi', 'llm-math'], llm=OpenAI(temperature=0))\n",
"agent = initialize_agent(tools, OpenAI(temperature=0), agent=\"zero-shot-react-description\")\n"
]
},
{
"cell_type": "markdown",
"id": "68504a8f",
"metadata": {},
"source": [
"## Make a prediction\n",
"\n",
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "cbcafc92",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'38,630,316 people live in Canada as of 2023.'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"agent.run(dataset[0]['question'])"
]
},
{
"cell_type": "markdown",
"id": "d0c16cd7",
"metadata": {},
"source": [
"## Make many predictions\n",
"Now we can make predictions"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "24b4c66e",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')).\n"
]
}
],
"source": [
"predictions = []\n",
"predicted_dataset = []\n",
"error_dataset = []\n",
"for data in dataset:\n",
" new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",
" try:\n",
" predictions.append(agent(new_data))\n",
" predicted_dataset.append(new_data)\n",
" except Exception:\n",
" error_dataset.append(new_data)"
]
},
{
"cell_type": "markdown",
"id": "49d969fb",
"metadata": {},
"source": [
"## Evaluate performance\n",
"Now we can evaluate the predictions. The first thing we can do is look at them by eye."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "1d583f03",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'input': 'How many people live in canada as of 2023?',\n",
" 'answer': 'approximately 38,625,801',\n",
" 'output': '38,630,316 people live in Canada as of 2023.',\n",
" 'intermediate_steps': [(AgentAction(tool='Search', tool_input='Population of Canada 2023', log=' I need to find population data\\nAction: Search\\nAction Input: Population of Canada 2023'),\n",
" '38,630,316')]}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions[0]"
]
},
{
"cell_type": "markdown",
"id": "4783344b",
"metadata": {},
"source": [
"Next, we can use a language model to score them programatically"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d0a9341d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.evaluation.qa import QAEvalChain"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "1612dec1",
"metadata": {},
"outputs": [],
"source": [
"llm = OpenAI(temperature=0)\n",
"eval_chain = QAEvalChain.from_llm(llm)\n",
"graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"output\")"
]
},
{
"cell_type": "markdown",
"id": "79587806",
"metadata": {},
"source": [
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "2a689df5",
"metadata": {},
"outputs": [],
"source": [
"for i, prediction in enumerate(predictions):\n",
" prediction['grade'] = graded_outputs[i]['text']"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "27b61215",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({' CORRECT': 4, ' INCORRECT': 6})"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"Counter([pred['grade'] for pred in predictions])"
]
},
{
"cell_type": "markdown",
"id": "12fe30f4",
"metadata": {},
"source": [
"We can also filter the datapoints to the incorrect examples and look at them."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "47c692a1",
"metadata": {},
"outputs": [],
"source": [
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "0ef976c1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'input': \"who is dua lipa's boyfriend? what is his age raised to the .43 power?\",\n",
" 'answer': 'her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665',\n",
" 'output': \"Isaac Carew, Dua Lipa's boyfriend, is 36 years old and his age raised to the .43 power is 4.6688516567750975.\",\n",
" 'intermediate_steps': [(AgentAction(tool='Search', tool_input=\"Dua Lipa's boyfriend\", log=' I need to find out who Dua Lipa\\'s boyfriend is and then calculate his age raised to the .43 power\\nAction: Search\\nAction Input: \"Dua Lipa\\'s boyfriend\"'),\n",
" 'Dua and Isaac, a model and a chef, dated on and off from 2013 to 2019. The two first split in early 2017, which is when Dua went on to date LANY ...'),\n",
" (AgentAction(tool='Search', tool_input='Isaac Carew age', log=' I need to find out Isaac\\'s age\\nAction: Search\\nAction Input: \"Isaac Carew age\"'),\n",
" '36 years'),\n",
" (AgentAction(tool='Calculator', tool_input='36^.43', log=' I need to calculate 36 raised to the .43 power\\nAction: Calculator\\nAction Input: 36^.43'),\n",
" 'Answer: 4.6688516567750975\\n')],\n",
" 'grade': ' INCORRECT'}"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"incorrect[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7710401a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,516 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "984169ca",
"metadata": {},
"source": [
"# Agent VectorDB Question Answering Benchmarking\n",
"\n",
"Here we go over how to benchmark performance on a question answering task using an agent to route between multiple vectordatabases.\n",
"\n",
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "7b57a50f",
"metadata": {},
"outputs": [],
"source": [
"# Comment this out if you are NOT using tracing\n",
"import os\n",
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
]
},
{
"cell_type": "markdown",
"id": "8a16b75d",
"metadata": {},
"source": [
"## Loading the data\n",
"First, let's load the data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5b2d5e98",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--agent-vectordb-qa-sota-pg-d3ae24016b514f92/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a7abbc20615d4c58b75a055a790d7212",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from langchain.evaluation.loading import load_dataset\n",
"dataset = load_dataset(\"agent-vectordb-qa-sota-pg\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "61375342",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'What is the purpose of the NATO Alliance?',\n",
" 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
" 'steps': [{'tool': 'State of Union QA System', 'tool_input': None},\n",
" {'tool': None, 'tool_input': 'What is the purpose of the NATO Alliance?'}]}"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset[0]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "02500304",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'What is the purpose of YC?',\n",
" 'answer': 'The purpose of YC is to cause startups to be founded that would not otherwise have existed.',\n",
" 'steps': [{'tool': 'Paul Graham QA System', 'tool_input': None},\n",
" {'tool': None, 'tool_input': 'What is the purpose of YC?'}]}"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset[-1]"
]
},
{
"cell_type": "markdown",
"id": "4ab6a716",
"metadata": {},
"source": [
"## Setting up a chain\n",
"Now we need to create some pipelines for doing question answering. Step one in that is creating indexes over the data in question."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c18680b5",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7f0de2b3",
"metadata": {},
"outputs": [],
"source": [
"from langchain.indexes import VectorstoreIndexCreator"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ef84ff99",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running Chroma using direct local API.\n",
"Using DuckDB in-memory for database. Data will be transient.\n"
]
}
],
"source": [
"vectorstore_sota = VectorstoreIndexCreator(vectorstore_kwargs={\"collection_name\":\"sota\"}).from_loaders([loader]).vectorstore"
]
},
{
"cell_type": "markdown",
"id": "f0b5d8f6",
"metadata": {},
"source": [
"Now we can create a question answering chain."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8843cb0c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import VectorDBQA\n",
"from langchain.llms import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "573719a0",
"metadata": {},
"outputs": [],
"source": [
"chain_sota = VectorDBQA.from_chain_type(llm=OpenAI(temperature=0), chain_type=\"stuff\", vectorstore=vectorstore_sota, input_key=\"question\")"
]
},
{
"cell_type": "markdown",
"id": "e48b03d8",
"metadata": {},
"source": [
"Now we do the same for the Paul Graham data."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c2dbb014",
"metadata": {},
"outputs": [],
"source": [
"loader = TextLoader(\"../../modules/paul_graham_essay.txt\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "98d16f08",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running Chroma using direct local API.\n",
"Using DuckDB in-memory for database. Data will be transient.\n"
]
}
],
"source": [
"vectorstore_pg = VectorstoreIndexCreator(vectorstore_kwargs={\"collection_name\":\"paul_graham\"}).from_loaders([loader]).vectorstore"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "ec0aab02",
"metadata": {},
"outputs": [],
"source": [
"chain_pg = VectorDBQA.from_chain_type(llm=OpenAI(temperature=0), chain_type=\"stuff\", vectorstore=vectorstore_pg, input_key=\"question\")\n"
]
},
{
"cell_type": "markdown",
"id": "76b5f8fb",
"metadata": {},
"source": [
"We can now set up an agent to route between them."
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "ade1aafa",
"metadata": {},
"outputs": [],
"source": [
"from langchain.agents import initialize_agent, Tool\n",
"tools = [\n",
" Tool(\n",
" name = \"State of Union QA System\",\n",
" func=chain_sota.run,\n",
" description=\"useful for when you need to answer questions about the most recent state of the union address. Input should be a fully formed question.\"\n",
" ),\n",
" Tool(\n",
" name = \"Paul Graham System\",\n",
" func=chain_pg.run,\n",
" description=\"useful for when you need to answer questions about Paul Graham. Input should be a fully formed question.\"\n",
" ),\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "104853f8",
"metadata": {},
"outputs": [],
"source": [
"agent = initialize_agent(tools, OpenAI(temperature=0), agent=\"zero-shot-react-description\", max_iterations=3)"
]
},
{
"cell_type": "markdown",
"id": "7f036641",
"metadata": {},
"source": [
"## Make a prediction\n",
"\n",
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "4664e79f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The purpose of the NATO Alliance is to promote peace and security in the North Atlantic region by providing a collective defense against potential threats.'"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"agent.run(dataset[0]['question'])"
]
},
{
"cell_type": "markdown",
"id": "d0c16cd7",
"metadata": {},
"source": [
"## Make many predictions\n",
"Now we can make predictions"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "24b4c66e",
"metadata": {},
"outputs": [],
"source": [
"predictions = []\n",
"predicted_dataset = []\n",
"error_dataset = []\n",
"for data in dataset:\n",
" new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",
" try:\n",
" predictions.append(agent(new_data))\n",
" predicted_dataset.append(new_data)\n",
" except Exception:\n",
" error_dataset.append(new_data)"
]
},
{
"cell_type": "markdown",
"id": "49d969fb",
"metadata": {},
"source": [
"## Evaluate performance\n",
"Now we can evaluate the predictions. The first thing we can do is look at them by eye."
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "1d583f03",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'input': 'What is the purpose of the NATO Alliance?',\n",
" 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
" 'output': 'The purpose of the NATO Alliance is to promote peace and security in the North Atlantic region by providing a collective defense against potential threats.'}"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions[0]"
]
},
{
"cell_type": "markdown",
"id": "4783344b",
"metadata": {},
"source": [
"Next, we can use a language model to score them programatically"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "d0a9341d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.evaluation.qa import QAEvalChain"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "1612dec1",
"metadata": {},
"outputs": [],
"source": [
"llm = OpenAI(temperature=0)\n",
"eval_chain = QAEvalChain.from_llm(llm)\n",
"graded_outputs = eval_chain.evaluate(predicted_dataset, predictions, question_key=\"input\", prediction_key=\"output\")"
]
},
{
"cell_type": "markdown",
"id": "79587806",
"metadata": {},
"source": [
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "2a689df5",
"metadata": {},
"outputs": [],
"source": [
"for i, prediction in enumerate(predictions):\n",
" prediction['grade'] = graded_outputs[i]['text']"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "27b61215",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({' CORRECT': 19, ' INCORRECT': 14})"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"Counter([pred['grade'] for pred in predictions])"
]
},
{
"cell_type": "markdown",
"id": "12fe30f4",
"metadata": {},
"source": [
"We can also filter the datapoints to the incorrect examples and look at them."
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "47c692a1",
"metadata": {},
"outputs": [],
"source": [
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "0ef976c1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'input': 'What is the purpose of the Bipartisan Innovation Act mentioned in the text?',\n",
" 'answer': 'The Bipartisan Innovation Act will make record investments in emerging technologies and American manufacturing to level the playing field with China and other competitors.',\n",
" 'output': 'The purpose of the Bipartisan Innovation Act is to promote innovation and entrepreneurship in the United States by providing tax incentives and other support for startups and small businesses.',\n",
" 'grade': ' INCORRECT'}"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"incorrect[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7710401a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,160 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a175c650",
"metadata": {},
"source": [
"# Benchmarking Template\n",
"\n",
"This is an example notebook that can be used to create a benchmarking notebook for a task of your choice. Evaluation is really hard, and so we greatly welcome any contributions that can make it easier for people to experiment"
]
},
{
"cell_type": "markdown",
"id": "984169ca",
"metadata": {},
"source": [
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "9fe4d1b4",
"metadata": {},
"outputs": [],
"source": [
"# Comment this out if you are NOT using tracing\n",
"import os\n",
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
]
},
{
"cell_type": "markdown",
"id": "0f66405e",
"metadata": {},
"source": [
"## Loading the data\n",
"\n",
"First, let's load the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "79402a8f",
"metadata": {},
"outputs": [],
"source": [
"# This notebook should so how to load the dataset from LangChainDatasets on Hugging Face\n",
"\n",
"# Please upload your dataset to https://huggingface.co/LangChainDatasets\n",
"\n",
"# The value passed into `load_dataset` should NOT have the `LangChainDatasets/` prefix\n",
"from langchain.evaluation.loading import load_dataset\n",
"dataset = load_dataset(\"TODO\")"
]
},
{
"cell_type": "markdown",
"id": "8a16b75d",
"metadata": {},
"source": [
"## Setting up a chain\n",
"\n",
"This next section should have an example of setting up a chain that can be run on this dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a2661ce0",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "6c0062e7",
"metadata": {},
"source": [
"## Make a prediction\n",
"\n",
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d28c5e7d",
"metadata": {},
"outputs": [],
"source": [
"# Example of running the chain on a single datapoint (`dataset[0]`) goes here"
]
},
{
"cell_type": "markdown",
"id": "d0c16cd7",
"metadata": {},
"source": [
"## Make many predictions\n",
"Now we can make predictions."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "24b4c66e",
"metadata": {},
"outputs": [],
"source": [
"# Example of running the chain on many predictions goes here\n",
"\n",
"# Sometimes its as simple as `chain.apply(dataset)`\n",
"\n",
"# Othertimes you may want to write a for loop to catch errors"
]
},
{
"cell_type": "markdown",
"id": "4783344b",
"metadata": {},
"source": [
"## Evaluate performance\n",
"\n",
"Any guide to evaluating performance in a more systematic manner goes here."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7710401a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,374 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "984169ca",
"metadata": {},
"source": [
"# Question Answering Benchmarking: Paul Graham Essay\n",
"\n",
"Here we go over how to benchmark performance on a question answering task over a Paul Graham essay.\n",
"\n",
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "3bd13ab7",
"metadata": {},
"outputs": [],
"source": [
"# Comment this out if you are NOT using tracing\n",
"import os\n",
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
]
},
{
"cell_type": "markdown",
"id": "8a16b75d",
"metadata": {},
"source": [
"## Loading the data\n",
"First, let's load the data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5b2d5e98",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--question-answering-paul-graham-76e8f711e038d742/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "63f434a42cba4739919333c75324acc9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from langchain.evaluation.loading import load_dataset\n",
"dataset = load_dataset(\"question-answering-paul-graham\")"
]
},
{
"cell_type": "markdown",
"id": "4ab6a716",
"metadata": {},
"source": [
"## Setting up a chain\n",
"Now we need to create some pipelines for doing question answering. Step one in that is creating an index over the data in question."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c18680b5",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"loader = TextLoader(\"../../modules/paul_graham_essay.txt\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7f0de2b3",
"metadata": {},
"outputs": [],
"source": [
"from langchain.indexes import VectorstoreIndexCreator"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ef84ff99",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running Chroma using direct local API.\n",
"Using DuckDB in-memory for database. Data will be transient.\n"
]
}
],
"source": [
"vectorstore = VectorstoreIndexCreator().from_loaders([loader]).vectorstore"
]
},
{
"cell_type": "markdown",
"id": "f0b5d8f6",
"metadata": {},
"source": [
"Now we can create a question answering chain."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "8843cb0c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import VectorDBQA\n",
"from langchain.llms import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "573719a0",
"metadata": {},
"outputs": [],
"source": [
"chain = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type=\"stuff\", vectorstore=vectorstore, input_key=\"question\")"
]
},
{
"cell_type": "markdown",
"id": "53b5aa23",
"metadata": {},
"source": [
"## Make a prediction\n",
"\n",
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "3f81d951",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'What were the two main things the author worked on before college?',\n",
" 'answer': 'The two main things the author worked on before college were writing and programming.',\n",
" 'result': ' Writing and programming.'}"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain(dataset[0])"
]
},
{
"cell_type": "markdown",
"id": "d0c16cd7",
"metadata": {},
"source": [
"## Make many predictions\n",
"Now we can make predictions"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "24b4c66e",
"metadata": {},
"outputs": [],
"source": [
"predictions = chain.apply(dataset)"
]
},
{
"cell_type": "markdown",
"id": "49d969fb",
"metadata": {},
"source": [
"## Evaluate performance\n",
"Now we can evaluate the predictions. The first thing we can do is look at them by eye."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1d583f03",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'What were the two main things the author worked on before college?',\n",
" 'answer': 'The two main things the author worked on before college were writing and programming.',\n",
" 'result': ' Writing and programming.'}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions[0]"
]
},
{
"cell_type": "markdown",
"id": "4783344b",
"metadata": {},
"source": [
"Next, we can use a language model to score them programatically"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "d0a9341d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.evaluation.qa import QAEvalChain"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "1612dec1",
"metadata": {},
"outputs": [],
"source": [
"llm = OpenAI(temperature=0)\n",
"eval_chain = QAEvalChain.from_llm(llm)\n",
"graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"result\")"
]
},
{
"cell_type": "markdown",
"id": "79587806",
"metadata": {},
"source": [
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "2a689df5",
"metadata": {},
"outputs": [],
"source": [
"for i, prediction in enumerate(predictions):\n",
" prediction['grade'] = graded_outputs[i]['text']"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "27b61215",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({' CORRECT': 12, ' INCORRECT': 10})"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"Counter([pred['grade'] for pred in predictions])"
]
},
{
"cell_type": "markdown",
"id": "12fe30f4",
"metadata": {},
"source": [
"We can also filter the datapoints to the incorrect examples and look at them."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "47c692a1",
"metadata": {},
"outputs": [],
"source": [
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "0ef976c1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'What did the author write their dissertation on?',\n",
" 'answer': 'The author wrote their dissertation on applications of continuations.',\n",
" 'result': ' The author does not mention what their dissertation was on, so it is not known.',\n",
" 'grade': ' INCORRECT'}"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"incorrect[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7710401a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,451 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "984169ca",
"metadata": {},
"source": [
"# Question Answering Benchmarking: State of the Union Address\n",
"\n",
"Here we go over how to benchmark performance on a question answering task over a state of the union address.\n",
"\n",
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "f127fb04",
"metadata": {},
"outputs": [],
"source": [
"# Comment this out if you are NOT using tracing\n",
"import os\n",
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
]
},
{
"cell_type": "markdown",
"id": "8a16b75d",
"metadata": {},
"source": [
"## Loading the data\n",
"First, let's load the data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5b2d5e98",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5d66c27b9b4744989843142f08f5c1b4",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading readme: 0%| | 0.00/21.0 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading and preparing dataset json/LangChainDatasets--question-answering-state-of-the-union to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--question-answering-state-of-the-union-a7e5a3b2db4f440d/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9e21e2ab96a0491ea5e252720d7dfa26",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c883830e068c42d39da8406ab38574c4",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading data: 0%| | 0.00/2.90k [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3b085715e52e49948d2a59d27e004eba",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0 examples [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset json downloaded and prepared to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--question-answering-state-of-the-union-a7e5a3b2db4f440d/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ee900d35e27d4843b42b31811b43212b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from langchain.evaluation.loading import load_dataset\n",
"dataset = load_dataset(\"question-answering-state-of-the-union\")"
]
},
{
"cell_type": "markdown",
"id": "4ab6a716",
"metadata": {},
"source": [
"## Setting up a chain\n",
"Now we need to create some pipelines for doing question answering. Step one in that is creating an index over the data in question."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c18680b5",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7f0de2b3",
"metadata": {},
"outputs": [],
"source": [
"from langchain.indexes import VectorstoreIndexCreator"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ef84ff99",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running Chroma using direct local API.\n",
"Using DuckDB in-memory for database. Data will be transient.\n"
]
}
],
"source": [
"vectorstore = VectorstoreIndexCreator().from_loaders([loader]).vectorstore"
]
},
{
"cell_type": "markdown",
"id": "f0b5d8f6",
"metadata": {},
"source": [
"Now we can create a question answering chain."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8843cb0c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import VectorDBQA\n",
"from langchain.llms import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "573719a0",
"metadata": {},
"outputs": [],
"source": [
"chain = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type=\"stuff\", vectorstore=vectorstore, input_key=\"question\")"
]
},
{
"cell_type": "markdown",
"id": "37d669e9",
"metadata": {},
"source": [
"## Make a prediction\n",
"\n",
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "3089e409",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'What is the purpose of the NATO Alliance?',\n",
" 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
" 'result': ' The NATO Alliance was created to secure peace and stability in Europe after World War 2.'}"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain(dataset[0])"
]
},
{
"cell_type": "markdown",
"id": "d0c16cd7",
"metadata": {},
"source": [
"## Make many predictions\n",
"Now we can make predictions"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "24b4c66e",
"metadata": {},
"outputs": [],
"source": [
"predictions = chain.apply(dataset)"
]
},
{
"cell_type": "markdown",
"id": "49d969fb",
"metadata": {},
"source": [
"## Evaluate performance\n",
"Now we can evaluate the predictions. The first thing we can do is look at them by eye."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "1d583f03",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'What is the purpose of the NATO Alliance?',\n",
" 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
" 'result': ' The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.'}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions[0]"
]
},
{
"cell_type": "markdown",
"id": "4783344b",
"metadata": {},
"source": [
"Next, we can use a language model to score them programatically"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "d0a9341d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.evaluation.qa import QAEvalChain"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1612dec1",
"metadata": {},
"outputs": [],
"source": [
"llm = OpenAI(temperature=0)\n",
"eval_chain = QAEvalChain.from_llm(llm)\n",
"graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"result\")"
]
},
{
"cell_type": "markdown",
"id": "79587806",
"metadata": {},
"source": [
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "2a689df5",
"metadata": {},
"outputs": [],
"source": [
"for i, prediction in enumerate(predictions):\n",
" prediction['grade'] = graded_outputs[i]['text']"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "27b61215",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({' CORRECT': 7, ' INCORRECT': 4})"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"Counter([pred['grade'] for pred in predictions])"
]
},
{
"cell_type": "markdown",
"id": "12fe30f4",
"metadata": {},
"source": [
"We can also filter the datapoints to the incorrect examples and look at them."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "47c692a1",
"metadata": {},
"outputs": [],
"source": [
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "0ef976c1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'What is the U.S. Department of Justice doing to combat the crimes of Russian oligarchs?',\n",
" 'answer': 'The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs.',\n",
" 'result': ' The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs and is naming a chief prosecutor for pandemic fraud.',\n",
" 'grade': ' INCORRECT'}"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"incorrect[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7710401a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,117 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "ee2a3a21",
"metadata": {},
"source": [
"# QA Generation\n",
"This notebook shows how to use the `QAGenerationChain` to come up with question-answer pairs over a specific document.\n",
"This is important because often times you may not have data to evaluate your question-answer system over, so this is a cheap and lightweight way to generate it!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "33d3f0b4",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2029a29c",
"metadata": {},
"outputs": [],
"source": [
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "87edb84c",
"metadata": {},
"outputs": [],
"source": [
"doc = loader.load()[0]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "04125b6d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.chains import QAGenerationChain\n",
"chain = QAGenerationChain.from_llm(ChatOpenAI(temperature = 0))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4f1593e4",
"metadata": {},
"outputs": [],
"source": [
"qa = chain.run(doc.page_content)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "ee831f92",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'What is the U.S. Department of Justice doing to combat the crimes of Russian oligarchs?',\n",
" 'answer': 'The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs.'}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qa[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7028754e",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -191,7 +191,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "782ae8c8",
"metadata": {},
@ -316,7 +315,7 @@
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@ -330,7 +329,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]"
"version": "3.9.1"
},
"vscode": {
"interpreter": {

@ -0,0 +1,423 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "984169ca",
"metadata": {},
"source": [
"# SQL Question Answering Benchmarking: Chinook\n",
"\n",
"Here we go over how to benchmark performance on a question answering task over a SQL database.\n",
"\n",
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "44874486",
"metadata": {},
"outputs": [],
"source": [
"# Comment this out if you are NOT using tracing\n",
"import os\n",
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
]
},
{
"cell_type": "markdown",
"id": "0f66405e",
"metadata": {},
"source": [
"## Loading the data\n",
"\n",
"First, let's load the data."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0df1393f",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b220d07ee5d14909bc842b4545cdc0de",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading readme: 0%| | 0.00/21.0 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading and preparing dataset json/LangChainDatasets--sql-qa-chinook to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--sql-qa-chinook-7528565d2d992b47/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "e89e3c8ef76f49889c4b39c624828c71",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a8421df6c26045e8978c7086cb418222",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading data: 0%| | 0.00/1.44k [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d1fb6becc3324a85bf039a53caf30924",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0 examples [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset json downloaded and prepared to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--sql-qa-chinook-7528565d2d992b47/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9d68ad1b3e4a4bd79f92597aac4d3cc9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from langchain.evaluation.loading import load_dataset\n",
"dataset = load_dataset(\"sql-qa-chinook\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "ab44d504",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'How many employees are there?', 'answer': '8'}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset[0]"
]
},
{
"cell_type": "markdown",
"id": "8a16b75d",
"metadata": {},
"source": [
"## Setting up a chain\n",
"This uses the example Chinook database.\n",
"To set it up follow the instructions on https://database.guide/2-sample-databases-sqlite/, placing the `.db` file in a notebooks folder at the root of this repository.\n",
"\n",
"Note that here we load a simple chain. If you want to experiment with more complex chains, or an agent, just create the `chain` object in a different way."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5b2d5e98",
"metadata": {},
"outputs": [],
"source": [
"from langchain import OpenAI, SQLDatabase, SQLDatabaseChain"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "33cdcbfc",
"metadata": {},
"outputs": [],
"source": [
"db = SQLDatabase.from_uri(\"sqlite:///../../../notebooks/Chinook.db\")\n",
"llm = OpenAI(temperature=0)"
]
},
{
"cell_type": "markdown",
"id": "f0b5d8f6",
"metadata": {},
"source": [
"Now we can create a SQL database chain."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "8843cb0c",
"metadata": {},
"outputs": [],
"source": [
"chain = SQLDatabaseChain(llm=llm, database=db, input_key=\"question\")"
]
},
{
"cell_type": "markdown",
"id": "6c0062e7",
"metadata": {},
"source": [
"## Make a prediction\n",
"\n",
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "d28c5e7d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'How many employees are there?',\n",
" 'answer': '8',\n",
" 'result': ' There are 8 employees.'}"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain(dataset[0])"
]
},
{
"cell_type": "markdown",
"id": "d0c16cd7",
"metadata": {},
"source": [
"## Make many predictions\n",
"Now we can make predictions. Note that we add a try-except because this chain can sometimes error (if SQL is written incorrectly, etc)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "24b4c66e",
"metadata": {},
"outputs": [],
"source": [
"predictions = []\n",
"predicted_dataset = []\n",
"error_dataset = []\n",
"for data in dataset:\n",
" try:\n",
" predictions.append(chain(data))\n",
" predicted_dataset.append(data)\n",
" except:\n",
" error_dataset.append(data)"
]
},
{
"cell_type": "markdown",
"id": "4783344b",
"metadata": {},
"source": [
"## Evaluate performance\n",
"Now we can evaluate the predictions. We can use a language model to score them programatically"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "d0a9341d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.evaluation.qa import QAEvalChain"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "1612dec1",
"metadata": {},
"outputs": [],
"source": [
"llm = OpenAI(temperature=0)\n",
"eval_chain = QAEvalChain.from_llm(llm)\n",
"graded_outputs = eval_chain.evaluate(predicted_dataset, predictions, question_key=\"question\", prediction_key=\"result\")"
]
},
{
"cell_type": "markdown",
"id": "79587806",
"metadata": {},
"source": [
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "2a689df5",
"metadata": {},
"outputs": [],
"source": [
"for i, prediction in enumerate(predictions):\n",
" prediction['grade'] = graded_outputs[i]['text']"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "27b61215",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({' CORRECT': 3, ' INCORRECT': 4})"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"Counter([pred['grade'] for pred in predictions])"
]
},
{
"cell_type": "markdown",
"id": "12fe30f4",
"metadata": {},
"source": [
"We can also filter the datapoints to the incorrect examples and look at them."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "47c692a1",
"metadata": {},
"outputs": [],
"source": [
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "0ef976c1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'How many employees are also customers?',\n",
" 'answer': 'None',\n",
" 'result': ' 59 employees are also customers.',\n",
" 'grade': ' INCORRECT'}"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"incorrect[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7710401a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -16,6 +16,7 @@ from langchain.chains.loading import load_chain
from langchain.chains.mapreduce import MapReduceChain
from langchain.chains.moderation import OpenAIModerationChain
from langchain.chains.pal.base import PALChain
from langchain.chains.qa_generation.base import QAGenerationChain
from langchain.chains.qa_with_sources.base import QAWithSourcesChain
from langchain.chains.qa_with_sources.vector_db import VectorDBQAWithSourcesChain
from langchain.chains.sequential import SequentialChain, SimpleSequentialChain
@ -52,4 +53,5 @@ __all__ = [
"ChatVectorDBChain",
"GraphQAChain",
"ConstitutionalChain",
"QAGenerationChain",
]

@ -0,0 +1,55 @@
from __future__ import annotations
import json
from typing import Any, Dict, List, Optional
from pydantic import Field
from langchain.chains.base import Chain
from langchain.chains.llm import LLMChain
from langchain.chains.qa_generation.prompt import PROMPT_SELECTOR
from langchain.prompts.base import BasePromptTemplate
from langchain.schema import BaseLanguageModel
from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
class QAGenerationChain(Chain):
llm_chain: LLMChain
text_splitter: TextSplitter = Field(
default=RecursiveCharacterTextSplitter(chunk_overlap=500)
)
input_key: str = "text"
output_key: str = "questions"
k: Optional[int] = None
@classmethod
def from_llm(
cls,
llm: BaseLanguageModel,
prompt: Optional[BasePromptTemplate] = None,
**kwargs: Any,
) -> QAGenerationChain:
_prompt = prompt or PROMPT_SELECTOR.get_prompt(llm)
chain = LLMChain(llm=llm, prompt=_prompt)
return cls(llm_chain=chain, **kwargs)
@property
def _chain_type(self) -> str:
raise NotImplementedError
@property
def input_keys(self) -> List[str]:
return [self.input_key]
@property
def output_keys(self) -> List[str]:
return [self.output_key]
def _call(self, inputs: Dict[str, str]) -> Dict[str, Any]:
docs = self.text_splitter.create_documents([inputs[self.input_key]])
results = self.llm_chain.generate([{"text": d.page_content} for d in docs])
qa = [json.loads(res[0].text) for res in results.generations]
return {self.output_key: qa}
async def _acall(self, inputs: Dict[str, str]) -> Dict[str, str]:
raise NotImplementedError

@ -0,0 +1,50 @@
# flake8: noqa
from langchain.chains.prompt_selector import ConditionalPromptSelector, is_chat_model
from langchain.prompts.chat import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)
from langchain.prompts.prompt import PromptTemplate
templ1 = """You are a smart assistant designed to help high school teachers come up with reading comprehension questions.
Given a piece of text, you must come up with a question and answer pair that can be used to test a student's reading comprehension abilities.
When coming up with this question/answer pair, you must respond in the following format:
```
{{
"question": "$YOUR_QUESTION_HERE",
"answer": "$THE_ANSWER_HERE"
}}
```
Everything between the ``` must be valid json.
"""
templ2 = """Please come up with a question/answer pair, in the specified JSON format, for the following text:
----------------
{text}"""
CHAT_PROMPT = ChatPromptTemplate.from_messages(
[
SystemMessagePromptTemplate.from_template(templ1),
HumanMessagePromptTemplate.from_template(templ2),
]
)
templ = """You are a smart assistant designed to help high school teachers come up with reading comprehension questions.
Given a piece of text, you must come up with a question and answer pair that can be used to test a student's reading comprehension abilities.
When coming up with this question/answer pair, you must respond in the following format:
```
{{
"question": "$YOUR_QUESTION_HERE",
"answer": "$THE_ANSWER_HERE"
}}
```
Everything between the ``` must be valid json.
Please come up with a question/answer pair, in the specified JSON format, for the following text:
----------------
{text}"""
PROMPT = PromptTemplate.from_template(templ)
PROMPT_SELECTOR = ConditionalPromptSelector(
default_prompt=PROMPT, conditionals=[(is_chat_model, CHAT_PROMPT)]
)

@ -0,0 +1,8 @@
from typing import Dict, List
def load_dataset(uri: str) -> List[Dict]:
from datasets import load_dataset
dataset = load_dataset(f"LangChainDatasets/{uri}")
return [d for d in dataset["train"]]

@ -50,8 +50,8 @@ class VectorstoreIndexCreator(BaseModel):
"""Logic for creating indexes."""
vectorstore_cls: Type[VectorStore] = Chroma
text_splitter: TextSplitter = Field(default_factory=_get_default_text_splitter)
embedding: Embeddings = Field(default_factory=OpenAIEmbeddings)
text_splitter: TextSplitter = Field(default_factory=_get_default_text_splitter)
vectorstore_kwargs: dict = Field(default_factory=dict)
class Config:

Loading…
Cancel
Save