forked from Archives/langchain
Harrison/agent eval (#1620)
Co-authored-by: jerwelborn <jeremy.welborn@gmail.com>
This commit is contained in:
parent
8965a2f0af
commit
2d098e8869
@ -679,7 +679,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.9"
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
@ -1,9 +1,85 @@
|
||||
Evaluation
|
||||
==============
|
||||
|
||||
Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.
|
||||
This section of documentation covers how we approach and think about evaluation in LangChain.
|
||||
Both evaluation of internal chains/agents, but also how we would recommend people building on top of LangChain approach evaluation.
|
||||
|
||||
The examples here all highlight how to use language models to assist in evaluation of themselves.
|
||||
The Problem
|
||||
-----------
|
||||
|
||||
It can be really hard to evaluate LangChain chains and agents.
|
||||
There are two main reasons for this:
|
||||
|
||||
**# 1: Lack of data**
|
||||
|
||||
You generally don't have a ton of data to evaluate your chains/agents over before starting a project.
|
||||
This is usually because Large Language Models (the core of most chains/agents) are terrific few-shot and zero shot learners,
|
||||
meaning you are almost always able to get started on a particular task (text-to-SQL, question answering, etc) without
|
||||
a large dataset of examples.
|
||||
This is in stark contrast to traditional machine learning where you had to first collect a bunch of datapoints
|
||||
before even getting started using a model.
|
||||
|
||||
**# 2: Lack of metrics**
|
||||
|
||||
Most chains/agents are performing tasks for which there are not very good metrics to evaluate performance.
|
||||
For example, one of the most common use cases is generating text of some form.
|
||||
Evaluating generated text is much more complicated than evaluating a classification prediction, or a numeric prediction.
|
||||
|
||||
The Solution
|
||||
------------
|
||||
|
||||
LangChain attempts to tackle both of those issues.
|
||||
What we have so far are initial passes at solutions - we do not think we have a perfect solution.
|
||||
So we very much welcome feedback, contributions, integrations, and thoughts on this.
|
||||
|
||||
Here is what we have for each problem so far:
|
||||
|
||||
**# 1: Lack of data**
|
||||
|
||||
We have started `LangChainDatasets <https://huggingface.co/LangChainDatasets>`_ a Community space on Hugging Face.
|
||||
We intend this to be a collection of open source datasets for evaluating common chains and agents.
|
||||
We have contributed five datasets of our own to start, but we highly intend this to be a community effort.
|
||||
In order to contribute a dataset, you simply need to join the community and then you will be able to upload datasets.
|
||||
|
||||
We're also aiming to make it as easy as possible for people to create their own datasets.
|
||||
As a first pass at this, we've added a QAGenerationChain, which given a document comes up
|
||||
with question-answer pairs that can be used to evaluate question-answering tasks over that document down the line.
|
||||
See `this notebook <./evaluation/qa_generation.html>`_ for an example of how to use this chain.
|
||||
|
||||
**# 2: Lack of metrics**
|
||||
|
||||
We have two solutions to the lack of metrics.
|
||||
|
||||
The first solution is to use no metrics, and rather just rely on looking at results by eye to get a sense for how the chain/agent is performing.
|
||||
To assist in this, we have developed (and will continue to develop) `tracing <../tracing.html>`_, a UI-based visualizer of your chain and agent runs.
|
||||
|
||||
The second solution we recommend is to use Language Models themselves to evaluate outputs.
|
||||
For this we have a few different chains and prompts aimed at tackling this issue.
|
||||
|
||||
The Examples
|
||||
------------
|
||||
|
||||
We have created a bunch of examples combining the above two solutions to show how we internally evaluate chains and agents when we are developing.
|
||||
In addition to the examples we've curated, we also highly welcome contributions here.
|
||||
To facilitate that, we've included a `template notebook <./evaluation/benchmarking_template.html>`_ for community members to use to build their own examples.
|
||||
|
||||
The existing examples we have are:
|
||||
|
||||
`Question Answering (State of Union) <./evaluation/qa_benchmarking_sota.html>`_: An notebook showing evaluation of a question-answering task over a State-of-the-Union address.
|
||||
|
||||
`Question Answering (Paul Graham Essay) <./evaluation/qa_benchmarking_pg.html>`_: An notebook showing evaluation of a question-answering task over a Paul Graham essay.
|
||||
|
||||
`SQL Question Answering (Chinook) <./evaluation/sql_qa_benchmarking_chinook.html>`_: An notebook showing evaluation of a question-answering task over a SQL database (the Chinook database).
|
||||
|
||||
`Agent Vectorstore <./evaluation/agent_vectordb_sota_pg.html>`_: An notebook showing evaluation of an agent doing question answering while routing between two different vector databases.
|
||||
|
||||
`Agent Search + Calculator <./evaluation/agent_benchmarking.html>`_: An notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools.
|
||||
|
||||
|
||||
Other Examples
|
||||
------------
|
||||
|
||||
In addition, we also have some more generic resources for evaluation.
|
||||
|
||||
`Question Answering <./evaluation/question_answering.html>`_: An overview of LLMs aimed at evaluating question answering systems in general.
|
||||
|
||||
|
343
docs/use_cases/evaluation/agent_benchmarking.ipynb
Normal file
343
docs/use_cases/evaluation/agent_benchmarking.ipynb
Normal file
@ -0,0 +1,343 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "984169ca",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Agent Benchmarking: Search + Calculator\n",
|
||||
"\n",
|
||||
"Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.\n",
|
||||
"\n",
|
||||
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "46bf9205",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Comment this out if you are NOT using tracing\n",
|
||||
"import os\n",
|
||||
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8a16b75d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading the data\n",
|
||||
"First, let's load the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "5b2d5e98",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--agent-search-calculator-8a025c0ce5fb99d2/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "3a275586643f4ccfba1a8d54be28c351",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
" 0%| | 0/1 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.evaluation.loading import load_dataset\n",
|
||||
"dataset = load_dataset(\"agent-search-calculator\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4ab6a716",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setting up a chain\n",
|
||||
"Now we need to load an agent capable of answering these questions."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "c18680b5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.llms import OpenAI\n",
|
||||
"from langchain.chains import LLMMathChain\n",
|
||||
"from langchain.agents import initialize_agent, Tool, load_tools\n",
|
||||
"\n",
|
||||
"tools = load_tools(['serpapi', 'llm-math'], llm=OpenAI(temperature=0))\n",
|
||||
"agent = initialize_agent(tools, OpenAI(temperature=0), agent=\"zero-shot-react-description\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "68504a8f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make a prediction\n",
|
||||
"\n",
|
||||
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "cbcafc92",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'38,630,316 people live in Canada as of 2023.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"agent.run(dataset[0]['question'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d0c16cd7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make many predictions\n",
|
||||
"Now we can make predictions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "24b4c66e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')).\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"predictions = []\n",
|
||||
"predicted_dataset = []\n",
|
||||
"error_dataset = []\n",
|
||||
"for data in dataset:\n",
|
||||
" new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",
|
||||
" try:\n",
|
||||
" predictions.append(agent(new_data))\n",
|
||||
" predicted_dataset.append(new_data)\n",
|
||||
" except Exception:\n",
|
||||
" error_dataset.append(new_data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "49d969fb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Evaluate performance\n",
|
||||
"Now we can evaluate the predictions. The first thing we can do is look at them by eye."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "1d583f03",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'input': 'How many people live in canada as of 2023?',\n",
|
||||
" 'answer': 'approximately 38,625,801',\n",
|
||||
" 'output': '38,630,316 people live in Canada as of 2023.',\n",
|
||||
" 'intermediate_steps': [(AgentAction(tool='Search', tool_input='Population of Canada 2023', log=' I need to find population data\\nAction: Search\\nAction Input: Population of Canada 2023'),\n",
|
||||
" '38,630,316')]}"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"predictions[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4783344b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we can use a language model to score them programatically"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "d0a9341d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.evaluation.qa import QAEvalChain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "1612dec1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"llm = OpenAI(temperature=0)\n",
|
||||
"eval_chain = QAEvalChain.from_llm(llm)\n",
|
||||
"graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"output\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "79587806",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "2a689df5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for i, prediction in enumerate(predictions):\n",
|
||||
" prediction['grade'] = graded_outputs[i]['text']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "27b61215",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Counter({' CORRECT': 4, ' INCORRECT': 6})"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"Counter([pred['grade'] for pred in predictions])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "12fe30f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also filter the datapoints to the incorrect examples and look at them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"id": "47c692a1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"id": "0ef976c1",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'input': \"who is dua lipa's boyfriend? what is his age raised to the .43 power?\",\n",
|
||||
" 'answer': 'her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665',\n",
|
||||
" 'output': \"Isaac Carew, Dua Lipa's boyfriend, is 36 years old and his age raised to the .43 power is 4.6688516567750975.\",\n",
|
||||
" 'intermediate_steps': [(AgentAction(tool='Search', tool_input=\"Dua Lipa's boyfriend\", log=' I need to find out who Dua Lipa\\'s boyfriend is and then calculate his age raised to the .43 power\\nAction: Search\\nAction Input: \"Dua Lipa\\'s boyfriend\"'),\n",
|
||||
" 'Dua and Isaac, a model and a chef, dated on and off from 2013 to 2019. The two first split in early 2017, which is when Dua went on to date LANY ...'),\n",
|
||||
" (AgentAction(tool='Search', tool_input='Isaac Carew age', log=' I need to find out Isaac\\'s age\\nAction: Search\\nAction Input: \"Isaac Carew age\"'),\n",
|
||||
" '36 years'),\n",
|
||||
" (AgentAction(tool='Calculator', tool_input='36^.43', log=' I need to calculate 36 raised to the .43 power\\nAction: Calculator\\nAction Input: 36^.43'),\n",
|
||||
" 'Answer: 4.6688516567750975\\n')],\n",
|
||||
" 'grade': ' INCORRECT'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"incorrect[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7710401a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
516
docs/use_cases/evaluation/agent_vectordb_sota_pg.ipynb
Normal file
516
docs/use_cases/evaluation/agent_vectordb_sota_pg.ipynb
Normal file
@ -0,0 +1,516 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "984169ca",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Agent VectorDB Question Answering Benchmarking\n",
|
||||
"\n",
|
||||
"Here we go over how to benchmark performance on a question answering task using an agent to route between multiple vectordatabases.\n",
|
||||
"\n",
|
||||
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 47,
|
||||
"id": "7b57a50f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Comment this out if you are NOT using tracing\n",
|
||||
"import os\n",
|
||||
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8a16b75d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading the data\n",
|
||||
"First, let's load the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "5b2d5e98",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--agent-vectordb-qa-sota-pg-d3ae24016b514f92/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "a7abbc20615d4c58b75a055a790d7212",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
" 0%| | 0/1 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.evaluation.loading import load_dataset\n",
|
||||
"dataset = load_dataset(\"agent-vectordb-qa-sota-pg\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "61375342",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'What is the purpose of the NATO Alliance?',\n",
|
||||
" 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
|
||||
" 'steps': [{'tool': 'State of Union QA System', 'tool_input': None},\n",
|
||||
" {'tool': None, 'tool_input': 'What is the purpose of the NATO Alliance?'}]}"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"dataset[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"id": "02500304",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'What is the purpose of YC?',\n",
|
||||
" 'answer': 'The purpose of YC is to cause startups to be founded that would not otherwise have existed.',\n",
|
||||
" 'steps': [{'tool': 'Paul Graham QA System', 'tool_input': None},\n",
|
||||
" {'tool': None, 'tool_input': 'What is the purpose of YC?'}]}"
|
||||
]
|
||||
},
|
||||
"execution_count": 22,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"dataset[-1]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4ab6a716",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setting up a chain\n",
|
||||
"Now we need to create some pipelines for doing question answering. Step one in that is creating indexes over the data in question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "c18680b5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import TextLoader\n",
|
||||
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "7f0de2b3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.indexes import VectorstoreIndexCreator"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "ef84ff99",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Running Chroma using direct local API.\n",
|
||||
"Using DuckDB in-memory for database. Data will be transient.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectorstore_sota = VectorstoreIndexCreator(vectorstore_kwargs={\"collection_name\":\"sota\"}).from_loaders([loader]).vectorstore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f0b5d8f6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we can create a question answering chain."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "8843cb0c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains import VectorDBQA\n",
|
||||
"from langchain.llms import OpenAI"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "573719a0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain_sota = VectorDBQA.from_chain_type(llm=OpenAI(temperature=0), chain_type=\"stuff\", vectorstore=vectorstore_sota, input_key=\"question\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e48b03d8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we do the same for the Paul Graham data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "c2dbb014",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = TextLoader(\"../../modules/paul_graham_essay.txt\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "98d16f08",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Running Chroma using direct local API.\n",
|
||||
"Using DuckDB in-memory for database. Data will be transient.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectorstore_pg = VectorstoreIndexCreator(vectorstore_kwargs={\"collection_name\":\"paul_graham\"}).from_loaders([loader]).vectorstore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "ec0aab02",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain_pg = VectorDBQA.from_chain_type(llm=OpenAI(temperature=0), chain_type=\"stuff\", vectorstore=vectorstore_pg, input_key=\"question\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "76b5f8fb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can now set up an agent to route between them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"id": "ade1aafa",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.agents import initialize_agent, Tool\n",
|
||||
"tools = [\n",
|
||||
" Tool(\n",
|
||||
" name = \"State of Union QA System\",\n",
|
||||
" func=chain_sota.run,\n",
|
||||
" description=\"useful for when you need to answer questions about the most recent state of the union address. Input should be a fully formed question.\"\n",
|
||||
" ),\n",
|
||||
" Tool(\n",
|
||||
" name = \"Paul Graham System\",\n",
|
||||
" func=chain_pg.run,\n",
|
||||
" description=\"useful for when you need to answer questions about Paul Graham. Input should be a fully formed question.\"\n",
|
||||
" ),\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"id": "104853f8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"agent = initialize_agent(tools, OpenAI(temperature=0), agent=\"zero-shot-react-description\", max_iterations=3)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7f036641",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make a prediction\n",
|
||||
"\n",
|
||||
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 48,
|
||||
"id": "4664e79f",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'The purpose of the NATO Alliance is to promote peace and security in the North Atlantic region by providing a collective defense against potential threats.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 48,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"agent.run(dataset[0]['question'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d0c16cd7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make many predictions\n",
|
||||
"Now we can make predictions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"id": "24b4c66e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"predictions = []\n",
|
||||
"predicted_dataset = []\n",
|
||||
"error_dataset = []\n",
|
||||
"for data in dataset:\n",
|
||||
" new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",
|
||||
" try:\n",
|
||||
" predictions.append(agent(new_data))\n",
|
||||
" predicted_dataset.append(new_data)\n",
|
||||
" except Exception:\n",
|
||||
" error_dataset.append(new_data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "49d969fb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Evaluate performance\n",
|
||||
"Now we can evaluate the predictions. The first thing we can do is look at them by eye."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 36,
|
||||
"id": "1d583f03",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'input': 'What is the purpose of the NATO Alliance?',\n",
|
||||
" 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
|
||||
" 'output': 'The purpose of the NATO Alliance is to promote peace and security in the North Atlantic region by providing a collective defense against potential threats.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 36,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"predictions[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4783344b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we can use a language model to score them programatically"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 37,
|
||||
"id": "d0a9341d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.evaluation.qa import QAEvalChain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 40,
|
||||
"id": "1612dec1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"llm = OpenAI(temperature=0)\n",
|
||||
"eval_chain = QAEvalChain.from_llm(llm)\n",
|
||||
"graded_outputs = eval_chain.evaluate(predicted_dataset, predictions, question_key=\"input\", prediction_key=\"output\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "79587806",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 41,
|
||||
"id": "2a689df5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for i, prediction in enumerate(predictions):\n",
|
||||
" prediction['grade'] = graded_outputs[i]['text']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 42,
|
||||
"id": "27b61215",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Counter({' CORRECT': 19, ' INCORRECT': 14})"
|
||||
]
|
||||
},
|
||||
"execution_count": 42,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"Counter([pred['grade'] for pred in predictions])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "12fe30f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also filter the datapoints to the incorrect examples and look at them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 43,
|
||||
"id": "47c692a1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 46,
|
||||
"id": "0ef976c1",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'input': 'What is the purpose of the Bipartisan Innovation Act mentioned in the text?',\n",
|
||||
" 'answer': 'The Bipartisan Innovation Act will make record investments in emerging technologies and American manufacturing to level the playing field with China and other competitors.',\n",
|
||||
" 'output': 'The purpose of the Bipartisan Innovation Act is to promote innovation and entrepreneurship in the United States by providing tax incentives and other support for startups and small businesses.',\n",
|
||||
" 'grade': ' INCORRECT'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 46,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"incorrect[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7710401a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
160
docs/use_cases/evaluation/benchmarking_template.ipynb
Normal file
160
docs/use_cases/evaluation/benchmarking_template.ipynb
Normal file
@ -0,0 +1,160 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a175c650",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Benchmarking Template\n",
|
||||
"\n",
|
||||
"This is an example notebook that can be used to create a benchmarking notebook for a task of your choice. Evaluation is really hard, and so we greatly welcome any contributions that can make it easier for people to experiment"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "984169ca",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"id": "9fe4d1b4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Comment this out if you are NOT using tracing\n",
|
||||
"import os\n",
|
||||
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0f66405e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading the data\n",
|
||||
"\n",
|
||||
"First, let's load the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "79402a8f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# This notebook should so how to load the dataset from LangChainDatasets on Hugging Face\n",
|
||||
"\n",
|
||||
"# Please upload your dataset to https://huggingface.co/LangChainDatasets\n",
|
||||
"\n",
|
||||
"# The value passed into `load_dataset` should NOT have the `LangChainDatasets/` prefix\n",
|
||||
"from langchain.evaluation.loading import load_dataset\n",
|
||||
"dataset = load_dataset(\"TODO\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8a16b75d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setting up a chain\n",
|
||||
"\n",
|
||||
"This next section should have an example of setting up a chain that can be run on this dataset."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a2661ce0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6c0062e7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make a prediction\n",
|
||||
"\n",
|
||||
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "d28c5e7d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Example of running the chain on a single datapoint (`dataset[0]`) goes here"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d0c16cd7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make many predictions\n",
|
||||
"Now we can make predictions."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "24b4c66e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Example of running the chain on many predictions goes here\n",
|
||||
"\n",
|
||||
"# Sometimes its as simple as `chain.apply(dataset)`\n",
|
||||
"\n",
|
||||
"# Othertimes you may want to write a for loop to catch errors"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4783344b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Evaluate performance\n",
|
||||
"\n",
|
||||
"Any guide to evaluating performance in a more systematic manner goes here."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7710401a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
374
docs/use_cases/evaluation/qa_benchmarking_pg.ipynb
Normal file
374
docs/use_cases/evaluation/qa_benchmarking_pg.ipynb
Normal file
@ -0,0 +1,374 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "984169ca",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Question Answering Benchmarking: Paul Graham Essay\n",
|
||||
"\n",
|
||||
"Here we go over how to benchmark performance on a question answering task over a Paul Graham essay.\n",
|
||||
"\n",
|
||||
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"id": "3bd13ab7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Comment this out if you are NOT using tracing\n",
|
||||
"import os\n",
|
||||
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8a16b75d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading the data\n",
|
||||
"First, let's load the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "5b2d5e98",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--question-answering-paul-graham-76e8f711e038d742/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "63f434a42cba4739919333c75324acc9",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
" 0%| | 0/1 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.evaluation.loading import load_dataset\n",
|
||||
"dataset = load_dataset(\"question-answering-paul-graham\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4ab6a716",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setting up a chain\n",
|
||||
"Now we need to create some pipelines for doing question answering. Step one in that is creating an index over the data in question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "c18680b5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import TextLoader\n",
|
||||
"loader = TextLoader(\"../../modules/paul_graham_essay.txt\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "7f0de2b3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.indexes import VectorstoreIndexCreator"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "ef84ff99",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Running Chroma using direct local API.\n",
|
||||
"Using DuckDB in-memory for database. Data will be transient.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectorstore = VectorstoreIndexCreator().from_loaders([loader]).vectorstore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f0b5d8f6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we can create a question answering chain."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "8843cb0c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains import VectorDBQA\n",
|
||||
"from langchain.llms import OpenAI"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "573719a0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type=\"stuff\", vectorstore=vectorstore, input_key=\"question\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "53b5aa23",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make a prediction\n",
|
||||
"\n",
|
||||
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"id": "3f81d951",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'What were the two main things the author worked on before college?',\n",
|
||||
" 'answer': 'The two main things the author worked on before college were writing and programming.',\n",
|
||||
" 'result': ' Writing and programming.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain(dataset[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d0c16cd7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make many predictions\n",
|
||||
"Now we can make predictions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "24b4c66e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"predictions = chain.apply(dataset)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "49d969fb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Evaluate performance\n",
|
||||
"Now we can evaluate the predictions. The first thing we can do is look at them by eye."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "1d583f03",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'What were the two main things the author worked on before college?',\n",
|
||||
" 'answer': 'The two main things the author worked on before college were writing and programming.',\n",
|
||||
" 'result': ' Writing and programming.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"predictions[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4783344b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we can use a language model to score them programatically"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "d0a9341d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.evaluation.qa import QAEvalChain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "1612dec1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"llm = OpenAI(temperature=0)\n",
|
||||
"eval_chain = QAEvalChain.from_llm(llm)\n",
|
||||
"graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"result\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "79587806",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "2a689df5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for i, prediction in enumerate(predictions):\n",
|
||||
" prediction['grade'] = graded_outputs[i]['text']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "27b61215",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Counter({' CORRECT': 12, ' INCORRECT': 10})"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"Counter([pred['grade'] for pred in predictions])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "12fe30f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also filter the datapoints to the incorrect examples and look at them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "47c692a1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"id": "0ef976c1",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'What did the author write their dissertation on?',\n",
|
||||
" 'answer': 'The author wrote their dissertation on applications of continuations.',\n",
|
||||
" 'result': ' The author does not mention what their dissertation was on, so it is not known.',\n",
|
||||
" 'grade': ' INCORRECT'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"incorrect[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7710401a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
451
docs/use_cases/evaluation/qa_benchmarking_sota.ipynb
Normal file
451
docs/use_cases/evaluation/qa_benchmarking_sota.ipynb
Normal file
@ -0,0 +1,451 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "984169ca",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Question Answering Benchmarking: State of the Union Address\n",
|
||||
"\n",
|
||||
"Here we go over how to benchmark performance on a question answering task over a state of the union address.\n",
|
||||
"\n",
|
||||
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"id": "f127fb04",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Comment this out if you are NOT using tracing\n",
|
||||
"import os\n",
|
||||
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8a16b75d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading the data\n",
|
||||
"First, let's load the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "5b2d5e98",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "5d66c27b9b4744989843142f08f5c1b4",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Downloading readme: 0%| | 0.00/21.0 [00:00<?, ?B/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Downloading and preparing dataset json/LangChainDatasets--question-answering-state-of-the-union to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--question-answering-state-of-the-union-a7e5a3b2db4f440d/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "9e21e2ab96a0491ea5e252720d7dfa26",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "c883830e068c42d39da8406ab38574c4",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Downloading data: 0%| | 0.00/2.90k [00:00<?, ?B/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "3b085715e52e49948d2a59d27e004eba",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Generating train split: 0 examples [00:00, ? examples/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Dataset json downloaded and prepared to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--question-answering-state-of-the-union-a7e5a3b2db4f440d/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "ee900d35e27d4843b42b31811b43212b",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
" 0%| | 0/1 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.evaluation.loading import load_dataset\n",
|
||||
"dataset = load_dataset(\"question-answering-state-of-the-union\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4ab6a716",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setting up a chain\n",
|
||||
"Now we need to create some pipelines for doing question answering. Step one in that is creating an index over the data in question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "c18680b5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import TextLoader\n",
|
||||
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "7f0de2b3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.indexes import VectorstoreIndexCreator"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "ef84ff99",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Running Chroma using direct local API.\n",
|
||||
"Using DuckDB in-memory for database. Data will be transient.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectorstore = VectorstoreIndexCreator().from_loaders([loader]).vectorstore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f0b5d8f6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we can create a question answering chain."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "8843cb0c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chains import VectorDBQA\n",
|
||||
"from langchain.llms import OpenAI"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "573719a0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type=\"stuff\", vectorstore=vectorstore, input_key=\"question\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "37d669e9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make a prediction\n",
|
||||
"\n",
|
||||
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"id": "3089e409",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'What is the purpose of the NATO Alliance?',\n",
|
||||
" 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
|
||||
" 'result': ' The NATO Alliance was created to secure peace and stability in Europe after World War 2.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain(dataset[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d0c16cd7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make many predictions\n",
|
||||
"Now we can make predictions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "24b4c66e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"predictions = chain.apply(dataset)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "49d969fb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Evaluate performance\n",
|
||||
"Now we can evaluate the predictions. The first thing we can do is look at them by eye."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "1d583f03",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'What is the purpose of the NATO Alliance?',\n",
|
||||
" 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
|
||||
" 'result': ' The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"predictions[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4783344b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we can use a language model to score them programatically"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "d0a9341d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.evaluation.qa import QAEvalChain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "1612dec1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"llm = OpenAI(temperature=0)\n",
|
||||
"eval_chain = QAEvalChain.from_llm(llm)\n",
|
||||
"graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"result\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "79587806",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "2a689df5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for i, prediction in enumerate(predictions):\n",
|
||||
" prediction['grade'] = graded_outputs[i]['text']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "27b61215",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Counter({' CORRECT': 7, ' INCORRECT': 4})"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"Counter([pred['grade'] for pred in predictions])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "12fe30f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also filter the datapoints to the incorrect examples and look at them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "47c692a1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "0ef976c1",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'What is the U.S. Department of Justice doing to combat the crimes of Russian oligarchs?',\n",
|
||||
" 'answer': 'The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs.',\n",
|
||||
" 'result': ' The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs and is naming a chief prosecutor for pandemic fraud.',\n",
|
||||
" 'grade': ' INCORRECT'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"incorrect[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7710401a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
117
docs/use_cases/evaluation/qa_generation.ipynb
Normal file
117
docs/use_cases/evaluation/qa_generation.ipynb
Normal file
@ -0,0 +1,117 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ee2a3a21",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# QA Generation\n",
|
||||
"This notebook shows how to use the `QAGenerationChain` to come up with question-answer pairs over a specific document.\n",
|
||||
"This is important because often times you may not have data to evaluate your question-answer system over, so this is a cheap and lightweight way to generate it!"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "33d3f0b4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import TextLoader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "2029a29c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "87edb84c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"doc = loader.load()[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "04125b6d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chat_models import ChatOpenAI\n",
|
||||
"from langchain.chains import QAGenerationChain\n",
|
||||
"chain = QAGenerationChain.from_llm(ChatOpenAI(temperature = 0))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "4f1593e4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"qa = chain.run(doc.page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "ee831f92",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'What is the U.S. Department of Justice doing to combat the crimes of Russian oligarchs?',\n",
|
||||
" 'answer': 'The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"qa[1]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7028754e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
@ -191,7 +191,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "782ae8c8",
|
||||
"metadata": {},
|
||||
@ -316,7 +315,7 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
@ -330,7 +329,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]"
|
||||
"version": "3.9.1"
|
||||
},
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
|
423
docs/use_cases/evaluation/sql_qa_benchmarking_chinook.ipynb
Normal file
423
docs/use_cases/evaluation/sql_qa_benchmarking_chinook.ipynb
Normal file
@ -0,0 +1,423 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "984169ca",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# SQL Question Answering Benchmarking: Chinook\n",
|
||||
"\n",
|
||||
"Here we go over how to benchmark performance on a question answering task over a SQL database.\n",
|
||||
"\n",
|
||||
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"id": "44874486",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Comment this out if you are NOT using tracing\n",
|
||||
"import os\n",
|
||||
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0f66405e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading the data\n",
|
||||
"\n",
|
||||
"First, let's load the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "0df1393f",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "b220d07ee5d14909bc842b4545cdc0de",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Downloading readme: 0%| | 0.00/21.0 [00:00<?, ?B/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Downloading and preparing dataset json/LangChainDatasets--sql-qa-chinook to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--sql-qa-chinook-7528565d2d992b47/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "e89e3c8ef76f49889c4b39c624828c71",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "a8421df6c26045e8978c7086cb418222",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Downloading data: 0%| | 0.00/1.44k [00:00<?, ?B/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "d1fb6becc3324a85bf039a53caf30924",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Generating train split: 0 examples [00:00, ? examples/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Dataset json downloaded and prepared to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--sql-qa-chinook-7528565d2d992b47/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "9d68ad1b3e4a4bd79f92597aac4d3cc9",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
" 0%| | 0/1 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.evaluation.loading import load_dataset\n",
|
||||
"dataset = load_dataset(\"sql-qa-chinook\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "ab44d504",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'How many employees are there?', 'answer': '8'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"dataset[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8a16b75d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setting up a chain\n",
|
||||
"This uses the example Chinook database.\n",
|
||||
"To set it up follow the instructions on https://database.guide/2-sample-databases-sqlite/, placing the `.db` file in a notebooks folder at the root of this repository.\n",
|
||||
"\n",
|
||||
"Note that here we load a simple chain. If you want to experiment with more complex chains, or an agent, just create the `chain` object in a different way."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "5b2d5e98",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain import OpenAI, SQLDatabase, SQLDatabaseChain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "33cdcbfc",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"db = SQLDatabase.from_uri(\"sqlite:///../../../notebooks/Chinook.db\")\n",
|
||||
"llm = OpenAI(temperature=0)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f0b5d8f6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we can create a SQL database chain."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "8843cb0c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chain = SQLDatabaseChain(llm=llm, database=db, input_key=\"question\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6c0062e7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make a prediction\n",
|
||||
"\n",
|
||||
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"id": "d28c5e7d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'How many employees are there?',\n",
|
||||
" 'answer': '8',\n",
|
||||
" 'result': ' There are 8 employees.'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 27,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"chain(dataset[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d0c16cd7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Make many predictions\n",
|
||||
"Now we can make predictions. Note that we add a try-except because this chain can sometimes error (if SQL is written incorrectly, etc)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"id": "24b4c66e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"predictions = []\n",
|
||||
"predicted_dataset = []\n",
|
||||
"error_dataset = []\n",
|
||||
"for data in dataset:\n",
|
||||
" try:\n",
|
||||
" predictions.append(chain(data))\n",
|
||||
" predicted_dataset.append(data)\n",
|
||||
" except:\n",
|
||||
" error_dataset.append(data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4783344b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Evaluate performance\n",
|
||||
"Now we can evaluate the predictions. We can use a language model to score them programatically"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"id": "d0a9341d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.evaluation.qa import QAEvalChain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"id": "1612dec1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"llm = OpenAI(temperature=0)\n",
|
||||
"eval_chain = QAEvalChain.from_llm(llm)\n",
|
||||
"graded_outputs = eval_chain.evaluate(predicted_dataset, predictions, question_key=\"question\", prediction_key=\"result\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "79587806",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"id": "2a689df5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for i, prediction in enumerate(predictions):\n",
|
||||
" prediction['grade'] = graded_outputs[i]['text']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"id": "27b61215",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Counter({' CORRECT': 3, ' INCORRECT': 4})"
|
||||
]
|
||||
},
|
||||
"execution_count": 24,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from collections import Counter\n",
|
||||
"Counter([pred['grade'] for pred in predictions])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "12fe30f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also filter the datapoints to the incorrect examples and look at them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"id": "47c692a1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"id": "0ef976c1",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'question': 'How many employees are also customers?',\n",
|
||||
" 'answer': 'None',\n",
|
||||
" 'result': ' 59 employees are also customers.',\n",
|
||||
" 'grade': ' INCORRECT'}"
|
||||
]
|
||||
},
|
||||
"execution_count": 26,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"incorrect[0]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7710401a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
@ -16,6 +16,7 @@ from langchain.chains.loading import load_chain
|
||||
from langchain.chains.mapreduce import MapReduceChain
|
||||
from langchain.chains.moderation import OpenAIModerationChain
|
||||
from langchain.chains.pal.base import PALChain
|
||||
from langchain.chains.qa_generation.base import QAGenerationChain
|
||||
from langchain.chains.qa_with_sources.base import QAWithSourcesChain
|
||||
from langchain.chains.qa_with_sources.vector_db import VectorDBQAWithSourcesChain
|
||||
from langchain.chains.sequential import SequentialChain, SimpleSequentialChain
|
||||
@ -52,4 +53,5 @@ __all__ = [
|
||||
"ChatVectorDBChain",
|
||||
"GraphQAChain",
|
||||
"ConstitutionalChain",
|
||||
"QAGenerationChain",
|
||||
]
|
||||
|
0
langchain/chains/qa_generation/__init__.py
Normal file
0
langchain/chains/qa_generation/__init__.py
Normal file
55
langchain/chains/qa_generation/base.py
Normal file
55
langchain/chains/qa_generation/base.py
Normal file
@ -0,0 +1,55 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
from pydantic import Field
|
||||
|
||||
from langchain.chains.base import Chain
|
||||
from langchain.chains.llm import LLMChain
|
||||
from langchain.chains.qa_generation.prompt import PROMPT_SELECTOR
|
||||
from langchain.prompts.base import BasePromptTemplate
|
||||
from langchain.schema import BaseLanguageModel
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
|
||||
|
||||
|
||||
class QAGenerationChain(Chain):
|
||||
llm_chain: LLMChain
|
||||
text_splitter: TextSplitter = Field(
|
||||
default=RecursiveCharacterTextSplitter(chunk_overlap=500)
|
||||
)
|
||||
input_key: str = "text"
|
||||
output_key: str = "questions"
|
||||
k: Optional[int] = None
|
||||
|
||||
@classmethod
|
||||
def from_llm(
|
||||
cls,
|
||||
llm: BaseLanguageModel,
|
||||
prompt: Optional[BasePromptTemplate] = None,
|
||||
**kwargs: Any,
|
||||
) -> QAGenerationChain:
|
||||
_prompt = prompt or PROMPT_SELECTOR.get_prompt(llm)
|
||||
chain = LLMChain(llm=llm, prompt=_prompt)
|
||||
return cls(llm_chain=chain, **kwargs)
|
||||
|
||||
@property
|
||||
def _chain_type(self) -> str:
|
||||
raise NotImplementedError
|
||||
|
||||
@property
|
||||
def input_keys(self) -> List[str]:
|
||||
return [self.input_key]
|
||||
|
||||
@property
|
||||
def output_keys(self) -> List[str]:
|
||||
return [self.output_key]
|
||||
|
||||
def _call(self, inputs: Dict[str, str]) -> Dict[str, Any]:
|
||||
docs = self.text_splitter.create_documents([inputs[self.input_key]])
|
||||
results = self.llm_chain.generate([{"text": d.page_content} for d in docs])
|
||||
qa = [json.loads(res[0].text) for res in results.generations]
|
||||
return {self.output_key: qa}
|
||||
|
||||
async def _acall(self, inputs: Dict[str, str]) -> Dict[str, str]:
|
||||
raise NotImplementedError
|
50
langchain/chains/qa_generation/prompt.py
Normal file
50
langchain/chains/qa_generation/prompt.py
Normal file
@ -0,0 +1,50 @@
|
||||
# flake8: noqa
|
||||
from langchain.chains.prompt_selector import ConditionalPromptSelector, is_chat_model
|
||||
from langchain.prompts.chat import (
|
||||
ChatPromptTemplate,
|
||||
HumanMessagePromptTemplate,
|
||||
SystemMessagePromptTemplate,
|
||||
)
|
||||
from langchain.prompts.prompt import PromptTemplate
|
||||
|
||||
templ1 = """You are a smart assistant designed to help high school teachers come up with reading comprehension questions.
|
||||
Given a piece of text, you must come up with a question and answer pair that can be used to test a student's reading comprehension abilities.
|
||||
When coming up with this question/answer pair, you must respond in the following format:
|
||||
```
|
||||
{{
|
||||
"question": "$YOUR_QUESTION_HERE",
|
||||
"answer": "$THE_ANSWER_HERE"
|
||||
}}
|
||||
```
|
||||
|
||||
Everything between the ``` must be valid json.
|
||||
"""
|
||||
templ2 = """Please come up with a question/answer pair, in the specified JSON format, for the following text:
|
||||
----------------
|
||||
{text}"""
|
||||
CHAT_PROMPT = ChatPromptTemplate.from_messages(
|
||||
[
|
||||
SystemMessagePromptTemplate.from_template(templ1),
|
||||
HumanMessagePromptTemplate.from_template(templ2),
|
||||
]
|
||||
)
|
||||
templ = """You are a smart assistant designed to help high school teachers come up with reading comprehension questions.
|
||||
Given a piece of text, you must come up with a question and answer pair that can be used to test a student's reading comprehension abilities.
|
||||
When coming up with this question/answer pair, you must respond in the following format:
|
||||
```
|
||||
{{
|
||||
"question": "$YOUR_QUESTION_HERE",
|
||||
"answer": "$THE_ANSWER_HERE"
|
||||
}}
|
||||
```
|
||||
|
||||
Everything between the ``` must be valid json.
|
||||
|
||||
Please come up with a question/answer pair, in the specified JSON format, for the following text:
|
||||
----------------
|
||||
{text}"""
|
||||
PROMPT = PromptTemplate.from_template(templ)
|
||||
|
||||
PROMPT_SELECTOR = ConditionalPromptSelector(
|
||||
default_prompt=PROMPT, conditionals=[(is_chat_model, CHAT_PROMPT)]
|
||||
)
|
8
langchain/evaluation/loading.py
Normal file
8
langchain/evaluation/loading.py
Normal file
@ -0,0 +1,8 @@
|
||||
from typing import Dict, List
|
||||
|
||||
|
||||
def load_dataset(uri: str) -> List[Dict]:
|
||||
from datasets import load_dataset
|
||||
|
||||
dataset = load_dataset(f"LangChainDatasets/{uri}")
|
||||
return [d for d in dataset["train"]]
|
@ -50,8 +50,8 @@ class VectorstoreIndexCreator(BaseModel):
|
||||
"""Logic for creating indexes."""
|
||||
|
||||
vectorstore_cls: Type[VectorStore] = Chroma
|
||||
text_splitter: TextSplitter = Field(default_factory=_get_default_text_splitter)
|
||||
embedding: Embeddings = Field(default_factory=OpenAIEmbeddings)
|
||||
text_splitter: TextSplitter = Field(default_factory=_get_default_text_splitter)
|
||||
vectorstore_kwargs: dict = Field(default_factory=dict)
|
||||
|
||||
class Config:
|
||||
|
Loading…
Reference in New Issue
Block a user