Harrison/agent eval (#1620)

Co-authored-by: jerwelborn <jeremy.welborn@gmail.com>
2023-03-14 12:37:48 -07:00 · 2023-03-14 12:37:48 -07:00 · 2d098e8869
commit 2d098e8869
parent 8965a2f0af
16 changed files with 2581 additions and 7 deletions
--- a/docs/modules/chains/examples/sqlite.ipynb
+++ b/docs/modules/chains/examples/sqlite.ipynb
@ -679,7 +679,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.9"
+   "version": "3.9.1"
  }
 },
 "nbformat": 4,
--- a/docs/use_cases/evaluation.rst
+++ b/docs/use_cases/evaluation.rst
@ -1,9 +1,85 @@
 Evaluation
 ==============

-Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.
+This section of documentation covers how we approach and think about evaluation in LangChain.
+Both evaluation of internal chains/agents, but also how we would recommend people building on top of LangChain approach evaluation.

-The examples here all highlight how to use language models to assist in evaluation of themselves.
+The Problem
+-----------
+
+It can be really hard to evaluate LangChain chains and agents.
+There are two main reasons for this:
+
+**# 1: Lack of data**
+
+You generally don't have a ton of data to evaluate your chains/agents over before starting a project.
+This is usually because Large Language Models (the core of most chains/agents) are terrific few-shot and zero shot learners,
+meaning you are almost always able to get started on a particular task (text-to-SQL, question answering, etc) without
+a large dataset of examples.
+This is in stark contrast to traditional machine learning where you had to first collect a bunch of datapoints
+before even getting started using a model.
+
+**# 2: Lack of metrics**
+
+Most chains/agents are performing tasks for which there are not very good metrics to evaluate performance.
+For example, one of the most common use cases is generating text of some form.
+Evaluating generated text is much more complicated than evaluating a classification prediction, or a numeric prediction.
+
+The Solution
+------------
+
+LangChain attempts to tackle both of those issues.
+What we have so far are initial passes at solutions - we do not think we have a perfect solution.
+So we very much welcome feedback, contributions, integrations, and thoughts on this.
+
+Here is what we have for each problem so far:
+
+**# 1: Lack of data**
+
+We have started `LangChainDatasets <https://huggingface.co/LangChainDatasets>`_ a Community space on Hugging Face.
+We intend this to be a collection of open source datasets for evaluating common chains and agents.
+We have contributed five datasets of our own to start, but we highly intend this to be a community effort.
+In order to contribute a dataset, you simply need to join the community and then you will be able to upload datasets.
+
+We're also aiming to make it as easy as possible for people to create their own datasets.
+As a first pass at this, we've added a QAGenerationChain, which given a document comes up
+with question-answer pairs that can be used to evaluate question-answering tasks over that document down the line.
+See `this notebook <./evaluation/qa_generation.html>`_ for an example of how to use this chain.
+
+**# 2: Lack of metrics**
+
+We have two solutions to the lack of metrics.
+
+The first solution is to use no metrics, and rather just rely on looking at results by eye to get a sense for how the chain/agent is performing.
+To assist in this, we have developed (and will continue to develop) `tracing <../tracing.html>`_, a UI-based visualizer of your chain and agent runs.
+
+The second solution we recommend is to use Language Models themselves to evaluate outputs.
+For this we have a few different chains and prompts aimed at tackling this issue.
+
+The Examples
+------------
+
+We have created a bunch of examples combining the above two solutions to show how we internally evaluate chains and agents when we are developing.
+In addition to the examples we've curated, we also highly welcome contributions here.
+To facilitate that, we've included a `template notebook <./evaluation/benchmarking_template.html>`_ for community members to use to build their own examples.
+
+The existing examples we have are:
+
+`Question Answering (State of Union) <./evaluation/qa_benchmarking_sota.html>`_: An notebook showing evaluation of a question-answering task over a State-of-the-Union address.
+
+`Question Answering (Paul Graham Essay) <./evaluation/qa_benchmarking_pg.html>`_: An notebook showing evaluation of a question-answering task over a Paul Graham essay.
+
+`SQL Question Answering (Chinook) <./evaluation/sql_qa_benchmarking_chinook.html>`_: An notebook showing evaluation of a question-answering task over a SQL database (the Chinook database).
+
+`Agent Vectorstore <./evaluation/agent_vectordb_sota_pg.html>`_: An notebook showing evaluation of an agent doing question answering while routing between two different vector databases.
+
+`Agent Search + Calculator <./evaluation/agent_benchmarking.html>`_: An notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools.
+
+
+Other Examples
+------------
+
+In addition, we also have some more generic resources for evaluation.

 `Question Answering <./evaluation/question_answering.html>`_: An overview of LLMs aimed at evaluating question answering systems in general.

--- a/docs/use_cases/evaluation/agent_benchmarking.ipynb
+++ b/docs/use_cases/evaluation/agent_benchmarking.ipynb
@ -0,0 +1,343 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "984169ca",
+   "metadata": {},
+   "source": [
+    "# Agent Benchmarking: Search + Calculator\n",
+    "\n",
+    "Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.\n",
+    "\n",
+    "It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "46bf9205",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Comment this out if you are NOT using tracing\n",
+    "import os\n",
+    "os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a16b75d",
+   "metadata": {},
+   "source": [
+    "## Loading the data\n",
+    "First, let's load the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "5b2d5e98",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--agent-search-calculator-8a025c0ce5fb99d2/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "3a275586643f4ccfba1a8d54be28c351",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation.loading import load_dataset\n",
+    "dataset = load_dataset(\"agent-search-calculator\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4ab6a716",
+   "metadata": {},
+   "source": [
+    "## Setting up a chain\n",
+    "Now we need to load an agent capable of answering these questions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "c18680b5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.llms import OpenAI\n",
+    "from langchain.chains import LLMMathChain\n",
+    "from langchain.agents import initialize_agent, Tool, load_tools\n",
+    "\n",
+    "tools = load_tools(['serpapi', 'llm-math'], llm=OpenAI(temperature=0))\n",
+    "agent = initialize_agent(tools, OpenAI(temperature=0), agent=\"zero-shot-react-description\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68504a8f",
+   "metadata": {},
+   "source": [
+    "## Make a prediction\n",
+    "\n",
+    "First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "cbcafc92",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'38,630,316 people live in Canada as of 2023.'"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "agent.run(dataset[0]['question'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0c16cd7",
+   "metadata": {},
+   "source": [
+    "## Make many predictions\n",
+    "Now we can make predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "24b4c66e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')).\n"
+     ]
+    }
+   ],
+   "source": [
+    "predictions = []\n",
+    "predicted_dataset = []\n",
+    "error_dataset = []\n",
+    "for data in dataset:\n",
+    "    new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",
+    "    try:\n",
+    "        predictions.append(agent(new_data))\n",
+    "        predicted_dataset.append(new_data)\n",
+    "    except Exception:\n",
+    "        error_dataset.append(new_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "49d969fb",
+   "metadata": {},
+   "source": [
+    "## Evaluate performance\n",
+    "Now we can evaluate the predictions. The first thing we can do is look at them by eye."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "1d583f03",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'input': 'How many people live in canada as of 2023?',\n",
+       " 'answer': 'approximately 38,625,801',\n",
+       " 'output': '38,630,316 people live in Canada as of 2023.',\n",
+       " 'intermediate_steps': [(AgentAction(tool='Search', tool_input='Population of Canada 2023', log=' I need to find population data\\nAction: Search\\nAction Input: Population of Canada 2023'),\n",
+       "   '38,630,316')]}"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "predictions[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4783344b",
+   "metadata": {},
+   "source": [
+    "Next, we can use a language model to score them programatically"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "d0a9341d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.evaluation.qa import QAEvalChain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "1612dec1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "llm = OpenAI(temperature=0)\n",
+    "eval_chain = QAEvalChain.from_llm(llm)\n",
+    "graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"output\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "79587806",
+   "metadata": {},
+   "source": [
+    "We can add in the graded output to the `predictions` dict and then get a count of the grades."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "2a689df5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, prediction in enumerate(predictions):\n",
+    "    prediction['grade'] = graded_outputs[i]['text']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "27b61215",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Counter({' CORRECT': 4, ' INCORRECT': 6})"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from collections import Counter\n",
+    "Counter([pred['grade'] for pred in predictions])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12fe30f4",
+   "metadata": {},
+   "source": [
+    "We can also filter the datapoints to the incorrect examples and look at them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "47c692a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "0ef976c1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'input': \"who is dua lipa's boyfriend? what is his age raised to the .43 power?\",\n",
+       " 'answer': 'her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665',\n",
+       " 'output': \"Isaac Carew, Dua Lipa's boyfriend, is 36 years old and his age raised to the .43 power is 4.6688516567750975.\",\n",
+       " 'intermediate_steps': [(AgentAction(tool='Search', tool_input=\"Dua Lipa's boyfriend\", log=' I need to find out who Dua Lipa\\'s boyfriend is and then calculate his age raised to the .43 power\\nAction: Search\\nAction Input: \"Dua Lipa\\'s boyfriend\"'),\n",
+       "   'Dua and Isaac, a model and a chef, dated on and off from 2013 to 2019. The two first split in early 2017, which is when Dua went on to date LANY ...'),\n",
+       "  (AgentAction(tool='Search', tool_input='Isaac Carew age', log=' I need to find out Isaac\\'s age\\nAction: Search\\nAction Input: \"Isaac Carew age\"'),\n",
+       "   '36 years'),\n",
+       "  (AgentAction(tool='Calculator', tool_input='36^.43', log=' I need to calculate 36 raised to the .43 power\\nAction: Calculator\\nAction Input: 36^.43'),\n",
+       "   'Answer: 4.6688516567750975\\n')],\n",
+       " 'grade': ' INCORRECT'}"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "incorrect[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7710401a",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/use_cases/evaluation/agent_vectordb_sota_pg.ipynb
+++ b/docs/use_cases/evaluation/agent_vectordb_sota_pg.ipynb
@ -0,0 +1,516 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "984169ca",
+   "metadata": {},
+   "source": [
+    "# Agent VectorDB Question Answering Benchmarking\n",
+    "\n",
+    "Here we go over how to benchmark performance on a question answering task using an agent to route between multiple vectordatabases.\n",
+    "\n",
+    "It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "7b57a50f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Comment this out if you are NOT using tracing\n",
+    "import os\n",
+    "os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a16b75d",
+   "metadata": {},
+   "source": [
+    "## Loading the data\n",
+    "First, let's load the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "5b2d5e98",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--agent-vectordb-qa-sota-pg-d3ae24016b514f92/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a7abbc20615d4c58b75a055a790d7212",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation.loading import load_dataset\n",
+    "dataset = load_dataset(\"agent-vectordb-qa-sota-pg\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "61375342",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'What is the purpose of the NATO Alliance?',\n",
+       " 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
+       " 'steps': [{'tool': 'State of Union QA System', 'tool_input': None},\n",
+       "  {'tool': None, 'tool_input': 'What is the purpose of the NATO Alliance?'}]}"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataset[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "02500304",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'What is the purpose of YC?',\n",
+       " 'answer': 'The purpose of YC is to cause startups to be founded that would not otherwise have existed.',\n",
+       " 'steps': [{'tool': 'Paul Graham QA System', 'tool_input': None},\n",
+       "  {'tool': None, 'tool_input': 'What is the purpose of YC?'}]}"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataset[-1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4ab6a716",
+   "metadata": {},
+   "source": [
+    "## Setting up a chain\n",
+    "Now we need to create some pipelines for doing question answering. Step one in that is creating indexes over the data in question."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "c18680b5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import TextLoader\n",
+    "loader = TextLoader(\"../../modules/state_of_the_union.txt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "7f0de2b3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.indexes import VectorstoreIndexCreator"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "ef84ff99",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Running Chroma using direct local API.\n",
+      "Using DuckDB in-memory for database. Data will be transient.\n"
+     ]
+    }
+   ],
+   "source": [
+    "vectorstore_sota = VectorstoreIndexCreator(vectorstore_kwargs={\"collection_name\":\"sota\"}).from_loaders([loader]).vectorstore"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f0b5d8f6",
+   "metadata": {},
+   "source": [
+    "Now we can create a question answering chain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "8843cb0c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.chains import VectorDBQA\n",
+    "from langchain.llms import OpenAI"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "573719a0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chain_sota = VectorDBQA.from_chain_type(llm=OpenAI(temperature=0), chain_type=\"stuff\", vectorstore=vectorstore_sota, input_key=\"question\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e48b03d8",
+   "metadata": {},
+   "source": [
+    "Now we do the same for the Paul Graham data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "c2dbb014",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = TextLoader(\"../../modules/paul_graham_essay.txt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "98d16f08",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Running Chroma using direct local API.\n",
+      "Using DuckDB in-memory for database. Data will be transient.\n"
+     ]
+    }
+   ],
+   "source": [
+    "vectorstore_pg = VectorstoreIndexCreator(vectorstore_kwargs={\"collection_name\":\"paul_graham\"}).from_loaders([loader]).vectorstore"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "ec0aab02",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chain_pg = VectorDBQA.from_chain_type(llm=OpenAI(temperature=0), chain_type=\"stuff\", vectorstore=vectorstore_pg, input_key=\"question\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76b5f8fb",
+   "metadata": {},
+   "source": [
+    "We can now set up an agent to route between them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "ade1aafa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.agents import initialize_agent, Tool\n",
+    "tools = [\n",
+    "    Tool(\n",
+    "        name = \"State of Union QA System\",\n",
+    "        func=chain_sota.run,\n",
+    "        description=\"useful for when you need to answer questions about the most recent state of the union address. Input should be a fully formed question.\"\n",
+    "    ),\n",
+    "    Tool(\n",
+    "        name = \"Paul Graham System\",\n",
+    "        func=chain_pg.run,\n",
+    "        description=\"useful for when you need to answer questions about Paul Graham. Input should be a fully formed question.\"\n",
+    "    ),\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "104853f8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "agent = initialize_agent(tools, OpenAI(temperature=0), agent=\"zero-shot-react-description\", max_iterations=3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7f036641",
+   "metadata": {},
+   "source": [
+    "## Make a prediction\n",
+    "\n",
+    "First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "4664e79f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'The purpose of the NATO Alliance is to promote peace and security in the North Atlantic region by providing a collective defense against potential threats.'"
+      ]
+     },
+     "execution_count": 48,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "agent.run(dataset[0]['question'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0c16cd7",
+   "metadata": {},
+   "source": [
+    "## Make many predictions\n",
+    "Now we can make predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "24b4c66e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predictions = []\n",
+    "predicted_dataset = []\n",
+    "error_dataset = []\n",
+    "for data in dataset:\n",
+    "    new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",
+    "    try:\n",
+    "        predictions.append(agent(new_data))\n",
+    "        predicted_dataset.append(new_data)\n",
+    "    except Exception:\n",
+    "        error_dataset.append(new_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "49d969fb",
+   "metadata": {},
+   "source": [
+    "## Evaluate performance\n",
+    "Now we can evaluate the predictions. The first thing we can do is look at them by eye."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "id": "1d583f03",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'input': 'What is the purpose of the NATO Alliance?',\n",
+       " 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
+       " 'output': 'The purpose of the NATO Alliance is to promote peace and security in the North Atlantic region by providing a collective defense against potential threats.'}"
+      ]
+     },
+     "execution_count": 36,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "predictions[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4783344b",
+   "metadata": {},
+   "source": [
+    "Next, we can use a language model to score them programatically"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "id": "d0a9341d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.evaluation.qa import QAEvalChain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "1612dec1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "llm = OpenAI(temperature=0)\n",
+    "eval_chain = QAEvalChain.from_llm(llm)\n",
+    "graded_outputs = eval_chain.evaluate(predicted_dataset, predictions, question_key=\"input\", prediction_key=\"output\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "79587806",
+   "metadata": {},
+   "source": [
+    "We can add in the graded output to the `predictions` dict and then get a count of the grades."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "id": "2a689df5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, prediction in enumerate(predictions):\n",
+    "    prediction['grade'] = graded_outputs[i]['text']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "27b61215",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Counter({' CORRECT': 19, ' INCORRECT': 14})"
+      ]
+     },
+     "execution_count": 42,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from collections import Counter\n",
+    "Counter([pred['grade'] for pred in predictions])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12fe30f4",
+   "metadata": {},
+   "source": [
+    "We can also filter the datapoints to the incorrect examples and look at them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "id": "47c692a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "id": "0ef976c1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'input': 'What is the purpose of the Bipartisan Innovation Act mentioned in the text?',\n",
+       " 'answer': 'The Bipartisan Innovation Act will make record investments in emerging technologies and American manufacturing to level the playing field with China and other competitors.',\n",
+       " 'output': 'The purpose of the Bipartisan Innovation Act is to promote innovation and entrepreneurship in the United States by providing tax incentives and other support for startups and small businesses.',\n",
+       " 'grade': ' INCORRECT'}"
+      ]
+     },
+     "execution_count": 46,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "incorrect[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7710401a",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/use_cases/evaluation/benchmarking_template.ipynb
+++ b/docs/use_cases/evaluation/benchmarking_template.ipynb
@ -0,0 +1,160 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a175c650",
+   "metadata": {},
+   "source": [
+    "# Benchmarking Template\n",
+    "\n",
+    "This is an example notebook that can be used to create a benchmarking notebook for a task of your choice. Evaluation is really hard, and so we greatly welcome any contributions that can make it easier for people to experiment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "984169ca",
+   "metadata": {},
+   "source": [
+    "It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "9fe4d1b4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Comment this out if you are NOT using tracing\n",
+    "import os\n",
+    "os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0f66405e",
+   "metadata": {},
+   "source": [
+    "## Loading the data\n",
+    "\n",
+    "First, let's load the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "79402a8f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This notebook should so how to load the dataset from LangChainDatasets on Hugging Face\n",
+    "\n",
+    "# Please upload your dataset to https://huggingface.co/LangChainDatasets\n",
+    "\n",
+    "# The value passed into `load_dataset` should NOT have the `LangChainDatasets/` prefix\n",
+    "from langchain.evaluation.loading import load_dataset\n",
+    "dataset = load_dataset(\"TODO\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a16b75d",
+   "metadata": {},
+   "source": [
+    "## Setting up a chain\n",
+    "\n",
+    "This next section should have an example of setting up a chain that can be run on this dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a2661ce0",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c0062e7",
+   "metadata": {},
+   "source": [
+    "## Make a prediction\n",
+    "\n",
+    "First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "d28c5e7d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Example of running the chain on a single datapoint (`dataset[0]`) goes here"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0c16cd7",
+   "metadata": {},
+   "source": [
+    "## Make many predictions\n",
+    "Now we can make predictions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "24b4c66e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Example of running the chain on many predictions goes here\n",
+    "\n",
+    "# Sometimes its as simple as `chain.apply(dataset)`\n",
+    "\n",
+    "# Othertimes you may want to write a for loop to catch errors"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4783344b",
+   "metadata": {},
+   "source": [
+    "## Evaluate performance\n",
+    "\n",
+    "Any guide to evaluating performance in a more systematic manner goes here."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7710401a",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/use_cases/evaluation/qa_benchmarking_pg.ipynb
+++ b/docs/use_cases/evaluation/qa_benchmarking_pg.ipynb
@ -0,0 +1,374 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "984169ca",
+   "metadata": {},
+   "source": [
+    "# Question Answering Benchmarking: Paul Graham Essay\n",
+    "\n",
+    "Here we go over how to benchmark performance on a question answering task over a Paul Graham essay.\n",
+    "\n",
+    "It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "3bd13ab7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Comment this out if you are NOT using tracing\n",
+    "import os\n",
+    "os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a16b75d",
+   "metadata": {},
+   "source": [
+    "## Loading the data\n",
+    "First, let's load the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "5b2d5e98",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--question-answering-paul-graham-76e8f711e038d742/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "63f434a42cba4739919333c75324acc9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation.loading import load_dataset\n",
+    "dataset = load_dataset(\"question-answering-paul-graham\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4ab6a716",
+   "metadata": {},
+   "source": [
+    "## Setting up a chain\n",
+    "Now we need to create some pipelines for doing question answering. Step one in that is creating an index over the data in question."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "c18680b5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import TextLoader\n",
+    "loader = TextLoader(\"../../modules/paul_graham_essay.txt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "7f0de2b3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.indexes import VectorstoreIndexCreator"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "ef84ff99",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Running Chroma using direct local API.\n",
+      "Using DuckDB in-memory for database. Data will be transient.\n"
+     ]
+    }
+   ],
+   "source": [
+    "vectorstore = VectorstoreIndexCreator().from_loaders([loader]).vectorstore"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f0b5d8f6",
+   "metadata": {},
+   "source": [
+    "Now we can create a question answering chain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "8843cb0c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.chains import VectorDBQA\n",
+    "from langchain.llms import OpenAI"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "573719a0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chain = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type=\"stuff\", vectorstore=vectorstore, input_key=\"question\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "53b5aa23",
+   "metadata": {},
+   "source": [
+    "## Make a prediction\n",
+    "\n",
+    "First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "3f81d951",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'What were the two main things the author worked on before college?',\n",
+       " 'answer': 'The two main things the author worked on before college were writing and programming.',\n",
+       " 'result': ' Writing and programming.'}"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "chain(dataset[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0c16cd7",
+   "metadata": {},
+   "source": [
+    "## Make many predictions\n",
+    "Now we can make predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "24b4c66e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predictions = chain.apply(dataset)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "49d969fb",
+   "metadata": {},
+   "source": [
+    "## Evaluate performance\n",
+    "Now we can evaluate the predictions. The first thing we can do is look at them by eye."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "1d583f03",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'What were the two main things the author worked on before college?',\n",
+       " 'answer': 'The two main things the author worked on before college were writing and programming.',\n",
+       " 'result': ' Writing and programming.'}"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "predictions[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4783344b",
+   "metadata": {},
+   "source": [
+    "Next, we can use a language model to score them programatically"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "d0a9341d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.evaluation.qa import QAEvalChain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "1612dec1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "llm = OpenAI(temperature=0)\n",
+    "eval_chain = QAEvalChain.from_llm(llm)\n",
+    "graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"result\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "79587806",
+   "metadata": {},
+   "source": [
+    "We can add in the graded output to the `predictions` dict and then get a count of the grades."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "2a689df5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, prediction in enumerate(predictions):\n",
+    "    prediction['grade'] = graded_outputs[i]['text']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "27b61215",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Counter({' CORRECT': 12, ' INCORRECT': 10})"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from collections import Counter\n",
+    "Counter([pred['grade'] for pred in predictions])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12fe30f4",
+   "metadata": {},
+   "source": [
+    "We can also filter the datapoints to the incorrect examples and look at them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "47c692a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "0ef976c1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'What did the author write their dissertation on?',\n",
+       " 'answer': 'The author wrote their dissertation on applications of continuations.',\n",
+       " 'result': ' The author does not mention what their dissertation was on, so it is not known.',\n",
+       " 'grade': ' INCORRECT'}"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "incorrect[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7710401a",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/use_cases/evaluation/qa_benchmarking_sota.ipynb
+++ b/docs/use_cases/evaluation/qa_benchmarking_sota.ipynb
@ -0,0 +1,451 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "984169ca",
+   "metadata": {},
+   "source": [
+    "# Question Answering Benchmarking: State of the Union Address\n",
+    "\n",
+    "Here we go over how to benchmark performance on a question answering task over a state of the union address.\n",
+    "\n",
+    "It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "f127fb04",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Comment this out if you are NOT using tracing\n",
+    "import os\n",
+    "os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a16b75d",
+   "metadata": {},
+   "source": [
+    "## Loading the data\n",
+    "First, let's load the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "5b2d5e98",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "5d66c27b9b4744989843142f08f5c1b4",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading readme:   0%|          | 0.00/21.0 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Downloading and preparing dataset json/LangChainDatasets--question-answering-state-of-the-union to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--question-answering-state-of-the-union-a7e5a3b2db4f440d/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "9e21e2ab96a0491ea5e252720d7dfa26",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "c883830e068c42d39da8406ab38574c4",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading data:   0%|          | 0.00/2.90k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "3b085715e52e49948d2a59d27e004eba",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Generating train split: 0 examples [00:00, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset json downloaded and prepared to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--question-answering-state-of-the-union-a7e5a3b2db4f440d/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "ee900d35e27d4843b42b31811b43212b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation.loading import load_dataset\n",
+    "dataset = load_dataset(\"question-answering-state-of-the-union\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4ab6a716",
+   "metadata": {},
+   "source": [
+    "## Setting up a chain\n",
+    "Now we need to create some pipelines for doing question answering. Step one in that is creating an index over the data in question."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "c18680b5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import TextLoader\n",
+    "loader = TextLoader(\"../../modules/state_of_the_union.txt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "7f0de2b3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.indexes import VectorstoreIndexCreator"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "ef84ff99",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Running Chroma using direct local API.\n",
+      "Using DuckDB in-memory for database. Data will be transient.\n"
+     ]
+    }
+   ],
+   "source": [
+    "vectorstore = VectorstoreIndexCreator().from_loaders([loader]).vectorstore"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f0b5d8f6",
+   "metadata": {},
+   "source": [
+    "Now we can create a question answering chain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "8843cb0c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.chains import VectorDBQA\n",
+    "from langchain.llms import OpenAI"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "573719a0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chain = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type=\"stuff\", vectorstore=vectorstore, input_key=\"question\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "37d669e9",
+   "metadata": {},
+   "source": [
+    "## Make a prediction\n",
+    "\n",
+    "First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "3089e409",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'What is the purpose of the NATO Alliance?',\n",
+       " 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
+       " 'result': ' The NATO Alliance was created to secure peace and stability in Europe after World War 2.'}"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "chain(dataset[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0c16cd7",
+   "metadata": {},
+   "source": [
+    "## Make many predictions\n",
+    "Now we can make predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "24b4c66e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predictions = chain.apply(dataset)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "49d969fb",
+   "metadata": {},
+   "source": [
+    "## Evaluate performance\n",
+    "Now we can evaluate the predictions. The first thing we can do is look at them by eye."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "1d583f03",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'What is the purpose of the NATO Alliance?',\n",
+       " 'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
+       " 'result': ' The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.'}"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "predictions[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4783344b",
+   "metadata": {},
+   "source": [
+    "Next, we can use a language model to score them programatically"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "d0a9341d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.evaluation.qa import QAEvalChain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "1612dec1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "llm = OpenAI(temperature=0)\n",
+    "eval_chain = QAEvalChain.from_llm(llm)\n",
+    "graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"result\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "79587806",
+   "metadata": {},
+   "source": [
+    "We can add in the graded output to the `predictions` dict and then get a count of the grades."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "2a689df5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, prediction in enumerate(predictions):\n",
+    "    prediction['grade'] = graded_outputs[i]['text']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "27b61215",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Counter({' CORRECT': 7, ' INCORRECT': 4})"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from collections import Counter\n",
+    "Counter([pred['grade'] for pred in predictions])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12fe30f4",
+   "metadata": {},
+   "source": [
+    "We can also filter the datapoints to the incorrect examples and look at them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "47c692a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "0ef976c1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'What is the U.S. Department of Justice doing to combat the crimes of Russian oligarchs?',\n",
+       " 'answer': 'The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs.',\n",
+       " 'result': ' The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs and is naming a chief prosecutor for pandemic fraud.',\n",
+       " 'grade': ' INCORRECT'}"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "incorrect[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7710401a",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/use_cases/evaluation/qa_generation.ipynb
+++ b/docs/use_cases/evaluation/qa_generation.ipynb
@ -0,0 +1,117 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ee2a3a21",
+   "metadata": {},
+   "source": [
+    "# QA Generation\n",
+    "This notebook shows how to use the `QAGenerationChain` to come up with question-answer pairs over a specific document.\n",
+    "This is important because often times you may not have data to evaluate your question-answer system over, so this is a cheap and lightweight way to generate it!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "33d3f0b4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import TextLoader"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "2029a29c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = TextLoader(\"../../modules/state_of_the_union.txt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "87edb84c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "doc = loader.load()[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "04125b6d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.chains import QAGenerationChain\n",
+    "chain = QAGenerationChain.from_llm(ChatOpenAI(temperature = 0))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "4f1593e4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "qa = chain.run(doc.page_content)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "ee831f92",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'What is the U.S. Department of Justice doing to combat the crimes of Russian oligarchs?',\n",
+       " 'answer': 'The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs.'}"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "qa[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7028754e",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/use_cases/evaluation/question_answering.ipynb
+++ b/docs/use_cases/evaluation/question_answering.ipynb
@ -191,7 +191,6 @@
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "782ae8c8",
   "metadata": {},
@ -316,7 +315,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": ".venv",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@ -330,7 +329,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]"
+   "version": "3.9.1"
  },
  "vscode": {
   "interpreter": {
--- a/docs/use_cases/evaluation/sql_qa_benchmarking_chinook.ipynb
+++ b/docs/use_cases/evaluation/sql_qa_benchmarking_chinook.ipynb
@ -0,0 +1,423 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "984169ca",
+   "metadata": {},
+   "source": [
+    "# SQL Question Answering Benchmarking: Chinook\n",
+    "\n",
+    "Here we go over how to benchmark performance on a question answering task over a SQL database.\n",
+    "\n",
+    "It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "44874486",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Comment this out if you are NOT using tracing\n",
+    "import os\n",
+    "os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0f66405e",
+   "metadata": {},
+   "source": [
+    "## Loading the data\n",
+    "\n",
+    "First, let's load the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "0df1393f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "b220d07ee5d14909bc842b4545cdc0de",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading readme:   0%|          | 0.00/21.0 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Downloading and preparing dataset json/LangChainDatasets--sql-qa-chinook to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--sql-qa-chinook-7528565d2d992b47/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "e89e3c8ef76f49889c4b39c624828c71",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a8421df6c26045e8978c7086cb418222",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading data:   0%|          | 0.00/1.44k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "d1fb6becc3324a85bf039a53caf30924",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Generating train split: 0 examples [00:00, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset json downloaded and prepared to /Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--sql-qa-chinook-7528565d2d992b47/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "9d68ad1b3e4a4bd79f92597aac4d3cc9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation.loading import load_dataset\n",
+    "dataset = load_dataset(\"sql-qa-chinook\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "ab44d504",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'How many employees are there?', 'answer': '8'}"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataset[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a16b75d",
+   "metadata": {},
+   "source": [
+    "## Setting up a chain\n",
+    "This uses the example Chinook database.\n",
+    "To set it up follow the instructions on https://database.guide/2-sample-databases-sqlite/, placing the `.db` file in a notebooks folder at the root of this repository.\n",
+    "\n",
+    "Note that here we load a simple chain. If you want to experiment with more complex chains, or an agent, just create the `chain` object in a different way."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "5b2d5e98",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain import OpenAI, SQLDatabase, SQLDatabaseChain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "33cdcbfc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "db = SQLDatabase.from_uri(\"sqlite:///../../../notebooks/Chinook.db\")\n",
+    "llm = OpenAI(temperature=0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f0b5d8f6",
+   "metadata": {},
+   "source": [
+    "Now we can create a SQL database chain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "8843cb0c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chain = SQLDatabaseChain(llm=llm, database=db, input_key=\"question\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c0062e7",
+   "metadata": {},
+   "source": [
+    "## Make a prediction\n",
+    "\n",
+    "First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "d28c5e7d",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'How many employees are there?',\n",
+       " 'answer': '8',\n",
+       " 'result': ' There are 8 employees.'}"
+      ]
+     },
+     "execution_count": 27,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "chain(dataset[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0c16cd7",
+   "metadata": {},
+   "source": [
+    "## Make many predictions\n",
+    "Now we can make predictions. Note that we add a try-except because this chain can sometimes error (if SQL is written incorrectly, etc)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "24b4c66e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predictions = []\n",
+    "predicted_dataset = []\n",
+    "error_dataset = []\n",
+    "for data in dataset:\n",
+    "    try:\n",
+    "        predictions.append(chain(data))\n",
+    "        predicted_dataset.append(data)\n",
+    "    except:\n",
+    "        error_dataset.append(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4783344b",
+   "metadata": {},
+   "source": [
+    "## Evaluate performance\n",
+    "Now we can evaluate the predictions. We can use a language model to score them programatically"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "d0a9341d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.evaluation.qa import QAEvalChain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "1612dec1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "llm = OpenAI(temperature=0)\n",
+    "eval_chain = QAEvalChain.from_llm(llm)\n",
+    "graded_outputs = eval_chain.evaluate(predicted_dataset, predictions, question_key=\"question\", prediction_key=\"result\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "79587806",
+   "metadata": {},
+   "source": [
+    "We can add in the graded output to the `predictions` dict and then get a count of the grades."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "2a689df5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for i, prediction in enumerate(predictions):\n",
+    "    prediction['grade'] = graded_outputs[i]['text']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "27b61215",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Counter({' CORRECT': 3, ' INCORRECT': 4})"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from collections import Counter\n",
+    "Counter([pred['grade'] for pred in predictions])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12fe30f4",
+   "metadata": {},
+   "source": [
+    "We can also filter the datapoints to the incorrect examples and look at them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "47c692a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "0ef976c1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'How many employees are also customers?',\n",
+       " 'answer': 'None',\n",
+       " 'result': ' 59 employees are also customers.',\n",
+       " 'grade': ' INCORRECT'}"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "incorrect[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7710401a",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/langchain/chains/init.py
+++ b/langchain/chains/init.py
@ -16,6 +16,7 @@ from langchain.chains.loading import load_chain
 from langchain.chains.mapreduce import MapReduceChain
 from langchain.chains.moderation import OpenAIModerationChain
 from langchain.chains.pal.base import PALChain
+from langchain.chains.qa_generation.base import QAGenerationChain
 from langchain.chains.qa_with_sources.base import QAWithSourcesChain
 from langchain.chains.qa_with_sources.vector_db import VectorDBQAWithSourcesChain
 from langchain.chains.sequential import SequentialChain, SimpleSequentialChain
@ -52,4 +53,5 @@ __all__ = [
    "ChatVectorDBChain",
    "GraphQAChain",
    "ConstitutionalChain",
+    "QAGenerationChain",
 ]
--- a/langchain/chains/qa_generation/init.py
+++ b/langchain/chains/qa_generation/init.py
--- a/langchain/chains/qa_generation/base.py
+++ b/langchain/chains/qa_generation/base.py
@ -0,0 +1,55 @@
+from __future__ import annotations
+
+import json
+from typing import Any, Dict, List, Optional
+
+from pydantic import Field
+
+from langchain.chains.base import Chain
+from langchain.chains.llm import LLMChain
+from langchain.chains.qa_generation.prompt import PROMPT_SELECTOR
+from langchain.prompts.base import BasePromptTemplate
+from langchain.schema import BaseLanguageModel
+from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
+
+
+class QAGenerationChain(Chain):
+    llm_chain: LLMChain
+    text_splitter: TextSplitter = Field(
+        default=RecursiveCharacterTextSplitter(chunk_overlap=500)
+    )
+    input_key: str = "text"
+    output_key: str = "questions"
+    k: Optional[int] = None
+
+    @classmethod
+    def from_llm(
+        cls,
+        llm: BaseLanguageModel,
+        prompt: Optional[BasePromptTemplate] = None,
+        **kwargs: Any,
+    ) -> QAGenerationChain:
+        _prompt = prompt or PROMPT_SELECTOR.get_prompt(llm)
+        chain = LLMChain(llm=llm, prompt=_prompt)
+        return cls(llm_chain=chain, **kwargs)
+
+    @property
+    def _chain_type(self) -> str:
+        raise NotImplementedError
+
+    @property
+    def input_keys(self) -> List[str]:
+        return [self.input_key]
+
+    @property
+    def output_keys(self) -> List[str]:
+        return [self.output_key]
+
+    def _call(self, inputs: Dict[str, str]) -> Dict[str, Any]:
+        docs = self.text_splitter.create_documents([inputs[self.input_key]])
+        results = self.llm_chain.generate([{"text": d.page_content} for d in docs])
+        qa = [json.loads(res[0].text) for res in results.generations]
+        return {self.output_key: qa}
+
+    async def _acall(self, inputs: Dict[str, str]) -> Dict[str, str]:
+        raise NotImplementedError
--- a/langchain/chains/qa_generation/prompt.py
+++ b/langchain/chains/qa_generation/prompt.py
@ -0,0 +1,50 @@
+# flake8: noqa
+from langchain.chains.prompt_selector import ConditionalPromptSelector, is_chat_model
+from langchain.prompts.chat import (
+    ChatPromptTemplate,
+    HumanMessagePromptTemplate,
+    SystemMessagePromptTemplate,
+)
+from langchain.prompts.prompt import PromptTemplate
+
+templ1 = """You are a smart assistant designed to help high school teachers come up with reading comprehension questions.
+Given a piece of text, you must come up with a question and answer pair that can be used to test a student's reading comprehension abilities.
+When coming up with this question/answer pair, you must respond in the following format:
+```
+{{
+    "question": "$YOUR_QUESTION_HERE",
+    "answer": "$THE_ANSWER_HERE"
+}}
+```
+
+Everything between the ``` must be valid json.
+"""
+templ2 = """Please come up with a question/answer pair, in the specified JSON format, for the following text:
+----------------
+{text}"""
+CHAT_PROMPT = ChatPromptTemplate.from_messages(
+    [
+        SystemMessagePromptTemplate.from_template(templ1),
+        HumanMessagePromptTemplate.from_template(templ2),
+    ]
+)
+templ = """You are a smart assistant designed to help high school teachers come up with reading comprehension questions.
+Given a piece of text, you must come up with a question and answer pair that can be used to test a student's reading comprehension abilities.
+When coming up with this question/answer pair, you must respond in the following format:
+```
+{{
+    "question": "$YOUR_QUESTION_HERE",
+    "answer": "$THE_ANSWER_HERE"
+}}
+```
+
+Everything between the ``` must be valid json.
+
+Please come up with a question/answer pair, in the specified JSON format, for the following text:
+----------------
+{text}"""
+PROMPT = PromptTemplate.from_template(templ)
+
+PROMPT_SELECTOR = ConditionalPromptSelector(
+    default_prompt=PROMPT, conditionals=[(is_chat_model, CHAT_PROMPT)]
+)
--- a/langchain/evaluation/loading.py
+++ b/langchain/evaluation/loading.py
@ -0,0 +1,8 @@
+from typing import Dict, List
+
+
+def load_dataset(uri: str) -> List[Dict]:
+    from datasets import load_dataset
+
+    dataset = load_dataset(f"LangChainDatasets/{uri}")
+    return [d for d in dataset["train"]]
--- a/langchain/indexes/vectorstore.py
+++ b/langchain/indexes/vectorstore.py
@ -50,8 +50,8 @@ class VectorstoreIndexCreator(BaseModel):
    """Logic for creating indexes."""

    vectorstore_cls: Type[VectorStore] = Chroma
-    text_splitter: TextSplitter = Field(default_factory=_get_default_text_splitter)
    embedding: Embeddings = Field(default_factory=OpenAIEmbeddings)
+    text_splitter: TextSplitter = Field(default_factory=_get_default_text_splitter)
    vectorstore_kwargs: dict = Field(default_factory=dict)

    class Config: