langchain/docs/use_cases/evaluation/agent_benchmarking.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "984169ca",
   "metadata": {},
   "source": [
    "# Agent Benchmarking: Search + Calculator\n",
    "\n",
    "Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.\n",
    "\n",
    "It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "46bf9205",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Comment this out if you are NOT using tracing\n",
    "import os\n",
    "os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a16b75d",
   "metadata": {},
   "source": [
    "## Loading the data\n",
    "First, let's load the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5b2d5e98",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--agent-search-calculator-8a025c0ce5fb99d2/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3a275586643f4ccfba1a8d54be28c351",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from langchain.evaluation.loading import load_dataset\n",
    "dataset = load_dataset(\"agent-search-calculator\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ab6a716",
   "metadata": {},
   "source": [
    "## Setting up a chain\n",
    "Now we need to load an agent capable of answering these questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c18680b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.llms import OpenAI\n",
    "from langchain.chains import LLMMathChain\n",
    "from langchain.agents import initialize_agent, Tool, load_tools\n",
    "from langchain.agents.agent_types import AgentType\n",
    "\n",
    "tools = load_tools(['serpapi', 'llm-math'], llm=OpenAI(temperature=0))\n",
    "agent = initialize_agent(tools, OpenAI(temperature=0), agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68504a8f",
   "metadata": {},
   "source": [
    "## Make a prediction\n",
    "\n",
    "First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "cbcafc92",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'38,630,316 people live in Canada as of 2023.'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "agent.run(dataset[0]['question'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0c16cd7",
   "metadata": {},
   "source": [
    "## Make many predictions\n",
    "Now we can make predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "24b4c66e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')).\n"
     ]
    }
   ],
   "source": [
    "predictions = []\n",
    "predicted_dataset = []\n",
    "error_dataset = []\n",
    "for data in dataset:\n",
    "    new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",
    "    try:\n",
    "        predictions.append(agent(new_data))\n",
    "        predicted_dataset.append(new_data)\n",
    "    except Exception:\n",
    "        error_dataset.append(new_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49d969fb",
   "metadata": {},
   "source": [
    "## Evaluate performance\n",
    "Now we can evaluate the predictions. The first thing we can do is look at them by eye."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1d583f03",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input': 'How many people live in canada as of 2023?',\n",
       " 'answer': 'approximately 38,625,801',\n",
       " 'output': '38,630,316 people live in Canada as of 2023.',\n",
       " 'intermediate_steps': [(AgentAction(tool='Search', tool_input='Population of Canada 2023', log=' I need to find population data\\nAction: Search\\nAction Input: Population of Canada 2023'),\n",
       "   '38,630,316')]}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictions[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4783344b",
   "metadata": {},
   "source": [
    "Next, we can use a language model to score them programatically"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d0a9341d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.evaluation.qa import QAEvalChain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "1612dec1",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = OpenAI(temperature=0)\n",
    "eval_chain = QAEvalChain.from_llm(llm)\n",
    "graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"output\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79587806",
   "metadata": {},
   "source": [
    "We can add in the graded output to the `predictions` dict and then get a count of the grades."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "2a689df5",
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, prediction in enumerate(predictions):\n",
    "    prediction['grade'] = graded_outputs[i]['text']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "27b61215",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({' CORRECT': 4, ' INCORRECT': 6})"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from collections import Counter\n",
    "Counter([pred['grade'] for pred in predictions])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12fe30f4",
   "metadata": {},
   "source": [
    "We can also filter the datapoints to the incorrect examples and look at them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "47c692a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "0ef976c1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input': \"who is dua lipa's boyfriend? what is his age raised to the .43 power?\",\n",
       " 'answer': 'her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665',\n",
       " 'output': \"Isaac Carew, Dua Lipa's boyfriend, is 36 years old and his age raised to the .43 power is 4.6688516567750975.\",\n",
       " 'intermediate_steps': [(AgentAction(tool='Search', tool_input=\"Dua Lipa's boyfriend\", log=' I need to find out who Dua Lipa\\'s boyfriend is and then calculate his age raised to the .43 power\\nAction: Search\\nAction Input: \"Dua Lipa\\'s boyfriend\"'),\n",
       "   'Dua and Isaac, a model and a chef, dated on and off from 2013 to 2019. The two first split in early 2017, which is when Dua went on to date LANY ...'),\n",
       "  (AgentAction(tool='Search', tool_input='Isaac Carew age', log=' I need to find out Isaac\\'s age\\nAction: Search\\nAction Input: \"Isaac Carew age\"'),\n",
       "   '36 years'),\n",
       "  (AgentAction(tool='Calculator', tool_input='36^.43', log=' I need to calculate 36 raised to the .43 power\\nAction: Calculator\\nAction Input: 36^.43'),\n",
       "   'Answer: 4.6688516567750975\\n')],\n",
       " 'grade': ' INCORRECT'}"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "incorrect[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7710401a",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"id": "984169ca",`
			`"metadata": {},`
			`"source": [`
			`"# Agent Benchmarking: Search + Calculator\n",`
			`"\n",`
			`"Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.\n",`
			`"\n",`
			`"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"id": "46bf9205",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# Comment this out if you are NOT using tracing\n",`
			`"import os\n",`
			`"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "8a16b75d",`
			`"metadata": {},`
			`"source": [`
			`"## Loading the data\n",`
			`"First, let's load the data."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"id": "5b2d5e98",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--agent-search-calculator-8a025c0ce5fb99d2/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n"`
			`]`
			`},`
			`{`
			`"data": {`
			`"application/vnd.jupyter.widget-view+json": {`
			`"model_id": "3a275586643f4ccfba1a8d54be28c351",`
			`"version_major": 2,`
			`"version_minor": 0`
			`},`
			`"text/plain": [`
			`" 0%\| \| 0/1 [00:00<?, ?it/s]"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`}`
			`],`
			`"source": [`
			`"from langchain.evaluation.loading import load_dataset\n",`
			`"dataset = load_dataset(\"agent-search-calculator\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "4ab6a716",`
			`"metadata": {},`
			`"source": [`
			`"## Setting up a chain\n",`
			`"Now we need to load an agent capable of answering these questions."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"id": "c18680b5",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.llms import OpenAI\n",`
			`"from langchain.chains import LLMMathChain\n",`
			`"from langchain.agents import initialize_agent, Tool, load_tools\n",`
Add Enum for agent types (#2321) This pull request adds an enum class for the various types of agents used in the project, located in the `agent_types.py` file. Currently, the project is using hardcoded strings for the initialization of these agents, which can lead to errors and make the code harder to maintain. With the introduction of the new enums, the code will be more readable and less error-prone. The new enum members include: - ZERO_SHOT_REACT_DESCRIPTION - REACT_DOCSTORE - SELF_ASK_WITH_SEARCH - CONVERSATIONAL_REACT_DESCRIPTION - CHAT_ZERO_SHOT_REACT_DESCRIPTION - CHAT_CONVERSATIONAL_REACT_DESCRIPTION In this PR, I have also replaced the hardcoded strings with the appropriate enum members throughout the codebase, ensuring a smooth transition to the new approach. 2023-04-04 04:56:20 +00:00			`"from langchain.agents.agent_types import AgentType\n",`
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00			`"\n",`
			`"tools = load_tools(['serpapi', 'llm-math'], llm=OpenAI(temperature=0))\n",`
Add Enum for agent types (#2321) This pull request adds an enum class for the various types of agents used in the project, located in the `agent_types.py` file. Currently, the project is using hardcoded strings for the initialization of these agents, which can lead to errors and make the code harder to maintain. With the introduction of the new enums, the code will be more readable and less error-prone. The new enum members include: - ZERO_SHOT_REACT_DESCRIPTION - REACT_DOCSTORE - SELF_ASK_WITH_SEARCH - CONVERSATIONAL_REACT_DESCRIPTION - CHAT_ZERO_SHOT_REACT_DESCRIPTION - CHAT_CONVERSATIONAL_REACT_DESCRIPTION In this PR, I have also replaced the hardcoded strings with the appropriate enum members throughout the codebase, ensuring a smooth transition to the new approach. 2023-04-04 04:56:20 +00:00			`"agent = initialize_agent(tools, OpenAI(temperature=0), agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)\n"`
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "68504a8f",`
			`"metadata": {},`
			`"source": [`
			`"## Make a prediction\n",`
			`"\n",`
			`"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"id": "cbcafc92",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'38,630,316 people live in Canada as of 2023.'"`
			`]`
			`},`
			`"execution_count": 7,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"agent.run(dataset[0]['question'])"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "d0c16cd7",`
			`"metadata": {},`
			`"source": [`
			`"## Make many predictions\n",`
			`"Now we can make predictions"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"id": "24b4c66e",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')).\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"predictions = []\n",`
			`"predicted_dataset = []\n",`
			`"error_dataset = []\n",`
			`"for data in dataset:\n",`
			`" new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",`
			`" try:\n",`
			`" predictions.append(agent(new_data))\n",`
			`" predicted_dataset.append(new_data)\n",`
			`" except Exception:\n",`
			`" error_dataset.append(new_data)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "49d969fb",`
			`"metadata": {},`
			`"source": [`
			`"## Evaluate performance\n",`
			`"Now we can evaluate the predictions. The first thing we can do is look at them by eye."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"id": "1d583f03",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"{'input': 'How many people live in canada as of 2023?',\n",`
			`" 'answer': 'approximately 38,625,801',\n",`
			`" 'output': '38,630,316 people live in Canada as of 2023.',\n",`
			`" 'intermediate_steps': [(AgentAction(tool='Search', tool_input='Population of Canada 2023', log=' I need to find population data\\nAction: Search\\nAction Input: Population of Canada 2023'),\n",`
			`" '38,630,316')]}"`
			`]`
			`},`
			`"execution_count": 9,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"predictions[0]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "4783344b",`
			`"metadata": {},`
			`"source": [`
			`"Next, we can use a language model to score them programatically"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 10,`
			`"id": "d0a9341d",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.evaluation.qa import QAEvalChain"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 14,`
			`"id": "1612dec1",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"llm = OpenAI(temperature=0)\n",`
			`"eval_chain = QAEvalChain.from_llm(llm)\n",`
			`"graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"output\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "79587806",`
			`"metadata": {},`
			`"source": [`
			"We can add in the graded output to the `predictions` dict and then get a count of the grades."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 15,`
			`"id": "2a689df5",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"for i, prediction in enumerate(predictions):\n",`
			`" prediction['grade'] = graded_outputs[i]['text']"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 16,`
			`"id": "27b61215",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Counter({' CORRECT': 4, ' INCORRECT': 6})"`
			`]`
			`},`
			`"execution_count": 16,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"from collections import Counter\n",`
			`"Counter([pred['grade'] for pred in predictions])"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "12fe30f4",`
			`"metadata": {},`
			`"source": [`
			`"We can also filter the datapoints to the incorrect examples and look at them."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 17,`
			`"id": "47c692a1",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 18,`
			`"id": "0ef976c1",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"{'input': \"who is dua lipa's boyfriend? what is his age raised to the .43 power?\",\n",`
			`" 'answer': 'her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665',\n",`
			`" 'output': \"Isaac Carew, Dua Lipa's boyfriend, is 36 years old and his age raised to the .43 power is 4.6688516567750975.\",\n",`
			`" 'intermediate_steps': [(AgentAction(tool='Search', tool_input=\"Dua Lipa's boyfriend\", log=' I need to find out who Dua Lipa\\'s boyfriend is and then calculate his age raised to the .43 power\\nAction: Search\\nAction Input: \"Dua Lipa\\'s boyfriend\"'),\n",`
			`" 'Dua and Isaac, a model and a chef, dated on and off from 2013 to 2019. The two first split in early 2017, which is when Dua went on to date LANY ...'),\n",`
			`" (AgentAction(tool='Search', tool_input='Isaac Carew age', log=' I need to find out Isaac\\'s age\\nAction: Search\\nAction Input: \"Isaac Carew age\"'),\n",`
			`" '36 years'),\n",`
			`" (AgentAction(tool='Calculator', tool_input='36^.43', log=' I need to calculate 36 raised to the .43 power\\nAction: Calculator\\nAction Input: 36^.43'),\n",`
			`" 'Answer: 4.6688516567750975\\n')],\n",`
			`" 'grade': ' INCORRECT'}"`
			`]`
			`},`
			`"execution_count": 18,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"incorrect[0]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "7710401a",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.9.1"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`