langchain/docs/extras/guides/evaluation/generic_agent_evaluation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluating Agent Trajectories\n",
    "\n",
    "Good evaluation is key for quickly iterating on your agent's prompts and tools. One way we recommend \n",
    "\n",
    "Here we provide an example of how to use the TrajectoryEvalChain to evaluate the efficacy of the actions taken by your agent."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Let's start by defining our agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain import Wikipedia\n",
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.agents import initialize_agent, Tool\n",
    "from langchain.agents import AgentType\n",
    "from langchain.agents.react.base import DocstoreExplorer\n",
    "from langchain.memory import ConversationBufferMemory\n",
    "from langchain import LLMMathChain\n",
    "from langchain.llms import OpenAI\n",
    "\n",
    "from langchain import SerpAPIWrapper\n",
    "\n",
    "docstore = DocstoreExplorer(Wikipedia())\n",
    "\n",
    "math_llm = OpenAI(temperature=0)\n",
    "\n",
    "llm_math_chain = LLMMathChain.from_llm(llm=math_llm, verbose=True)\n",
    "\n",
    "search = SerpAPIWrapper()\n",
    "\n",
    "tools = [\n",
    "    Tool(\n",
    "        name=\"Search\",\n",
    "        func=docstore.search,\n",
    "        description=\"useful for when you need to ask with search. Must call before lookup.\",\n",
    "    ),\n",
    "    Tool(\n",
    "        name=\"Lookup\",\n",
    "        func=docstore.lookup,\n",
    "        description=\"useful for when you need to ask with lookup. Only call after a successfull 'Search'.\",\n",
    "    ),\n",
    "    Tool(\n",
    "        name=\"Calculator\",\n",
    "        func=llm_math_chain.run,\n",
    "        description=\"useful for arithmetic. Expects strict numeric input, no words.\",\n",
    "    ),\n",
    "    Tool(\n",
    "        name=\"Search-the-Web-SerpAPI\",\n",
    "        func=search.run,\n",
    "        description=\"useful for when you need to answer questions about current events\",\n",
    "    ),\n",
    "]\n",
    "\n",
    "memory = ConversationBufferMemory(\n",
    "    memory_key=\"chat_history\", return_messages=True, output_key=\"output\"\n",
    ")\n",
    "\n",
    "llm = ChatOpenAI(temperature=0, model_name=\"gpt-3.5-turbo-0613\")\n",
    "\n",
    "agent = initialize_agent(\n",
    "    tools,\n",
    "    llm,\n",
    "    agent=AgentType.OPENAI_FUNCTIONS,\n",
    "    verbose=True,\n",
    "    memory=memory,\n",
    "    return_intermediate_steps=True,  # This is needed for the evaluation later\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test the Agent\n",
    "\n",
    "Now let's try our agent out on some example queries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\u001b[1m> Entering new  chain...\u001b[0m\n",
      "\u001b[32;1m\u001b[1;3m\n",
      "Invoking: `Calculator` with `1040000 / (4/100)^3 / 1000000`\n",
      "responded: {content}\n",
      "\n",
      "\u001b[0m\n",
      "\n",
      "\u001b[1m> Entering new  chain...\u001b[0m\n",
      "1040000 / (4/100)^3 / 1000000\u001b[32;1m\u001b[1;3m```text\n",
      "1040000 / (4/100)**3 / 1000000\n",
      "```\n",
      "...numexpr.evaluate(\"1040000 / (4/100)**3 / 1000000\")...\n",
      "\u001b[0m\n",
      "Answer: \u001b[33;1m\u001b[1;3m16249.999999999998\u001b[0m\n",
      "\u001b[1m> Finished chain.\u001b[0m\n",
      "\u001b[38;5;200m\u001b[1;3mAnswer: 16249.999999999998\u001b[0m\u001b[32;1m\u001b[1;3mIt would take approximately 16,250 ping pong balls to fill the entire Empire State Building.\u001b[0m\n",
      "\n",
      "\u001b[1m> Finished chain.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "query_one = (\n",
    "    \"How many ping pong balls would it take to fill the entire Empire State Building?\"\n",
    ")\n",
    "\n",
    "test_outputs_one = agent({\"input\": query_one}, return_only_outputs=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks alright.. Let's try it out on another query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\u001b[1m> Entering new  chain...\u001b[0m\n",
      "\u001b[32;1m\u001b[1;3m\n",
      "Invoking: `Search` with `length of the US from coast to coast`\n",
      "\n",
      "\n",
      "\u001b[0m\u001b[36;1m\u001b[1;3m\n",
      "== Watercraft ==\u001b[0m\u001b[32;1m\u001b[1;3m\n",
      "Invoking: `Search` with `distance from coast to coast of the US`\n",
      "\n",
      "\n",
      "\u001b[0m\u001b[36;1m\u001b[1;3mThe Oregon Coast is a coastal region of the U.S. state of Oregon. It is bordered by the Pacific Ocean to its west and the Oregon Coast Range to the east, and stretches approximately 362 miles (583 km) from the California state border in the south to the Columbia River in the north. The region is not a specific geological, environmental, or political entity, and includes the Columbia River Estuary.\n",
      "The Oregon Beach Bill of 1967 allows free beach access to everyone.  In return for a pedestrian easement and relief from construction, the bill eliminates property taxes on private beach land and allows its owners to retain certain beach land rights.Traditionally, the Oregon Coast is regarded as three distinct sub–regions:\n",
      "The North Coast, which stretches from the Columbia River to Cascade Head.\n",
      "The Central Coast, which stretches from Cascade Head to Reedsport.\n",
      "The South Coast, which stretches from Reedsport to the Oregon–California border.The largest city is Coos Bay, population 16,700 in Coos County on the South Coast. U.S. Route 101 is the primary highway from Brookings to Astoria and is known for its scenic overlooks of the Pacific Ocean. Over 80 state parks and recreation areas dot the Oregon Coast. However, only a few highways cross the Coast Range to the interior: US 30, US 26, OR 6, US 20, OR 18, OR 34, OR 126, OR 38, and OR 42.  OR 18 and US 20 are considered among the dangerous roads in the state.The Oregon Coast includes Clatsop County, Tillamook County, Lincoln County, western Lane County, western Douglas County, Coos County, and Curry County.\u001b[0m\u001b[32;1m\u001b[1;3m\n",
      "Invoking: `Calculator` with `362 miles * 5280 feet`\n",
      "\n",
      "\n",
      "\u001b[0m\n",
      "\n",
      "\u001b[1m> Entering new  chain...\u001b[0m\n",
      "362 miles * 5280 feet\u001b[32;1m\u001b[1;3m```text\n",
      "362 * 5280\n",
      "```\n",
      "...numexpr.evaluate(\"362 * 5280\")...\n",
      "\u001b[0m\n",
      "Answer: \u001b[33;1m\u001b[1;3m1911360\u001b[0m\n",
      "\u001b[1m> Finished chain.\u001b[0m\n",
      "\u001b[38;5;200m\u001b[1;3mAnswer: 1911360\u001b[0m\u001b[32;1m\u001b[1;3m\n",
      "Invoking: `Calculator` with `1911360 feet / 1063 feet`\n",
      "\n",
      "\n",
      "\u001b[0m\n",
      "\n",
      "\u001b[1m> Entering new  chain...\u001b[0m\n",
      "1911360 feet / 1063 feet\u001b[32;1m\u001b[1;3m```text\n",
      "1911360 / 1063\n",
      "```\n",
      "...numexpr.evaluate(\"1911360 / 1063\")...\n",
      "\u001b[0m\n",
      "Answer: \u001b[33;1m\u001b[1;3m1798.0809031044214\u001b[0m\n",
      "\u001b[1m> Finished chain.\u001b[0m\n",
      "\u001b[38;5;200m\u001b[1;3mAnswer: 1798.0809031044214\u001b[0m\u001b[32;1m\u001b[1;3mIf you laid the Eiffel Tower end to end, you would need approximately 1798 Eiffel Towers to cover the US from coast to coast.\u001b[0m\n",
      "\n",
      "\u001b[1m> Finished chain.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "query_two = \"If you laid the Eiffel Tower end to end, how many would you need cover the US from coast to coast?\"\n",
    "\n",
    "test_outputs_two = agent({\"input\": query_two}, return_only_outputs=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This doesn't look so good. Let's try running some evaluation.\n",
    "\n",
    "## Evaluating the Agent\n",
    "\n",
    "Let's start by defining the TrajectoryEvalChain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation.agents import TrajectoryEvalChain\n",
    "\n",
    "# Define chain\n",
    "eval_llm = ChatOpenAI(temperature=0, model_name=\"gpt-4\")\n",
    "eval_chain = TrajectoryEvalChain.from_llm(\n",
    "    llm=eval_llm,  # Note: This must be a chat model\n",
    "    agent_tools=agent.tools,\n",
    "    return_reasoning=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try evaluating the first query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Score from 1 to 5:  1\n",
      "Reasoning:  i. Is the final answer helpful?\n",
      "The final answer is not helpful because it is incorrect. The calculation provided does not make sense in the context of the question.\n",
      "\n",
      "ii. Does the AI language use a logical sequence of tools to answer the question?\n",
      "The AI language model does not use a logical sequence of tools. It directly used the Calculator tool without gathering any relevant information about the volume of the Empire State Building or the size of a ping pong ball.\n",
      "\n",
      "iii. Does the AI language model use the tools in a helpful way?\n",
      "The AI language model does not use the tools in a helpful way. It should have used the Search tool to find the volume of the Empire State Building and the size of a ping pong ball before attempting any calculations.\n",
      "\n",
      "iv. Does the AI language model use too many steps to answer the question?\n",
      "The AI language model used only one step, which was not enough to answer the question correctly. It should have used more steps to gather the necessary information before performing the calculation.\n",
      "\n",
      "v. Are the appropriate tools used to answer the question?\n",
      "The appropriate tools were not used to answer the question. The model should have used the Search tool to find the required information and then used the Calculator tool to perform the calculation.\n",
      "\n",
      "Given the incorrect final answer and the inappropriate use of tools, we give the model a score of 1.\n"
     ]
    }
   ],
   "source": [
    "question, steps, answer = (\n",
    "    test_outputs_one[\"input\"],\n",
    "    test_outputs_one[\"intermediate_steps\"],\n",
    "    test_outputs_one[\"output\"],\n",
    ")\n",
    "\n",
    "evaluation = eval_chain.evaluate_agent_trajectory(\n",
    "    input=test_outputs_one[\"input\"],\n",
    "    output=test_outputs_one[\"output\"],\n",
    "    agent_trajectory=test_outputs_one[\"intermediate_steps\"],\n",
    ")\n",
    "\n",
    "print(\"Score from 1 to 5: \", evaluation[\"score\"])\n",
    "print(\"Reasoning: \", evaluation[\"reasoning\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**That seems about right. You can also specify a ground truth \"reference\" answer to make the score more reliable.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Score from 1 to 5:  1\n",
      "Reasoning:  i. Is the final answer helpful?\n",
      "The final answer is not helpful, as it is incorrect. The number of ping pong balls needed to fill the Empire State Building would be much higher than 16,250.\n",
      "\n",
      "ii. Does the AI language use a logical sequence of tools to answer the question?\n",
      "The AI language model does not use a logical sequence of tools. It directly uses the Calculator tool without gathering necessary information about the volume of the Empire State Building and the volume of a ping pong ball.\n",
      "\n",
      "iii. Does the AI language model use the tools in a helpful way?\n",
      "The AI language model does not use the tools in a helpful way. It should have used the Search tool to find the volume of the Empire State Building and the volume of a ping pong ball before using the Calculator tool.\n",
      "\n",
      "iv. Does the AI language model use too many steps to answer the question?\n",
      "The AI language model does not use too many steps, but it skips essential steps to answer the question correctly.\n",
      "\n",
      "v. Are the appropriate tools used to answer the question?\n",
      "The appropriate tools are not used to answer the question. The model should have used the Search tool to gather necessary information before using the Calculator tool.\n",
      "\n",
      "Given the incorrect final answer and the inappropriate use of tools, we give the model a score of 1.\n"
     ]
    }
   ],
   "source": [
    "evaluation = eval_chain.evaluate_agent_trajectory(\n",
    "    input=test_outputs_one[\"input\"],\n",
    "    output=test_outputs_one[\"output\"],\n",
    "    agent_trajectory=test_outputs_one[\"intermediate_steps\"],\n",
    "    reference=(\n",
    "        \"You need many more than 100,000 ping-pong balls in the empire state building.\"\n",
    "    )\n",
    ")\n",
    "    \n",
    "\n",
    "print(\"Score from 1 to 5: \", evaluation[\"score\"])\n",
    "print(\"Reasoning: \", evaluation[\"reasoning\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's try the second query. This time, use the async API. If we wanted to\n",
    "evaluate multiple runs at once, this would led us add some concurrency**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Score from 1 to 5:  2\n",
      "Reasoning:  i. Is the final answer helpful?\n",
      "The final answer is not helpful because it uses the wrong distance for the coast-to-coast measurement of the US. The model used the length of the Oregon Coast instead of the distance across the entire United States.\n",
      "\n",
      "ii. Does the AI language use a logical sequence of tools to answer the question?\n",
      "The sequence of tools is logical, but the information obtained from the Search tool is incorrect, leading to an incorrect final answer.\n",
      "\n",
      "iii. Does the AI language model use the tools in a helpful way?\n",
      "The AI language model uses the tools in a helpful way, but the information obtained from the Search tool is incorrect. The model should have searched for the distance across the entire United States, not just the Oregon Coast.\n",
      "\n",
      "iv. Does the AI language model use too many steps to answer the question?\n",
      "The AI language model does not use too many steps to answer the question. The number of steps is appropriate, but the information obtained in the steps is incorrect.\n",
      "\n",
      "v. Are the appropriate tools used to answer the question?\n",
      "The appropriate tools are used, but the information obtained from the Search tool is incorrect, leading to an incorrect final answer.\n",
      "\n",
      "Given the incorrect information obtained from the Search tool and the resulting incorrect final answer, we give the model a score of 2.\n"
     ]
    }
   ],
   "source": [
    "evaluation = await eval_chain.aevaluate_agent_trajectory(\n",
    "    input=test_outputs_two[\"input\"],\n",
    "    output=test_outputs_two[\"output\"],\n",
    "    agent_trajectory=test_outputs_two[\"intermediate_steps\"],\n",
    ")\n",
    "\n",
    "print(\"Score from 1 to 5: \", evaluation[\"score\"])\n",
    "print(\"Reasoning: \", evaluation[\"reasoning\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "In this example, you evaluated an agent based its entire \"trajectory\" using the `TrajectoryEvalChain`. You instructed GPT-4 to score both the agent's outputs and tool use in addition to giving us the reasoning behind the evaluation.\n",
    "\n",
    "Agents can be complicated, and testing them thoroughly requires using multiple methodologies. Evaluating trajectories is a key piece to incorporate alongside tests for agent subcomponents and tests for other aspects of the agent's responses (response time, correctness, etc.) "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  },
  "vscode": {
   "interpreter": {
    "hash": "06ba49dd587e86cdcfee66b9ffe769e1e94f0e368e54c2d6c866e38e33c0d9b1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}