langchain/docs/extras/modules/evaluation/trajectory/trajectory_eval.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6e5ea1a1-7e74-459b-bf14-688f87d09124",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Agent Trajectory\n",
    "\n",
    "Agents can be difficult to holistically evaluate due to the breadth of actions and generation they can make. We recommend using multiple evaluation techniques appropriate to your use case. One way to evaluate an agent is to look at the whole trajectory of actions taken along with their responses.\n",
    "\n",
    "Evaluators that do this can implement the `AgentTrajectoryEvaluator` interface. This walkthrough will show how to use the `trajectory` evaluator to grade  an OpenAI functions agent.\n",
    "\n",
    "For more information, check out the reference docs for the [TrajectoryEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.html#langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain) for more info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "149402da-5212-43e2-b7c0-a701727f5293",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "evaluator = load_evaluator(\"trajectory\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e733562c-4c17-4942-9647-acfc5ebfaca2",
   "metadata": {},
   "source": [
    "## Capturing Trajectory\n",
    "\n",
    "The easiest way to return an agent's trajectory (without using tracing callbacks like those in LangSmith) for evaluation is to initialize the agent with `return_intermediate_steps=True`.\n",
    "\n",
    "Below, create an example agent we will call to evaluate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "451cb0cb-6f42-4abd-aa6d-fb871fce034d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.tools import tool\n",
    "from langchain.agents import AgentType, initialize_agent\n",
    "from pydantic import HttpUrl\n",
    "import subprocess\n",
    "from urllib.parse import urlparse\n",
    "\n",
    "\n",
    "@tool\n",
    "def ping(url: HttpUrl, return_error: bool) -> str:\n",
    "    \"\"\"Ping the fully specified url. Must include https:// in the url.\"\"\"\n",
    "    hostname = urlparse(str(url)).netloc\n",
    "    completed_process = subprocess.run(\n",
    "        [\"ping\", \"-c\", \"1\", hostname], capture_output=True, text=True\n",
    "    )\n",
    "    output = completed_process.stdout\n",
    "    if return_error and completed_process.returncode != 0:\n",
    "        return completed_process.stderr\n",
    "    return output\n",
    "\n",
    "\n",
    "@tool\n",
    "def trace_route(url: HttpUrl, return_error: bool) -> str:\n",
    "    \"\"\"Trace the route to the specified url. Must include https:// in the url.\"\"\"\n",
    "    hostname = urlparse(str(url)).netloc\n",
    "    completed_process = subprocess.run(\n",
    "        [\"traceroute\", hostname], capture_output=True, text=True\n",
    "    )\n",
    "    output = completed_process.stdout\n",
    "    if return_error and completed_process.returncode != 0:\n",
    "        return completed_process.stderr\n",
    "    return output\n",
    "\n",
    "\n",
    "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0613\", temperature=0)\n",
    "agent = initialize_agent(\n",
    "    llm=llm,\n",
    "    tools=[ping, trace_route],\n",
    "    agent=AgentType.OPENAI_MULTI_FUNCTIONS,\n",
    "    return_intermediate_steps=True,  # IMPORTANT!\n",
    ")\n",
    "\n",
    "result = agent(\"What's the latency like for https://langchain.com?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2df34eed-45a5-4f91-88d3-9aa55f28391a",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Evaluate Trajectory\n",
    "\n",
    "Pass the input, trajectory, and pass to the [evaluate_agent_trajectory](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.schema.AgentTrajectoryEvaluator.html#langchain.evaluation.schema.AgentTrajectoryEvaluator.evaluate_agent_trajectory) method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8d2c8703-98ed-4068-8a8b-393f0f1f64ea",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Type <class 'langchain.agents.openai_functions_multi_agent.base._FunctionsAgentAction'> not serializable\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluation_result = evaluator.evaluate_agent_trajectory(\n",
    "    prediction=result[\"output\"],\n",
    "    input=result[\"input\"],\n",
    "    agent_trajectory=result[\"intermediate_steps\"],\n",
    ")\n",
    "evaluation_result[\"score\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc5467c1-ea92-405f-949a-3011388fa9ee",
   "metadata": {},
   "source": [
    "## Configuring the Evaluation LLM\n",
    "\n",
    "If you don't select an LLM to use for evaluation, the [load_evaluator](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.loading.load_evaluator.html#langchain.evaluation.loading.load_evaluator) function will use `gpt-4` to power the evaluation chain. You can select any chat model for the agent trajectory evaluator as below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1f6318f3-642a-4766-bc7a-f91239795ee7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# %pip install anthropic\n",
    "# ANTHROPIC_API_KEY=<YOUR ANTHROPIC API KEY>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b2852289-5df9-402e-95b5-7efebf0fc943",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatAnthropic\n",
    "\n",
    "eval_llm = ChatAnthropic(temperature=0)\n",
    "evaluator = load_evaluator(\"trajectory\", llm=eval_llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ff72d21a-93b9-4c2f-8613-733d9c9330d7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluation_result = evaluator.evaluate_agent_trajectory(\n",
    "    prediction=result[\"output\"],\n",
    "    input=result[\"input\"],\n",
    "    agent_trajectory=result[\"intermediate_steps\"],\n",
    ")\n",
    "evaluation_result[\"score\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95ce4240-f5a0-4810-8d09-b2f4c9e18b7f",
   "metadata": {},
   "source": [
    "## Providing List of Valid Tools\n",
    "\n",
    "By default, the evaluator doesn't take into account the tools the agent is permitted to call. You can provide these to the evaluator via the `agent_tools` argument.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "24c10566-2ef5-45c5-9213-a8fb28e2ca1f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "evaluator = load_evaluator(\"trajectory\", agent_tools=[ping, trace_route])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7b995786-5b78-4d9e-8e8a-1f2a203113e2",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluation_result = evaluator.evaluate_agent_trajectory(\n",
    "    prediction=result[\"output\"],\n",
    "    input=result[\"input\"],\n",
    "    agent_trajectory=result[\"intermediate_steps\"],\n",
    ")\n",
    "evaluation_result[\"score\"]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}