langchain/docs/extras/guides/evaluation/trajectory/trajectory_eval.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6e5ea1a1-7e74-459b-bf14-688f87d09124",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Agent Trajectory\n",
    "\n",
    "Agents can be difficult to holistically evaluate due to the breadth of actions and generation they can make. We recommend using multiple evaluation techniques appropriate to your use case. One way to evaluate an agent is to look at the whole trajectory of actions taken along with their responses.\n",
    "\n",
    "Evaluators that do this can implement the `AgentTrajectoryEvaluator` interface. This walkthrough will show how to use the `trajectory` evaluator to grade  an OpenAI functions agent.\n",
    "\n",
    "For more information, check out the reference docs for the [TrajectoryEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.html#langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain) for more info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "149402da-5212-43e2-b7c0-a701727f5293",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "evaluator = load_evaluator(\"trajectory\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1c64c1a",
   "metadata": {},
   "source": [
    "## Methods\n",
    "\n",
    "\n",
    "The Agent Trajectory Evaluators are used with the [evaluate_agent_trajectory](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.html#langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.evaluate_agent_trajectory) (and async [aevaluate_agent_trajectory](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.html#langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.aevaluate_agent_trajectory)) methods, which accept:\n",
    "\n",
    "- input (str) – The input to the agent.\n",
    "- prediction (str) – The final predicted response.\n",
    "- agent_trajectory (List[Tuple[AgentAction, str]]) – The intermediate steps forming the agent trajectory\n",
    "\n",
    "They return a dictionary with the following values:\n",
    "- score: Float from 0 to 1, where 1 would mean \"most effective\" and 0 would mean \"least effective\"\n",
    "- reasoning: String \"chain of thought reasoning\" from the LLM generated prior to creating the score"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e733562c-4c17-4942-9647-acfc5ebfaca2",
   "metadata": {},
   "source": [
    "## Capturing Trajectory\n",
    "\n",
    "The easiest way to return an agent's trajectory (without using tracing callbacks like those in LangSmith) for evaluation is to initialize the agent with `return_intermediate_steps=True`.\n",
    "\n",
    "Below, create an example agent we will call to evaluate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "451cb0cb-6f42-4abd-aa6d-fb871fce034d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import subprocess\n",
    "\n",
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.tools import tool\n",
    "from langchain.agents import AgentType, initialize_agent\n",
    "\n",
    "from pydantic import HttpUrl\n",
    "from urllib.parse import urlparse\n",
    "\n",
    "\n",
    "@tool\n",
    "def ping(url: HttpUrl, return_error: bool) -> str:\n",
    "    \"\"\"Ping the fully specified url. Must include https:// in the url.\"\"\"\n",
    "    hostname = urlparse(str(url)).netloc\n",
    "    completed_process = subprocess.run(\n",
    "        [\"ping\", \"-c\", \"1\", hostname], capture_output=True, text=True\n",
    "    )\n",
    "    output = completed_process.stdout\n",
    "    if return_error and completed_process.returncode != 0:\n",
    "        return completed_process.stderr\n",
    "    return output\n",
    "\n",
    "\n",
    "@tool\n",
    "def trace_route(url: HttpUrl, return_error: bool) -> str:\n",
    "    \"\"\"Trace the route to the specified url. Must include https:// in the url.\"\"\"\n",
    "    hostname = urlparse(str(url)).netloc\n",
    "    completed_process = subprocess.run(\n",
    "        [\"traceroute\", hostname], capture_output=True, text=True\n",
    "    )\n",
    "    output = completed_process.stdout\n",
    "    if return_error and completed_process.returncode != 0:\n",
    "        return completed_process.stderr\n",
    "    return output\n",
    "\n",
    "\n",
    "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0613\", temperature=0)\n",
    "agent = initialize_agent(\n",
    "    llm=llm,\n",
    "    tools=[ping, trace_route],\n",
    "    agent=AgentType.OPENAI_MULTI_FUNCTIONS,\n",
    "    return_intermediate_steps=True,  # IMPORTANT!\n",
    ")\n",
    "\n",
    "result = agent(\"What's the latency like for https://langchain.com?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2df34eed-45a5-4f91-88d3-9aa55f28391a",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Evaluate Trajectory\n",
    "\n",
    "Pass the input, trajectory, and pass to the [evaluate_agent_trajectory](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.schema.AgentTrajectoryEvaluator.html#langchain.evaluation.schema.AgentTrajectoryEvaluator.evaluate_agent_trajectory) method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8d2c8703-98ed-4068-8a8b-393f0f1f64ea",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'score': 1.0,\n",
       " 'reasoning': \"i. The final answer is helpful. It directly answers the user's question about the latency for the website https://langchain.com.\\n\\nii. The AI language model uses a logical sequence of tools to answer the question. It uses the 'ping' tool to measure the latency of the website, which is the correct tool for this task.\\n\\niii. The AI language model uses the tool in a helpful way. It inputs the URL into the 'ping' tool and correctly interprets the output to provide the latency in milliseconds.\\n\\niv. The AI language model does not use too many steps to answer the question. It only uses one step, which is appropriate for this type of question.\\n\\nv. The appropriate tool is used to answer the question. The 'ping' tool is the correct tool to measure website latency.\\n\\nGiven these considerations, the AI language model's performance is excellent. It uses the correct tool, interprets the output correctly, and provides a helpful and direct answer to the user's question.\"}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluation_result = evaluator.evaluate_agent_trajectory(\n",
    "    prediction=result[\"output\"],\n",
    "    input=result[\"input\"],\n",
    "    agent_trajectory=result[\"intermediate_steps\"],\n",
    ")\n",
    "evaluation_result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc5467c1-ea92-405f-949a-3011388fa9ee",
   "metadata": {},
   "source": [
    "## Configuring the Evaluation LLM\n",
    "\n",
    "If you don't select an LLM to use for evaluation, the [load_evaluator](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.loading.load_evaluator.html#langchain.evaluation.loading.load_evaluator) function will use `gpt-4` to power the evaluation chain. You can select any chat model for the agent trajectory evaluator as below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1f6318f3-642a-4766-bc7a-f91239795ee7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# %pip install anthropic\n",
    "# ANTHROPIC_API_KEY=<YOUR ANTHROPIC API KEY>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b2852289-5df9-402e-95b5-7efebf0fc943",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatAnthropic\n",
    "\n",
    "eval_llm = ChatAnthropic(temperature=0)\n",
    "evaluator = load_evaluator(\"trajectory\", llm=eval_llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ff72d21a-93b9-4c2f-8613-733d9c9330d7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'score': 1.0,\n",
       " 'reasoning': \"Here is my detailed evaluation of the AI's response:\\n\\ni. The final answer is helpful, as it directly provides the latency measurement for the requested website.\\n\\nii. The sequence of using the ping tool to measure latency is logical for this question.\\n\\niii. The ping tool is used in a helpful way, with the website URL provided as input and the output latency measurement extracted.\\n\\niv. Only one step is used, which is appropriate for simply measuring latency. More steps are not needed.\\n\\nv. The ping tool is an appropriate choice to measure latency. \\n\\nIn summary, the AI uses an optimal single step approach with the right tool and extracts the needed output. The final answer directly answers the question in a helpful way.\\n\\nOverall\"}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluation_result = evaluator.evaluate_agent_trajectory(\n",
    "    prediction=result[\"output\"],\n",
    "    input=result[\"input\"],\n",
    "    agent_trajectory=result[\"intermediate_steps\"],\n",
    ")\n",
    "evaluation_result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95ce4240-f5a0-4810-8d09-b2f4c9e18b7f",
   "metadata": {},
   "source": [
    "## Providing List of Valid Tools\n",
    "\n",
    "By default, the evaluator doesn't take into account the tools the agent is permitted to call. You can provide these to the evaluator via the `agent_tools` argument.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "24c10566-2ef5-45c5-9213-a8fb28e2ca1f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "evaluator = load_evaluator(\"trajectory\", agent_tools=[ping, trace_route])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7b995786-5b78-4d9e-8e8a-1f2a203113e2",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'score': 1.0,\n",
       " 'reasoning': \"i. The final answer is helpful. It directly answers the user's question about the latency for the specified website.\\n\\nii. The AI language model uses a logical sequence of tools to answer the question. In this case, only one tool was needed to answer the question, and the model chose the correct one.\\n\\niii. The AI language model uses the tool in a helpful way. The 'ping' tool was used to determine the latency of the website, which was the information the user was seeking.\\n\\niv. The AI language model does not use too many steps to answer the question. Only one step was needed and used.\\n\\nv. The appropriate tool was used to answer the question. The 'ping' tool is designed to measure latency, which was the information the user was seeking.\\n\\nGiven these considerations, the AI language model's performance in answering this question is excellent.\"}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluation_result = evaluator.evaluate_agent_trajectory(\n",
    "    prediction=result[\"output\"],\n",
    "    input=result[\"input\"],\n",
    "    agent_trajectory=result[\"intermediate_steps\"],\n",
    ")\n",
    "evaluation_result"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								{
 								 "cells": [
 								  {
 								   "cell_type": "markdown",
 								   "id": "6e5ea1a1-7e74-459b-bf14-688f87d09124",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "source": [
 								    "# Agent Trajectory\n",
 								    "\n",
 								    "Agents can be difficult to holistically evaluate due to the breadth of actions and generation they can make. We recommend using multiple evaluation techniques appropriate to your use case. One way to evaluate an agent is to look at the whole trajectory of actions taken along with their responses.\n",
 								    "\n",
 								    "Evaluators that do this can implement the `AgentTrajectoryEvaluator` interface. This walkthrough will show how to use the `trajectory` evaluator to grade  an OpenAI functions agent.\n",
 								    "\n",
 								    "For more information, check out the reference docs for the [TrajectoryEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.html#langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain) for more info."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
 								   "id": "149402da-5212-43e2-b7c0-a701727f5293",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [],
 								   "source": [
 								    "from langchain.evaluation import load_evaluator\n",
 								    "\n",
 								    "evaluator = load_evaluator(\"trajectory\")"
 								   ]
 								  },
-												Delete Old Evals Examples (#8252)

Still retain:
- Comparison Examples
- Data + QA walkthrough
- QA (but really minimize it)
											
										
										
											2023-07-27 01:46:54 +00:00
+								  {
 								   "cell_type": "markdown",
 								   "id": "b1c64c1a",
 								   "metadata": {},
 								   "source": [
 								    "## Methods\n",
 								    "\n",
 								    "\n",
 								    "The Agent Trajectory Evaluators are used with the [evaluate_agent_trajectory](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.html#langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.evaluate_agent_trajectory) (and async [aevaluate_agent_trajectory](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.html#langchain.evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain.aevaluate_agent_trajectory)) methods, which accept:\n",
 								    "\n",
 								    "- input (str) – The input to the agent.\n",
 								    "- prediction (str) – The final predicted response.\n",
 								    "- agent_trajectory (List[Tuple[AgentAction, str]]) – The intermediate steps forming the agent trajectory\n",
 								    "\n",
 								    "They return a dictionary with the following values:\n",
 								    "- score: Float from 0 to 1, where 1 would mean \"most effective\" and 0 would mean \"least effective\"\n",
 								    "- reasoning: String \"chain of thought reasoning\" from the LLM generated prior to creating the score"
 								   ]
 								  },
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								  {
 								   "cell_type": "markdown",
 								   "id": "e733562c-4c17-4942-9647-acfc5ebfaca2",
 								   "metadata": {},
 								   "source": [
 								    "## Capturing Trajectory\n",
 								    "\n",
 								    "The easiest way to return an agent's trajectory (without using tracing callbacks like those in LangSmith) for evaluation is to initialize the agent with `return_intermediate_steps=True`.\n",
 								    "\n",
 								    "Below, create an example agent we will call to evaluate."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 2,
 								   "id": "451cb0cb-6f42-4abd-aa6d-fb871fce034d",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [],
 								   "source": [
 								    "import os\n",
-												Delete Old Evals Examples (#8252)

Still retain:
- Comparison Examples
- Data + QA walkthrough
- QA (but really minimize it)
											
										
										
											2023-07-27 01:46:54 +00:00
+								    "import subprocess\n",
 								    "\n",
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								    "from langchain.chat_models import ChatOpenAI\n",
 								    "from langchain.tools import tool\n",
 								    "from langchain.agents import AgentType, initialize_agent\n",
-												Delete Old Evals Examples (#8252)

Still retain:
- Comparison Examples
- Data + QA walkthrough
- QA (but really minimize it)
											
										
										
											2023-07-27 01:46:54 +00:00
+								    "\n",
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								    "from pydantic import HttpUrl\n",
 								    "from urllib.parse import urlparse\n",
 								    "\n",
 								    "\n",
 								    "@tool\n",
 								    "def ping(url: HttpUrl, return_error: bool) -> str:\n",
 								    "    \"\"\"Ping the fully specified url. Must include https:// in the url.\"\"\"\n",
 								    "    hostname = urlparse(str(url)).netloc\n",
 								    "    completed_process = subprocess.run(\n",
 								    "        [\"ping\", \"-c\", \"1\", hostname], capture_output=True, text=True\n",
 								    "    )\n",
 								    "    output = completed_process.stdout\n",
 								    "    if return_error and completed_process.returncode != 0:\n",
 								    "        return completed_process.stderr\n",
 								    "    return output\n",
 								    "\n",
 								    "\n",
 								    "@tool\n",
 								    "def trace_route(url: HttpUrl, return_error: bool) -> str:\n",
 								    "    \"\"\"Trace the route to the specified url. Must include https:// in the url.\"\"\"\n",
 								    "    hostname = urlparse(str(url)).netloc\n",
 								    "    completed_process = subprocess.run(\n",
 								    "        [\"traceroute\", hostname], capture_output=True, text=True\n",
 								    "    )\n",
 								    "    output = completed_process.stdout\n",
 								    "    if return_error and completed_process.returncode != 0:\n",
 								    "        return completed_process.stderr\n",
 								    "    return output\n",
 								    "\n",
 								    "\n",
 								    "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0613\", temperature=0)\n",
 								    "agent = initialize_agent(\n",
 								    "    llm=llm,\n",
 								    "    tools=[ping, trace_route],\n",
 								    "    agent=AgentType.OPENAI_MULTI_FUNCTIONS,\n",
 								    "    return_intermediate_steps=True,  # IMPORTANT!\n",
 								    ")\n",
 								    "\n",
 								    "result = agent(\"What's the latency like for https://langchain.com?\")"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "2df34eed-45a5-4f91-88d3-9aa55f28391a",
-												Docs Nits (#7874)

Add links to reference docs
											
										
										
											2023-07-18 08:50:14 +00:00
+								   "metadata": {
 								    "tags": []
 								   },
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   "source": [
 								    "## Evaluate Trajectory\n",
 								    "\n",
-												Docs Nits (#7874)

Add links to reference docs
											
										
										
											2023-07-18 08:50:14 +00:00
+								    "Pass the input, trajectory, and pass to the [evaluate_agent_trajectory](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.schema.AgentTrajectoryEvaluator.html#langchain.evaluation.schema.AgentTrajectoryEvaluator.evaluate_agent_trajectory) method."
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 3,
 								   "id": "8d2c8703-98ed-4068-8a8b-393f0f1f64ea",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
-												Delete Old Evals Examples (#8252)

Still retain:
- Comparison Examples
- Data + QA walkthrough
- QA (but really minimize it)
											
										
										
											2023-07-27 01:46:54 +00:00
+								       "{'score': 1.0,\n",
 								       " 'reasoning': \"i. The final answer is helpful. It directly answers the user's question about the latency for the website https://langchain.com.\\n\\nii. The AI language model uses a logical sequence of tools to answer the question. It uses the 'ping' tool to measure the latency of the website, which is the correct tool for this task.\\n\\niii. The AI language model uses the tool in a helpful way. It inputs the URL into the 'ping' tool and correctly interprets the output to provide the latency in milliseconds.\\n\\niv. The AI language model does not use too many steps to answer the question. It only uses one step, which is appropriate for this type of question.\\n\\nv. The appropriate tool is used to answer the question. The 'ping' tool is the correct tool to measure website latency.\\n\\nGiven these considerations, the AI language model's performance is excellent. It uses the correct tool, interprets the output correctly, and provides a helpful and direct answer to the user's question.\"}"
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								      ]
 								     },
 								     "execution_count": 3,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "evaluation_result = evaluator.evaluate_agent_trajectory(\n",
 								    "    prediction=result[\"output\"],\n",
 								    "    input=result[\"input\"],\n",
 								    "    agent_trajectory=result[\"intermediate_steps\"],\n",
 								    ")\n",
-												Delete Old Evals Examples (#8252)

Still retain:
- Comparison Examples
- Data + QA walkthrough
- QA (but really minimize it)
											
										
										
											2023-07-27 01:46:54 +00:00
+								    "evaluation_result"
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "fc5467c1-ea92-405f-949a-3011388fa9ee",
 								   "metadata": {},
 								   "source": [
 								    "## Configuring the Evaluation LLM\n",
 								    "\n",
-												Docs Nits (#7874)

Add links to reference docs
											
										
										
											2023-07-18 08:50:14 +00:00
+								    "If you don't select an LLM to use for evaluation, the [load_evaluator](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.loading.load_evaluator.html#langchain.evaluation.loading.load_evaluator) function will use `gpt-4` to power the evaluation chain. You can select any chat model for the agent trajectory evaluator as below."
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 4,
 								   "id": "1f6318f3-642a-4766-bc7a-f91239795ee7",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [],
 								   "source": [
 								    "# %pip install anthropic\n",
 								    "# ANTHROPIC_API_KEY=<YOUR ANTHROPIC API KEY>"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 5,
 								   "id": "b2852289-5df9-402e-95b5-7efebf0fc943",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [],
 								   "source": [
 								    "from langchain.chat_models import ChatAnthropic\n",
 								    "\n",
 								    "eval_llm = ChatAnthropic(temperature=0)\n",
 								    "evaluator = load_evaluator(\"trajectory\", llm=eval_llm)"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 6,
 								   "id": "ff72d21a-93b9-4c2f-8613-733d9c9330d7",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
-												Delete Old Evals Examples (#8252)

Still retain:
- Comparison Examples
- Data + QA walkthrough
- QA (but really minimize it)
											
										
										
											2023-07-27 01:46:54 +00:00
+								       "{'score': 1.0,\n",
 								       " 'reasoning': \"Here is my detailed evaluation of the AI's response:\\n\\ni. The final answer is helpful, as it directly provides the latency measurement for the requested website.\\n\\nii. The sequence of using the ping tool to measure latency is logical for this question.\\n\\niii. The ping tool is used in a helpful way, with the website URL provided as input and the output latency measurement extracted.\\n\\niv. Only one step is used, which is appropriate for simply measuring latency. More steps are not needed.\\n\\nv. The ping tool is an appropriate choice to measure latency. \\n\\nIn summary, the AI uses an optimal single step approach with the right tool and extracts the needed output. The final answer directly answers the question in a helpful way.\\n\\nOverall\"}"
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								      ]
 								     },
 								     "execution_count": 6,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "evaluation_result = evaluator.evaluate_agent_trajectory(\n",
 								    "    prediction=result[\"output\"],\n",
 								    "    input=result[\"input\"],\n",
 								    "    agent_trajectory=result[\"intermediate_steps\"],\n",
 								    ")\n",
-												Delete Old Evals Examples (#8252)

Still retain:
- Comparison Examples
- Data + QA walkthrough
- QA (but really minimize it)
											
										
										
											2023-07-27 01:46:54 +00:00
+								    "evaluation_result"
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "95ce4240-f5a0-4810-8d09-b2f4c9e18b7f",
 								   "metadata": {},
 								   "source": [
 								    "## Providing List of Valid Tools\n",
 								    "\n",
 								    "By default, the evaluator doesn't take into account the tools the agent is permitted to call. You can provide these to the evaluator via the `agent_tools` argument.\n"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 7,
 								   "id": "24c10566-2ef5-45c5-9213-a8fb28e2ca1f",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [],
 								   "source": [
 								    "from langchain.evaluation import load_evaluator\n",
 								    "\n",
 								    "evaluator = load_evaluator(\"trajectory\", agent_tools=[ping, trace_route])"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 8,
 								   "id": "7b995786-5b78-4d9e-8e8a-1f2a203113e2",
 								   "metadata": {
 								    "tags": []
 								   },
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
-												Delete Old Evals Examples (#8252)

Still retain:
- Comparison Examples
- Data + QA walkthrough
- QA (but really minimize it)
											
										
										
											2023-07-27 01:46:54 +00:00
+								       "{'score': 1.0,\n",
 								       " 'reasoning': \"i. The final answer is helpful. It directly answers the user's question about the latency for the specified website.\\n\\nii. The AI language model uses a logical sequence of tools to answer the question. In this case, only one tool was needed to answer the question, and the model chose the correct one.\\n\\niii. The AI language model uses the tool in a helpful way. The 'ping' tool was used to determine the latency of the website, which was the information the user was seeking.\\n\\niv. The AI language model does not use too many steps to answer the question. Only one step was needed and used.\\n\\nv. The appropriate tool was used to answer the question. The 'ping' tool is designed to measure latency, which was the information the user was seeking.\\n\\nGiven these considerations, the AI language model's performance in answering this question is excellent.\"}"
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								      ]
 								     },
 								     "execution_count": 8,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "evaluation_result = evaluator.evaluate_agent_trajectory(\n",
 								    "    prediction=result[\"output\"],\n",
 								    "    input=result[\"input\"],\n",
 								    "    agent_trajectory=result[\"intermediate_steps\"],\n",
 								    ")\n",
-												Delete Old Evals Examples (#8252)

Still retain:
- Comparison Examples
- Data + QA walkthrough
- QA (but really minimize it)
											
										
										
											2023-07-27 01:46:54 +00:00
+								    "evaluation_result"
-												Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
											
										
										
											2023-07-18 08:00:01 +00:00
+								   ]
 								  }
 								 ],
 								 "metadata": {
 								  "kernelspec": {
 								   "display_name": "Python 3 (ipykernel)",
 								   "language": "python",
 								   "name": "python3"
 								  },
 								  "language_info": {
 								   "codemirror_mode": {
 								    "name": "ipython",
 								    "version": 3
 								   },
 								   "file_extension": ".py",
 								   "mimetype": "text/x-python",
 								   "name": "python",
 								   "nbconvert_exporter": "python",
 								   "pygments_lexer": "ipython3",
 								   "version": "3.11.2"
 								  }
 								 },
 								 "nbformat": 4,
 								 "nbformat_minor": 5
 								}