langchain/docs/extras/guides/evaluation/agent_benchmarking.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "984169ca",
   "metadata": {},
   "source": [
    "# Agent Benchmarking: Search + Calculator\n",
    "\n",
    "Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.\n",
    "\n",
    "It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://python.langchain.com/docs/guides/tracing/) for an explanation of what tracing is and how to set it up."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46bf9205",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Comment this out if you are NOT using tracing\n",
    "import os\n",
    "\n",
    "os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a16b75d",
   "metadata": {},
   "source": [
    "## Loading the data\n",
    "First, let's load the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b2d5e98",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation.loading import load_dataset\n",
    "\n",
    "dataset = load_dataset(\"agent-search-calculator\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ab6a716",
   "metadata": {},
   "source": [
    "## Setting up a chain\n",
    "Now we need to load an agent capable of answering these questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c18680b5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.llms import OpenAI\n",
    "from langchain.chains import LLMMathChain\n",
    "from langchain.agents import initialize_agent, Tool, load_tools\n",
    "from langchain.agents import AgentType\n",
    "\n",
    "tools = load_tools([\"serpapi\", \"llm-math\"], llm=OpenAI(temperature=0))\n",
    "agent = initialize_agent(\n",
    "    tools,\n",
    "    OpenAI(temperature=0),\n",
    "    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,\n",
    "    verbose=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68504a8f",
   "metadata": {},
   "source": [
    "## Make a prediction\n",
    "\n",
    "First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cbcafc92",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(dataset[0][\"question\"])\n",
    "agent.run(dataset[0][\"question\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0c16cd7",
   "metadata": {},
   "source": [
    "## Make many predictions\n",
    "Now we can make predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bbbbb20e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "agent.run(dataset[4][\"question\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24b4c66e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "predictions = []\n",
    "predicted_dataset = []\n",
    "error_dataset = []\n",
    "for data in dataset:\n",
    "    new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",
    "    try:\n",
    "        predictions.append(agent(new_data))\n",
    "        predicted_dataset.append(new_data)\n",
    "    except Exception as e:\n",
    "        predictions.append({\"output\": str(e), **new_data})\n",
    "        error_dataset.append(new_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49d969fb",
   "metadata": {},
   "source": [
    "## Evaluate performance\n",
    "Now we can evaluate the predictions. The first thing we can do is look at them by eye."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d583f03",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "predictions[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4783344b",
   "metadata": {},
   "source": [
    "Next, we can use a language model to score them programatically"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0a9341d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation.qa import QAEvalChain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1612dec1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "llm = OpenAI(temperature=0)\n",
    "eval_chain = QAEvalChain.from_llm(llm)\n",
    "graded_outputs = eval_chain.evaluate(\n",
    "    dataset, predictions, question_key=\"question\", prediction_key=\"output\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79587806",
   "metadata": {},
   "source": [
    "We can add in the graded output to the `predictions` dict and then get a count of the grades."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a689df5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "for i, prediction in enumerate(predictions):\n",
    "    prediction[\"grade\"] = graded_outputs[i][\"text\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27b61215",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "Counter([pred[\"grade\"] for pred in predictions])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12fe30f4",
   "metadata": {},
   "source": [
    "We can also filter the datapoints to the incorrect examples and look at them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47c692a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "incorrect = [pred for pred in predictions if pred[\"grade\"] == \" INCORRECT\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ef976c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "incorrect"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3eb948cf-f767-4c87-a12d-275b66eef407",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}