{ "cells": [ { "cell_type": "markdown", "id": "984169ca", "metadata": {}, "source": [ "# Agent Benchmarking: Search + Calculator\n", "\n", "Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.\n", "\n", "It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up." ] }, { "cell_type": "code", "execution_count": null, "id": "46bf9205", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Comment this out if you are NOT using tracing\n", "import os\n", "os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\"" ] }, { "cell_type": "markdown", "id": "8a16b75d", "metadata": {}, "source": [ "## Loading the data\n", "First, let's load the data." ] }, { "cell_type": "code", "execution_count": null, "id": "5b2d5e98", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain.evaluation.loading import load_dataset\n", "dataset = load_dataset(\"agent-search-calculator\")" ] }, { "cell_type": "markdown", "id": "4ab6a716", "metadata": {}, "source": [ "## Setting up a chain\n", "Now we need to load an agent capable of answering these questions." ] }, { "cell_type": "code", "execution_count": null, "id": "c18680b5", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain.llms import OpenAI\n", "from langchain.chains import LLMMathChain\n", "from langchain.agents import initialize_agent, Tool, load_tools\n", "from langchain.agents import AgentType\n", "\n", "tools = load_tools(['serpapi', 'llm-math'], llm=OpenAI(temperature=0))\n", "agent = initialize_agent(tools, OpenAI(temperature=0), agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)" ] }, { "cell_type": "markdown", "id": "68504a8f", "metadata": {}, "source": [ "## Make a prediction\n", "\n", "First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints" ] }, { "cell_type": "code", "execution_count": null, "id": "cbcafc92", "metadata": { "tags": [] }, "outputs": [], "source": [ "print(dataset[0]['question'])\n", "agent.run(dataset[0]['question'])" ] }, { "cell_type": "markdown", "id": "d0c16cd7", "metadata": {}, "source": [ "## Make many predictions\n", "Now we can make predictions" ] }, { "cell_type": "code", "execution_count": null, "id": "bbbbb20e", "metadata": { "tags": [] }, "outputs": [], "source": [ "agent.run(dataset[4]['question'])" ] }, { "cell_type": "code", "execution_count": null, "id": "24b4c66e", "metadata": { "tags": [] }, "outputs": [], "source": [ "predictions = []\n", "predicted_dataset = []\n", "error_dataset = []\n", "for data in dataset:\n", " new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n", " try:\n", " predictions.append(agent(new_data))\n", " predicted_dataset.append(new_data)\n", " except Exception as e:\n", " predictions.append({\"output\": str(e), **new_data})\n", " error_dataset.append(new_data)" ] }, { "cell_type": "markdown", "id": "49d969fb", "metadata": {}, "source": [ "## Evaluate performance\n", "Now we can evaluate the predictions. The first thing we can do is look at them by eye." ] }, { "cell_type": "code", "execution_count": null, "id": "1d583f03", "metadata": { "tags": [] }, "outputs": [], "source": [ "predictions[0]" ] }, { "cell_type": "markdown", "id": "4783344b", "metadata": {}, "source": [ "Next, we can use a language model to score them programatically" ] }, { "cell_type": "code", "execution_count": null, "id": "d0a9341d", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain.evaluation.qa import QAEvalChain" ] }, { "cell_type": "code", "execution_count": null, "id": "1612dec1", "metadata": { "tags": [] }, "outputs": [], "source": [ "llm = OpenAI(temperature=0)\n", "eval_chain = QAEvalChain.from_llm(llm)\n", "graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"output\")" ] }, { "cell_type": "markdown", "id": "79587806", "metadata": {}, "source": [ "We can add in the graded output to the `predictions` dict and then get a count of the grades." ] }, { "cell_type": "code", "execution_count": null, "id": "2a689df5", "metadata": { "tags": [] }, "outputs": [], "source": [ "for i, prediction in enumerate(predictions):\n", " prediction['grade'] = graded_outputs[i]['text']" ] }, { "cell_type": "code", "execution_count": null, "id": "27b61215", "metadata": { "tags": [] }, "outputs": [], "source": [ "from collections import Counter\n", "Counter([pred['grade'] for pred in predictions])" ] }, { "cell_type": "markdown", "id": "12fe30f4", "metadata": {}, "source": [ "We can also filter the datapoints to the incorrect examples and look at them." ] }, { "cell_type": "code", "execution_count": null, "id": "47c692a1", "metadata": {}, "outputs": [], "source": [ "incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "0ef976c1", "metadata": {}, "outputs": [], "source": [ "incorrect" ] }, { "cell_type": "code", "execution_count": null, "id": "3eb948cf-f767-4c87-a12d-275b66eef407", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 5 }