{ "cells": [ { "cell_type": "markdown", "id": "984169ca", "metadata": {}, "source": [ "# Agent Benchmarking: Search + Calculator\n", "\n", "Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.\n", "\n", "It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up." ] }, { "cell_type": "code", "execution_count": 1, "id": "46bf9205", "metadata": {}, "outputs": [], "source": [ "# Comment this out if you are NOT using tracing\n", "import os\n", "os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\"" ] }, { "cell_type": "markdown", "id": "8a16b75d", "metadata": {}, "source": [ "## Loading the data\n", "First, let's load the data." ] }, { "cell_type": "code", "execution_count": 2, "id": "5b2d5e98", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--agent-search-calculator-8a025c0ce5fb99d2/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3a275586643f4ccfba1a8d54be28c351", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/1 [00:00._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')).\n" ] } ], "source": [ "predictions = []\n", "predicted_dataset = []\n", "error_dataset = []\n", "for data in dataset:\n", " new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n", " try:\n", " predictions.append(agent(new_data))\n", " predicted_dataset.append(new_data)\n", " except Exception:\n", " error_dataset.append(new_data)" ] }, { "cell_type": "markdown", "id": "49d969fb", "metadata": {}, "source": [ "## Evaluate performance\n", "Now we can evaluate the predictions. The first thing we can do is look at them by eye." ] }, { "cell_type": "code", "execution_count": 9, "id": "1d583f03", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'input': 'How many people live in canada as of 2023?',\n", " 'answer': 'approximately 38,625,801',\n", " 'output': '38,630,316 people live in Canada as of 2023.',\n", " 'intermediate_steps': [(AgentAction(tool='Search', tool_input='Population of Canada 2023', log=' I need to find population data\\nAction: Search\\nAction Input: Population of Canada 2023'),\n", " '38,630,316')]}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions[0]" ] }, { "cell_type": "markdown", "id": "4783344b", "metadata": {}, "source": [ "Next, we can use a language model to score them programatically" ] }, { "cell_type": "code", "execution_count": 10, "id": "d0a9341d", "metadata": {}, "outputs": [], "source": [ "from langchain.evaluation.qa import QAEvalChain" ] }, { "cell_type": "code", "execution_count": 14, "id": "1612dec1", "metadata": {}, "outputs": [], "source": [ "llm = OpenAI(temperature=0)\n", "eval_chain = QAEvalChain.from_llm(llm)\n", "graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"output\")" ] }, { "cell_type": "markdown", "id": "79587806", "metadata": {}, "source": [ "We can add in the graded output to the `predictions` dict and then get a count of the grades." ] }, { "cell_type": "code", "execution_count": 15, "id": "2a689df5", "metadata": {}, "outputs": [], "source": [ "for i, prediction in enumerate(predictions):\n", " prediction['grade'] = graded_outputs[i]['text']" ] }, { "cell_type": "code", "execution_count": 16, "id": "27b61215", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({' CORRECT': 4, ' INCORRECT': 6})" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "Counter([pred['grade'] for pred in predictions])" ] }, { "cell_type": "markdown", "id": "12fe30f4", "metadata": {}, "source": [ "We can also filter the datapoints to the incorrect examples and look at them." ] }, { "cell_type": "code", "execution_count": 17, "id": "47c692a1", "metadata": {}, "outputs": [], "source": [ "incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]" ] }, { "cell_type": "code", "execution_count": 18, "id": "0ef976c1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'input': \"who is dua lipa's boyfriend? what is his age raised to the .43 power?\",\n", " 'answer': 'her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665',\n", " 'output': \"Isaac Carew, Dua Lipa's boyfriend, is 36 years old and his age raised to the .43 power is 4.6688516567750975.\",\n", " 'intermediate_steps': [(AgentAction(tool='Search', tool_input=\"Dua Lipa's boyfriend\", log=' I need to find out who Dua Lipa\\'s boyfriend is and then calculate his age raised to the .43 power\\nAction: Search\\nAction Input: \"Dua Lipa\\'s boyfriend\"'),\n", " 'Dua and Isaac, a model and a chef, dated on and off from 2013 to 2019. The two first split in early 2017, which is when Dua went on to date LANY ...'),\n", " (AgentAction(tool='Search', tool_input='Isaac Carew age', log=' I need to find out Isaac\\'s age\\nAction: Search\\nAction Input: \"Isaac Carew age\"'),\n", " '36 years'),\n", " (AgentAction(tool='Calculator', tool_input='36^.43', log=' I need to calculate 36 raised to the .43 power\\nAction: Calculator\\nAction Input: 36^.43'),\n", " 'Answer: 4.6688516567750975\\n')],\n", " 'grade': ' INCORRECT'}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "incorrect[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "7710401a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 5 }