You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/extras/guides/evaluation/agent_benchmarking.ipynb

302 lines
6.6 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "984169ca",
"metadata": {},
"source": [
"# Agent Benchmarking: Search + Calculator\n",
"\n",
"Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.\n",
"\n",
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://python.langchain.com/docs/guides/tracing/) for an explanation of what tracing is and how to set it up."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46bf9205",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Comment this out if you are NOT using tracing\n",
"import os\n",
"\n",
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
]
},
{
"cell_type": "markdown",
"id": "8a16b75d",
"metadata": {},
"source": [
"## Loading the data\n",
"First, let's load the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b2d5e98",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.evaluation.loading import load_dataset\n",
"\n",
"dataset = load_dataset(\"agent-search-calculator\")"
]
},
{
"cell_type": "markdown",
"id": "4ab6a716",
"metadata": {},
"source": [
"## Setting up a chain\n",
"Now we need to load an agent capable of answering these questions."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c18680b5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.llms import OpenAI\n",
"from langchain.chains import LLMMathChain\n",
"from langchain.agents import initialize_agent, Tool, load_tools\n",
"from langchain.agents import AgentType\n",
"\n",
"tools = load_tools([\"serpapi\", \"llm-math\"], llm=OpenAI(temperature=0))\n",
"agent = initialize_agent(\n",
" tools,\n",
" OpenAI(temperature=0),\n",
" agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,\n",
" verbose=True,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "68504a8f",
"metadata": {},
"source": [
"## Make a prediction\n",
"\n",
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cbcafc92",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"print(dataset[0][\"question\"])\n",
"agent.run(dataset[0][\"question\"])"
]
},
{
"cell_type": "markdown",
"id": "d0c16cd7",
"metadata": {},
"source": [
"## Make many predictions\n",
"Now we can make predictions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bbbbb20e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"agent.run(dataset[4][\"question\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24b4c66e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"predictions = []\n",
"predicted_dataset = []\n",
"error_dataset = []\n",
"for data in dataset:\n",
" new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",
" try:\n",
" predictions.append(agent(new_data))\n",
" predicted_dataset.append(new_data)\n",
" except Exception as e:\n",
" predictions.append({\"output\": str(e), **new_data})\n",
" error_dataset.append(new_data)"
]
},
{
"cell_type": "markdown",
"id": "49d969fb",
"metadata": {},
"source": [
"## Evaluate performance\n",
"Now we can evaluate the predictions. The first thing we can do is look at them by eye."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1d583f03",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"predictions[0]"
]
},
{
"cell_type": "markdown",
"id": "4783344b",
"metadata": {},
"source": [
"Next, we can use a language model to score them programatically"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d0a9341d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.evaluation.qa import QAEvalChain"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1612dec1",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"llm = OpenAI(temperature=0)\n",
"eval_chain = QAEvalChain.from_llm(llm)\n",
"graded_outputs = eval_chain.evaluate(\n",
" dataset, predictions, question_key=\"question\", prediction_key=\"output\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "79587806",
"metadata": {},
"source": [
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2a689df5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"for i, prediction in enumerate(predictions):\n",
" prediction[\"grade\"] = graded_outputs[i][\"text\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27b61215",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from collections import Counter\n",
"\n",
"Counter([pred[\"grade\"] for pred in predictions])"
]
},
{
"cell_type": "markdown",
"id": "12fe30f4",
"metadata": {},
"source": [
"We can also filter the datapoints to the incorrect examples and look at them."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "47c692a1",
"metadata": {},
"outputs": [],
"source": [
"incorrect = [pred for pred in predictions if pred[\"grade\"] == \" INCORRECT\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0ef976c1",
"metadata": {},
"outputs": [],
"source": [
"incorrect"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3eb948cf-f767-4c87-a12d-275b66eef407",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}