mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
5ca7ce77cd
Use numexpr evaluate instead of the python REPL to avoid malicious code injection. Tested against the (limited) math dataset and got the same score as before. For more permissive tools (like the REPL tool itself), other approaches ought to be provided (some combination of Sanitizer + Restricted python + unprivileged-docker + ...), but for a calculator tool, only mathematical expressions should be permitted. See https://github.com/hwchase17/langchain/issues/814
292 lines
6.5 KiB
Plaintext
292 lines
6.5 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "984169ca",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Agent Benchmarking: Search + Calculator\n",
|
|
"\n",
|
|
"Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.\n",
|
|
"\n",
|
|
"It is highly reccomended that you do any evaluation/benchmarking with tracing enabled. See [here](https://langchain.readthedocs.io/en/latest/tracing.html) for an explanation of what tracing is and how to set it up."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "46bf9205",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Comment this out if you are NOT using tracing\n",
|
|
"import os\n",
|
|
"os.environ[\"LANGCHAIN_HANDLER\"] = \"langchain\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8a16b75d",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Loading the data\n",
|
|
"First, let's load the data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "5b2d5e98",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.evaluation.loading import load_dataset\n",
|
|
"dataset = load_dataset(\"agent-search-calculator\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4ab6a716",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setting up a chain\n",
|
|
"Now we need to load an agent capable of answering these questions."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c18680b5",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.llms import OpenAI\n",
|
|
"from langchain.chains import LLMMathChain\n",
|
|
"from langchain.agents import initialize_agent, Tool, load_tools\n",
|
|
"from langchain.agents import AgentType\n",
|
|
"\n",
|
|
"tools = load_tools(['serpapi', 'llm-math'], llm=OpenAI(temperature=0))\n",
|
|
"agent = initialize_agent(tools, OpenAI(temperature=0), agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "68504a8f",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Make a prediction\n",
|
|
"\n",
|
|
"First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cbcafc92",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(dataset[0]['question'])\n",
|
|
"agent.run(dataset[0]['question'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d0c16cd7",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Make many predictions\n",
|
|
"Now we can make predictions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bbbbb20e",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"agent.run(dataset[4]['question'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "24b4c66e",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"predictions = []\n",
|
|
"predicted_dataset = []\n",
|
|
"error_dataset = []\n",
|
|
"for data in dataset:\n",
|
|
" new_data = {\"input\": data[\"question\"], \"answer\": data[\"answer\"]}\n",
|
|
" try:\n",
|
|
" predictions.append(agent(new_data))\n",
|
|
" predicted_dataset.append(new_data)\n",
|
|
" except Exception as e:\n",
|
|
" predictions.append({\"output\": str(e), **new_data})\n",
|
|
" error_dataset.append(new_data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "49d969fb",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Evaluate performance\n",
|
|
"Now we can evaluate the predictions. The first thing we can do is look at them by eye."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1d583f03",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"predictions[0]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4783344b",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next, we can use a language model to score them programatically"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d0a9341d",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.evaluation.qa import QAEvalChain"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1612dec1",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"llm = OpenAI(temperature=0)\n",
|
|
"eval_chain = QAEvalChain.from_llm(llm)\n",
|
|
"graded_outputs = eval_chain.evaluate(dataset, predictions, question_key=\"question\", prediction_key=\"output\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "79587806",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can add in the graded output to the `predictions` dict and then get a count of the grades."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2a689df5",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"for i, prediction in enumerate(predictions):\n",
|
|
" prediction['grade'] = graded_outputs[i]['text']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "27b61215",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from collections import Counter\n",
|
|
"Counter([pred['grade'] for pred in predictions])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "12fe30f4",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can also filter the datapoints to the incorrect examples and look at them."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "47c692a1",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"incorrect = [pred for pred in predictions if pred['grade'] == \" INCORRECT\"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "0ef976c1",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"incorrect"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3eb948cf-f767-4c87-a12d-275b66eef407",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|