langchain/docs/extras/guides/langsmith/walkthrough.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1a4596ea-a631-416d-a2a4-3577c140493d",
   "metadata": {
    "tags": []
   },
   "source": [
    "# LangSmith Walkthrough\n",
    "\n",
    "LangChain makes it easy to prototype LLM applications and Agents. However, delivering LLM applications to production can be deceptively difficult. You will likely have to heavily customize and iterate on your prompts, chains, and other components to create a high-quality product.\n",
    "\n",
    "To aid in this process, we've launched LangSmith, a unified platform for debugging, testing, and monitoring your LLM applications.\n",
    "\n",
    "When might this come in handy? You may find it useful when you want to:\n",
    "\n",
    "- Quickly debug a new chain, agent, or set of tools\n",
    "- Visualize how components (chains, llms, retrievers, etc.) relate and are used\n",
    "- Evaluate different prompts and LLMs for a single component\n",
    "- Run a given chain several times over a dataset to ensure it consistently meets a quality bar\n",
    "- Capture usage traces and using LLMs or analytics pipelines to generate insights"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "138fbb8f-960d-4d26-9dd5-6d6acab3ee55",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "**[Create a LangSmith account](https://smith.langchain.com/) and create an API key (see bottom left corner). Familiarize yourself with the platform by looking through the [docs](https://docs.smith.langchain.com/)**\n",
    "\n",
    "Note LangSmith is in closed beta; we're in the process of rolling it out to more users. However, you can fill out the form on the website for expedited access.\n",
    "\n",
    "Now, let's get started!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d77d064-41b4-41fb-82e6-2d16461269ec",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Log runs to LangSmith\n",
    "\n",
    "First, configure your environment variables to tell LangChain to log traces. This is done by setting the `LANGCHAIN_TRACING_V2` environment variable to true.\n",
    "You can tell LangChain which project to log to by setting the `LANGCHAIN_PROJECT` environment variable (if this isn't set, runs will be logged to the `default` project). This will automatically create the project for you if it doesn't exiust. You must also set the `LANGCHAIN_ENDPOINT` and `LANGCHAIN_API_KEY` environment variables.\n",
    "\n",
    "For more information on other ways to set up tracing, please reference the [LangSmith documentation](https://docs.smith.langchain.com/docs/)\n",
    "\n",
    "**NOTE:** You must also set your `OPENAI_API_KEY` and `SERPAPI_API_KEY` environment variables in order to run the following tutorial.\n",
    "\n",
    "**NOTE:** You can only access an API key when you first create it. Keep it somewhere safe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "904db9a5-f387-4a57-914c-c8af8d39e249",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from uuid import uuid4\n",
    "\n",
    "unique_id = uuid4().hex[0:8]\n",
    "os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
    "os.environ[\"LANGCHAIN_PROJECT\"] = f\"Tracing Walkthrough - {unique_id}\"\n",
    "os.environ[\"LANGCHAIN_ENDPOINT\"] = \"https://api.smith.langchain.com\"\n",
    "os.environ[\"LANGCHAIN_API_KEY\"] = \"\"  # Update to your API key\n",
    "\n",
    "# Used by the agent in this tutorial\n",
    "# os.environ[\"OPENAI_API_KEY\"] = \"<YOUR-OPENAI-API-KEY>\"\n",
    "# os.environ[\"SERPAPI_API_KEY\"] = \"<YOUR-SERPAPI-API-KEY>\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ee7f34b-b65c-4e09-ad52-e3ace78d0221",
   "metadata": {
    "tags": []
   },
   "source": [
    "Create the langsmith client to interact with the API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "510b5ca0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langsmith import Client\n",
    "\n",
    "client = Client()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca27fa11-ddce-4af0-971e-c5c37d5b92ef",
   "metadata": {},
   "source": [
    "Create a LangChain component and log runs to the platform. In this example, we will create a ReAct-style agent with access to Search and Calculator as tools. However, LangSmith works regardless of which type of LangChain component you use (LLMs, Chat Models, Tools, Retrievers, Agents are all supported)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "7c801853-8e96-404d-984c-51ace59cbbef",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.agents import AgentType, initialize_agent, load_tools\n",
    "\n",
    "llm = ChatOpenAI(temperature=0)\n",
    "tools = load_tools([\"serpapi\", \"llm-math\"], llm=llm)\n",
    "agent = initialize_agent(\n",
    "    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cab51e1e-8270-452c-ba22-22b5b5951899",
   "metadata": {},
   "source": [
    "We are running the agent concurrently on multiple inputs to reduce latency. Runs get logged to LangSmith in the background so execution latency is unaffected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "19537902-b95c-4390-80a4-f6c9a937081e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import asyncio\n",
    "\n",
    "inputs = [\n",
    "    \"How many people live in canada as of 2023?\",\n",
    "    \"who is dua lipa's boyfriend? what is his age raised to the .43 power?\",\n",
    "    \"what is dua lipa's boyfriend age raised to the .43 power?\",\n",
    "    \"how far is it from paris to boston in miles\",\n",
    "    \"what was the total number of points scored in the 2023 super bowl? what is that number raised to the .23 power?\",\n",
    "    \"what was the total number of points scored in the 2023 super bowl raised to the .23 power?\",\n",
    "    \"how many more points were scored in the 2023 super bowl than in the 2022 super bowl?\",\n",
    "    \"what is 153 raised to .1312 power?\",\n",
    "    \"who is kendall jenner's boyfriend? what is his height (in inches) raised to .13 power?\",\n",
    "    \"what is 1213 divided by 4345?\",\n",
    "]\n",
    "results = []\n",
    "\n",
    "\n",
    "async def arun(agent, input_example):\n",
    "    try:\n",
    "        return await agent.arun(input_example)\n",
    "    except Exception as e:\n",
    "        # The agent sometimes makes mistakes! These will be captured by the tracing.\n",
    "        return e\n",
    "\n",
    "\n",
    "for input_example in inputs:\n",
    "    results.append(arun(agent, input_example))\n",
    "results = await asyncio.gather(*results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "0405ff30-21fe-413d-85cf-9fa3c649efec",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.callbacks.tracers.langchain import wait_for_all_tracers\n",
    "\n",
    "# Logs are submitted in a background thread to avoid blocking execution.\n",
    "# For the sake of this tutorial, we want to make sure\n",
    "# they've been submitted before moving on. This is also\n",
    "# useful for serverless deployments.\n",
    "wait_for_all_tracers()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9decb964-be07-4b6c-9802-9825c8be7b64",
   "metadata": {},
   "source": [
    "Assuming you've successfully set up your environment, your agent traces should show up in the `Projects` section in the [app](https://smith.langchain.com/). Congrats!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c43c311-4e09-4d57-9ef3-13afb96ff430",
   "metadata": {},
   "source": [
    "## Evaluate another agent implementation\n",
    "\n",
    "In addition to logging runs, LangSmith also allows you to test and evaluate your LLM applications.\n",
    "\n",
    "In this section, you will leverage LangSmith to create a benchmark dataset and run AI-assisted evaluators on an agent. You will do so in a few steps:\n",
    "\n",
    "1. Create a dataset from pre-existing run inputs and outputs\n",
    "2. Initialize a new agent to benchmark\n",
    "3. Configure evaluators to grade an agent's output\n",
    "4. Run the agent over the dataset and evaluate the results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "beab1a29-b79d-4a99-b5b1-0870c2d772b1",
   "metadata": {},
   "source": [
    "### 1. Create a LangSmith dataset\n",
    "\n",
    "Below, we use the LangSmith client to create a dataset from the agent runs you just logged above. You will use these later to measure performance for a new agent. This simple taking the inputs and outputs of the runs and saving them as examples to a dataset. A dataset is a collection of examples, which are nothing more than input-output pairs you can use as test cases to your application.\n",
    "\n",
    "**Note: this is a simple, walkthrough example. In a real-world setting, you'd ideally first validate the outputs before adding them to a benchmark dataset to be used for evaluating other agents.**\n",
    "\n",
    "For more information on datasets, including how to create them from CSVs or other files or how to create them in the platform, please refer to the [LangSmith documentation](https://docs.smith.langchain.com/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "17580c4b-bd04-4dde-9d21-9d4edd25b00d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "dataset_name = f\"calculator-example-dataset-{unique_id}\"\n",
    "\n",
    "dataset = client.create_dataset(\n",
    "    dataset_name, description=\"A calculator example dataset\"\n",
    ")\n",
    "\n",
    "runs = client.list_runs(\n",
    "    project_name=os.environ[\"LANGCHAIN_PROJECT\"],\n",
    "    execution_order=1,  # Only return the top-level runs\n",
    "    error=False,  # Only runs that succeed\n",
    ")\n",
    "for run in runs:\n",
    "    client.create_example(inputs=run.inputs, outputs=run.outputs, dataset_id=dataset.id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8adfd29c-b258-49e5-94b4-74597a12ba16",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 2. Initialize a new agent to benchmark\n",
    "\n",
    "You can evaluate any LLM, chain, or agent. Since chains can have memory, we will pass in a `chain_factory` (aka a `constructor` ) function to initialize for each call.\n",
    "\n",
    "In this case, we will test an agent that uses OpenAI's function calling endpoints."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "f42d8ecc-d46a-448b-a89c-04b0f6907f75",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.agents import AgentType, initialize_agent, load_tools\n",
    "\n",
    "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0613\", temperature=0)\n",
    "tools = load_tools([\"serpapi\", \"llm-math\"], llm=llm)\n",
    "\n",
    "\n",
    "# Since chains can be stateful (e.g. they can have memory), we provide\n",
    "# a way to initialize a new chain for each row in the dataset. This is done\n",
    "# by passing in a factory function that returns a new chain for each row.\n",
    "def agent_factory():\n",
    "    return initialize_agent(tools, llm, agent=AgentType.OPENAI_FUNCTIONS, verbose=False)\n",
    "\n",
    "\n",
    "# If your chain is NOT stateful, your factory can return the object directly\n",
    "# to improve runtime performance. For example:\n",
    "# chain_factory = lambda: agent"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9cb9ef53",
   "metadata": {},
   "source": [
    "### 3. Configure evaluation\n",
    "\n",
    "Manually comparing the results of chains in the UI is effective, but it can be time consuming.\n",
    "It can be helpful to use automated metrics and AI-assisted feedback to evaluate your component's performance.\n",
    "\n",
    "Below, we will create some pre-implemented run evaluators that do the following:\n",
    "- Compare results against ground truth labels. (You used the debug outputs above for this)\n",
    "- Measure semantic (dis)similarity using embedding distance\n",
    "- Evaluate 'aspects' of the agent's response in a reference-free manner using custom criteria\n",
    "\n",
    "For a longer discussion of how to select an appropriate evaluator for your use case and how to create your own\n",
    "custom evaluators, please refer to the [LangSmith documentation](https://docs.smith.langchain.com/).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "a25dc281",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation import EvaluatorType\n",
    "from langchain.smith import RunEvalConfig\n",
    "\n",
    "evaluation_config = RunEvalConfig(\n",
    "    # Evaluators can either be an evaluator type (e.g., \"qa\", \"criteria\", \"embedding_distance\", etc.) or a configuration for that evaluator\n",
    "    evaluators=[\n",
    "        # Measures whether a QA response is \"Correct\", based on a reference answer\n",
    "        # You can also select via the raw string \"qa\"\n",
    "        EvaluatorType.QA,\n",
    "        # Measure the embedding distance between the output and the reference answer\n",
    "        # Equivalent to: EvalConfig.EmbeddingDistance(embeddings=OpenAIEmbeddings())\n",
    "        EvaluatorType.EMBEDDING_DISTANCE,\n",
    "        # Grade whether the output satisfies the stated criteria. You can select a default one such as \"helpfulness\" or provide your own.\n",
    "        RunEvalConfig.LabeledCriteria(\"helpfulness\"),\n",
    "        # Both the Criteria and LabeledCriteria evaluators can be configured with a dictionary of custom criteria.\n",
    "        RunEvalConfig.Criteria(\n",
    "            {\n",
    "                \"fifth-grader-score\": \"Do you have to be smarter than a fifth grader to answer this question?\"\n",
    "            }\n",
    "        ),\n",
    "    ],\n",
    "    # You can add custom StringEvaluator or RunEvaluator objects here as well, which will automatically be\n",
    "    # applied to each prediction. Check out the docs for examples.\n",
    "    custom_evaluators=[],\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "07885b10",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 4. Run the agent and evaluators\n",
    "\n",
    "Use the [arun_on_dataset](https://api.python.langchain.com/en/latest/smith/langchain.smith.evaluation.runner_utils.arun_on_dataset.html#langchain.smith.evaluation.runner_utils.arun_on_dataset) (or synchronous [run_on_dataset](https://api.python.langchain.com/en/latest/smith/langchain.smith.evaluation.runner_utils.run_on_dataset.html#langchain.smith.evaluation.runner_utils.run_on_dataset)) function to evaluate your model. This will:\n",
    "1. Fetch example rows from the specified dataset\n",
    "2. Run your llm or chain on each example.\n",
    "3. Apply evalutors to the resulting run traces and corresponding reference examples to generate automated feedback.\n",
    "\n",
    "The results will be visible in the LangSmith app."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "3733269b-8085-4644-9d5d-baedcff13a2f",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processed examples: 1\r"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Chain failed for example 85f3a543-0429-48ae-be23-f48f0d903530. Error: LLMMathChain._evaluate(\"\n",
      "age_of_Dua_Lipa_boyfriend ** 0.43\n",
      "\") raised error: 'age_of_Dua_Lipa_boyfriend'. Please try again with a valid numerical expression\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processed examples: 6\r"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Chain failed for example 97d0d138-e9b3-4825-af2c-42789c66c0d4. Error: Too many arguments to single-input tool Calculator. Args: ['height ^ 0.13', {'height': 72}]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processed examples: 9\r"
     ]
    }
   ],
   "source": [
    "from langchain.smith import (\n",
    "    arun_on_dataset,\n",
    "    run_on_dataset,  # Available if your chain doesn't support async calls.\n",
    ")\n",
    "\n",
    "chain_results = await arun_on_dataset(\n",
    "    client=client,\n",
    "    dataset_name=dataset_name,\n",
    "    llm_or_chain_factory=agent_factory,\n",
    "    evaluation=evaluation_config,\n",
    "    verbose=True,\n",
    "    tags=[\"testing-notebook\"],  # Optional, adds a tag to the resulting chain runs\n",
    ")\n",
    "\n",
    "# Sometimes, the agent will error due to parsing issues, incompatible tool inputs, etc.\n",
    "# These are logged as warnings here and captured as errors in the tracing UI."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdacd159-eb4d-49e9-bb2a-c55322c40ed4",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Review the test results\n",
    "\n",
    "You can review the test results tracing UI below by navigating to the \"Datasets & Testing\" page and selecting the **\"calculator-example-dataset-*\"** dataset, clicking on the `Test Runs` tab, then inspecting the runs in the corresponding project. \n",
    "\n",
    "This will show the new runs and the feedback logged from the selected evaluators. Note that runs that error out will not have feedback."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "591c819e-9932-45cf-adab-63727dd49559",
   "metadata": {},
   "source": [
    "## Exporting datasets and runs\n",
    "\n",
    "LangSmith lets you export data to common formats such as CSV or JSONL directly in the web app. You can also use the client to fetch runs for further analysis, to store in your own database, or to share with others. Let's fetch the run traces from the evaluation run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "33bfefde-d1bb-4f50-9f7a-fd572ee76820",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Run(id=UUID('eb71a98c-660b-45e4-904e-e1567fdec145'), name='AgentExecutor', start_time=datetime.datetime(2023, 7, 13, 8, 23, 35, 102907), run_type=<RunTypeEnum.chain: 'chain'>, end_time=datetime.datetime(2023, 7, 13, 8, 23, 37, 793962), extra={'runtime': {'library': 'langchain', 'runtime': 'python', 'platform': 'macOS-13.4.1-arm64-arm-64bit', 'sdk_version': '0.0.5', 'library_version': '0.0.231', 'runtime_version': '3.11.2'}, 'total_tokens': 512, 'prompt_tokens': 451, 'completion_tokens': 61}, error=None, serialized=None, events=[{'name': 'start', 'time': '2023-07-13T08:23:35.102907'}, {'name': 'end', 'time': '2023-07-13T08:23:37.793962'}], inputs={'input': 'what is 1213 divided by 4345?'}, outputs={'output': '1213 divided by 4345 is approximately 0.2792.'}, reference_example_id=UUID('d343add7-2631-417b-905a-dc39361ace69'), parent_run_id=None, tags=['openai-functions', 'testing-notebook'], execution_order=1, session_id=UUID('cc5f4f88-f1bf-495f-8adb-384f66321eb2'), child_run_ids=[UUID('daa9708a-ad08-4be1-9841-e92e2f384cce'), UUID('28b1ada7-3fe8-4853-a5b0-dac8a93a3066'), UUID('dc0b4867-3f3d-46f7-bfb5-f4be10f3cc52'), UUID('58c9494e-2ea6-4291-ab78-73b8ffcdaef5'), UUID('8f5a3e08-ce96-4c81-a6aa-86bf5b3bb590'), UUID('f0447532-7ded-45b6-9d87-f1fa18e381b0')], child_runs=None, feedback_stats={'correctness': {'n': 1, 'avg': 1.0, 'mode': 1}, 'helpfulness': {'n': 1, 'avg': 1.0, 'mode': 1}, 'fifth-grader-score': {'n': 1, 'avg': 0.0, 'mode': 0}, 'embedding_cosine_distance': {'n': 1, 'avg': 0.144522385071361, 'mode': 0.144522385071361}})"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "runs = list(client.list_runs(dataset_name=dataset_name))\n",
    "runs[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "6595c888-1f5c-4ae3-9390-0a559f5575d1",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'correctness': {'n': 7, 'avg': 0.7142857142857143, 'mode': 1},\n",
       " 'helpfulness': {'n': 7, 'avg': 1.0, 'mode': 1},\n",
       " 'fifth-grader-score': {'n': 7, 'avg': 0.7142857142857143, 'mode': 1},\n",
       " 'embedding_cosine_distance': {'n': 7,\n",
       "  'avg': 0.08308464442094905,\n",
       "  'mode': 0.00371031210788608}}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "client.read_project(project_id=runs[0].session_id).feedback_stats"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2646f0fb-81d4-43ce-8a9b-54b8e19841e2",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Conclusion\n",
    "\n",
    "Congratulations! You have succesfully traced and evaluated an agent using LangSmith!\n",
    "\n",
    "This was a quick guide to get started, but there are many more ways to use LangSmith to speed up your developer flow and produce better results.\n",
    "\n",
    "For more information on how you can get the most out of LangSmith, check out [LangSmith documentation](https://docs.smith.langchain.com/), and please reach out with questions, feature requests, or feedback at [support@langchain.dev](mailto:support@langchain.dev)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57237f12",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}