langchain/docs/modules/chains/examples/openapi_eval.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "692f3256",
   "metadata": {},
   "source": [
    "# Evaluating an OpenAPI Chain\n",
    "\n",
    "This notebook goes over ways to semantically evaluate an [OpenAPI Chain](openapi.ipynb), which calls an endpoint defined by the OpenAPI specification using purely natural language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a457106d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.tools import OpenAPISpec, APIOperation\n",
    "from langchain.chains import OpenAPIEndpointChain, LLMChain\n",
    "from langchain.requests import Requests\n",
    "from langchain.llms import OpenAI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c3b0954",
   "metadata": {},
   "source": [
    "## Load the API Chain\n",
    "\n",
    "Load a wrapper of the spec (so we can work with it more easily). You can load from a url or from a local file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "794142ba",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n"
     ]
    }
   ],
   "source": [
    "# Load and parse the OpenAPI Spec\n",
    "spec = OpenAPISpec.from_url(\"https://www.klarna.com/us/shopping/public/openai/v0/api-docs/\")\n",
    "# Load a single endpoint operation\n",
    "operation = APIOperation.from_openapi_spec(spec, '/public/openai/v0/products', \"get\")\n",
    "verbose = False\n",
    "# Select any LangChain LLM\n",
    "llm = OpenAI(temperature=0, max_tokens=1000)\n",
    "# Create the endpoint chain\n",
    "api_chain = OpenAPIEndpointChain.from_api_operation(\n",
    "    operation, \n",
    "    llm, \n",
    "    requests=Requests(), \n",
    "    verbose=verbose,\n",
    "    return_intermediate_steps=True # Return request and response text\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c05ba5b",
   "metadata": {},
   "source": [
    "### *Optional*: Generate Input Questions and Request Ground Truth Queries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a0c0cb7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import re\n",
    "# from langchain.prompts import PromptTemplate\n",
    "\n",
    "# template = \"\"\"Below is a service description:\n",
    "\n",
    "# {spec}\n",
    "\n",
    "# Imagine you're a new user trying to use {operation} through a search bar. What are 10 different things you want to request?\n",
    "# Wants/Questions:\n",
    "# 1. \"\"\"\n",
    "\n",
    "# prompt = PromptTemplate.from_template(template)\n",
    "\n",
    "# generation_chain = LLMChain(llm=llm, prompt=prompt)\n",
    "\n",
    "# questions_ = generation_chain.run(spec=operation.to_typescript(), operation=operation.operation_id).split('\\n')\n",
    "# # Strip preceding numeric bullets\n",
    "# questions = [re.sub(r'^\\d+\\. ', '', q).strip() for q in questions_]\n",
    "# questions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f3d767ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ground_truths = [\n",
    "# {\"q\": ...} # What are the best queries for each input?\n",
    "# ]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81098a05",
   "metadata": {},
   "source": [
    "## Run the API Chain\n",
    "\n",
    "The two simplest questions a user of the API Chain are:\n",
    "- Did the chain succesfully access the endpoint?\n",
    "- Did the action accomplish the correct result?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "64bc7ed9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "# Collect metrics to report at completion\n",
    "scores = defaultdict(list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "dfd2d09f",
   "metadata": {},
   "outputs": [],
   "source": [
    "questions = [\n",
    "     'What iPhone models are available?',\n",
    "     'Are there any budget laptops?',\n",
    "     'Show me the cheapest gaming PC.',\n",
    "     'Are there any tablets under $400?',\n",
    "     'What are the best headphones?',\n",
    "     'What are the top rated laptops?',\n",
    "     'I want to buy some shoes. I like Adidas and Nike.',\n",
    "     'I want to buy a new skirt',\n",
    "     'My company is asking me to get a professional Deskopt PC - money is no object.',\n",
    "     'What are the best budget cameras?'\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "00511f7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "## Run the the API chain itself\n",
    "raise_error = False # Stop on first failed example - useful for development\n",
    "chain_outputs = []\n",
    "failed_examples = []\n",
    "for question in questions:\n",
    "    try:\n",
    "        chain_outputs.append(api_chain(question))\n",
    "        scores[\"completed\"].append(1.0)\n",
    "    except Exception as e:\n",
    "        if raise_error:\n",
    "            raise e\n",
    "        failed_examples.append({'q': question, 'error': e})\n",
    "        scores[\"completed\"].append(0.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f3c9729f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# If the chain failed to run, show the failing examples\n",
    "failed_examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "914e7587",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['There are currently 10 Apple iPhone models available: Apple iPhone 14 Pro Max 256GB, Apple iPhone 12 128GB, Apple iPhone 13 128GB, Apple iPhone 14 Pro 128GB, Apple iPhone 14 Pro 256GB, Apple iPhone 14 Pro Max 128GB, Apple iPhone 13 Pro Max 128GB, Apple iPhone 14 128GB, Apple iPhone 12 Pro 512GB, and Apple iPhone 12 mini 64GB.',\n",
       " 'Yes, there are several budget laptops in the API_RESPONSE. The HP 14-dq0055dx and HP 14-dq0054dx are both priced at $199.99, and the HP 15-dw0083wm is priced at $244.99.',\n",
       " 'The cheapest gaming PC available is the Alarco Gaming PC (X_BLACK_GTX750) for $499.99. You can find more information about it here: https://www.klarna.com/us/shopping/pl/cl223/3203154750/Desktop-Computers/Alarco-Gaming-PC-%28X_BLACK_GTX750%29/?utm_source=openai&ref-site=openai_plugin',\n",
       " 'Yes, there are several tablets under $400. These include the Apple iPad 10.2\" 32GB (2019), Amazon Fire HD 8\" 32GB (10th Generation), Amazon Fire HD 8 Kids Tablet 8\" - 32GB, and Samsung Galaxy Tab A8 10.5 SM-X200 32GB.',\n",
       " 'It looks like there are several great options available. Based on the API response, some of the best headphones include Apple AirPods Pro (2nd generation) 2022, Apple AirPods Max, and Bose Noise Cancelling Headphones 700.',\n",
       " 'The top rated laptops from the API_RESPONSE are the Apple MacBook Air (2020) M1 OC 7C GPU 8GB 256GB SSD 13\", Apple MacBook Air (2022) M2 OC 8C GPU 8GB 256GB SSD 13.6\", Apple MacBook Pro (2022) M2 OC 10C GPU 8GB 256GB SSD 13.3\", and Apple MacBook Pro (2020) M1 OC 8C GPU 8GB 256GB SSD 13\".',\n",
       " \"I found several Nike and Adidas shoes in the API response. Here are the links to the products: Nike Dunk Low M - Black/White: https://www.klarna.com/us/shopping/pl/cl337/3200177969/Shoes/Nike-Dunk-Low-M-Black-White/?utm_source=openai&ref-site=openai_plugin, Nike Air Jordan 4 Retro M - Midnight Navy: https://www.klarna.com/us/shopping/pl/cl337/3202929835/Shoes/Nike-Air-Jordan-4-Retro-M-Midnight-Navy/?utm_source=openai&ref-site=openai_plugin, Nike Air Force 1 '07 M - White: https://www.klarna.com/us/shopping/pl/cl337/3979297/Shoes/Nike-Air-Force-1-07-M-White/?utm_source=openai&ref-site=openai_plugin, Nike Dunk Low W - White/Black: https://www.klarna.com/us/shopping/pl/cl337/3200134705/Shoes/Nike-Dunk-Low-W-White-Black/?utm_source=openai&ref-site=openai_plugin, Nike Air Jordan 1 Retro High M - White/University Blue/Black: https://www.klarna.com/us/shopping/pl/cl337/3200383658/Shoes/Nike-Air-Jordan-1-Retro-High-M-White-University-Blue-Black/?utm_source=openai&ref-site=openai_plugin, Nike Air Jordan 1 Retro High OG M - True Blue/Cement Grey/White: https://www.klarna.com/us/shopping/pl/cl337/3204655673/Shoes/Nike-Air-Jordan-1-Retro-High-OG-M-True-Blue-Cement-Grey-White/?utm_source=openai&ref-site=openai_plugin, Nike Air Jordan 11 Retro Cherry - White/Varsity Red/Black: https://www.klarna.com/us/shopping/pl/cl337/3202929696/Shoes/Nike-Air-Jordan-11-Retro-Cherry-White-Varsity-Red-Black/?utm_source=openai&ref-site=openai_plugin, Nike Dunk High W - White/Black: https://www.klarna.com/us/shopping/pl/cl337/3201956448/Shoes/Nike-Dunk-High-W-White-Black/?utm_source=openai&ref-site=openai_plugin, Nike Air Jordan 5 Retro M - Black/Taxi/Aquatone: https://www.klarna.com/us/shopping/pl/cl337/3204923084/Shoes/Nike-Air-Jordan-5-Retro-M-Black-Taxi-Aquatone/?utm_source=openai&ref-site=openai_plugin, Nike Court Legacy Lift W: https://www.klarna.com/us/shopping/pl/cl337/3202103728/Shoes/Nike-Court-Legacy-Lift-W/?utm_source=openai&ref-site=openai_plugin\",\n",
       " \"I found several skirts that may interest you. Please take a look at the following products: Avenue Plus Size Denim Stretch Skirt, LoveShackFancy Ruffled Mini Skirt - Antique White, Nike Dri-Fit Club Golf Skirt - Active Pink, Skims Soft Lounge Ruched Long Skirt, French Toast Girl's Front Pleated Skirt with Tabs, Alexia Admor Women's Harmonie Mini Skirt Pink Pink, Vero Moda Long Skirt, Nike Court Dri-FIT Victory Flouncy Tennis Skirt Women - White/Black, Haoyuan Mini Pleated Skirts W, and Zimmermann Lyre Midi Skirt.\",\n",
       " 'Based on the API response, you may want to consider the Skytech Archangel Gaming Computer PC Desktop, the CyberPowerPC Gamer Master Gaming Desktop, or the ASUS ROG Strix G10DK-RS756, all of which offer powerful processors and plenty of RAM.',\n",
       " 'Based on the API response, the best budget cameras are the DJI Mini 2 Dog Camera ($448.50), Insta360 Sphere with Landing Pad ($429.99), DJI FPV Gimbal Camera ($121.06), Parrot Camera & Body ($36.19), and DJI FPV Air Unit ($179.00).']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "answers = [res['output'] for res in chain_outputs]\n",
    "answers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "484f0587",
   "metadata": {},
   "source": [
    "## Evaluate the requests chain\n",
    "\n",
    "The API Chain has two main components:\n",
    "1. Translate the user query to an API request\n",
    "2. Translate the API response to a natural language response\n",
    "\n",
    "Here, we construct an evaluation chain to grade the request synthesizer against selected human queries "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "3ea5afd7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "# Define ground truth labels\n",
    "truth_queries = [\n",
    "    {\"q\": \"iPhone\"},\n",
    "    {\"q\": \"laptop\", \"max_price\": 300},\n",
    "    {\"q\": \"tablet\"},\n",
    "    {\"q\": \"headphone\"},\n",
    "    {\"q\": \"laptop\", \"max_price\": 400},\n",
    "    {\"q\": \"shoe\"},\n",
    "    {\"q\": \"skirt\"},\n",
    "    {\"q\": \"professional desktop PC\", \"max_price\": 10000},\n",
    "    {\"q\": \"camera\", \"max_price\": 300},\n",
    "]\n",
    "truth_queries = [json.dumps(q) for q in truth_queries]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "e055f24b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Collect the API queries generated by the chain\n",
    "predicted_queries = [output[\"intermediate_steps\"][\"request_args\"] for output in chain_outputs]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7d4f2b88",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "template = \"\"\"You are trying to answer the following question by querying an API:\n",
    "\n",
    "> Question: {question}\n",
    "\n",
    "The query you know you should be executing against the API is:\n",
    "\n",
    "> Query: {truth_query}\n",
    "\n",
    "Is the following predicted query semantically the same (eg likely to produce the same answer)?\n",
    "\n",
    "> Predicted Query: {predict_query}\n",
    "\n",
    "Please give the Predicted Query a grade of either an A, B, C, D, or F, along with an explanation of why. End the evaluation with 'Final Grade: <the letter>'\n",
    "\n",
    "> Explanation: Let's think step by step.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate.from_template(template)\n",
    "\n",
    "eval_chain = LLMChain(llm=llm, prompt=prompt, verbose=verbose)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8cc1b1db",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[' The original query is asking for all iPhone models, so the \"q\" parameter is correct. The \"size\" parameter is not necessary, as it is not relevant to the question. The \"min_price\" and \"max_price\" parameters are also not necessary, as the question does not ask for any pricing information. Therefore, this predicted query is not semantically the same as the original query and does not provide the same answer. Final Grade: F',\n",
       " ' The original query is asking for laptops with a maximum price of $300. The predicted query is asking for laptops with a minimum price of $0 and a maximum price of $500. This means that the predicted query is likely to return more results than the original query, as it is asking for a wider range of prices. However, the predicted query is still likely to return some budget laptops, as it is still asking for laptops with a maximum price of $500. Therefore, the predicted query is semantically similar to the original query, but it is likely to return more results. Final Grade: B',\n",
       " ' The query is asking for the cheapest gaming PC, so the first two parameters are correct. The third parameter, \"size\", is not necessary for this query, so it should be removed. The fourth parameter, \"min_price\", is also not necessary since the query is asking for the cheapest gaming PC. The fifth parameter, \"max_price\", should be set to null since the query is asking for the cheapest gaming PC and not a specific price range. Final Grade: B',\n",
       " ' The original query is asking for any tablets under $400. The predicted query is asking for a tablet, with a size of 10, with a minimum price of 0 and a maximum price of 400. This query is semantically the same as the original query, as it is asking for a tablet with a price range of 0 to 400. Therefore, the predicted query is likely to produce the same answer as the original query. Final Grade: A',\n",
       " ' The original query is looking for a laptop with a maximum price of 400. The predicted query is looking for headphones with a minimum price of 0 and a maximum price of 500. The two queries are not semantically the same because they are looking for different items (laptops vs. headphones) and different price ranges. Final Grade: F',\n",
       " ' The original query is asking for the top rated laptops, so the first part of the predicted query is correct in that it is asking for laptops. However, the predicted query also includes parameters for size, min_price, and max_price, which are not relevant to the original question. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: F',\n",
       " ' The original query is asking for shoes, so the predicted query is on the right track. However, the predicted query is adding additional parameters that are not relevant to the original question. The size, min_price, and max_price parameters are not necessary to answer the original question, so they are not semantically the same. Final Grade: C',\n",
       " ' The original query is looking for a professional desktop PC with a maximum price of $10,000. The predicted query is looking for a skirt with a size of 10 and a price range of $0 to $500. The two queries are not semantically the same, as they are looking for two different items. The predicted query is not likely to produce the same answer as the original query. Final Grade: F',\n",
       " \" The original query is asking for a professional Desktop PC with no price limit. The predicted query is asking for a Desktop PC with a size of 10 and a minimum price of 0, with no maximum price limit. The predicted query is missing the 'professional' keyword, which could lead to results that are not what the original query was asking for. Therefore, the predicted query does not semantically match the original query and should not be used. Final Grade: F\"]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "request_eval_results = []\n",
    "for question, predict_query, truth_query in list(zip(questions, predicted_queries, truth_queries)):\n",
    "    eval_output = eval_chain.run(\n",
    "        question=question,\n",
    "        truth_query=truth_query,\n",
    "        predict_query=predict_query,\n",
    "    )\n",
    "    request_eval_results.append(eval_output)\n",
    "request_eval_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "0d76f8ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from typing import List\n",
    "# Parse the evaluation chain responses into a rubric\n",
    "def parse_eval_results(results: List[str]) -> List[float]:\n",
    "    rubric = {\n",
    "        \"A\": 1.0,\n",
    "        \"B\": 0.75,\n",
    "        \"C\": 0.5,\n",
    "        \"D\": 0.25,\n",
    "        \"F\": 0\n",
    "    }\n",
    "    return [rubric[re.search(r'Final Grade: (\\w+)', res).group(1)] for res in results]\n",
    "\n",
    "\n",
    "parsed_results = parse_eval_results(request_eval_results)\n",
    "# Collect the scores for a final evaluation table\n",
    "scores['request_synthesizer'].extend(parsed_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f3ee8ea",
   "metadata": {},
   "source": [
    "## Evaluate the Response Chain\n",
    "\n",
    "The second component translated the structured API response to a natural language response.\n",
    "Evaluate this against the user's original question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8b97847c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "template = \"\"\"You are trying to answer the following question by querying an API:\n",
    "\n",
    "> Question: {question}\n",
    "\n",
    "The API returned a response of:\n",
    "\n",
    "> API result: {api_response}\n",
    "\n",
    "Your response to the user: {answer}\n",
    "\n",
    "Please evaluate the accuracy and utility of your response to the user's original question, conditioned on the information available.\n",
    "Give a letter grade of either an A, B, C, D, or F, along with an explanation of why. End the evaluation with 'Final Grade: <the letter>'\n",
    "\n",
    "> Explanation: Let's think step by step.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate.from_template(template)\n",
    "\n",
    "eval_chain = LLMChain(llm=llm, prompt=prompt, verbose=verbose)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "642852ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract the API responses from the chain\n",
    "api_responses = [output[\"intermediate_steps\"][\"response_text\"] for output in chain_outputs]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "08a5eb4f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[' The original query is asking for all iPhone models, so the \"q\" parameter is correct. The \"size\" parameter is not necessary, as it is not relevant to the question. The \"min_price\" and \"max_price\" parameters are also not necessary, as the question does not ask for any pricing information. Therefore, this predicted query is not semantically the same as the original query and does not provide the same answer. Final Grade: F',\n",
       " ' The original query is asking for laptops with a maximum price of $300. The predicted query is asking for laptops with a minimum price of $0 and a maximum price of $500. This means that the predicted query is likely to return more results than the original query, as it is asking for a wider range of prices. However, the predicted query is still likely to return some budget laptops, as it is still asking for laptops with a maximum price of $500. Therefore, the predicted query is semantically similar to the original query, but it is likely to return more results. Final Grade: B',\n",
       " ' The query is asking for the cheapest gaming PC, so the first two parameters are correct. The third parameter, \"size\", is not necessary for this query, so it should be removed. The fourth parameter, \"min_price\", is also not necessary since the query is asking for the cheapest gaming PC. The fifth parameter, \"max_price\", should be set to null since the query is asking for the cheapest gaming PC and not a specific price range. Final Grade: B',\n",
       " ' The original query is asking for any tablets under $400. The predicted query is asking for a tablet, with a size of 10, with a minimum price of 0 and a maximum price of 400. This query is semantically the same as the original query, as it is asking for a tablet with a price range of 0 to 400. Therefore, the predicted query is likely to produce the same answer as the original query. Final Grade: A',\n",
       " ' The original query is looking for a laptop with a maximum price of 400. The predicted query is looking for headphones with a minimum price of 0 and a maximum price of 500. The two queries are not semantically the same because they are looking for different items (laptops vs. headphones) and different price ranges. Final Grade: F',\n",
       " ' The original query is asking for the top rated laptops, so the first part of the predicted query is correct in that it is asking for laptops. However, the predicted query also includes parameters for size, min_price, and max_price, which are not relevant to the original question. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: F',\n",
       " ' The original query is asking for shoes, so the predicted query is on the right track. However, the predicted query is adding additional parameters that are not relevant to the original question. The size, min_price, and max_price parameters are not necessary to answer the original question, so they are not semantically the same. Final Grade: C',\n",
       " ' The original query is looking for a professional desktop PC with a maximum price of $10,000. The predicted query is looking for a skirt with a size of 10 and a price range of $0 to $500. The two queries are not semantically the same, as they are looking for two different items. The predicted query is not likely to produce the same answer as the original query. Final Grade: F',\n",
       " \" The original query is asking for a professional Desktop PC with no price limit. The predicted query is asking for a Desktop PC with a size of 10 and a minimum price of 0, with no maximum price limit. The predicted query is missing the 'professional' keyword, which could lead to results that are not what the original query was asking for. Therefore, the predicted query does not semantically match the original query and should not be used. Final Grade: F\",\n",
       " ' The user asked a question about what iPhone models are available, and the API returned a response with 10 different models. The response provided by the user accurately listed all 10 models, so the accuracy of the response is A+. The utility of the response is also A+ since the user was able to get the exact information they were looking for. Final Grade: A+',\n",
       " ' The API response provided a list of laptops with their prices and attributes. The response was accurate in that it provided the user with a list of budget laptops that are available. The response was also useful in that it provided the user with the information they needed to make an informed decision. Final Grade: A',\n",
       " \" The API response provided the name, price, and URL of the product, which is exactly what the user asked for. The response also provided additional information about the product's attributes, which is useful for the user to make an informed decision. Therefore, the response is accurate and useful. Final Grade: A\",\n",
       " \" The API response provided a list of tablets that are under $400. The response accurately answered the user's question. The response also provided useful information such as the product name, price, and attributes. Therefore, the response was accurate and useful. Final Grade: A\",\n",
       " ' The API response provided a list of headphones with their respective features and prices. The user asked for the best headphones, so the response should include the best options available. The response provided a list of headphones that include features such as noise cancelling and type of headphone. The response also included the prices of the headphones, which is important for the user to know. Therefore, the response was accurate and useful in providing the user with the best options available. Final Grade: A',\n",
       " ' The API response provided a list of laptops with their attributes, which is exactly what the user asked for. The response provided a concise list of the top rated laptops, which is what the user was looking for. The response was accurate and useful, so I would give it an A. Final Grade: A',\n",
       " ' The API response provided a list of shoes from both Adidas and Nike, which is exactly what the user asked for. The response also included the product name, price, and attributes for each shoe, which is useful information for the user to make an informed decision. The response also included links to the products, which is helpful for the user to purchase the shoes. Overall, the response was accurate and useful, so I would give it an A. Final Grade: A',\n",
       " \" The API response provided a list of skirts that could potentially meet the user's needs. The response also included the name, price, and attributes of each skirt. This is a great start, as it provides the user with a variety of options to choose from. However, the response does not provide any images of the skirts, which would have been helpful for the user to make a decision. Additionally, the response does not provide any information about the availability of the skirts, which could be important for the user. \\n\\nFinal Grade: B\",\n",
       " ' The user asked for a professional desktop PC with no budget constraints. The API response provided a list of products that fit the criteria, including the Skytech Archangel Gaming Computer PC Desktop, the CyberPowerPC Gamer Master Gaming Desktop, and the ASUS ROG Strix G10DK-RS756. The response accurately suggested these three products as potential options for the user, and provided useful information about their features and prices. Final Grade: A',\n",
       " ' The API response provided a list of cameras with their prices, which is exactly what the user asked for. The response was accurate and provided the user with the information they needed to make an informed decision. The response was also useful, as it provided the user with a list of cameras that fit their budget. Final Grade: A']"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Run the grader chain\n",
    "response_eval_results = []\n",
    "for question, api_response, answer in list(zip(questions, api_responses, answers)):\n",
    "    request_eval_results.append(eval_chain.run(question=question, api_response=api_response, answer=answer))\n",
    "request_eval_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "a144aa9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reusing the rubric from above, parse the evaluation chain responses\n",
    "parsed_response_results = parse_eval_results(request_eval_results)\n",
    "# Collect the scores for a final evaluation table\n",
    "scores['result_synthesizer'].extend(parsed_response_results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "e95042bc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Metric              \tMin       \tMean      \tMax       \n",
      "completed           \t1.00      \t1.00      \t1.00      \n",
      "request_synthesizer \t0.00      \t0.33      \t1.00      \n",
      "result_synthesizer  \t0.00      \t0.67      \t1.00      \n"
     ]
    }
   ],
   "source": [
    "# Print out Score statistics for the evaluation session\n",
    "header = \"{:<20}\\t{:<10}\\t{:<10}\\t{:<10}\".format(\"Metric\", \"Min\", \"Mean\", \"Max\")\n",
    "print(header)\n",
    "for metric, metric_scores in scores.items():\n",
    "    mean_scores = sum(metric_scores) / len(metric_scores) if len(metric_scores) > 0 else float('nan')\n",
    "    row = \"{:<20}\\t{:<10.2f}\\t{:<10.2f}\\t{:<10.2f}\".format(metric, min(metric_scores), mean_scores, max(metric_scores))\n",
    "    print(row)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "03fe96af",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Re-show the examples for which the chain failed to complete\n",
    "failed_examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ee43877",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}