Add to notebook to assist in ground truth question generation (#2523)

At the bottom of the notebook, continue to show how to generate example test cases with the assistance of an LLM
1 year ago · 632c65d64b
parent 15cdfa9e7f
commit 632c65d64b
1 changed files with 877 additions and 565 deletions
--- a/docs/modules/chains/examples/openapi_eval.ipynb
+++ b/docs/modules/chains/examples/openapi_eval.ipynb
@ -1,6 +1,5 @@
 {
- "cells": [
-  {
+       "cells": [{
                     "cell_type": "markdown",
                     "id": "692f3256",
                     "metadata": {},
@ -38,15 +37,13 @@
                     "execution_count": 2,
                     "id": "794142ba",
                     "metadata": {},
-   "outputs": [
-    {
+                     "outputs": [{
                            "name": "stderr",
                            "output_type": "stream",
                            "text": [
                                   "Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n"
                            ]
-    }
-   ],
+                     }],
                     "source": [
                            "# Load and parse the OpenAPI Spec\n",
                            "spec = OpenAPISpec.from_url(\"https://www.klarna.com/us/shopping/public/openai/v0/api-docs/\")\n",
@ -70,7 +67,9 @@
                     "id": "6c05ba5b",
                     "metadata": {},
                     "source": [
-    "### *Optional*: Generate Input Questions and Request Ground Truth Queries"
+                            "### *Optional*: Generate Input Questions and Request Ground Truth Queries\n",
+                            "\n",
+                            "See [Generating Test Datasets](#Generating-Test-Datasets) at the end of this notebook for more details."
                     ]
              },
              {
@ -185,8 +184,7 @@
                     "execution_count": 8,
                     "id": "f3c9729f",
                     "metadata": {},
-   "outputs": [
-    {
+                     "outputs": [{
                            "data": {
                                   "text/plain": [
                                          "[]"
@ -195,8 +193,7 @@
                            "execution_count": 8,
                            "metadata": {},
                            "output_type": "execute_result"
-    }
-   ],
+                     }],
                     "source": [
                            "# If the chain failed to run, show the failing examples\n",
                            "failed_examples"
@ -207,8 +204,7 @@
                     "execution_count": 9,
                     "id": "914e7587",
                     "metadata": {},
-   "outputs": [
-    {
+                     "outputs": [{
                            "data": {
                                   "text/plain": [
                                          "['There are currently 10 Apple iPhone models available: Apple iPhone 14 Pro Max 256GB, Apple iPhone 12 128GB, Apple iPhone 13 128GB, Apple iPhone 14 Pro 128GB, Apple iPhone 14 Pro 256GB, Apple iPhone 14 Pro Max 128GB, Apple iPhone 13 Pro Max 128GB, Apple iPhone 14 128GB, Apple iPhone 12 Pro 512GB, and Apple iPhone 12 mini 64GB.',\n",
@ -226,8 +222,7 @@
                            "execution_count": 9,
                            "metadata": {},
                            "output_type": "execute_result"
-    }
-   ],
+                     }],
                     "source": [
                            "answers = [res['output'] for res in chain_outputs]\n",
                            "answers"
@ -317,8 +312,7 @@
                     "execution_count": 13,
                     "id": "8cc1b1db",
                     "metadata": {},
-   "outputs": [
-    {
+                     "outputs": [{
                            "data": {
                                   "text/plain": [
                                          "[' The original query is asking for all iPhone models, so the \"q\" parameter is correct. The \"size\" parameter is not necessary, as it is not relevant to the question. The \"min_price\" and \"max_price\" parameters are also not necessary, as the question does not ask for any pricing information. Therefore, this predicted query is not semantically the same as the original query and does not provide the same answer. Final Grade: F',\n",
@ -326,17 +320,16 @@
                                          " ' The query is asking for the cheapest gaming PC, so the first two parameters are correct. The third parameter, \"size\", is not necessary for this query, so it should be removed. The fourth parameter, \"min_price\", is also not necessary since the query is asking for the cheapest gaming PC. The fifth parameter, \"max_price\", should be set to null since the query is asking for the cheapest gaming PC and not a specific price range. Final Grade: B',\n",
                                          " ' The original query is asking for any tablets under $400. The predicted query is asking for a tablet, with a size of 10, with a minimum price of 0 and a maximum price of 400. This query is semantically the same as the original query, as it is asking for a tablet with a price range of 0 to 400. Therefore, the predicted query is likely to produce the same answer as the original query. Final Grade: A',\n",
                                          " ' The original query is looking for a laptop with a maximum price of 400. The predicted query is looking for headphones with a minimum price of 0 and a maximum price of 500. The two queries are not semantically the same because they are looking for different items (laptops vs. headphones) and different price ranges. Final Grade: F',\n",
-       " ' The original query is asking for the top rated laptops, so the first part of the predicted query is correct in that it is asking for laptops. However, the predicted query also includes parameters for size, min_price, and max_price, which are not relevant to the original question. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: F',\n",
+                                          " ' The original query is asking for the top rated laptops, so the first part of the predicted query is correct in that it is asking for laptops. However, the predicted query also includes parameters for size, min_price, and max_price, which are not relevant to the original question. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: D',\n",
                                          " ' The original query is asking for shoes, so the predicted query is on the right track. However, the predicted query is adding additional parameters that are not relevant to the original question. The size, min_price, and max_price parameters are not necessary to answer the original question, so they are not semantically the same. Final Grade: C',\n",
                                          " ' The original query is looking for a professional desktop PC with a maximum price of $10,000. The predicted query is looking for a skirt with a size of 10 and a price range of $0 to $500. The two queries are not semantically the same, as they are looking for two different items. The predicted query is not likely to produce the same answer as the original query. Final Grade: F',\n",
-       " \" The original query is asking for a professional Desktop PC with no price limit. The predicted query is asking for a Desktop PC with a size of 10 and a minimum price of 0, with no maximum price limit. The predicted query is missing the 'professional' keyword, which could lead to results that are not what the original query was asking for. Therefore, the predicted query does not semantically match the original query and should not be used. Final Grade: F\"]"
+                                          " \" The original query is asking for a professional Desktop PC with no price limit. The predicted query is asking for a Desktop PC with a size of 10 and a minimum price of 0, with no maximum price limit. The predicted query is missing the 'professional' keyword, which is important for the query to be semantically the same. Additionally, the size of the Desktop PC is not specified in the original query, so this could be a factor in the results. Therefore, the predicted query is not semantically the same as the original query. Final Grade: D\"]"
                                   ]
                            },
                            "execution_count": 13,
                            "metadata": {},
                            "output_type": "execute_result"
-    }
-   ],
+                     }],
                     "source": [
                            "request_eval_results = []\n",
                            "for question, predict_query, truth_query in list(zip(questions, predicted_queries, truth_queries)):\n",
@ -433,8 +426,7 @@
                     "metadata": {
                            "scrolled": true
                     },
-   "outputs": [
-    {
+                     "outputs": [{
                            "data": {
                                   "text/plain": [
                                          "[' The original query is asking for all iPhone models, so the \"q\" parameter is correct. The \"size\" parameter is not necessary, as it is not relevant to the question. The \"min_price\" and \"max_price\" parameters are also not necessary, as the question does not ask for any pricing information. Therefore, this predicted query is not semantically the same as the original query and does not provide the same answer. Final Grade: F',\n",
@ -442,27 +434,26 @@
                                          " ' The query is asking for the cheapest gaming PC, so the first two parameters are correct. The third parameter, \"size\", is not necessary for this query, so it should be removed. The fourth parameter, \"min_price\", is also not necessary since the query is asking for the cheapest gaming PC. The fifth parameter, \"max_price\", should be set to null since the query is asking for the cheapest gaming PC and not a specific price range. Final Grade: B',\n",
                                          " ' The original query is asking for any tablets under $400. The predicted query is asking for a tablet, with a size of 10, with a minimum price of 0 and a maximum price of 400. This query is semantically the same as the original query, as it is asking for a tablet with a price range of 0 to 400. Therefore, the predicted query is likely to produce the same answer as the original query. Final Grade: A',\n",
                                          " ' The original query is looking for a laptop with a maximum price of 400. The predicted query is looking for headphones with a minimum price of 0 and a maximum price of 500. The two queries are not semantically the same because they are looking for different items (laptops vs. headphones) and different price ranges. Final Grade: F',\n",
-       " ' The original query is asking for the top rated laptops, so the first part of the predicted query is correct in that it is asking for laptops. However, the predicted query also includes parameters for size, min_price, and max_price, which are not relevant to the original question. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: F',\n",
+                                          " ' The original query is asking for the top rated laptops, so the first part of the predicted query is correct in that it is asking for laptops. However, the predicted query also includes parameters for size, min_price, and max_price, which are not relevant to the original question. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: D',\n",
                                          " ' The original query is asking for shoes, so the predicted query is on the right track. However, the predicted query is adding additional parameters that are not relevant to the original question. The size, min_price, and max_price parameters are not necessary to answer the original question, so they are not semantically the same. Final Grade: C',\n",
                                          " ' The original query is looking for a professional desktop PC with a maximum price of $10,000. The predicted query is looking for a skirt with a size of 10 and a price range of $0 to $500. The two queries are not semantically the same, as they are looking for two different items. The predicted query is not likely to produce the same answer as the original query. Final Grade: F',\n",
-       " \" The original query is asking for a professional Desktop PC with no price limit. The predicted query is asking for a Desktop PC with a size of 10 and a minimum price of 0, with no maximum price limit. The predicted query is missing the 'professional' keyword, which could lead to results that are not what the original query was asking for. Therefore, the predicted query does not semantically match the original query and should not be used. Final Grade: F\",\n",
+                                          " \" The original query is asking for a professional Desktop PC with no price limit. The predicted query is asking for a Desktop PC with a size of 10 and a minimum price of 0, with no maximum price limit. The predicted query is missing the 'professional' keyword, which is important for the query to be semantically the same. Additionally, the size of the Desktop PC is not specified in the original query, so this could be a factor in the results. Therefore, the predicted query is not semantically the same as the original query. Final Grade: D\",\n",
                                          " ' The user asked a question about what iPhone models are available, and the API returned a response with 10 different models. The response provided by the user accurately listed all 10 models, so the accuracy of the response is A+. The utility of the response is also A+ since the user was able to get the exact information they were looking for. Final Grade: A+',\n",
-       " ' The API response provided a list of laptops with their prices and attributes. The response was accurate in that it provided the user with a list of budget laptops that are available. The response was also useful in that it provided the user with the information they needed to make an informed decision. Final Grade: A',\n",
+                                          " \" The API response provided a list of laptops with their prices and attributes. The user asked if there were any budget laptops, and the response provided a list of laptops that are all priced under $500. Therefore, the response was accurate and useful in answering the user's question. Final Grade: A\",\n",
                                          " \" The API response provided the name, price, and URL of the product, which is exactly what the user asked for. The response also provided additional information about the product's attributes, which is useful for the user to make an informed decision. Therefore, the response is accurate and useful. Final Grade: A\",\n",
-       " \" The API response provided a list of tablets that are under $400. The response accurately answered the user's question. The response also provided useful information such as the product name, price, and attributes. Therefore, the response was accurate and useful. Final Grade: A\",\n",
-       " ' The API response provided a list of headphones with their respective features and prices. The user asked for the best headphones, so the response should include the best options available. The response provided a list of headphones that include features such as noise cancelling and type of headphone. The response also included the prices of the headphones, which is important for the user to know. Therefore, the response was accurate and useful in providing the user with the best options available. Final Grade: A',\n",
-       " ' The API response provided a list of laptops with their attributes, which is exactly what the user asked for. The response provided a concise list of the top rated laptops, which is what the user was looking for. The response was accurate and useful, so I would give it an A. Final Grade: A',\n",
-       " ' The API response provided a list of shoes from both Adidas and Nike, which is exactly what the user asked for. The response also included the product name, price, and attributes for each shoe, which is useful information for the user to make an informed decision. The response also included links to the products, which is helpful for the user to purchase the shoes. Overall, the response was accurate and useful, so I would give it an A. Final Grade: A',\n",
+                                          " \" The API response provided a list of tablets that are under $400. The response accurately answered the user's question. The response also provided useful information such as the product name, price, and attributes. The response was clear and concise. Final Grade: A\",\n",
+                                          " ' The API response provided a list of headphones with their respective prices and attributes. The user asked for the best headphones, so the response should include the best headphones based on the criteria provided. The response provided a list of headphones that are all from the same brand (Apple) and all have the same type of headphone (True Wireless, In-Ear). This does not provide the user with enough information to make an informed decision about which headphones are the best. The response should have included a variety of brands and types of headphones to give the user a better understanding of their options. Final Grade: D',\n",
+                                          " ' The API response provided a list of laptops with their attributes, which is exactly what the user asked for. The response provided a comprehensive list of the top rated laptops, which is what the user was looking for. The response was accurate and useful, providing the user with the information they needed. Final Grade: A',\n",
+                                          " ' The API response provided a list of shoes from both Adidas and Nike, which is exactly what the user asked for. The response also included the product name, price, and attributes for each shoe, which is useful information for the user to make an informed decision. The response also included links to the products, which is helpful for the user to purchase the shoes. Therefore, the response was accurate and useful. Final Grade: A',\n",
                                          " \" The API response provided a list of skirts that could potentially meet the user's needs. The response also included the name, price, and attributes of each skirt. This is a great start, as it provides the user with a variety of options to choose from. However, the response does not provide any images of the skirts, which would have been helpful for the user to make a decision. Additionally, the response does not provide any information about the availability of the skirts, which could be important for the user. \\n\\nFinal Grade: B\",\n",
-       " ' The user asked for a professional desktop PC with no budget constraints. The API response provided a list of products that fit the criteria, including the Skytech Archangel Gaming Computer PC Desktop, the CyberPowerPC Gamer Master Gaming Desktop, and the ASUS ROG Strix G10DK-RS756. The response accurately suggested these three products as potential options for the user, and provided useful information about their features and prices. Final Grade: A',\n",
-       " ' The API response provided a list of cameras with their prices, which is exactly what the user asked for. The response was accurate and provided the user with the information they needed to make an informed decision. The response was also useful, as it provided the user with a list of cameras that fit their budget. Final Grade: A']"
+                                          " \" First, the response accurately answers the user's question by providing a list of professional desktop PCs that are available. Second, the response provides the user with a range of options that fit their needs, as each of the PCs listed have powerful processors and plenty of RAM. Finally, the response provides the user with the relevant information they need to make an informed decision, such as the price and attributes of each PC. Overall, the response is accurate and useful, so I would give it an A. Final Grade: A\",\n",
+                                          " \" The API response provided a list of cameras with their prices, which is exactly what the user asked for. The response also included additional information such as features and memory cards, which is not necessary for the user's question but could be useful for further research. The response was accurate and provided the user with the information they needed. Final Grade: A\"]"
                                   ]
                            },
                            "execution_count": 17,
                            "metadata": {},
                            "output_type": "execute_result"
-    }
-   ],
+                     }],
                     "source": [
                            "# Run the grader chain\n",
                            "response_eval_results = []\n",
@ -489,18 +480,16 @@
                     "execution_count": 19,
                     "id": "e95042bc",
                     "metadata": {},
-   "outputs": [
-    {
+                     "outputs": [{
                            "name": "stdout",
                            "output_type": "stream",
                            "text": [
                                   "Metric              \tMin       \tMean      \tMax       \n",
                                   "completed           \t1.00      \t1.00      \t1.00      \n",
-      "request_synthesizer \t0.00      \t0.33      \t1.00      \n",
-      "result_synthesizer  \t0.00      \t0.67      \t1.00      \n"
+                                   "request_synthesizer \t0.00      \t0.39      \t1.00      \n",
+                                   "result_synthesizer  \t0.00      \t0.66      \t1.00      \n"
                            ]
-    }
-   ],
+                     }],
                     "source": [
                            "# Print out Score statistics for the evaluation session\n",
                            "header = \"{:<20}\\t{:<10}\\t{:<10}\\t{:<10}\".format(\"Metric\", \"Min\", \"Mean\", \"Max\")\n",
@ -516,8 +505,7 @@
                     "execution_count": 20,
                     "id": "03fe96af",
                     "metadata": {},
-   "outputs": [
-    {
+                     "outputs": [{
                            "data": {
                                   "text/plain": [
                                          "[]"
@ -526,17 +514,341 @@
                            "execution_count": 20,
                            "metadata": {},
                            "output_type": "execute_result"
-    }
-   ],
+                     }],
                     "source": [
                            "# Re-show the examples for which the chain failed to complete\n",
                            "failed_examples"
                     ]
              },
+              {
+                     "cell_type": "markdown",
+                     "id": "2bb3636d",
+                     "metadata": {},
+                     "source": [
+                            "## Generating Test Datasets\n",
+                            "\n",
+                            "To evaluate a chain against your own endpoint, you'll want to generate a test dataset that's conforms to the API.\n",
+                            "\n",
+                            "This section provides an overview of how to bootstrap the process.\n",
+                            "\n",
+                            "First, we'll parse the OpenAPI Spec. For this example, we'll [Speak](https://www.speak.com/)'s OpenAPI specification."
+                     ]
+              },
+              {
+                     "cell_type": "code",
+                     "execution_count": 21,
+                     "id": "a453eb93",
+                     "metadata": {},
+                     "outputs": [{
+                            "name": "stderr",
+                            "output_type": "stream",
+                            "text": [
+                                   "Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n",
+                                   "Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.\n"
+                            ]
+                     }],
+                     "source": [
+                            "# Load and parse the OpenAPI Spec\n",
+                            "spec = OpenAPISpec.from_url(\"https://api.speak.com/openapi.yaml\")"
+                     ]
+              },
+              {
+                     "cell_type": "code",
+                     "execution_count": 22,
+                     "id": "bb65ffe8",
+                     "metadata": {},
+                     "outputs": [{
+                            "data": {
+                                   "text/plain": [
+                                          "['/v1/public/openai/explain-phrase',\n",
+                                          " '/v1/public/openai/explain-task',\n",
+                                          " '/v1/public/openai/translate']"
+                                   ]
+                            },
+                            "execution_count": 22,
+                            "metadata": {},
+                            "output_type": "execute_result"
+                     }],
+                     "source": [
+                            "# List the paths in the OpenAPI Spec\n",
+                            "paths = sorted(spec.paths.keys())\n",
+                            "paths"
+                     ]
+              },
+              {
+                     "cell_type": "code",
+                     "execution_count": 23,
+                     "id": "0988f01b",
+                     "metadata": {},
+                     "outputs": [{
+                            "data": {
+                                   "text/plain": [
+                                          "['post']"
+                                   ]
+                            },
+                            "execution_count": 23,
+                            "metadata": {},
+                            "output_type": "execute_result"
+                     }],
+                     "source": [
+                            "# See which HTTP Methods are available for a given path\n",
+                            "methods = spec.get_methods_for_path('/v1/public/openai/explain-task')\n",
+                            "methods"
+                     ]
+              },
+              {
+                     "cell_type": "code",
+                     "execution_count": 24,
+                     "id": "e9ef0a77",
+                     "metadata": {},
+                     "outputs": [{
+                            "name": "stdout",
+                            "output_type": "stream",
+                            "text": [
+                                   "type explainTask = (_: {\n",
+                                   "/* Description of the task that the user wants to accomplish or do. For example, \"tell the waiter they messed up my order\" or \"compliment someone on their shirt\" */\n",
+                                   "  task_description?: string,\n",
+                                   "/* The foreign language that the user is learning and asking about. The value can be inferred from question - for example, if the user asks \"how do i ask a girl out in mexico city\", the value should be \"Spanish\" because of Mexico City. Always use the full name of the language (e.g. Spanish, French). */\n",
+                                   "  learning_language?: string,\n",
+                                   "/* The user's native language. Infer this value from the language the user asked their question in. Always use the full name of the language (e.g. Spanish, French). */\n",
+                                   "  native_language?: string,\n",
+                                   "/* A description of any additional context in the user's question that could affect the explanation - e.g. setting, scenario, situation, tone, speaking style and formality, usage notes, or any other qualifiers. */\n",
+                                   "  additional_context?: string,\n",
+                                   "/* Full text of the user's question. */\n",
+                                   "  full_query?: string,\n",
+                                   "}) => any;\n"
+                            ]
+                     }],
+                     "source": [
+                            "# Load a single endpoint operation\n",
+                            "operation = APIOperation.from_openapi_spec(spec, '/v1/public/openai/explain-task', 'post')\n",
+                            "\n",
+                            "# The operation can be serialized as typescript\n",
+                            "print(operation.to_typescript())"
+                     ]
+              },
+              {
+                     "cell_type": "code",
+                     "execution_count": 25,
+                     "id": "f1186b6d",
+                     "metadata": {},
+                     "outputs": [],
+                     "source": [
+                            "# Compress the service definition to avoid leaking too much input structure to the sample data\n",
+                            "template = \"\"\"In 20 words or less, what does this service accomplish?\n",
+                            "{spec}\n",
+                            "\n",
+                            "Function: It's designed to \"\"\"\n",
+                            "prompt = PromptTemplate.from_template(template)\n",
+                            "generation_chain = LLMChain(llm=llm, prompt=prompt)\n",
+                            "purpose = generation_chain.run(spec=operation.to_typescript())"
+                     ]
+              },
+              {
+                     "cell_type": "code",
+                     "execution_count": 26,
+                     "id": "a594406a",
+                     "metadata": {},
+                     "outputs": [{
+                            "data": {
+                                   "text/plain": [
+                                          "[\"Can you explain how to say 'hello' in Spanish?\",\n",
+                                          " \"I need help understanding the French word for 'goodbye'.\",\n",
+                                          " \"Can you tell me how to say 'thank you' in German?\",\n",
+                                          " \"I'm trying to learn the Italian word for 'please'.\",\n",
+                                          " \"Can you help me with the pronunciation of 'yes' in Portuguese?\",\n",
+                                          " \"I'm looking for the Dutch word for 'no'.\",\n",
+                                          " \"Can you explain the meaning of 'hello' in Japanese?\",\n",
+                                          " \"I need help understanding the Russian word for 'thank you'.\",\n",
+                                          " \"Can you tell me how to say 'goodbye' in Chinese?\",\n",
+                                          " \"I'm trying to learn the Arabic word for 'please'.\"]"
+                                   ]
+                            },
+                            "execution_count": 26,
+                            "metadata": {},
+                            "output_type": "execute_result"
+                     }],
+                     "source": [
+                            "template = \"\"\"Write a list of {num_to_generate} unique messages users might send to a service designed to{purpose} They must each be completely unique.\n",
+                            "\n",
+                            "1.\"\"\"\n",
+                            "def parse_list(text: str) -> List[str]:\n",
+                            "    # Match lines starting with a number then period\n",
+                            "    # Strip leading and trailing whitespace\n",
+                            "    matches = re.findall(r'^\\d+\\. ', text)\n",
+                            "    return [re.sub(r'^\\d+\\. ', '', q).strip().strip('\"') for q in text.split('\\n')]\n",
+                            "\n",
+                            "num_to_generate = 10 # How many examples to use for this test set.\n",
+                            "prompt = PromptTemplate.from_template(template)\n",
+                            "generation_chain = LLMChain(llm=llm, prompt=prompt)\n",
+                            "text = generation_chain.run(purpose=purpose,\n",
+                            "                            num_to_generate=num_to_generate)\n",
+                            "# Strip preceding numeric bullets\n",
+                            "queries = parse_list(text)\n",
+                            "queries\n"
+                     ]
+              },
+              {
+                     "cell_type": "code",
+                     "execution_count": 27,
+                     "id": "8dc60f43",
+                     "metadata": {},
+                     "outputs": [{
+                            "data": {
+                                   "text/plain": [
+                                          "['{\"task_description\": \"say \\'hello\\'\", \"learning_language\": \"Spanish\", \"native_language\": \"English\", \"full_query\": \"Can you explain how to say \\'hello\\' in Spanish?\"}',\n",
+                                          " '{\"task_description\": \"understanding the French word for \\'goodbye\\'\", \"learning_language\": \"French\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the French word for \\'goodbye\\'.\"}',\n",
+                                          " '{\"task_description\": \"say \\'thank you\\'\", \"learning_language\": \"German\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say \\'thank you\\' in German?\"}',\n",
+                                          " '{\"task_description\": \"Learn the Italian word for \\'please\\'\", \"learning_language\": \"Italian\", \"native_language\": \"English\", \"full_query\": \"I\\'m trying to learn the Italian word for \\'please\\'.\"}',\n",
+                                          " '{\"task_description\": \"Help with pronunciation of \\'yes\\' in Portuguese\", \"learning_language\": \"Portuguese\", \"native_language\": \"English\", \"full_query\": \"Can you help me with the pronunciation of \\'yes\\' in Portuguese?\"}',\n",
+                                          " '{\"task_description\": \"Find the Dutch word for \\'no\\'\", \"learning_language\": \"Dutch\", \"native_language\": \"English\", \"full_query\": \"I\\'m looking for the Dutch word for \\'no\\'.\"}',\n",
+                                          " '{\"task_description\": \"Explain the meaning of \\'hello\\' in Japanese\", \"learning_language\": \"Japanese\", \"native_language\": \"English\", \"full_query\": \"Can you explain the meaning of \\'hello\\' in Japanese?\"}',\n",
+                                          " '{\"task_description\": \"understanding the Russian word for \\'thank you\\'\", \"learning_language\": \"Russian\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the Russian word for \\'thank you\\'.\"}',\n",
+                                          " '{\"task_description\": \"say goodbye\", \"learning_language\": \"Chinese\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say \\'goodbye\\' in Chinese?\"}',\n",
+                                          " '{\"task_description\": \"Learn the Arabic word for \\'please\\'\", \"learning_language\": \"Arabic\", \"native_language\": \"English\", \"full_query\": \"I\\'m trying to learn the Arabic word for \\'please\\'.\"}']"
+                                   ]
+                            },
+                            "execution_count": 27,
+                            "metadata": {},
+                            "output_type": "execute_result"
+                     }],
+                     "source": [
+                            "# Define the generation chain to get hypotheses\n",
+                            "api_chain = OpenAPIEndpointChain.from_api_operation(\n",
+                            "    operation, \n",
+                            "    llm, \n",
+                            "    requests=Requests(), \n",
+                            "    verbose=verbose,\n",
+                            "    return_intermediate_steps=True # Return request and response text\n",
+                            ")\n",
+                            "\n",
+                            "predicted_outputs =[api_chain(query) for query in queries]\n",
+                            "request_args = [output[\"intermediate_steps\"][\"request_args\"] for output in predicted_outputs]\n",
+                            "\n",
+                            "# Show the generated request\n",
+                            "request_args"
+                     ]
+              },
+              {
+                     "cell_type": "code",
+                     "execution_count": 28,
+                     "id": "b727e28e",
+                     "metadata": {},
+                     "outputs": [],
+                     "source": [
+                            "## AI Assisted Correction\n",
+                            "correction_template = \"\"\"Correct the following API request based on the user's feedback. If the user indicates no changes are needed, output the original without making any changes.\n",
+                            "\n",
+                            "REQUEST: {request}\n",
+                            "\n",
+                            "User Feedback / requested changes: {user_feedback}\n",
+                            "\n",
+                            "Finalized Request: \"\"\"\n",
+                            "\n",
+                            "prompt = PromptTemplate.from_template(correction_template)\n",
+                            "correction_chain = LLMChain(llm=llm, prompt=prompt)\n"
+                     ]
+              },
+              {
+                     "cell_type": "code",
+                     "execution_count": 29,
+                     "id": "c1f4d71f",
+                     "metadata": {},
+                     "outputs": [{
+                            "name": "stdout",
+                            "output_type": "stream",
+                            "text": [
+                                   "Query: Can you explain how to say 'hello' in Spanish?\n",
+                                   "Request: {\"task_description\": \"say 'hello'\", \"learning_language\": \"Spanish\", \"native_language\": \"English\", \"full_query\": \"Can you explain how to say 'hello' in Spanish?\"}\n",
+                                   "Requested changes: \n",
+                                   "Query: I need help understanding the French word for 'goodbye'.\n",
+                                   "Request: {\"task_description\": \"understanding the French word for 'goodbye'\", \"learning_language\": \"French\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the French word for 'goodbye'.\"}\n",
+                                   "Requested changes: \n",
+                                   "Query: Can you tell me how to say 'thank you' in German?\n",
+                                   "Request: {\"task_description\": \"say 'thank you'\", \"learning_language\": \"German\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say 'thank you' in German?\"}\n",
+                                   "Requested changes: \n",
+                                   "Query: I'm trying to learn the Italian word for 'please'.\n",
+                                   "Request: {\"task_description\": \"Learn the Italian word for 'please'\", \"learning_language\": \"Italian\", \"native_language\": \"English\", \"full_query\": \"I'm trying to learn the Italian word for 'please'.\"}\n",
+                                   "Requested changes: \n",
+                                   "Query: Can you help me with the pronunciation of 'yes' in Portuguese?\n",
+                                   "Request: {\"task_description\": \"Help with pronunciation of 'yes' in Portuguese\", \"learning_language\": \"Portuguese\", \"native_language\": \"English\", \"full_query\": \"Can you help me with the pronunciation of 'yes' in Portuguese?\"}\n",
+                                   "Requested changes: task should be \"correctly pronounce yes\"\n",
+                                   "Updated request:  {\"task_description\": \"Correctly pronounce 'yes' in Portuguese\", \"learning_language\": \"Portuguese\", \"native_language\": \"English\", \"full_query\": \"Can you help me with the pronunciation of 'yes' in Portuguese?\"}\n",
+                                   "Query: I'm looking for the Dutch word for 'no'.\n",
+                                   "Request: {\"task_description\": \"Find the Dutch word for 'no'\", \"learning_language\": \"Dutch\", \"native_language\": \"English\", \"full_query\": \"I'm looking for the Dutch word for 'no'.\"}\n",
+                                   "Requested changes: task should be \"say no\"\n",
+                                   "Updated request:  {\"task_description\": \"Say no\", \"learning_language\": \"Dutch\", \"native_language\": \"English\", \"full_query\": \"I'm looking for the Dutch word for 'no'.\"}\n",
+                                   "Query: Can you explain the meaning of 'hello' in Japanese?\n",
+                                   "Request: {\"task_description\": \"Explain the meaning of 'hello' in Japanese\", \"learning_language\": \"Japanese\", \"native_language\": \"English\", \"full_query\": \"Can you explain the meaning of 'hello' in Japanese?\"}\n",
+                                   "Requested changes: task should be \"say hello\"\n",
+                                   "Updated request:  {\"task_description\": \"Say hello in Japanese\", \"learning_language\": \"Japanese\", \"native_language\": \"English\", \"full_query\": \"Can you say hello in Japanese?\"}\n",
+                                   "Query: I need help understanding the Russian word for 'thank you'.\n",
+                                   "Request: {\"task_description\": \"understanding the Russian word for 'thank you'\", \"learning_language\": \"Russian\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the Russian word for 'thank you'.\"}\n",
+                                   "Requested changes: task should be \"thanks omeone\"\n",
+                                   "Updated request:  {\"task_description\": \"thanks someone\", \"learning_language\": \"Russian\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the Russian word for 'thank you'.\"}\n",
+                                   "Query: Can you tell me how to say 'goodbye' in Chinese?\n",
+                                   "Request: {\"task_description\": \"say goodbye\", \"learning_language\": \"Chinese\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say 'goodbye' in Chinese?\"}\n",
+                                   "Requested changes: \n",
+                                   "Query: I'm trying to learn the Arabic word for 'please'.\n",
+                                   "Request: {\"task_description\": \"Learn the Arabic word for 'please'\", \"learning_language\": \"Arabic\", \"native_language\": \"English\", \"full_query\": \"I'm trying to learn the Arabic word for 'please'.\"}\n",
+                                   "Requested changes: task should be \"ask please\"\n",
+                                   "Updated request:  {\"task_description\": \"Ask please\", \"learning_language\": \"Arabic\", \"native_language\": \"English\", \"full_query\": \"I'm trying to learn the Arabic word for 'please'.\"}\n"
+                            ]
+                     }],
+                     "source": [
+                            "ground_truth = []\n",
+                            "for query, request_arg in list(zip(queries, request_args)):\n",
+                            "    feedback = input(f\"Query: {query}\\nRequest: {request_arg}\\nRequested changes: \")\n",
+                            "    if feedback == 'n' or feedback == 'none' or not feedback:\n",
+                            "        ground_truth.append(request_arg)\n",
+                            "        continue\n",
+                            "    resolved = correction_chain.run(request=request_arg,\n",
+                            "                            user_feedback=feedback)\n",
+                            "    ground_truth.append(resolved.strip())\n",
+                            "    print(\"Updated request:\", resolved)"
+                     ]
+              },
+              {
+                     "cell_type": "markdown",
+                     "id": "19d68882",
+                     "metadata": {},
+                     "source": [
+                            "**Now you can use the `ground_truth` as shown above in [Evaluate the Requests Chain](#Evaluate-the-requests-chain)!**"
+                     ]
+              },
+              {
+                     "cell_type": "code",
+                     "execution_count": 30,
+                     "id": "5a596176",
+                     "metadata": {},
+                     "outputs": [{
+                            "data": {
+                                   "text/plain": [
+                                          "['{\"task_description\": \"say \\'hello\\'\", \"learning_language\": \"Spanish\", \"native_language\": \"English\", \"full_query\": \"Can you explain how to say \\'hello\\' in Spanish?\"}',\n",
+                                          " '{\"task_description\": \"understanding the French word for \\'goodbye\\'\", \"learning_language\": \"French\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the French word for \\'goodbye\\'.\"}',\n",
+                                          " '{\"task_description\": \"say \\'thank you\\'\", \"learning_language\": \"German\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say \\'thank you\\' in German?\"}',\n",
+                                          " '{\"task_description\": \"Learn the Italian word for \\'please\\'\", \"learning_language\": \"Italian\", \"native_language\": \"English\", \"full_query\": \"I\\'m trying to learn the Italian word for \\'please\\'.\"}',\n",
+                                          " '{\"task_description\": \"Correctly pronounce \\'yes\\' in Portuguese\", \"learning_language\": \"Portuguese\", \"native_language\": \"English\", \"full_query\": \"Can you help me with the pronunciation of \\'yes\\' in Portuguese?\"}',\n",
+                                          " '{\"task_description\": \"Say no\", \"learning_language\": \"Dutch\", \"native_language\": \"English\", \"full_query\": \"I\\'m looking for the Dutch word for \\'no\\'.\"}',\n",
+                                          " '{\"task_description\": \"Say hello in Japanese\", \"learning_language\": \"Japanese\", \"native_language\": \"English\", \"full_query\": \"Can you say hello in Japanese?\"}',\n",
+                                          " '{\"task_description\": \"thanks someone\", \"learning_language\": \"Russian\", \"native_language\": \"English\", \"full_query\": \"I need help understanding the Russian word for \\'thank you\\'.\"}',\n",
+                                          " '{\"task_description\": \"say goodbye\", \"learning_language\": \"Chinese\", \"native_language\": \"English\", \"full_query\": \"Can you tell me how to say \\'goodbye\\' in Chinese?\"}',\n",
+                                          " '{\"task_description\": \"Ask please\", \"learning_language\": \"Arabic\", \"native_language\": \"English\", \"full_query\": \"I\\'m trying to learn the Arabic word for \\'please\\'.\"}']"
+                                   ]
+                            },
+                            "execution_count": 30,
+                            "metadata": {},
+                            "output_type": "execute_result"
+                     }],
+                     "source": [
+                            "# Now you have a new ground truth set to use as shown above!\n",
+                            "ground_truth"
+                     ]
+              },
              {
                     "cell_type": "code",
                     "execution_count": null,
-   "id": "0ee43877",
+                     "id": "b7fe9dfa",
                     "metadata": {},
                     "outputs": [],
                     "source": []