langchain/docs/use_cases/evaluation/question_answering.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "480b7cf8",
   "metadata": {},
   "source": [
    "# Question Answering\n",
    "\n",
    "This notebook covers how to evaluate generic question answering problems. This is a situation where you have an example containing a question and its corresponding ground truth answer, and you want to measure how well the language model does at answering those questions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78e3023b",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "For demonstration purposes, we will just evaluate a simple question answering system that only evaluates the model's internal knowledge. Please see other notebooks for examples where it evaluates how the model does at question answering over data not present in what the model was trained on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "96710d50",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "from langchain.chains import LLMChain\n",
    "from langchain.llms import OpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e33ccf00",
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = PromptTemplate(template=\"Question: {question}\\nAnswer:\", input_variables=[\"question\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "172d993a",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = OpenAI(model_name=\"text-davinci-003\", temperature=0)\n",
    "chain = LLMChain(llm=llm, prompt=prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c584440",
   "metadata": {},
   "source": [
    "## Examples\n",
    "For this purpose, we will just use two simple hardcoded examples, but see other notebooks for tips on how to get and/or generate these examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "87de1d84",
   "metadata": {},
   "outputs": [],
   "source": [
    "examples = [\n",
    "    {\n",
    "        \"question\": \"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?\",\n",
    "        \"answer\": \"11\"\n",
    "    },\n",
    "    {\n",
    "        \"question\": 'Is the following sentence plausible? \"Joao Moutinho caught the screen pass in the NFC championship.\"',\n",
    "        \"answer\": \"No\"\n",
    "    }\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "143b1155",
   "metadata": {},
   "source": [
    "## Predictions\n",
    "\n",
    "We can now make and inspect the predictions for these questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c7bd809c",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = chain.apply(examples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f06dceab",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'text': ' 11 tennis balls'},\n",
       " {'text': ' No, this sentence is not plausible. Joao Moutinho is a professional soccer player, not an American football player, so it is not likely that he would be catching a screen pass in the NFC championship.'}]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45cc2f9d",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "\n",
    "We can see that if we tried to just do exact match on the answer answers (`11` and `No`) they would not match what the lanuage model answered. However, semantically the language model is correct in both cases. In order to account for this, we can use a language model itself to evaluate the answers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0cacc65a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.evaluation.qa import QAEvalChain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5aa6cd65",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = OpenAI(temperature=0)\n",
    "eval_chain = QAEvalChain.from_llm(llm)\n",
    "graded_outputs = eval_chain.evaluate(examples, predictions, question_key=\"question\", prediction_key=\"text\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "63780020",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Example 0:\n",
      "Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?\n",
      "Real Answer: 11\n",
      "Predicted Answer:  11 tennis balls\n",
      "Predicted Grade:  CORRECT\n",
      "\n",
      "Example 1:\n",
      "Question: Is the following sentence plausible? \"Joao Moutinho caught the screen pass in the NFC championship.\"\n",
      "Real Answer: No\n",
      "Predicted Answer:  No, this sentence is not plausible. Joao Moutinho is a professional soccer player, not an American football player, so it is not likely that he would be catching a screen pass in the NFC championship.\n",
      "Predicted Grade:  CORRECT\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for i, eg in enumerate(examples):\n",
    "    print(f\"Example {i}:\")\n",
    "    print(\"Question: \" + eg['question'])\n",
    "    print(\"Real Answer: \" + eg['answer'])\n",
    "    print(\"Predicted Answer: \" + predictions[i]['text'])\n",
    "    print(\"Predicted Grade: \" + graded_outputs[i]['text'])\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "782ae8c8",
   "metadata": {},
   "source": [
    "## Customize Prompt\n",
    "\n",
    "You can also customize the prompt that is used. Here is an example prompting it using a score from 0 to 10.\n",
    "The custom prompt requires 3 input variables: \"query\", \"answer\" and \"result\". Where \"query\" is the question, \"answer\" is the ground truth answer, and \"result\" is the predicted answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "153425c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts.prompt import PromptTemplate\n",
    "\n",
    "_PROMPT_TEMPLATE = \"\"\"You are an expert professor specialized in grading students' answers to questions.\n",
    "You are grading the following question:\n",
    "{query}\n",
    "Here is the real answer:\n",
    "{answer}\n",
    "You are grading the following predicted answer:\n",
    "{result}\n",
    "What grade do you give from 0 to 10, where 0 is the lowest (very low similarity) and 10 is the highest (very high similarity)?\n",
    "\"\"\"\n",
    "\n",
    "PROMPT = PromptTemplate(input_variables=[\"query\", \"answer\", \"result\"], template=_PROMPT_TEMPLATE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a3b0fb7",
   "metadata": {},
   "outputs": [],
   "source": [
    "evalchain = QAEvalChain.from_llm(llm=llm,prompt=PROMPT)\n",
    "evalchain.evaluate(examples, predictions, question_key=\"question\", answer_key=\"answer\", prediction_key=\"text\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaa61f0c",
   "metadata": {},
   "source": [
    "## Comparing to other evaluation metrics\n",
    "We can compare the evaluation results we get to other common evaluation metrics. To do this, let's load some evaluation metrics from HuggingFace's `evaluate` package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d851453b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some data munging to get the examples in the right format\n",
    "for i, eg in enumerate(examples):\n",
    "    eg['id'] = str(i)\n",
    "    eg['answers'] = {\"text\": [eg['answer']], \"answer_start\": [0]}\n",
    "    predictions[i]['id'] = str(i)\n",
    "    predictions[i]['prediction_text'] = predictions[i]['text']\n",
    "\n",
    "for p in predictions:\n",
    "    del p['text']\n",
    "\n",
    "new_examples = examples.copy()\n",
    "for eg in new_examples:\n",
    "    del eg ['question']\n",
    "    del eg['answer']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c38eb3e9",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from evaluate import load\n",
    "squad_metric = load(\"squad\")\n",
    "results = squad_metric.compute(\n",
    "    references=new_examples,\n",
    "    predictions=predictions,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "07d68f85",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'exact_match': 0.0, 'f1': 28.125}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b775150",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  },
  "vscode": {
   "interpreter": {
    "hash": "53f3bc57609c7a84333bb558594977aa5b4026b1d6070b93987956689e367341"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"id": "480b7cf8",`
			`"metadata": {},`
			`"source": [`
			`"# Question Answering\n",`
			`"\n",`
			`"This notebook covers how to evaluate generic question answering problems. This is a situation where you have an example containing a question and its corresponding ground truth answer, and you want to measure how well the language model does at answering those questions."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "78e3023b",`
			`"metadata": {},`
			`"source": [`
			`"## Setup\n",`
			`"\n",`
			`"For demonstration purposes, we will just evaluate a simple question answering system that only evaluates the model's internal knowledge. Please see other notebooks for examples where it evaluates how the model does at question answering over data not present in what the model was trained on."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"id": "96710d50",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.prompts import PromptTemplate\n",`
			`"from langchain.chains import LLMChain\n",`
			`"from langchain.llms import OpenAI"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"id": "e33ccf00",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"prompt = PromptTemplate(template=\"Question: {question}\\nAnswer:\", input_variables=[\"question\"])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"id": "172d993a",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"llm = OpenAI(model_name=\"text-davinci-003\", temperature=0)\n",`
			`"chain = LLMChain(llm=llm, prompt=prompt)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "0c584440",`
			`"metadata": {},`
			`"source": [`
			`"## Examples\n",`
			`"For this purpose, we will just use two simple hardcoded examples, but see other notebooks for tips on how to get and/or generate these examples."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"id": "87de1d84",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"examples = [\n",`
			`" {\n",`
			`" \"question\": \"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?\",\n",`
			`" \"answer\": \"11\"\n",`
			`" },\n",`
			`" {\n",`
			`" \"question\": 'Is the following sentence plausible? \"Joao Moutinho caught the screen pass in the NFC championship.\"',\n",`
			`" \"answer\": \"No\"\n",`
			`" }\n",`
			`"]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "143b1155",`
			`"metadata": {},`
			`"source": [`
			`"## Predictions\n",`
			`"\n",`
			`"We can now make and inspect the predictions for these questions."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"id": "c7bd809c",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"predictions = chain.apply(examples)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"id": "f06dceab",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[{'text': ' 11 tennis balls'},\n",`
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00			`" {'text': ' No, this sentence is not plausible. Joao Moutinho is a professional soccer player, not an American football player, so it is not likely that he would be catching a screen pass in the NFC championship.'}]"`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`]`
			`},`
			`"execution_count": 6,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"predictions"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "45cc2f9d",`
			`"metadata": {},`
			`"source": [`
			`"## Evaluation\n",`
			`"\n",`
			"We can see that if we tried to just do exact match on the answer answers (`11` and `No`) they would not match what the lanuage model answered. However, semantically the language model is correct in both cases. In order to account for this, we can use a language model itself to evaluate the answers."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"id": "0cacc65a",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.evaluation.qa import QAEvalChain"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"id": "5aa6cd65",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"llm = OpenAI(temperature=0)\n",`
			`"eval_chain = QAEvalChain.from_llm(llm)\n",`
			`"graded_outputs = eval_chain.evaluate(examples, predictions, question_key=\"question\", prediction_key=\"text\")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00			`"execution_count": 9,`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`"id": "63780020",`
			`"metadata": {},`
			`"outputs": [`
			`{`
Harrison/ver 0048 (#429) 2022-12-26 16:36:49 +00:00			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Example 0:\n",`
			`"Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?\n",`
			`"Real Answer: 11\n",`
			`"Predicted Answer: 11 tennis balls\n",`
			`"Predicted Grade: CORRECT\n",`
			`"\n",`
			`"Example 1:\n",`
			`"Question: Is the following sentence plausible? \"Joao Moutinho caught the screen pass in the NFC championship.\"\n",`
			`"Real Answer: No\n",`
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00			`"Predicted Answer: No, this sentence is not plausible. Joao Moutinho is a professional soccer player, not an American football player, so it is not likely that he would be catching a screen pass in the NFC championship.\n",`
Harrison/ver 0048 (#429) 2022-12-26 16:36:49 +00:00			`"Predicted Grade: CORRECT\n",`
			`"\n"`
			`]`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`}`
			`],`
			`"source": [`
Harrison/ver 0048 (#429) 2022-12-26 16:36:49 +00:00			`"for i, eg in enumerate(examples):\n",`
			`" print(f\"Example {i}:\")\n",`
			`" print(\"Question: \" + eg['question'])\n",`
			`" print(\"Real Answer: \" + eg['answer'])\n",`
			`" print(\"Predicted Answer: \" + predictions[i]['text'])\n",`
			`" print(\"Predicted Grade: \" + graded_outputs[i]['text'])\n",`
			`" print()"`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`]`
			`},`
feat: add custom prompt for QAEvalChain chain (#610) I originally had only modified the `from_llm` to include the prompt but I realized that if the prompt keys used on the custom prompt didn't match the default prompt, it wouldn't work because of how `apply` works. So I made some changes to the evaluate method to check if the prompt is the default and if not, it will check if the input keys are the same as the prompt key and update the inputs appropriately. Let me know if there is a better way to do this. Also added the custom prompt to the QA eval notebook. 2023-01-14 15:23:48 +00:00			`{`
			`"cell_type": "markdown",`
			`"id": "782ae8c8",`
			`"metadata": {},`
			`"source": [`
			`"## Customize Prompt\n",`
			`"\n",`
			`"You can also customize the prompt that is used. Here is an example prompting it using a score from 0 to 10.\n",`
			`"The custom prompt requires 3 input variables: \"query\", \"answer\" and \"result\". Where \"query\" is the question, \"answer\" is the ground truth answer, and \"result\" is the predicted answer."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "153425c4",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.prompts.prompt import PromptTemplate\n",`
			`"\n",`
			`"_PROMPT_TEMPLATE = \"\"\"You are an expert professor specialized in grading students' answers to questions.\n",`
			`"You are grading the following question:\n",`
			`"{query}\n",`
			`"Here is the real answer:\n",`
			`"{answer}\n",`
			`"You are grading the following predicted answer:\n",`
			`"{result}\n",`
			`"What grade do you give from 0 to 10, where 0 is the lowest (very low similarity) and 10 is the highest (very high similarity)?\n",`
			`"\"\"\"\n",`
			`"\n",`
			`"PROMPT = PromptTemplate(input_variables=[\"query\", \"answer\", \"result\"], template=_PROMPT_TEMPLATE)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "0a3b0fb7",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"evalchain = QAEvalChain.from_llm(llm=llm,prompt=PROMPT)\n",`
			`"evalchain.evaluate(examples, predictions, question_key=\"question\", answer_key=\"answer\", prediction_key=\"text\")"`
			`]`
			`},`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`{`
			`"cell_type": "markdown",`
			`"id": "aaa61f0c",`
			`"metadata": {},`
			`"source": [`
			`"## Comparing to other evaluation metrics\n",`
			"We can compare the evaluation results we get to other common evaluation metrics. To do this, let's load some evaluation metrics from HuggingFace's `evaluate` package."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00			`"execution_count": 10,`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`"id": "d851453b",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# Some data munging to get the examples in the right format\n",`
			`"for i, eg in enumerate(examples):\n",`
			`" eg['id'] = str(i)\n",`
			`" eg['answers'] = {\"text\": [eg['answer']], \"answer_start\": [0]}\n",`
			`" predictions[i]['id'] = str(i)\n",`
			`" predictions[i]['prediction_text'] = predictions[i]['text']\n",`
			`"\n",`
			`"for p in predictions:\n",`
			`" del p['text']\n",`
			`"\n",`
			`"new_examples = examples.copy()\n",`
			`"for eg in new_examples:\n",`
			`" del eg ['question']\n",`
			`" del eg['answer']"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00			`"execution_count": 11,`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`"id": "c38eb3e9",`
			`"metadata": {`
			`"scrolled": true`
			`},`
			`"outputs": [],`
			`"source": [`
			`"from evaluate import load\n",`
			`"squad_metric = load(\"squad\")\n",`
			`"results = squad_metric.compute(\n",`
			`" references=new_examples,\n",`
			`" predictions=predictions,\n",`
			`")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00			`"execution_count": 12,`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`"id": "07d68f85",`
			`"metadata": {},`
Docs refactor (#480) Big docs refactor! Motivation is to make it easier for people to find resources they are looking for. To accomplish this, there are now three main sections: - Getting Started: steps for getting started, walking through most core functionality - Modules: these are different modules of functionality that langchain provides. Each part here has a "getting started", "how to", "key concepts" and "reference" section (except in a few select cases where it didnt easily fit). - Use Cases: this is to separate use cases (like summarization, question answering, evaluation, etc) from the modules, and provide a different entry point to the code base. There is also a full reference section, as well as extra resources (glossary, gallery, etc) Co-authored-by: Shreya Rajpal <ShreyaR@users.noreply.github.com> 2023-01-02 16:24:09 +00:00			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"{'exact_match': 0.0, 'f1': 28.125}"`
			`]`
			`},`
			`"execution_count": 12,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`"source": [`
			`"results"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "3b775150",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00			`"display_name": "Python 3 (ipykernel)",`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
Harrison/agent eval (#1620) Co-authored-by: jerwelborn <jeremy.welborn@gmail.com> 2023-03-14 19:37:48 +00:00			`"version": "3.9.1"`
feat: add custom prompt for QAEvalChain chain (#610) I originally had only modified the `from_llm` to include the prompt but I realized that if the prompt keys used on the custom prompt didn't match the default prompt, it wouldn't work because of how `apply` works. So I made some changes to the evaluate method to check if the prompt is the default and if not, it will check if the input keys are the same as the prompt key and update the inputs appropriately. Let me know if there is a better way to do this. Also added the custom prompt to the QA eval notebook. 2023-01-14 15:23:48 +00:00			`},`
			`"vscode": {`
			`"interpreter": {`
			`"hash": "53f3bc57609c7a84333bb558594977aa5b4026b1d6070b93987956689e367341"`
			`}`
Harrison/evaluation notebook (#426) 2022-12-26 14:16:37 +00:00			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`