langchain/docs/extras/modules/evaluation/string/qa.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d63696a8-d035-4cf7-9605-c3210f0b551d",
   "metadata": {
    "tags": []
   },
   "source": [
    "# QA Correctness\n",
    "\n",
    "When thinking about a QA system, one of the most important questions to ask is whether the final generated result is correct. The `\"qa\"` evaluator compares a question-answering model's response to a reference answer to provide this level of information. If you are able to annotate a test dataset, this evaluator will be useful.\n",
    "\n",
    "For more details, check out the reference docs for the [QAEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.QAEvalChain.html#langchain.evaluation.qa.eval_chain.QAEvalChain)'s class definition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9672fdb9-b53f-41e4-8f72-f21d11edbeac",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "llm = ChatOpenAI(model=\"gpt-4\", temperature=0)\n",
    "\n",
    "# Note: the eval_llm is optional. A gpt-4 model will be provided by default if not specified\n",
    "evaluator = load_evaluator(\"qa\", eval_llm=llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b4db474a-9c9d-473f-81b1-55070ee584a6",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': None, 'value': 'CORRECT', 'score': 1}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator.evaluate_strings(\n",
    "    input=\"What's last quarter's sales numbers?\",\n",
    "    prediction=\"Last quarter we sold 600,000 total units of product.\",\n",
    "    reference=\"Last quarter we sold 100,000 units of product A, 210,000 units of product B, and 300,000 units of product C.\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5b345aa-7f45-4eea-bedf-9b0d5e824be3",
   "metadata": {},
   "source": [
    "## SQL Correctness\n",
    "\n",
    "You can use an LLM to check the equivalence of a SQL query against a reference SQL query using the sql prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6c803b8c-fe1f-4fb7-8ea0-d9c67b855eb3",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation.qa.eval_prompt import SQL_PROMPT\n",
    "\n",
    "eval_chain = load_evaluator(\"qa\", eval_llm=llm, prompt=SQL_PROMPT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "e28b8d07-248f-405c-bcef-e0ebe3a05c3e",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': 'The expert answer and the submission are very similar in their structure and logic. Both queries are trying to calculate the sum of sales amounts for the last quarter. They both use the SUM function to add up the sale_amount from the sales table. They also both use the same WHERE clause to filter the sales data to only include sales from the last quarter. The WHERE clause uses the DATEADD function to subtract 1 quarter from the current date (GETDATE()) and only includes sales where the sale_date is greater than or equal to this date and less than the current date.\\n\\nThe main difference between the two queries is that the expert answer uses a subquery to first select the sale_amount from the sales table with the appropriate date filter, and then sums these amounts in the outer query. The submission, on the other hand, does not use a subquery and instead sums the sale_amount directly in the main query with the same date filter.\\n\\nHowever, this difference does not affect the result of the query. Both queries will return the same result, which is the sum of the sales amounts for the last quarter.\\n\\nCORRECT',\n",
       " 'value': 'CORRECT',\n",
       " 'score': 1}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eval_chain.evaluate_strings(\n",
    "    input=\"What's last quarter's sales numbers?\",\n",
    "    prediction=\"\"\"SELECT SUM(sale_amount) AS last_quarter_sales\n",
    "FROM sales\n",
    "WHERE sale_date >= DATEADD(quarter, -1, GETDATE()) AND sale_date < GETDATE();\n",
    "\"\"\",\n",
    "    reference=\"\"\"SELECT SUM(sub.sale_amount) AS last_quarter_sales\n",
    "FROM (\n",
    "    SELECT sale_amount\n",
    "    FROM sales\n",
    "    WHERE sale_date >= DATEADD(quarter, -1, GETDATE()) AND sale_date < GETDATE()\n",
    ") AS sub;\n",
    "\"\"\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0c3dcad-408e-4d26-9e25-848ebacac2c4",
   "metadata": {},
   "source": [
    "## Using Context\n",
    "\n",
    "Sometimes, reference labels aren't all available, but you have additional knowledge as context from a retrieval system. Often there may be additional information that isn't available to the model you want to evaluate. For this type of scenario, you can use the [ContextQAEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.ContextQAEvalChain.html#langchain.evaluation.qa.eval_chain.ContextQAEvalChain)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9f3ae116-3a2f-461d-ba6f-7352b42c1b0c",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': None, 'value': 'CORRECT', 'score': 1}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eval_chain = load_evaluator(\"context_qa\", eval_llm=llm)\n",
    "\n",
    "eval_chain.evaluate_strings(\n",
    "    input=\"Who won the NFC championship game in 2023?\",\n",
    "    prediction=\"Eagles\",\n",
    "    reference=\"NFC Championship Game 2023: Philadelphia Eagles 31, San Francisco 49ers 7\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba5eac17-08b6-4e4f-a896-79e7fc637018",
   "metadata": {},
   "source": [
    "## CoT With Context\n",
    "\n",
    "The same prompt strategies such as chain of thought can be used to make the evaluation results more reliable.\n",
    "The [CotQAEvalChain's](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.CotQAEvalChain.html#langchain.evaluation.qa.eval_chain.CotQAEvalChain) default prompt instructs the model to do this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "26e3b686-98f4-45a5-9854-7071ec2893f1",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': 'The student\\'s answer is \"Eagles\". The context states that the Philadelphia Eagles won the NFC championship game in 2023. Therefore, the student\\'s answer matches the information provided in the context.',\n",
       " 'value': 'GRADE: CORRECT',\n",
       " 'score': 1}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eval_chain = load_evaluator(\"cot_qa\", eval_llm=llm)\n",
    "\n",
    "eval_chain.evaluate_strings(\n",
    "    input=\"Who won the NFC championship game in 2023?\",\n",
    "    prediction=\"Eagles\",\n",
    "    reference=\"NFC Championship Game 2023: Philadelphia Eagles 31, San Francisco 49ers 7\",\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}