"This notebook covers how to evaluate generic question answering problems. This is a situation where you have an example containing a question and its corresponding ground truth answer, and you want to measure how well the language model does at answering those questions."
]
},
{
"cell_type": "markdown",
"id": "78e3023b",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"For demonstration purposes, we will just evaluate a simple question answering system that only evaluates the model's internal knowledge. Please see other notebooks for examples where it evaluates how the model does at question answering over data not present in what the model was trained on."
"For this purpose, we will just use two simple hardcoded examples, but see other notebooks for tips on how to get and/or generate these examples."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "87de1d84",
"metadata": {},
"outputs": [],
"source": [
"examples = [\n",
" {\n",
" \"question\": \"Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?\",\n",
" \"answer\": \"11\"\n",
" },\n",
" {\n",
" \"question\": 'Is the following sentence plausible? \"Joao Moutinho caught the screen pass in the NFC championship.\"',\n",
" \"answer\": \"No\"\n",
" }\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "143b1155",
"metadata": {},
"source": [
"## Predictions\n",
"\n",
"We can now make and inspect the predictions for these questions."
" {'text': ' No, this sentence is not plausible. Joao Moutinho is a professional soccer player, not an American football player, so it is not possible for him to catch a screen pass in the NFC championship.'}]"
"We can see that if we tried to just do exact match on the answer answers (`11` and `No`) they would not match what the lanuage model answered. However, semantically the language model is correct in both cases. In order to account for this, we can use a language model itself to evaluate the answers."
"Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?\n",
"Real Answer: 11\n",
"Predicted Answer: 11 tennis balls\n",
"Predicted Grade: CORRECT\n",
"\n",
"Example 1:\n",
"Question: Is the following sentence plausible? \"Joao Moutinho caught the screen pass in the NFC championship.\"\n",
"Real Answer: No\n",
"Predicted Answer: No, this sentence is not plausible. Joao Moutinho is a professional soccer player, not an American football player, so it is not possible for him to catch a screen pass in the NFC championship.\n",
"We can compare the evaluation results we get to other common evaluation metrics. To do this, let's load some evaluation metrics from HuggingFace's `evaluate` package."