openai-cookbook/examples/evaluation/How_to_eval_abstractive_sum...

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "83a38f3a8a224a7ab3138f15febbc251",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "markdown"
   },
   "source": [
    "# How to evaluate a summarization task\n",
    "\n",
    "In this notebook we delve into the evaluation techniques for abstractive summarization tasks using a simple example. We explore traditional evaluation methods like [ROUGE](https://aclanthology.org/W04-1013/) and [BERTScore](https://arxiv.org/abs/1904.09675), in addition to showcasing a more novel approach using LLMs as evaluators.\n",
    "\n",
    "Evaluating the quality of summaries is a time-consuming process, as it involves different quality metrics such as coherence, conciseness, readability and content. Traditional automatic evaluation metrics such as `ROUGE` and `BERTScore` and others are concrete and reliable, but they may not correlate well with the actual quality of summaries. They show relatively low correlation with human judgments, especially for open-ended generation tasks ([Liu et al., 2023](https://arxiv.org/pdf/2303.16634.pdf)). There's a growing need to lean on human evaluations, user feedback, or model-based metrics while being vigilant about potential biases. While human judgment provides invaluable insights, it is often not scalable and can be cost-prohibitive.\n",
    "\n",
    "In addition to these traditional metrics, we showcase a method ([G-Eval](https://arxiv.org/pdf/2303.16634.pdf)) that leverages Large Language Models (LLMs) as a novel, reference-free metric for assessing abstractive summaries. In this case, we use `gpt-4` to score candidate outputs. `gpt-4` has effectively learned an internal model of language quality that allows it to differentiate between fluent, coherent text and low-quality text. Harnessing this internal scoring mechanism allows auto-evaluation of new candidate outputs generated by an LLM.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cell_id": "0c1c7a1190a44c4da1c652f12694b8ce",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 22360,
    "execution_start": 1692080227636,
    "source_hash": "a9d11aa3"
   },
   "outputs": [],
   "source": [
    "# Installing necessary packages for the evaluation\n",
    "# rouge: For evaluating with ROUGE metric\n",
    "# bert_score: For evaluating with BERTScore\n",
    "# openai: To interact with OpenAI's API\n",
    "!pip install rouge --quiet\n",
    "!pip install bert_score --quiet\n",
    "!pip install openai --quiet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "cell_id": "b2e0f0ba05a34b6aa371b1b67d25acc8",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 8,
    "execution_start": 1692082891192,
    "source_hash": "cf469010"
   },
   "outputs": [
    {
     "data": {
      "application/javascript": "\n            setTimeout(function() {\n                var nbb_cell_id = 23;\n                var nbb_unformatted_code = \"import openai\\nimport os\\nimport re\\nimport pandas as pd\\n\\n# Python Implementation of the ROUGE Metric\\nfrom rouge import Rouge\\n\\n# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.\\nfrom bert_score import BERTScorer\\n\\nopenai.api_key = os.environ.get(\\\"OPENAI_API_KEY\\\")\";\n                var nbb_formatted_code = \"import openai\\nimport os\\nimport re\\nimport pandas as pd\\n\\n# Python Implementation of the ROUGE Metric\\nfrom rouge import Rouge\\n\\n# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.\\nfrom bert_score import BERTScorer\\n\\nopenai.api_key = os.environ.get(\\\"OPENAI_API_KEY\\\")\";\n                var nbb_cells = Jupyter.notebook.get_cells();\n                for (var i = 0; i < nbb_cells.length; ++i) {\n                    if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n                        if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n                             nbb_cells[i].set_text(nbb_formatted_code);\n                        }\n                        break;\n                    }\n                }\n            }, 500);\n            ",
      "text/plain": [
       "<IPython.core.display.Javascript object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from openai import OpenAI\n",
    "import os\n",
    "import re\n",
    "import pandas as pd\n",
    "\n",
    "# Python Implementation of the ROUGE Metric\n",
    "from rouge import Rouge\n",
    "\n",
    "# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.\n",
    "from bert_score import BERTScorer\n",
    "\n",
    "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "7c8bf29b2e6b4c78b5a50a0f42d093d2",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "markdown"
   },
   "source": [
    "## Example task\n",
    "\n",
    "For the purposes of this notebook we'll use the example summarization below. Notice that we provide two generated summaries to compare, and a reference human-written summary, which evaluation metrics like `ROUGE` and `BERTScore` require.\n",
    "\n",
    "Excerpt (`excerpt`):\n",
    "\n",
    "> OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\n",
    "\n",
    "Summaries:\n",
    "\n",
    "| Reference Summary /`ref_summary` (human generated)                                                                                                                                                                                                                                                                                                                         | Eval Summary 1 / `eval_summary_1` (system generated)                                                                                                                                                                                                                                                                                                                               | Eval Summary 2 / `eval_summary_2` (system generated)                                                                                                                                                                                                                                                   |\n",
    "| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n",
    "| OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges. | OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good. | OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff. |\n",
    "\n",
    "Take a moment to figure out which summary you'd personally prefer and the one that captures OpenAI's mission really well.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "cell_id": "cc5d9f65e8924200bb5134c176c4fd05",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 16,
    "execution_start": 1692083015932,
    "source_hash": "9aa26bd6"
   },
   "outputs": [
    {
     "data": {
      "application/javascript": "\n            setTimeout(function() {\n                var nbb_cell_id = 9;\n                var nbb_unformatted_code = \"excerpt = \\\"OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\\\"\\nref_summary = \\\"OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.\\\"\\neval_summary_1 = \\\"OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good.\\\"\\neval_summary_2 = \\\"OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff.\\\"\";\n                var nbb_formatted_code = \"excerpt = \\\"OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\\\"\\nref_summary = \\\"OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.\\\"\\neval_summary_1 = \\\"OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good.\\\"\\neval_summary_2 = \\\"OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff.\\\"\";\n                var nbb_cells = Jupyter.notebook.get_cells();\n                for (var i = 0; i < nbb_cells.length; ++i) {\n                    if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n                        if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n                             nbb_cells[i].set_text(nbb_formatted_code);\n                        }\n                        break;\n                    }\n                }\n            }, 500);\n            ",
      "text/plain": [
       "<IPython.core.display.Javascript object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "excerpt = \"OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\"\n",
    "ref_summary = \"OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.\"\n",
    "eval_summary_1 = \"OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good.\"\n",
    "eval_summary_2 = \"OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff.\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "f3ae350a2e4b47d985843c5b0808e5b6",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "markdown"
   },
   "source": [
    "## Evaluating using ROUGE\n",
    "\n",
    "[ROUGE](https://aclanthology.org/W04-1013/), which stands for Recall-Oriented Understudy for Gisting Evaluation, primarily gauges the overlap of words between a generated output and a reference text. It's a prevalent metric for evaluating automatic summarization tasks. Among its variants, `ROUGE-L` offers insights into the longest contiguous match between system-generated and reference summaries, gauging how well the system retains the original summary's essence.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "cell_id": "dbd380ae5135456bb79ee3192128e489",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 86,
    "execution_start": 1692083097056,
    "source_hash": "c50fbd38"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_7e6ac_row0_col0, #T_7e6ac_row1_col1, #T_7e6ac_row2_col0 {\n",
       "  background-color: white;\n",
       "}\n",
       "#T_7e6ac_row0_col1, #T_7e6ac_row1_col0, #T_7e6ac_row2_col1 {\n",
       "  background-color: lightgreen;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_7e6ac\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"blank level0\" >&nbsp;</th>\n",
       "      <th id=\"T_7e6ac_level0_col0\" class=\"col_heading level0 col0\" >Summary 1</th>\n",
       "      <th id=\"T_7e6ac_level0_col1\" class=\"col_heading level0 col1\" >Summary 2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th class=\"index_name level0\" >Metric</th>\n",
       "      <th class=\"blank col0\" >&nbsp;</th>\n",
       "      <th class=\"blank col1\" >&nbsp;</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_7e6ac_level0_row0\" class=\"row_heading level0 row0\" >rouge-1 (F-Score)</th>\n",
       "      <td id=\"T_7e6ac_row0_col0\" class=\"data row0 col0\" >0.488889</td>\n",
       "      <td id=\"T_7e6ac_row0_col1\" class=\"data row0 col1\" >0.511628</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_7e6ac_level0_row1\" class=\"row_heading level0 row1\" >rouge-2 (F-Score)</th>\n",
       "      <td id=\"T_7e6ac_row1_col0\" class=\"data row1 col0\" >0.230769</td>\n",
       "      <td id=\"T_7e6ac_row1_col1\" class=\"data row1 col1\" >0.163265</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_7e6ac_level0_row2\" class=\"row_heading level0 row2\" >rouge-l (F-Score)</th>\n",
       "      <td id=\"T_7e6ac_row2_col0\" class=\"data row2 col0\" >0.488889</td>\n",
       "      <td id=\"T_7e6ac_row2_col1\" class=\"data row2 col1\" >0.511628</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x140b7dee0>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "application/javascript": "\n            setTimeout(function() {\n                var nbb_cell_id = 10;\n                var nbb_unformatted_code = \"# function to calculate the Rouge score\\ndef get_rouge_scores(text1, text2):\\n    rouge = Rouge()\\n    return rouge.get_scores(text1, text2)\\n\\n\\nrouge_scores_out = []\\n\\n# Calculate the ROUGE scores for both summaries using reference\\neval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)\\neval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)\\n\\nfor metric in [\\\"rouge-1\\\", \\\"rouge-2\\\", \\\"rouge-l\\\"]:\\n    for label in [\\\"F-Score\\\"]:\\n        eval_1_score = eval_1_rouge[0][metric][label[0].lower()]\\n        eval_2_score = eval_2_rouge[0][metric][label[0].lower()]\\n\\n        row = {\\n            \\\"Metric\\\": f\\\"{metric} ({label})\\\",\\n            \\\"Summary 1\\\": eval_1_score,\\n            \\\"Summary 2\\\": eval_2_score,\\n        }\\n        rouge_scores_out.append(row)\\n\\n\\ndef highlight_max(s):\\n    is_max = s == s.max()\\n    return [\\n        \\\"background-color: lightgreen\\\" if v else \\\"background-color: white\\\"\\n        for v in is_max\\n    ]\\n\\n\\nrouge_scores_out = (\\n    pd.DataFrame(rouge_scores_out)\\n    .set_index(\\\"Metric\\\")\\n    .style.apply(highlight_max, axis=1)\\n)\\n\\nrouge_scores_out\";\n                var nbb_formatted_code = \"# function to calculate the Rouge score\\ndef get_rouge_scores(text1, text2):\\n    rouge = Rouge()\\n    return rouge.get_scores(text1, text2)\\n\\n\\nrouge_scores_out = []\\n\\n# Calculate the ROUGE scores for both summaries using reference\\neval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)\\neval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)\\n\\nfor metric in [\\\"rouge-1\\\", \\\"rouge-2\\\", \\\"rouge-l\\\"]:\\n    for label in [\\\"F-Score\\\"]:\\n        eval_1_score = eval_1_rouge[0][metric][label[0].lower()]\\n        eval_2_score = eval_2_rouge[0][metric][label[0].lower()]\\n\\n        row = {\\n            \\\"Metric\\\": f\\\"{metric} ({label})\\\",\\n            \\\"Summary 1\\\": eval_1_score,\\n            \\\"Summary 2\\\": eval_2_score,\\n        }\\n        rouge_scores_out.append(row)\\n\\n\\ndef highlight_max(s):\\n    is_max = s == s.max()\\n    return [\\n        \\\"background-color: lightgreen\\\" if v else \\\"background-color: white\\\"\\n        for v in is_max\\n    ]\\n\\n\\nrouge_scores_out = (\\n    pd.DataFrame(rouge_scores_out)\\n    .set_index(\\\"Metric\\\")\\n    .style.apply(highlight_max, axis=1)\\n)\\n\\nrouge_scores_out\";\n                var nbb_cells = Jupyter.notebook.get_cells();\n                for (var i = 0; i < nbb_cells.length; ++i) {\n                    if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n                        if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n                             nbb_cells[i].set_text(nbb_formatted_code);\n                        }\n                        break;\n                    }\n                }\n            }, 500);\n            ",
      "text/plain": [
       "<IPython.core.display.Javascript object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# function to calculate the Rouge score\n",
    "def get_rouge_scores(text1, text2):\n",
    "    rouge = Rouge()\n",
    "    return rouge.get_scores(text1, text2)\n",
    "\n",
    "\n",
    "rouge_scores_out = []\n",
    "\n",
    "# Calculate the ROUGE scores for both summaries using reference\n",
    "eval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)\n",
    "eval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)\n",
    "\n",
    "for metric in [\"rouge-1\", \"rouge-2\", \"rouge-l\"]:\n",
    "    for label in [\"F-Score\"]:\n",
    "        eval_1_score = eval_1_rouge[0][metric][label[0].lower()]\n",
    "        eval_2_score = eval_2_rouge[0][metric][label[0].lower()]\n",
    "\n",
    "        row = {\n",
    "            \"Metric\": f\"{metric} ({label})\",\n",
    "            \"Summary 1\": eval_1_score,\n",
    "            \"Summary 2\": eval_2_score,\n",
    "        }\n",
    "        rouge_scores_out.append(row)\n",
    "\n",
    "\n",
    "def highlight_max(s):\n",
    "    is_max = s == s.max()\n",
    "    return [\n",
    "        \"background-color: lightgreen\" if v else \"background-color: white\"\n",
    "        for v in is_max\n",
    "    ]\n",
    "\n",
    "\n",
    "rouge_scores_out = (\n",
    "    pd.DataFrame(rouge_scores_out)\n",
    "    .set_index(\"Metric\")\n",
    "    .style.apply(highlight_max, axis=1)\n",
    ")\n",
    "\n",
    "rouge_scores_out"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "a0857e829dc64f64a183212bb5aab122",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "markdown"
   },
   "source": [
    "The table shows the `ROUGE` scores for evaluating two different summaries against a reference text. In the case of `rouge-1`, Summary 2 outperforms Summary 1, indicating a better overlap of individual words and for `rouge-l`, Summary 2 has a higher score, implying a closer match in the longest common subsequences, and thus a potentially better overall summarization in capturing the main content and order of the original text. Since Summary 2 has many words and short phrases directly lifted from the excerpt, its overlap with the reference summary would likely be higher, leading to higher `ROUGE` scores.\n",
    "\n",
    "While `ROUGE` and similar metrics, such as [BLEU](https://aclanthology.org/P02-1040.pdf) and [METEOR](https://www.cs.cmu.edu/~alavie/METEOR/), offer quantitative measures, they often fail to capture the true essence of a well-generated summary. They also correlate worse with human scores. Given the advancements in LLMs, which are adept at producing fluent and coherent summaries, traditional metrics like `ROUGE` may inadvertently penalize these models. This is especially true if the summaries are articulated differently but still encapsulate the core information accurately.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "609cfe2cf2f14cd09e184168b83de274",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "markdown"
   },
   "source": [
    "## Evaluating using BERTScore\n",
    "\n",
    "ROUGE relies on the exact presence of words in both the predicted and reference texts, failing to interpret the underlying semantics. This is where [BERTScore](https://arxiv.org/abs/1904.09675) comes in and leverages the contextual embeddings from the BERT model, aiming to evaluate the similarity between a predicted and a reference sentence in the context of machine-generated text. By comparing embeddings from both sentences, `BERTScore` captures semantic similarities that might be missed by traditional n-gram based metrics.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "cell_id": "b966c86ab65744f5a4a6d2e4d534c86e",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 17954,
    "execution_start": 1692083196232,
    "source_hash": "a90f7d76"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias']\n",
      "- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Summary 1 F1 Score: 0.9227314591407776\n",
      "Summary 2 F1 Score: 0.9189572930335999\n"
     ]
    },
    {
     "data": {
      "application/javascript": "\n            setTimeout(function() {\n                var nbb_cell_id = 11;\n                var nbb_unformatted_code = \"# Instantiate the BERTScorer object for English language\\nscorer = BERTScorer(lang=\\\"en\\\")\\n\\n# Calculate BERTScore for the summary 1 against the excerpt\\n# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively\\nP1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])\\n\\n# Calculate BERTScore for summary 2 against the excerpt\\n# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively\\nP2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])\\n\\nprint(\\\"Summary 1 F1 Score:\\\", F1_1.tolist()[0])\\nprint(\\\"Summary 2 F1 Score:\\\", F2_2.tolist()[0])\";\n                var nbb_formatted_code = \"# Instantiate the BERTScorer object for English language\\nscorer = BERTScorer(lang=\\\"en\\\")\\n\\n# Calculate BERTScore for the summary 1 against the excerpt\\n# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively\\nP1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])\\n\\n# Calculate BERTScore for summary 2 against the excerpt\\n# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively\\nP2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])\\n\\nprint(\\\"Summary 1 F1 Score:\\\", F1_1.tolist()[0])\\nprint(\\\"Summary 2 F1 Score:\\\", F2_2.tolist()[0])\";\n                var nbb_cells = Jupyter.notebook.get_cells();\n                for (var i = 0; i < nbb_cells.length; ++i) {\n                    if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n                        if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n                             nbb_cells[i].set_text(nbb_formatted_code);\n                        }\n                        break;\n                    }\n                }\n            }, 500);\n            ",
      "text/plain": [
       "<IPython.core.display.Javascript object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Instantiate the BERTScorer object for English language\n",
    "scorer = BERTScorer(lang=\"en\")\n",
    "\n",
    "# Calculate BERTScore for the summary 1 against the excerpt\n",
    "# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively\n",
    "P1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])\n",
    "\n",
    "# Calculate BERTScore for summary 2 against the excerpt\n",
    "# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively\n",
    "P2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])\n",
    "\n",
    "print(\"Summary 1 F1 Score:\", F1_1.tolist()[0])\n",
    "print(\"Summary 2 F1 Score:\", F2_2.tolist()[0])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "79d07eaa9a344985838133ffc9e9e02b",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "markdown"
   },
   "source": [
    "The close F1 Scores between the summaries indicate that they may perform similarly in capturing the key information. However, this small difference should be interpreted with caution. Since `BERTScore` may not fully grasp subtleties and high-level concepts that a human evaluator might understand, reliance solely on this metric could lead to misinterpreting the actual quality and nuances of the summary. An integrated approach combining `BERTScore` with human judgment and other metrics could offer a more reliable evaluation.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "f0d66b7a59334ed3ba51d9bbbcb85890",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "markdown"
   },
   "source": [
    "## Evaluating using GPT-4\n",
    "\n",
    "Here we implement an example **reference-free** text evaluator using `gpt-4`, inspired by the [G-Eval]((https://arxiv.org/pdf/2303.16634.pdf)) framework which evaluates the quality of generated text using large language models. Unlike metrics like `ROUGE` or `BERTScore` that rely on comparison to reference summaries, the `gpt-4` based evaluator assesses the quality of generated content based solely on the input prompt and text, without any ground truth references. This makes it applicable to new datasets and tasks where human references are sparse or unavailable. \n",
    "\n",
    "Here's an overview of this method:\n",
    "\n",
    "1. We define four distinct criteria:\n",
    "    1. **Relevance**: Evaluates if the summary includes only important information and excludes redundancies.\n",
    "    2. **Coherence**: Assesses the logical flow and organization of the summary.\n",
    "    3. **Consistency**: Checks if the summary aligns with the facts in the source document.\n",
    "    4. **Fluency**: Rates the grammar and readability of the summary.\n",
    "2. We craft prompts for each of these criteria, taking the original document and the summary as inputs, and leveraging chain-of-thought generation and guiding the model to output a numeric score from 1-5 for each criteria. \n",
    "3. We generate scores from `gpt-4` with the defined prompts, comparing them across summaries.\n",
    "\n",
    "In this demonstration, we're using a direct scoring function where `gpt-4` generates a discrete score (1-5) for each metric. Normalizing the scores and taking a weighted sum could result in more robust, continuous scores that better reflect the quality and diversity of the summaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "cell_id": "b029621eb5874de78b349d3cf8dd45b4",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 7700,
    "execution_start": 1692083249280,
    "source_hash": "ab0afee3"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_94fab_row0_col0, #T_94fab_row1_col0, #T_94fab_row1_col1, #T_94fab_row2_col0, #T_94fab_row3_col0 {\n",
       "  background-color: lightgreen;\n",
       "}\n",
       "#T_94fab_row0_col1, #T_94fab_row2_col1, #T_94fab_row3_col1 {\n",
       "  background-color: white;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_94fab\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"index_name level0\" >Summary Type</th>\n",
       "      <th id=\"T_94fab_level0_col0\" class=\"col_heading level0 col0\" >Summary 1</th>\n",
       "      <th id=\"T_94fab_level0_col1\" class=\"col_heading level0 col1\" >Summary 2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th class=\"index_name level0\" >Evaluation Type</th>\n",
       "      <th class=\"blank col0\" >&nbsp;</th>\n",
       "      <th class=\"blank col1\" >&nbsp;</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_94fab_level0_row0\" class=\"row_heading level0 row0\" >Coherence</th>\n",
       "      <td id=\"T_94fab_row0_col0\" class=\"data row0 col0\" >5</td>\n",
       "      <td id=\"T_94fab_row0_col1\" class=\"data row0 col1\" >3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_94fab_level0_row1\" class=\"row_heading level0 row1\" >Consistency</th>\n",
       "      <td id=\"T_94fab_row1_col0\" class=\"data row1 col0\" >5</td>\n",
       "      <td id=\"T_94fab_row1_col1\" class=\"data row1 col1\" >5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_94fab_level0_row2\" class=\"row_heading level0 row2\" >Fluency</th>\n",
       "      <td id=\"T_94fab_row2_col0\" class=\"data row2 col0\" >3</td>\n",
       "      <td id=\"T_94fab_row2_col1\" class=\"data row2 col1\" >2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_94fab_level0_row3\" class=\"row_heading level0 row3\" >Relevance</th>\n",
       "      <td id=\"T_94fab_row3_col0\" class=\"data row3 col0\" >5</td>\n",
       "      <td id=\"T_94fab_row3_col1\" class=\"data row3 col1\" >4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x143907b50>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/javascript": "\n            setTimeout(function() {\n                var nbb_cell_id = 15;\n                var nbb_unformatted_code = \"# Evaluation prompt template based on G-Eval\\nEVALUATION_PROMPT_TEMPLATE = \\\"\\\"\\\"\\nYou will be given one summary written for an article. Your task is to rate the summary on one metric.\\nPlease make sure you read and understand these instructions very carefully. \\nPlease keep this document open while reviewing, and refer to it as needed.\\n\\nEvaluation Criteria:\\n\\n{criteria}\\n\\nEvaluation Steps:\\n\\n{steps}\\n\\nExample:\\n\\nSource Text:\\n\\n{document}\\n\\nSummary:\\n\\n{summary}\\n\\nEvaluation Form (scores ONLY):\\n\\n- {metric_name}\\n\\\"\\\"\\\"\\n\\n# Metric 1: Relevance\\n\\nRELEVANCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nRelevance(1-5) - selection of important content from the source. \\\\\\nThe summary should include only important information from the source document. \\\\\\nAnnotators were instructed to penalize summaries which contained redundancies and excess information.\\n\\\"\\\"\\\"\\n\\nRELEVANCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the summary and the source document carefully.\\n2. Compare the summary to the source document and identify the main points of the article.\\n3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\\n4. Assign a relevance score from 1 to 5.\\n\\\"\\\"\\\"\\n\\n# Metric 2: Coherence\\n\\nCOHERENCE_SCORE_CRITERIA = \\\"\\\"\\\"\\nCoherence(1-5) - the collective quality of all sentences. \\\\\\nWe align this dimension with the DUC quality question of structure and coherence \\\\\\nwhereby \\\"the summary should be well-structured and well-organized. \\\\\\nThe summary should not just be a heap of related information, but should build from sentence to a\\\\\\ncoherent body of information about a topic.\\\"\\n\\\"\\\"\\\"\\n\\nCOHERENCE_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main topic and key points.\\n2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,\\nand if it presents them in a clear and logical order.\\n3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 3: Consistency\\n\\nCONSISTENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nConsistency(1-5) - the factual alignment between the summary and the summarized source. \\\\\\nA factually consistent summary contains only statements that are entailed by the source document. \\\\\\nAnnotators were also asked to penalize summaries that contained hallucinated facts.\\n\\\"\\\"\\\"\\n\\nCONSISTENCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main facts and details it presents.\\n2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.\\n3. Assign a score for consistency based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 4: Fluency\\n\\nFLUENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nFluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\\n1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\\n2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\\n3: Good. The summary has few or no errors and is easy to read and follow.\\n\\\"\\\"\\\"\\n\\nFLUENCY_SCORE_STEPS = \\\"\\\"\\\"\\nRead the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.\\n\\\"\\\"\\\"\\n\\n\\ndef get_geval_score(\\n    criteria: str, steps: str, document: str, summary: str, metric_name: str\\n):\\n    prompt = EVALUATION_PROMPT_TEMPLATE.format(\\n        criteria=criteria,\\n        steps=steps,\\n        metric_name=metric_name,\\n        document=document,\\n        summary=summary,\\n    )\\n    response = openai.chat.completions.create(\\n        model=\\\"gpt-4\\\",\\n        messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": prompt}],\\n        temperature=0,\\n        max_tokens=5,\\n        top_p=1,\\n        frequency_penalty=0,\\n        presence_penalty=0,\\n    )\\n    return response.choices[0].message.content\\n\\n\\nevaluation_metrics = {\\n    \\\"Relevance\\\": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),\\n    \\\"Coherence\\\": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),\\n    \\\"Consistency\\\": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),\\n    \\\"Fluency\\\": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),\\n}\\n\\nsummaries = {\\\"Summary 1\\\": eval_summary_1, \\\"Summary 2\\\": eval_summary_2}\\n\\ndata = {\\\"Evaluation Type\\\": [], \\\"Summary Type\\\": [], \\\"Score\\\": []}\\n\\nfor eval_type, (criteria, steps) in evaluation_metrics.items():\\n    for summ_type, summary in summaries.items():\\n        data[\\\"Evaluation Type\\\"].append(eval_type)\\n        data[\\\"Summary Type\\\"].append(summ_type)\\n        result = get_geval_score(criteria, steps, excerpt, summary, eval_type)\\n        score_num = int(result.strip())\\n        data[\\\"Score\\\"].append(score_num)\\n\\npivot_df = pd.DataFrame(data, index=None).pivot(\\n    index=\\\"Evaluation Type\\\", columns=\\\"Summary Type\\\", values=\\\"Score\\\"\\n)\\nstyled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)\\ndisplay(styled_pivot_df)\";\n                var nbb_formatted_code = \"# Evaluation prompt template based on G-Eval\\nEVALUATION_PROMPT_TEMPLATE = \\\"\\\"\\\"\\nYou will be given one summary written for an article. Your task is to rate the summary on one metric.\\nPlease make sure you read and understand these instructions very carefully. \\nPlease keep this document open while reviewing, and refer to it as needed.\\n\\nEvaluation Criteria:\\n\\n{criteria}\\n\\nEvaluation Steps:\\n\\n{steps}\\n\\nExample:\\n\\nSource Text:\\n\\n{document}\\n\\nSummary:\\n\\n{summary}\\n\\nEvaluation Form (scores ONLY):\\n\\n- {metric_name}\\n\\\"\\\"\\\"\\n\\n# Metric 1: Relevance\\n\\nRELEVANCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nRelevance(1-5) - selection of important content from the source. \\\\\\nThe summary should include only important information from the source document. \\\\\\nAnnotators were instructed to penalize summaries which contained redundancies and excess information.\\n\\\"\\\"\\\"\\n\\nRELEVANCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the summary and the source document carefully.\\n2. Compare the summary to the source document and identify the main points of the article.\\n3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\\n4. Assign a relevance score from 1 to 5.\\n\\\"\\\"\\\"\\n\\n# Metric 2: Coherence\\n\\nCOHERENCE_SCORE_CRITERIA = \\\"\\\"\\\"\\nCoherence(1-5) - the collective quality of all sentences. \\\\\\nWe align this dimension with the DUC quality question of structure and coherence \\\\\\nwhereby \\\"the summary should be well-structured and well-organized. \\\\\\nThe summary should not just be a heap of related information, but should build from sentence to a\\\\\\ncoherent body of information about a topic.\\\"\\n\\\"\\\"\\\"\\n\\nCOHERENCE_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main topic and key points.\\n2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,\\nand if it presents them in a clear and logical order.\\n3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 3: Consistency\\n\\nCONSISTENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nConsistency(1-5) - the factual alignment between the summary and the summarized source. \\\\\\nA factually consistent summary contains only statements that are entailed by the source document. \\\\\\nAnnotators were also asked to penalize summaries that contained hallucinated facts.\\n\\\"\\\"\\\"\\n\\nCONSISTENCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main facts and details it presents.\\n2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.\\n3. Assign a score for consistency based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 4: Fluency\\n\\nFLUENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nFluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\\n1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\\n2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\\n3: Good. The summary has few or no errors and is easy to read and follow.\\n\\\"\\\"\\\"\\n\\nFLUENCY_SCORE_STEPS = \\\"\\\"\\\"\\nRead the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.\\n\\\"\\\"\\\"\\n\\n\\ndef get_geval_score(\\n    criteria: str, steps: str, document: str, summary: str, metric_name: str\\n):\\n    prompt = EVALUATION_PROMPT_TEMPLATE.format(\\n        criteria=criteria,\\n        steps=steps,\\n        metric_name=metric_name,\\n        document=document,\\n        summary=summary,\\n    )\\n    response = openai.chat.completions.create(\\n        model=\\\"gpt-4\\\",\\n        messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": prompt}],\\n        temperature=0,\\n        max_tokens=5,\\n        top_p=1,\\n        frequency_penalty=0,\\n        presence_penalty=0,\\n    )\\n    return response.choices[0].message.content\\n\\n\\nevaluation_metrics = {\\n    \\\"Relevance\\\": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),\\n    \\\"Coherence\\\": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),\\n    \\\"Consistency\\\": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),\\n    \\\"Fluency\\\": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),\\n}\\n\\nsummaries = {\\\"Summary 1\\\": eval_summary_1, \\\"Summary 2\\\": eval_summary_2}\\n\\ndata = {\\\"Evaluation Type\\\": [], \\\"Summary Type\\\": [], \\\"Score\\\": []}\\n\\nfor eval_type, (criteria, steps) in evaluation_metrics.items():\\n    for summ_type, summary in summaries.items():\\n        data[\\\"Evaluation Type\\\"].append(eval_type)\\n        data[\\\"Summary Type\\\"].append(summ_type)\\n        result = get_geval_score(criteria, steps, excerpt, summary, eval_type)\\n        score_num = int(result.strip())\\n        data[\\\"Score\\\"].append(score_num)\\n\\npivot_df = pd.DataFrame(data, index=None).pivot(\\n    index=\\\"Evaluation Type\\\", columns=\\\"Summary Type\\\", values=\\\"Score\\\"\\n)\\nstyled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)\\ndisplay(styled_pivot_df)\";\n                var nbb_cells = Jupyter.notebook.get_cells();\n                for (var i = 0; i < nbb_cells.length; ++i) {\n                    if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n                        if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n                             nbb_cells[i].set_text(nbb_formatted_code);\n                        }\n                        break;\n                    }\n                }\n            }, 500);\n            ",
      "text/plain": [
       "<IPython.core.display.Javascript object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Evaluation prompt template based on G-Eval\n",
    "EVALUATION_PROMPT_TEMPLATE = \"\"\"\n",
    "You will be given one summary written for an article. Your task is to rate the summary on one metric.\n",
    "Please make sure you read and understand these instructions very carefully. \n",
    "Please keep this document open while reviewing, and refer to it as needed.\n",
    "\n",
    "Evaluation Criteria:\n",
    "\n",
    "{criteria}\n",
    "\n",
    "Evaluation Steps:\n",
    "\n",
    "{steps}\n",
    "\n",
    "Example:\n",
    "\n",
    "Source Text:\n",
    "\n",
    "{document}\n",
    "\n",
    "Summary:\n",
    "\n",
    "{summary}\n",
    "\n",
    "Evaluation Form (scores ONLY):\n",
    "\n",
    "- {metric_name}\n",
    "\"\"\"\n",
    "\n",
    "# Metric 1: Relevance\n",
    "\n",
    "RELEVANCY_SCORE_CRITERIA = \"\"\"\n",
    "Relevance(1-5) - selection of important content from the source. \\\n",
    "The summary should include only important information from the source document. \\\n",
    "Annotators were instructed to penalize summaries which contained redundancies and excess information.\n",
    "\"\"\"\n",
    "\n",
    "RELEVANCY_SCORE_STEPS = \"\"\"\n",
    "1. Read the summary and the source document carefully.\n",
    "2. Compare the summary to the source document and identify the main points of the article.\n",
    "3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\n",
    "4. Assign a relevance score from 1 to 5.\n",
    "\"\"\"\n",
    "\n",
    "# Metric 2: Coherence\n",
    "\n",
    "COHERENCE_SCORE_CRITERIA = \"\"\"\n",
    "Coherence(1-5) - the collective quality of all sentences. \\\n",
    "We align this dimension with the DUC quality question of structure and coherence \\\n",
    "whereby \"the summary should be well-structured and well-organized. \\\n",
    "The summary should not just be a heap of related information, but should build from sentence to a\\\n",
    "coherent body of information about a topic.\"\n",
    "\"\"\"\n",
    "\n",
    "COHERENCE_SCORE_STEPS = \"\"\"\n",
    "1. Read the article carefully and identify the main topic and key points.\n",
    "2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,\n",
    "and if it presents them in a clear and logical order.\n",
    "3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.\n",
    "\"\"\"\n",
    "\n",
    "# Metric 3: Consistency\n",
    "\n",
    "CONSISTENCY_SCORE_CRITERIA = \"\"\"\n",
    "Consistency(1-5) - the factual alignment between the summary and the summarized source. \\\n",
    "A factually consistent summary contains only statements that are entailed by the source document. \\\n",
    "Annotators were also asked to penalize summaries that contained hallucinated facts.\n",
    "\"\"\"\n",
    "\n",
    "CONSISTENCY_SCORE_STEPS = \"\"\"\n",
    "1. Read the article carefully and identify the main facts and details it presents.\n",
    "2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.\n",
    "3. Assign a score for consistency based on the Evaluation Criteria.\n",
    "\"\"\"\n",
    "\n",
    "# Metric 4: Fluency\n",
    "\n",
    "FLUENCY_SCORE_CRITERIA = \"\"\"\n",
    "Fluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\n",
    "1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\n",
    "2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\n",
    "3: Good. The summary has few or no errors and is easy to read and follow.\n",
    "\"\"\"\n",
    "\n",
    "FLUENCY_SCORE_STEPS = \"\"\"\n",
    "Read the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.\n",
    "\"\"\"\n",
    "\n",
    "\n",
    "def get_geval_score(\n",
    "    criteria: str, steps: str, document: str, summary: str, metric_name: str\n",
    "):\n",
    "    prompt = EVALUATION_PROMPT_TEMPLATE.format(\n",
    "        criteria=criteria,\n",
    "        steps=steps,\n",
    "        metric_name=metric_name,\n",
    "        document=document,\n",
    "        summary=summary,\n",
    "    )\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"gpt-4\",\n",
    "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "        temperature=0,\n",
    "        max_tokens=5,\n",
    "        top_p=1,\n",
    "        frequency_penalty=0,\n",
    "        presence_penalty=0,\n",
    "    )\n",
    "    return response.choices[0].message.content\n",
    "\n",
    "\n",
    "evaluation_metrics = {\n",
    "    \"Relevance\": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),\n",
    "    \"Coherence\": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),\n",
    "    \"Consistency\": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),\n",
    "    \"Fluency\": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),\n",
    "}\n",
    "\n",
    "summaries = {\"Summary 1\": eval_summary_1, \"Summary 2\": eval_summary_2}\n",
    "\n",
    "data = {\"Evaluation Type\": [], \"Summary Type\": [], \"Score\": []}\n",
    "\n",
    "for eval_type, (criteria, steps) in evaluation_metrics.items():\n",
    "    for summ_type, summary in summaries.items():\n",
    "        data[\"Evaluation Type\"].append(eval_type)\n",
    "        data[\"Summary Type\"].append(summ_type)\n",
    "        result = get_geval_score(criteria, steps, excerpt, summary, eval_type)\n",
    "        score_num = int(result.strip())\n",
    "        data[\"Score\"].append(score_num)\n",
    "\n",
    "pivot_df = pd.DataFrame(data, index=None).pivot(\n",
    "    index=\"Evaluation Type\", columns=\"Summary Type\", values=\"Score\"\n",
    ")\n",
    "styled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)\n",
    "display(styled_pivot_df)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "03a7682297624a71ae448cf67192d8fd",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 0
    },
    "deepnote_cell_type": "markdown"
   },
   "source": [
    "\n",
    "Overall, the Summary 1 appears to outperform Summary 2 in three of the four categories (Coherence, Relevance and Fluency). Both summaries are found to be consistent with each other. The result might suggest that Summary 1 is generally preferable based on the given evaluation criteria.\n",
    "\n",
    "### Limitations\n",
    "\n",
    "Note that LLM-based metrics could have a bias towards preferring LLM-generated texts over human-written texts. Additionally LLM based metrics are sensitive to system messages/prompts. We recommend experimenting with other techniques that can help improve performance and/or get consistent scores, striking the right balance between high-quality expensive evaluation and automated evaluations. It is also worth noting that this scoring methodology is currently limited by `gpt-4`'s context window.\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "Evaluating abstractive summarization remains an open area for further improvement. Traditional metrics like `ROUGE`, `BLEU`, and `BERTScore` provide useful automatic evaluation but have limitations in capturing semantic similarity and nuanced aspects of summarization quality. Moreover, they require reference outputs which can be expensive to collect/label. LLM-based metrics offer promise as a reference-free method of evaluating coherence, fluency, and relevance. However, they too have potential biases favoring text generated by LLMs. Ultimately, a combination of automatic metrics and human evaluation is ideal for reliably assessing abstractive summarization systems. While human evaluation is indispensable for gaining a comprehensive understanding of summary quality, it should be complemented with automated evaluation to enable efficient, large-scale testing. The field will continue to evolve more robust evaluation techniques, balancing quality, scalability, and fairness. Advancing evaluation methods is crucial for driving progress in production applications.\n",
    "\n",
    "## References\n",
    "\n",
    "- [G-EVAL: NLG Evaluation Using GPT-4 with Better Human Alignment](https://arxiv.org/pdf/2303.16634.pdf) - Liu Y, Iter D, Xu Y, Wang S, Xu R, Zhu C. Published May, 2023.\n",
    "- [BERTScore: Evaluating Text Generation with BERT](https://arxiv.org/abs/1904.09675) - Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. Published online February, 2020.\n",
    "- [ROUGE: A Package for Automatic Evaluation of Summaries](https://aclanthology.org/W04-1013/) - Lin CY. Published July, 2004.\n",
    "- [SummEval: Re-evaluating Summarization Evaluation](https://aclanthology.org/2021.tacl-1.24) - Fabbri et al. Published April, 2021.\n"
   ]
  }
 ],
 "metadata": {
  "deepnote": {},
  "deepnote_app_layout": "powerful-article",
  "deepnote_execution_queue": [],
  "deepnote_full_width": true,
  "deepnote_notebook_id": "20f885ddefe84c16bd1250151b5a5e1f",
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "venv"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}