diff --git a/examples/evaluation/How_to_eval_abstractive_summarization.ipynb b/examples/evaluation/How_to_eval_abstractive_summarization.ipynb new file mode 100644 index 00000000..bd131c42 --- /dev/null +++ b/examples/evaluation/How_to_eval_abstractive_summarization.ipynb @@ -0,0 +1,820 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "cell_id": "83a38f3a8a224a7ab3138f15febbc251", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "markdown" + }, + "source": [ + "## Evaluating Abstractive Summarization\n", + "\n", + "In this notebook we delve into the evaluation techniques for abstractive summarization tasks using a simple example. We explore traditional evaluation methods like [ROUGE](https://aclanthology.org/W04-1013/) and [BERTScore](https://arxiv.org/abs/1904.09675), in addition to showcasing a more novel approach using LLMs as evaluators.\n", + "\n", + "Evaluating the quality of summaries is a time-consuming process, as it involves different quality metrics such as coherence, conciseness, readability and content. Traditional automatic evaluation metrics such as `ROUGE` and `BERTScore` and others are concrete and reliable, but they may not correlate well with the actual quality of summaries. They show relatively low correlation with human judgments, especially for open-ended generation tasks (Liu et al., 2023). There's a growing need to lean on human evaluations, user feedback, or model-based metrics while being vigilant about potential biases. While human judgment provides invaluable insights, it is often not scalable and can be cost-prohibitive.\n", + "\n", + "In addition to these traditional metrics, we showcase a method ([G-Eval](https://arxiv.org/pdf/2303.16634.pdf)) that leverages Large Language Models (LLMs) as a novel, reference-free metric for assessing abstractive summaries. In this case, we use `gpt-4` to score candidate outputs. `gpt-4` has effectively learned an internal model of language quality that allows it to differentiate between fluent, coherent text and low-quality text. Harnessing this internal scoring mechanism allows auto-evaluation of new candidate outputs generated by an LLM.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cell_id": "0c1c7a1190a44c4da1c652f12694b8ce", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "code", + "deepnote_to_be_reexecuted": false, + "execution_millis": 22360, + "execution_start": 1692080227636, + "source_hash": "a9d11aa3" + }, + "outputs": [], + "source": [ + "# Installing necessary packages for the evaluation\n", + "# rouge: For evaluating with ROUGE metric\n", + "# bert_score: For evaluating with BERTScore\n", + "# openai: To interact with OpenAI's API\n", + "!pip install rouge --quiet\n", + "!pip install bert_score --quiet\n", + "!pip install openai --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "cell_id": "b2e0f0ba05a34b6aa371b1b67d25acc8", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "code", + "deepnote_to_be_reexecuted": false, + "execution_millis": 8, + "execution_start": 1692082891192, + "source_hash": "cf469010" + }, + "outputs": [ + { + "data": { + "application/javascript": [ + "\n", + " setTimeout(function() {\n", + " var nbb_cell_id = 23;\n", + " var nbb_unformatted_code = \"import openai\\nimport os\\nimport re\\nimport pandas as pd\\n\\n# Python Implementation of the ROUGE Metric\\nfrom rouge import Rouge\\n\\n# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.\\nfrom bert_score import BERTScorer\\n\\nopenai.api_key = os.environ.get(\\\"OPENAI_API_KEY\\\")\";\n", + " var nbb_formatted_code = \"import openai\\nimport os\\nimport re\\nimport pandas as pd\\n\\n# Python Implementation of the ROUGE Metric\\nfrom rouge import Rouge\\n\\n# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.\\nfrom bert_score import BERTScorer\\n\\nopenai.api_key = os.environ.get(\\\"OPENAI_API_KEY\\\")\";\n", + " var nbb_cells = Jupyter.notebook.get_cells();\n", + " for (var i = 0; i < nbb_cells.length; ++i) {\n", + " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", + " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", + " nbb_cells[i].set_text(nbb_formatted_code);\n", + " }\n", + " break;\n", + " }\n", + " }\n", + " }, 500);\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import openai\n", + "import os\n", + "import re\n", + "import pandas as pd\n", + "\n", + "# Python Implementation of the ROUGE Metric\n", + "from rouge import Rouge\n", + "\n", + "# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.\n", + "from bert_score import BERTScorer\n", + "\n", + "openai.api_key = os.environ.get(\"OPENAI_API_KEY\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "cell_id": "7c8bf29b2e6b4c78b5a50a0f42d093d2", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "markdown" + }, + "source": [ + "## Evaluating an abstractive summary\n", + "\n", + "Here is a task for evaluating an abstractive summary for the given excerpt. Note that evaluation metrics like `ROUGE` and `BERTScore` require associated reference output.\n", + "\n", + "Excerpt (`excerpt`):\n", + "\n", + "> OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\n", + "\n", + "Summaries:\n", + "\n", + "| Reference Summary /`ref_summary` (human generated) | Eval Summary 1 / `eval_summary_1` (system generated) | Eval Summary 2 / `eval_summary_2` (system generated) |\n", + "| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n", + "| OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges. | OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good. | OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff. |\n", + "\n", + "Take a moment to figure out which summary you'd personally prefer and the one that captures OpenAI's mission really well.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "cell_id": "cc5d9f65e8924200bb5134c176c4fd05", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "code", + "deepnote_to_be_reexecuted": false, + "execution_millis": 16, + "execution_start": 1692083015932, + "source_hash": "9aa26bd6" + }, + "outputs": [ + { + "data": { + "application/javascript": [ + "\n", + " setTimeout(function() {\n", + " var nbb_cell_id = 9;\n", + " var nbb_unformatted_code = \"excerpt = \\\"OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\\\"\\nref_summary = \\\"OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.\\\"\\neval_summary_1 = \\\"OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good.\\\"\\neval_summary_2 = \\\"OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff.\\\"\";\n", + " var nbb_formatted_code = \"excerpt = \\\"OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\\\"\\nref_summary = \\\"OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.\\\"\\neval_summary_1 = \\\"OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good.\\\"\\neval_summary_2 = \\\"OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff.\\\"\";\n", + " var nbb_cells = Jupyter.notebook.get_cells();\n", + " for (var i = 0; i < nbb_cells.length; ++i) {\n", + " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", + " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", + " nbb_cells[i].set_text(nbb_formatted_code);\n", + " }\n", + " break;\n", + " }\n", + " }\n", + " }, 500);\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "excerpt = \"OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\"\n", + "ref_summary = \"OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.\"\n", + "eval_summary_1 = \"OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good.\"\n", + "eval_summary_2 = \"OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff.\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "cell_id": "f3ae350a2e4b47d985843c5b0808e5b6", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "markdown" + }, + "source": [ + "### Evaluating using ROUGE\n", + "\n", + "[ROUGE](https://aclanthology.org/W04-1013/), which stands for Recall-Oriented Understudy for Gisting Evaluation, primarily gauges the overlap of words between a generated output and a reference text. It's a prevalent metric for evaluating automatic summarization tasks. Among its variants, `ROUGE-L` offers insights into the longest contiguous match between system-generated and reference summaries, gauging how well the system retains the original summary's essence.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "cell_id": "dbd380ae5135456bb79ee3192128e489", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "code", + "deepnote_to_be_reexecuted": false, + "execution_millis": 86, + "execution_start": 1692083097056, + "source_hash": "c50fbd38" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
 Summary 1Summary 2
Metric  
rouge-1 (F-Score)0.4888890.511628
rouge-2 (F-Score)0.2307690.163265
rouge-l (F-Score)0.4888890.511628
\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "application/javascript": [ + "\n", + " setTimeout(function() {\n", + " var nbb_cell_id = 10;\n", + " var nbb_unformatted_code = \"# function to calculate the Rouge score\\ndef get_rouge_scores(text1, text2):\\n rouge = Rouge()\\n return rouge.get_scores(text1, text2)\\n\\n\\nrouge_scores_out = []\\n\\n# Calculate the ROUGE scores for both summaries using reference\\neval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)\\neval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)\\n\\nfor metric in [\\\"rouge-1\\\", \\\"rouge-2\\\", \\\"rouge-l\\\"]:\\n for label in [\\\"F-Score\\\"]:\\n eval_1_score = eval_1_rouge[0][metric][label[0].lower()]\\n eval_2_score = eval_2_rouge[0][metric][label[0].lower()]\\n\\n row = {\\n \\\"Metric\\\": f\\\"{metric} ({label})\\\",\\n \\\"Summary 1\\\": eval_1_score,\\n \\\"Summary 2\\\": eval_2_score,\\n }\\n rouge_scores_out.append(row)\\n\\n\\ndef highlight_max(s):\\n is_max = s == s.max()\\n return [\\n \\\"background-color: lightgreen\\\" if v else \\\"background-color: white\\\"\\n for v in is_max\\n ]\\n\\n\\nrouge_scores_out = (\\n pd.DataFrame(rouge_scores_out)\\n .set_index(\\\"Metric\\\")\\n .style.apply(highlight_max, axis=1)\\n)\\n\\nrouge_scores_out\";\n", + " var nbb_formatted_code = \"# function to calculate the Rouge score\\ndef get_rouge_scores(text1, text2):\\n rouge = Rouge()\\n return rouge.get_scores(text1, text2)\\n\\n\\nrouge_scores_out = []\\n\\n# Calculate the ROUGE scores for both summaries using reference\\neval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)\\neval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)\\n\\nfor metric in [\\\"rouge-1\\\", \\\"rouge-2\\\", \\\"rouge-l\\\"]:\\n for label in [\\\"F-Score\\\"]:\\n eval_1_score = eval_1_rouge[0][metric][label[0].lower()]\\n eval_2_score = eval_2_rouge[0][metric][label[0].lower()]\\n\\n row = {\\n \\\"Metric\\\": f\\\"{metric} ({label})\\\",\\n \\\"Summary 1\\\": eval_1_score,\\n \\\"Summary 2\\\": eval_2_score,\\n }\\n rouge_scores_out.append(row)\\n\\n\\ndef highlight_max(s):\\n is_max = s == s.max()\\n return [\\n \\\"background-color: lightgreen\\\" if v else \\\"background-color: white\\\"\\n for v in is_max\\n ]\\n\\n\\nrouge_scores_out = (\\n pd.DataFrame(rouge_scores_out)\\n .set_index(\\\"Metric\\\")\\n .style.apply(highlight_max, axis=1)\\n)\\n\\nrouge_scores_out\";\n", + " var nbb_cells = Jupyter.notebook.get_cells();\n", + " for (var i = 0; i < nbb_cells.length; ++i) {\n", + " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", + " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", + " nbb_cells[i].set_text(nbb_formatted_code);\n", + " }\n", + " break;\n", + " }\n", + " }\n", + " }, 500);\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# function to calculate the Rouge score\n", + "def get_rouge_scores(text1, text2):\n", + " rouge = Rouge()\n", + " return rouge.get_scores(text1, text2)\n", + "\n", + "\n", + "rouge_scores_out = []\n", + "\n", + "# Calculate the ROUGE scores for both summaries using reference\n", + "eval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)\n", + "eval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)\n", + "\n", + "for metric in [\"rouge-1\", \"rouge-2\", \"rouge-l\"]:\n", + " for label in [\"F-Score\"]:\n", + " eval_1_score = eval_1_rouge[0][metric][label[0].lower()]\n", + " eval_2_score = eval_2_rouge[0][metric][label[0].lower()]\n", + "\n", + " row = {\n", + " \"Metric\": f\"{metric} ({label})\",\n", + " \"Summary 1\": eval_1_score,\n", + " \"Summary 2\": eval_2_score,\n", + " }\n", + " rouge_scores_out.append(row)\n", + "\n", + "\n", + "def highlight_max(s):\n", + " is_max = s == s.max()\n", + " return [\n", + " \"background-color: lightgreen\" if v else \"background-color: white\"\n", + " for v in is_max\n", + " ]\n", + "\n", + "\n", + "rouge_scores_out = (\n", + " pd.DataFrame(rouge_scores_out)\n", + " .set_index(\"Metric\")\n", + " .style.apply(highlight_max, axis=1)\n", + ")\n", + "\n", + "rouge_scores_out" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "cell_id": "a0857e829dc64f64a183212bb5aab122", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "markdown" + }, + "source": [ + "The table shows the `ROUGE` scores for evaluating two different summaries against a reference text. In the case of `rouge-1`, Summary 2 outperforms Summary 1, indicating a better overlap of individual words and for `rouge-l`, Summary 2 has a higher score, implying a closer match in the longest common subsequences, and thus a potentially better overall summarization in capturing the main content and order of the original text. Since Summary 2 has many words and short phrases directly lifted from the excerpt, its overlap with the reference summary would likely be higher, leading to higher `ROUGE` scores.\n", + "\n", + "While `ROUGE` and similar metrics, such as [BLEU](https://aclanthology.org/P02-1040.pdf) and [METEOR](https://www.cs.cmu.edu/~alavie/METEOR/), offer quantitative measures, they often fail to capture the true essence of a well-generated summary. They also correlate worse with human scores. Given the advancements in LLMs, which are adept at producing fluent and coherent summaries, traditional metrics like `ROUGE` may inadvertently penalize these models. This is especially true if the summaries are articulated differently but still encapsulate the core information accurately.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "cell_id": "609cfe2cf2f14cd09e184168b83de274", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "markdown" + }, + "source": [ + "### Evaluating using BERTScore\n", + "\n", + "ROUGE has a limitation as it relies on the exact presence of words in both the predicted and reference texts, failing to interpret the underlying semantics. This is where [BERTScore](https://arxiv.org/abs/1904.09675) comes in and leverages the contextual embeddings from the BERT model, aiming to evaluate the similarity between a predicted and a reference sentence in the context of machine-generated text. By comparing embeddings from both sentences, `BERTScore` captures semantic similarities that might be missed by traditional n-gram based metrics.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "cell_id": "b966c86ab65744f5a4a6d2e4d534c86e", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "code", + "deepnote_to_be_reexecuted": false, + "execution_millis": 17954, + "execution_start": 1692083196232, + "source_hash": "a90f7d76" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias']\n", + "- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", + "- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Summary 1 F1 Score: 0.9227314591407776\n", + "Summary 2 F1 Score: 0.9189572930335999\n" + ] + }, + { + "data": { + "application/javascript": [ + "\n", + " setTimeout(function() {\n", + " var nbb_cell_id = 11;\n", + " var nbb_unformatted_code = \"# Instantiate the BERTScorer object for English language\\nscorer = BERTScorer(lang=\\\"en\\\")\\n\\n# Calculate BERTScore for the summary 1 against the excerpt\\n# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively\\nP1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])\\n\\n# Calculate BERTScore for summary 2 against the excerpt\\n# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively\\nP2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])\\n\\nprint(\\\"Summary 1 F1 Score:\\\", F1_1.tolist()[0])\\nprint(\\\"Summary 2 F1 Score:\\\", F2_2.tolist()[0])\";\n", + " var nbb_formatted_code = \"# Instantiate the BERTScorer object for English language\\nscorer = BERTScorer(lang=\\\"en\\\")\\n\\n# Calculate BERTScore for the summary 1 against the excerpt\\n# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively\\nP1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])\\n\\n# Calculate BERTScore for summary 2 against the excerpt\\n# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively\\nP2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])\\n\\nprint(\\\"Summary 1 F1 Score:\\\", F1_1.tolist()[0])\\nprint(\\\"Summary 2 F1 Score:\\\", F2_2.tolist()[0])\";\n", + " var nbb_cells = Jupyter.notebook.get_cells();\n", + " for (var i = 0; i < nbb_cells.length; ++i) {\n", + " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", + " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", + " nbb_cells[i].set_text(nbb_formatted_code);\n", + " }\n", + " break;\n", + " }\n", + " }\n", + " }, 500);\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Instantiate the BERTScorer object for English language\n", + "scorer = BERTScorer(lang=\"en\")\n", + "\n", + "# Calculate BERTScore for the summary 1 against the excerpt\n", + "# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively\n", + "P1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])\n", + "\n", + "# Calculate BERTScore for summary 2 against the excerpt\n", + "# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively\n", + "P2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])\n", + "\n", + "print(\"Summary 1 F1 Score:\", F1_1.tolist()[0])\n", + "print(\"Summary 2 F1 Score:\", F2_2.tolist()[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "cell_id": "79d07eaa9a344985838133ffc9e9e02b", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "markdown" + }, + "source": [ + "The close F1 Scores between the summaries indicate that they may perform similarly in capturing the key information. However, this small difference should be interpreted with caution. Since `BERTScore` may not fully grasp subtleties and high-level concepts that a human evaluator might understand, reliance solely on this metric could lead to misinterpreting the actual quality and nuances of the summary. An integrated approach combining `BERTScore` with human judgment and other metrics could offer a more reliable evaluation.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "cell_id": "f0d66b7a59334ed3ba51d9bbbcb85890", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "markdown" + }, + "source": [ + "## Using GPT-4 for reference-free evaluation\n", + "\n", + "We implement a reference-free text evaluator using `gpt-4`, inspired by the G-Eval framework which evaluates the quality of generated text using large language models. Unlike metrics like `ROUGE` or `BERTScore` that rely on comparison to reference summaries, the `gpt-4` based evaluator is reference-free - it assesses the quality of generated content based solely on the input prompt and text, without any ground truth references. This makes it applicable to new datasets and tasks where human references are sparse or unavailable. It has the following key components:\n", + "\n", + "* Prompt design: A prompt designed to define the NLG task and specify the desired evaluation criteria for the summarization task.\n", + "* Chain-of-thought generation: a detailed chain-of-thought with step-by-step evaluation instructions.\n", + "* Scoring function: The LLM fills out an evaluation form by generating scores for metrics like coherence, consistency, etc. based on the prompt and chain-of-thought.\n", + "\n", + "**In this demonstration, we're using the direct scoring function where GPT-4 generates a discrete score like 1-5 for each metric. Normalizing the scores and taking a weighted sum could result in more robust, continuous scores that better reflect the quality and diversity of the summaries. In the future, implementing the probability weighting would likely improve the correlation with human judgments.**\n", + "\n", + "_Source: [G-EVAL: NLG Evaluation Using GPT-4 with Better Human Alignment.](https://arxiv.org/pdf/2303.16634.pdf) - Liu Y, Iter D, Xu Y, Wang S, Xu R, Zhu C._\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "cell_id": "b029621eb5874de78b349d3cf8dd45b4", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "code", + "deepnote_to_be_reexecuted": false, + "execution_millis": 7700, + "execution_start": 1692083249280, + "source_hash": "ab0afee3" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Summary TypeSummary 1Summary 2
Evaluation Type  
Coherence53
Consistency55
Fluency32
Relevance54
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/javascript": [ + "\n", + " setTimeout(function() {\n", + " var nbb_cell_id = 15;\n", + " var nbb_unformatted_code = \"# Evaluation prompt template based on G-Eval\\nEVALUATION_PROMPT_TEMPLATE = \\\"\\\"\\\"\\nYou will be given one summary written for an article. Your task is to rate the summary on one metric.\\nPlease make sure you read and understand these instructions very carefully. \\nPlease keep this document open while reviewing, and refer to it as needed.\\n\\nEvaluation Criteria:\\n\\n{criteria}\\n\\nEvaluation Steps:\\n\\n{steps}\\n\\nExample:\\n\\nSource Text:\\n\\n{document}\\n\\nSummary:\\n\\n{summary}\\n\\nEvaluation Form (scores ONLY):\\n\\n- {metric_name}\\n\\\"\\\"\\\"\\n\\n# Metric 1: Relevance\\n\\nRELEVANCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nRelevance(1-5) - selection of important content from the source. \\\\\\nThe summary should include only important information from the source document. \\\\\\nAnnotators were instructed to penalize summaries which contained redundancies and excess information.\\n\\\"\\\"\\\"\\n\\nRELEVANCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the summary and the source document carefully.\\n2. Compare the summary to the source document and identify the main points of the article.\\n3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\\n4. Assign a relevance score from 1 to 5.\\n\\\"\\\"\\\"\\n\\n# Metric 2: Coherence\\n\\nCOHERENCE_SCORE_CRITERIA = \\\"\\\"\\\"\\nCoherence(1-5) - the collective quality of all sentences. \\\\\\nWe align this dimension with the DUC quality question of structure and coherence \\\\\\nwhereby \\\"the summary should be well-structured and well-organized. \\\\\\nThe summary should not just be a heap of related information, but should build from sentence to a\\\\\\ncoherent body of information about a topic.\\\"\\n\\\"\\\"\\\"\\n\\nCOHERENCE_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main topic and key points.\\n2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,\\nand if it presents them in a clear and logical order.\\n3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 3: Consistency\\n\\nCONSISTENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nConsistency(1-5) - the factual alignment between the summary and the summarized source. \\\\\\nA factually consistent summary contains only statements that are entailed by the source document. \\\\\\nAnnotators were also asked to penalize summaries that contained hallucinated facts.\\n\\\"\\\"\\\"\\n\\nCONSISTENCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main facts and details it presents.\\n2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.\\n3. Assign a score for consistency based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 4: Fluency\\n\\nFLUENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nFluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\\n1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\\n2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\\n3: Good. The summary has few or no errors and is easy to read and follow.\\n\\\"\\\"\\\"\\n\\nFLUENCY_SCORE_STEPS = \\\"\\\"\\\"\\nRead the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.\\n\\\"\\\"\\\"\\n\\n\\ndef get_geval_score(\\n criteria: str, steps: str, document: str, summary: str, metric_name: str\\n):\\n prompt = EVALUATION_PROMPT_TEMPLATE.format(\\n criteria=criteria,\\n steps=steps,\\n metric_name=metric_name,\\n document=document,\\n summary=summary,\\n )\\n response = openai.ChatCompletion.create(\\n model=\\\"gpt-4\\\",\\n messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": prompt}],\\n temperature=0,\\n max_tokens=5,\\n top_p=1,\\n frequency_penalty=0,\\n presence_penalty=0,\\n )\\n return response.choices[0].message.content\\n\\n\\nevaluation_metrics = {\\n \\\"Relevance\\\": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),\\n \\\"Coherence\\\": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),\\n \\\"Consistency\\\": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),\\n \\\"Fluency\\\": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),\\n}\\n\\nsummaries = {\\\"Summary 1\\\": eval_summary_1, \\\"Summary 2\\\": eval_summary_2}\\n\\ndata = {\\\"Evaluation Type\\\": [], \\\"Summary Type\\\": [], \\\"Score\\\": []}\\n\\nfor eval_type, (criteria, steps) in evaluation_metrics.items():\\n for summ_type, summary in summaries.items():\\n data[\\\"Evaluation Type\\\"].append(eval_type)\\n data[\\\"Summary Type\\\"].append(summ_type)\\n result = get_geval_score(criteria, steps, excerpt, summary, eval_type)\\n score_num = int(result.strip())\\n data[\\\"Score\\\"].append(score_num)\\n\\npivot_df = pd.DataFrame(data, index=None).pivot(\\n index=\\\"Evaluation Type\\\", columns=\\\"Summary Type\\\", values=\\\"Score\\\"\\n)\\nstyled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)\\ndisplay(styled_pivot_df)\";\n", + " var nbb_formatted_code = \"# Evaluation prompt template based on G-Eval\\nEVALUATION_PROMPT_TEMPLATE = \\\"\\\"\\\"\\nYou will be given one summary written for an article. Your task is to rate the summary on one metric.\\nPlease make sure you read and understand these instructions very carefully. \\nPlease keep this document open while reviewing, and refer to it as needed.\\n\\nEvaluation Criteria:\\n\\n{criteria}\\n\\nEvaluation Steps:\\n\\n{steps}\\n\\nExample:\\n\\nSource Text:\\n\\n{document}\\n\\nSummary:\\n\\n{summary}\\n\\nEvaluation Form (scores ONLY):\\n\\n- {metric_name}\\n\\\"\\\"\\\"\\n\\n# Metric 1: Relevance\\n\\nRELEVANCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nRelevance(1-5) - selection of important content from the source. \\\\\\nThe summary should include only important information from the source document. \\\\\\nAnnotators were instructed to penalize summaries which contained redundancies and excess information.\\n\\\"\\\"\\\"\\n\\nRELEVANCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the summary and the source document carefully.\\n2. Compare the summary to the source document and identify the main points of the article.\\n3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\\n4. Assign a relevance score from 1 to 5.\\n\\\"\\\"\\\"\\n\\n# Metric 2: Coherence\\n\\nCOHERENCE_SCORE_CRITERIA = \\\"\\\"\\\"\\nCoherence(1-5) - the collective quality of all sentences. \\\\\\nWe align this dimension with the DUC quality question of structure and coherence \\\\\\nwhereby \\\"the summary should be well-structured and well-organized. \\\\\\nThe summary should not just be a heap of related information, but should build from sentence to a\\\\\\ncoherent body of information about a topic.\\\"\\n\\\"\\\"\\\"\\n\\nCOHERENCE_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main topic and key points.\\n2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,\\nand if it presents them in a clear and logical order.\\n3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 3: Consistency\\n\\nCONSISTENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nConsistency(1-5) - the factual alignment between the summary and the summarized source. \\\\\\nA factually consistent summary contains only statements that are entailed by the source document. \\\\\\nAnnotators were also asked to penalize summaries that contained hallucinated facts.\\n\\\"\\\"\\\"\\n\\nCONSISTENCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main facts and details it presents.\\n2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.\\n3. Assign a score for consistency based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 4: Fluency\\n\\nFLUENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nFluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\\n1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\\n2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\\n3: Good. The summary has few or no errors and is easy to read and follow.\\n\\\"\\\"\\\"\\n\\nFLUENCY_SCORE_STEPS = \\\"\\\"\\\"\\nRead the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.\\n\\\"\\\"\\\"\\n\\n\\ndef get_geval_score(\\n criteria: str, steps: str, document: str, summary: str, metric_name: str\\n):\\n prompt = EVALUATION_PROMPT_TEMPLATE.format(\\n criteria=criteria,\\n steps=steps,\\n metric_name=metric_name,\\n document=document,\\n summary=summary,\\n )\\n response = openai.ChatCompletion.create(\\n model=\\\"gpt-4\\\",\\n messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": prompt}],\\n temperature=0,\\n max_tokens=5,\\n top_p=1,\\n frequency_penalty=0,\\n presence_penalty=0,\\n )\\n return response.choices[0].message.content\\n\\n\\nevaluation_metrics = {\\n \\\"Relevance\\\": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),\\n \\\"Coherence\\\": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),\\n \\\"Consistency\\\": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),\\n \\\"Fluency\\\": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),\\n}\\n\\nsummaries = {\\\"Summary 1\\\": eval_summary_1, \\\"Summary 2\\\": eval_summary_2}\\n\\ndata = {\\\"Evaluation Type\\\": [], \\\"Summary Type\\\": [], \\\"Score\\\": []}\\n\\nfor eval_type, (criteria, steps) in evaluation_metrics.items():\\n for summ_type, summary in summaries.items():\\n data[\\\"Evaluation Type\\\"].append(eval_type)\\n data[\\\"Summary Type\\\"].append(summ_type)\\n result = get_geval_score(criteria, steps, excerpt, summary, eval_type)\\n score_num = int(result.strip())\\n data[\\\"Score\\\"].append(score_num)\\n\\npivot_df = pd.DataFrame(data, index=None).pivot(\\n index=\\\"Evaluation Type\\\", columns=\\\"Summary Type\\\", values=\\\"Score\\\"\\n)\\nstyled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)\\ndisplay(styled_pivot_df)\";\n", + " var nbb_cells = Jupyter.notebook.get_cells();\n", + " for (var i = 0; i < nbb_cells.length; ++i) {\n", + " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", + " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", + " nbb_cells[i].set_text(nbb_formatted_code);\n", + " }\n", + " break;\n", + " }\n", + " }\n", + " }, 500);\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Evaluation prompt template based on G-Eval\n", + "EVALUATION_PROMPT_TEMPLATE = \"\"\"\n", + "You will be given one summary written for an article. Your task is to rate the summary on one metric.\n", + "Please make sure you read and understand these instructions very carefully. \n", + "Please keep this document open while reviewing, and refer to it as needed.\n", + "\n", + "Evaluation Criteria:\n", + "\n", + "{criteria}\n", + "\n", + "Evaluation Steps:\n", + "\n", + "{steps}\n", + "\n", + "Example:\n", + "\n", + "Source Text:\n", + "\n", + "{document}\n", + "\n", + "Summary:\n", + "\n", + "{summary}\n", + "\n", + "Evaluation Form (scores ONLY):\n", + "\n", + "- {metric_name}\n", + "\"\"\"\n", + "\n", + "# Metric 1: Relevance\n", + "\n", + "RELEVANCY_SCORE_CRITERIA = \"\"\"\n", + "Relevance(1-5) - selection of important content from the source. \\\n", + "The summary should include only important information from the source document. \\\n", + "Annotators were instructed to penalize summaries which contained redundancies and excess information.\n", + "\"\"\"\n", + "\n", + "RELEVANCY_SCORE_STEPS = \"\"\"\n", + "1. Read the summary and the source document carefully.\n", + "2. Compare the summary to the source document and identify the main points of the article.\n", + "3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\n", + "4. Assign a relevance score from 1 to 5.\n", + "\"\"\"\n", + "\n", + "# Metric 2: Coherence\n", + "\n", + "COHERENCE_SCORE_CRITERIA = \"\"\"\n", + "Coherence(1-5) - the collective quality of all sentences. \\\n", + "We align this dimension with the DUC quality question of structure and coherence \\\n", + "whereby \"the summary should be well-structured and well-organized. \\\n", + "The summary should not just be a heap of related information, but should build from sentence to a\\\n", + "coherent body of information about a topic.\"\n", + "\"\"\"\n", + "\n", + "COHERENCE_SCORE_STEPS = \"\"\"\n", + "1. Read the article carefully and identify the main topic and key points.\n", + "2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,\n", + "and if it presents them in a clear and logical order.\n", + "3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.\n", + "\"\"\"\n", + "\n", + "# Metric 3: Consistency\n", + "\n", + "CONSISTENCY_SCORE_CRITERIA = \"\"\"\n", + "Consistency(1-5) - the factual alignment between the summary and the summarized source. \\\n", + "A factually consistent summary contains only statements that are entailed by the source document. \\\n", + "Annotators were also asked to penalize summaries that contained hallucinated facts.\n", + "\"\"\"\n", + "\n", + "CONSISTENCY_SCORE_STEPS = \"\"\"\n", + "1. Read the article carefully and identify the main facts and details it presents.\n", + "2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.\n", + "3. Assign a score for consistency based on the Evaluation Criteria.\n", + "\"\"\"\n", + "\n", + "# Metric 4: Fluency\n", + "\n", + "FLUENCY_SCORE_CRITERIA = \"\"\"\n", + "Fluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\n", + "1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\n", + "2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\n", + "3: Good. The summary has few or no errors and is easy to read and follow.\n", + "\"\"\"\n", + "\n", + "FLUENCY_SCORE_STEPS = \"\"\"\n", + "Read the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.\n", + "\"\"\"\n", + "\n", + "\n", + "def get_geval_score(\n", + " criteria: str, steps: str, document: str, summary: str, metric_name: str\n", + "):\n", + " prompt = EVALUATION_PROMPT_TEMPLATE.format(\n", + " criteria=criteria,\n", + " steps=steps,\n", + " metric_name=metric_name,\n", + " document=document,\n", + " summary=summary,\n", + " )\n", + " response = openai.ChatCompletion.create(\n", + " model=\"gpt-4\",\n", + " messages=[{\"role\": \"user\", \"content\": prompt}],\n", + " temperature=0,\n", + " max_tokens=5,\n", + " top_p=1,\n", + " frequency_penalty=0,\n", + " presence_penalty=0,\n", + " )\n", + " return response.choices[0].message.content\n", + "\n", + "\n", + "evaluation_metrics = {\n", + " \"Relevance\": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),\n", + " \"Coherence\": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),\n", + " \"Consistency\": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),\n", + " \"Fluency\": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),\n", + "}\n", + "\n", + "summaries = {\"Summary 1\": eval_summary_1, \"Summary 2\": eval_summary_2}\n", + "\n", + "data = {\"Evaluation Type\": [], \"Summary Type\": [], \"Score\": []}\n", + "\n", + "for eval_type, (criteria, steps) in evaluation_metrics.items():\n", + " for summ_type, summary in summaries.items():\n", + " data[\"Evaluation Type\"].append(eval_type)\n", + " data[\"Summary Type\"].append(summ_type)\n", + " result = get_geval_score(criteria, steps, excerpt, summary, eval_type)\n", + " score_num = int(result.strip())\n", + " data[\"Score\"].append(score_num)\n", + "\n", + "pivot_df = pd.DataFrame(data, index=None).pivot(\n", + " index=\"Evaluation Type\", columns=\"Summary Type\", values=\"Score\"\n", + ")\n", + "styled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)\n", + "display(styled_pivot_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "cell_id": "03a7682297624a71ae448cf67192d8fd", + "deepnote_app_coordinates": { + "h": 5, + "w": 12, + "x": 0, + "y": 0 + }, + "deepnote_cell_type": "markdown" + }, + "source": [ + "- **Coherence**: Summary 1 received a higher score of 5, compared to the score for Summary 2. This indicates that the Summary 1 is perceived to be more logically structured and coherent. Coherence is a metric used to assess the quality of the learned topics in a text or document.\n", + "- **Consistency**: Both the summaries received equal scores of 5. This shows that both summaries maintain a uniform tone and content, without contradictions.\n", + "- **Fluency**: The Summary 1 scored higher again with a score of 3, whereas Summary 2 scored 2. This means that the Summary 1 is considered more fluent and reads more naturally. Fluency is a metric used to evaluate the smoothness and naturalness of a text or speech\n", + "- **Relevance**: Lastly, relevance, where Summary 1 scored higher, is a metric used to assess the degree to which something is related or applicable to a particular topic or context. Both summaries perform equally well.\n", + "\n", + "Overall, the Summary 1 appears to outperform Summary 2 in three of the four categories (Coherence, Relevance and Fluency). Both summaries are found to be consistent with each other. The result might suggest that Summary 1 is generally preferable based on the given evaluation criteria.\n", + "\n", + "##### Limitations\n", + "\n", + "Note that LLM-based metrics could have a bias towards preferring LLM-generated texts over human-written texts. Additionally LLM based metrics are sensitive to system messages/prompts. We recommend experimenting with other techniques that can help improve performance and/or get consistent scores, striking the right balance between high-quality expensive evaluation and automated evaluations.\n", + "\n", + "### Conclusion\n", + "\n", + "Evaluating abstractive summarization remains an open area for further improvement. Traditional metrics like `ROUGE`, `BLEU`, and `BERTScore` provide useful automatic evaluation but have limitations in capturing semantic similarity and nuanced aspects of summarization quality. Moreover, they require reference outputs which can be expensive to collect/label. LLM-based metrics offer promise as a reference-free method of evaluating coherence, fluency, and relevance. However, they too have potential biases favoring text generated by LLMs. Ultimately, a combination of automatic metrics and human evaluation is ideal for reliably assessing abstractive summarization systems. While human evaluation is indispensable for gaining a comprehensive understanding of summary quality, it should be complemented with automated evaluation to enable efficient, large-scale testing. The field will continue to evolve more robust evaluation techniques, balancing quality, scalability, and fairness. Advancing evaluation methods is crucial for driving progress in production applications.\n", + "\n", + "### References\n", + "\n", + "- [G-EVAL: NLG Evaluation Using GPT-4 with Better Human Alignment](https://arxiv.org/pdf/2303.16634.pdf) - Liu Y, Iter D, Xu Y, Wang S, Xu R, Zhu C.\n", + "- [BERTScore: Evaluating Text Generation with BERT](https://arxiv.org/abs/1904.09675) - Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. Published online February 24, 2020.\n", + "- [ROUGE: A Package for Automatic Evaluation of Summaries](https://aclanthology.org/W04-1013/) - Lin CY. Published July 1, 2004.\n" + ] + } + ], + "metadata": { + "deepnote": {}, + "deepnote_app_layout": "powerful-article", + "deepnote_execution_queue": [], + "deepnote_full_width": true, + "deepnote_notebook_id": "20f885ddefe84c16bd1250151b5a5e1f", + "kernelspec": { + "display_name": "venv", + "language": "python", + "name": "venv" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.13" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}