diff --git a/examples/evaluation/How_to_eval_abstractive_summarization.ipynb b/examples/evaluation/How_to_eval_abstractive_summarization.ipynb index bd131c42..52df4d06 100644 --- a/examples/evaluation/How_to_eval_abstractive_summarization.ipynb +++ b/examples/evaluation/How_to_eval_abstractive_summarization.ipynb @@ -1,6 +1,7 @@ { "cells": [ { + "attachments": {}, "cell_type": "markdown", "metadata": { "cell_id": "83a38f3a8a224a7ab3138f15febbc251", @@ -13,20 +14,21 @@ "deepnote_cell_type": "markdown" }, "source": [ - "## Evaluating Abstractive Summarization\n", + "# Evaluating Abstractive Summarization\n", "\n", "In this notebook we delve into the evaluation techniques for abstractive summarization tasks using a simple example. We explore traditional evaluation methods like [ROUGE](https://aclanthology.org/W04-1013/) and [BERTScore](https://arxiv.org/abs/1904.09675), in addition to showcasing a more novel approach using LLMs as evaluators.\n", "\n", - "Evaluating the quality of summaries is a time-consuming process, as it involves different quality metrics such as coherence, conciseness, readability and content. Traditional automatic evaluation metrics such as `ROUGE` and `BERTScore` and others are concrete and reliable, but they may not correlate well with the actual quality of summaries. They show relatively low correlation with human judgments, especially for open-ended generation tasks (Liu et al., 2023). There's a growing need to lean on human evaluations, user feedback, or model-based metrics while being vigilant about potential biases. While human judgment provides invaluable insights, it is often not scalable and can be cost-prohibitive.\n", + "Evaluating the quality of summaries is a time-consuming process, as it involves different quality metrics such as coherence, conciseness, readability and content. Traditional automatic evaluation metrics such as `ROUGE` and `BERTScore` and others are concrete and reliable, but they may not correlate well with the actual quality of summaries. They show relatively low correlation with human judgments, especially for open-ended generation tasks ([Liu et al., 2023](https://arxiv.org/pdf/2303.16634.pdf)). There's a growing need to lean on human evaluations, user feedback, or model-based metrics while being vigilant about potential biases. While human judgment provides invaluable insights, it is often not scalable and can be cost-prohibitive.\n", "\n", "In addition to these traditional metrics, we showcase a method ([G-Eval](https://arxiv.org/pdf/2303.16634.pdf)) that leverages Large Language Models (LLMs) as a novel, reference-free metric for assessing abstractive summaries. In this case, we use `gpt-4` to score candidate outputs. `gpt-4` has effectively learned an internal model of language quality that allows it to differentiate between fluent, coherent text and low-quality text. Harnessing this internal scoring mechanism allows auto-evaluation of new candidate outputs generated by an LLM.\n" ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "### Setup\n" + "## Setup\n" ] }, { @@ -77,24 +79,7 @@ "outputs": [ { "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 23;\n", - " var nbb_unformatted_code = \"import openai\\nimport os\\nimport re\\nimport pandas as pd\\n\\n# Python Implementation of the ROUGE Metric\\nfrom rouge import Rouge\\n\\n# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.\\nfrom bert_score import BERTScorer\\n\\nopenai.api_key = os.environ.get(\\\"OPENAI_API_KEY\\\")\";\n", - " var nbb_formatted_code = \"import openai\\nimport os\\nimport re\\nimport pandas as pd\\n\\n# Python Implementation of the ROUGE Metric\\nfrom rouge import Rouge\\n\\n# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.\\nfrom bert_score import BERTScorer\\n\\nopenai.api_key = os.environ.get(\\\"OPENAI_API_KEY\\\")\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], + "application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 23;\n var nbb_unformatted_code = \"import openai\\nimport os\\nimport re\\nimport pandas as pd\\n\\n# Python Implementation of the ROUGE Metric\\nfrom rouge import Rouge\\n\\n# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.\\nfrom bert_score import BERTScorer\\n\\nopenai.api_key = os.environ.get(\\\"OPENAI_API_KEY\\\")\";\n var nbb_formatted_code = \"import openai\\nimport os\\nimport re\\nimport pandas as pd\\n\\n# Python Implementation of the ROUGE Metric\\nfrom rouge import Rouge\\n\\n# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.\\nfrom bert_score import BERTScorer\\n\\nopenai.api_key = os.environ.get(\\\"OPENAI_API_KEY\\\")\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ", "text/plain": [ "" ] @@ -119,6 +104,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": { "cell_id": "7c8bf29b2e6b4c78b5a50a0f42d093d2", @@ -131,9 +117,9 @@ "deepnote_cell_type": "markdown" }, "source": [ - "## Evaluating an abstractive summary\n", + "## Example task\n", "\n", - "Here is a task for evaluating an abstractive summary for the given excerpt. Note that evaluation metrics like `ROUGE` and `BERTScore` require associated reference output.\n", + "For the purposes of this notebook we'll use the example summarization below. Notice that we provide two generated summaries to compare, and a reference human-writte summary, which evaluation metrics like `ROUGE` and `BERTScore` require.\n", "\n", "Excerpt (`excerpt`):\n", "\n", @@ -168,24 +154,7 @@ "outputs": [ { "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 9;\n", - " var nbb_unformatted_code = \"excerpt = \\\"OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\\\"\\nref_summary = \\\"OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.\\\"\\neval_summary_1 = \\\"OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good.\\\"\\neval_summary_2 = \\\"OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff.\\\"\";\n", - " var nbb_formatted_code = \"excerpt = \\\"OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\\\"\\nref_summary = \\\"OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.\\\"\\neval_summary_1 = \\\"OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good.\\\"\\neval_summary_2 = \\\"OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff.\\\"\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], + "application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 9;\n var nbb_unformatted_code = \"excerpt = \\\"OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\\\"\\nref_summary = \\\"OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.\\\"\\neval_summary_1 = \\\"OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good.\\\"\\neval_summary_2 = \\\"OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff.\\\"\";\n var nbb_formatted_code = \"excerpt = \\\"OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.\\\"\\nref_summary = \\\"OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges.\\\"\\neval_summary_1 = \\\"OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good.\\\"\\neval_summary_2 = \\\"OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff.\\\"\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ", "text/plain": [ "" ] @@ -202,6 +171,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": { "cell_id": "f3ae350a2e4b47d985843c5b0808e5b6", @@ -214,7 +184,7 @@ "deepnote_cell_type": "markdown" }, "source": [ - "### Evaluating using ROUGE\n", + "## Evaluating using ROUGE\n", "\n", "[ROUGE](https://aclanthology.org/W04-1013/), which stands for Recall-Oriented Understudy for Gisting Evaluation, primarily gauges the overlap of words between a generated output and a reference text. It's a prevalent metric for evaluating automatic summarization tasks. Among its variants, `ROUGE-L` offers insights into the longest contiguous match between system-generated and reference summaries, gauging how well the system retains the original summary's essence.\n" ] @@ -290,24 +260,7 @@ }, { "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 10;\n", - " var nbb_unformatted_code = \"# function to calculate the Rouge score\\ndef get_rouge_scores(text1, text2):\\n rouge = Rouge()\\n return rouge.get_scores(text1, text2)\\n\\n\\nrouge_scores_out = []\\n\\n# Calculate the ROUGE scores for both summaries using reference\\neval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)\\neval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)\\n\\nfor metric in [\\\"rouge-1\\\", \\\"rouge-2\\\", \\\"rouge-l\\\"]:\\n for label in [\\\"F-Score\\\"]:\\n eval_1_score = eval_1_rouge[0][metric][label[0].lower()]\\n eval_2_score = eval_2_rouge[0][metric][label[0].lower()]\\n\\n row = {\\n \\\"Metric\\\": f\\\"{metric} ({label})\\\",\\n \\\"Summary 1\\\": eval_1_score,\\n \\\"Summary 2\\\": eval_2_score,\\n }\\n rouge_scores_out.append(row)\\n\\n\\ndef highlight_max(s):\\n is_max = s == s.max()\\n return [\\n \\\"background-color: lightgreen\\\" if v else \\\"background-color: white\\\"\\n for v in is_max\\n ]\\n\\n\\nrouge_scores_out = (\\n pd.DataFrame(rouge_scores_out)\\n .set_index(\\\"Metric\\\")\\n .style.apply(highlight_max, axis=1)\\n)\\n\\nrouge_scores_out\";\n", - " var nbb_formatted_code = \"# function to calculate the Rouge score\\ndef get_rouge_scores(text1, text2):\\n rouge = Rouge()\\n return rouge.get_scores(text1, text2)\\n\\n\\nrouge_scores_out = []\\n\\n# Calculate the ROUGE scores for both summaries using reference\\neval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)\\neval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)\\n\\nfor metric in [\\\"rouge-1\\\", \\\"rouge-2\\\", \\\"rouge-l\\\"]:\\n for label in [\\\"F-Score\\\"]:\\n eval_1_score = eval_1_rouge[0][metric][label[0].lower()]\\n eval_2_score = eval_2_rouge[0][metric][label[0].lower()]\\n\\n row = {\\n \\\"Metric\\\": f\\\"{metric} ({label})\\\",\\n \\\"Summary 1\\\": eval_1_score,\\n \\\"Summary 2\\\": eval_2_score,\\n }\\n rouge_scores_out.append(row)\\n\\n\\ndef highlight_max(s):\\n is_max = s == s.max()\\n return [\\n \\\"background-color: lightgreen\\\" if v else \\\"background-color: white\\\"\\n for v in is_max\\n ]\\n\\n\\nrouge_scores_out = (\\n pd.DataFrame(rouge_scores_out)\\n .set_index(\\\"Metric\\\")\\n .style.apply(highlight_max, axis=1)\\n)\\n\\nrouge_scores_out\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], + "application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 10;\n var nbb_unformatted_code = \"# function to calculate the Rouge score\\ndef get_rouge_scores(text1, text2):\\n rouge = Rouge()\\n return rouge.get_scores(text1, text2)\\n\\n\\nrouge_scores_out = []\\n\\n# Calculate the ROUGE scores for both summaries using reference\\neval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)\\neval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)\\n\\nfor metric in [\\\"rouge-1\\\", \\\"rouge-2\\\", \\\"rouge-l\\\"]:\\n for label in [\\\"F-Score\\\"]:\\n eval_1_score = eval_1_rouge[0][metric][label[0].lower()]\\n eval_2_score = eval_2_rouge[0][metric][label[0].lower()]\\n\\n row = {\\n \\\"Metric\\\": f\\\"{metric} ({label})\\\",\\n \\\"Summary 1\\\": eval_1_score,\\n \\\"Summary 2\\\": eval_2_score,\\n }\\n rouge_scores_out.append(row)\\n\\n\\ndef highlight_max(s):\\n is_max = s == s.max()\\n return [\\n \\\"background-color: lightgreen\\\" if v else \\\"background-color: white\\\"\\n for v in is_max\\n ]\\n\\n\\nrouge_scores_out = (\\n pd.DataFrame(rouge_scores_out)\\n .set_index(\\\"Metric\\\")\\n .style.apply(highlight_max, axis=1)\\n)\\n\\nrouge_scores_out\";\n var nbb_formatted_code = \"# function to calculate the Rouge score\\ndef get_rouge_scores(text1, text2):\\n rouge = Rouge()\\n return rouge.get_scores(text1, text2)\\n\\n\\nrouge_scores_out = []\\n\\n# Calculate the ROUGE scores for both summaries using reference\\neval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)\\neval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)\\n\\nfor metric in [\\\"rouge-1\\\", \\\"rouge-2\\\", \\\"rouge-l\\\"]:\\n for label in [\\\"F-Score\\\"]:\\n eval_1_score = eval_1_rouge[0][metric][label[0].lower()]\\n eval_2_score = eval_2_rouge[0][metric][label[0].lower()]\\n\\n row = {\\n \\\"Metric\\\": f\\\"{metric} ({label})\\\",\\n \\\"Summary 1\\\": eval_1_score,\\n \\\"Summary 2\\\": eval_2_score,\\n }\\n rouge_scores_out.append(row)\\n\\n\\ndef highlight_max(s):\\n is_max = s == s.max()\\n return [\\n \\\"background-color: lightgreen\\\" if v else \\\"background-color: white\\\"\\n for v in is_max\\n ]\\n\\n\\nrouge_scores_out = (\\n pd.DataFrame(rouge_scores_out)\\n .set_index(\\\"Metric\\\")\\n .style.apply(highlight_max, axis=1)\\n)\\n\\nrouge_scores_out\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ", "text/plain": [ "" ] @@ -360,6 +313,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": { "cell_id": "a0857e829dc64f64a183212bb5aab122", @@ -378,6 +332,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": { "cell_id": "609cfe2cf2f14cd09e184168b83de274", @@ -390,9 +345,9 @@ "deepnote_cell_type": "markdown" }, "source": [ - "### Evaluating using BERTScore\n", + "## Evaluating using BERTScore\n", "\n", - "ROUGE has a limitation as it relies on the exact presence of words in both the predicted and reference texts, failing to interpret the underlying semantics. This is where [BERTScore](https://arxiv.org/abs/1904.09675) comes in and leverages the contextual embeddings from the BERT model, aiming to evaluate the similarity between a predicted and a reference sentence in the context of machine-generated text. By comparing embeddings from both sentences, `BERTScore` captures semantic similarities that might be missed by traditional n-gram based metrics.\n" + "ROUGE relies on the exact presence of words in both the predicted and reference texts, failing to interpret the underlying semantics. This is where [BERTScore](https://arxiv.org/abs/1904.09675) comes in and leverages the contextual embeddings from the BERT model, aiming to evaluate the similarity between a predicted and a reference sentence in the context of machine-generated text. By comparing embeddings from both sentences, `BERTScore` captures semantic similarities that might be missed by traditional n-gram based metrics.\n" ] }, { @@ -432,24 +387,7 @@ }, { "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 11;\n", - " var nbb_unformatted_code = \"# Instantiate the BERTScorer object for English language\\nscorer = BERTScorer(lang=\\\"en\\\")\\n\\n# Calculate BERTScore for the summary 1 against the excerpt\\n# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively\\nP1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])\\n\\n# Calculate BERTScore for summary 2 against the excerpt\\n# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively\\nP2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])\\n\\nprint(\\\"Summary 1 F1 Score:\\\", F1_1.tolist()[0])\\nprint(\\\"Summary 2 F1 Score:\\\", F2_2.tolist()[0])\";\n", - " var nbb_formatted_code = \"# Instantiate the BERTScorer object for English language\\nscorer = BERTScorer(lang=\\\"en\\\")\\n\\n# Calculate BERTScore for the summary 1 against the excerpt\\n# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively\\nP1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])\\n\\n# Calculate BERTScore for summary 2 against the excerpt\\n# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively\\nP2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])\\n\\nprint(\\\"Summary 1 F1 Score:\\\", F1_1.tolist()[0])\\nprint(\\\"Summary 2 F1 Score:\\\", F2_2.tolist()[0])\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], + "application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 11;\n var nbb_unformatted_code = \"# Instantiate the BERTScorer object for English language\\nscorer = BERTScorer(lang=\\\"en\\\")\\n\\n# Calculate BERTScore for the summary 1 against the excerpt\\n# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively\\nP1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])\\n\\n# Calculate BERTScore for summary 2 against the excerpt\\n# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively\\nP2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])\\n\\nprint(\\\"Summary 1 F1 Score:\\\", F1_1.tolist()[0])\\nprint(\\\"Summary 2 F1 Score:\\\", F2_2.tolist()[0])\";\n var nbb_formatted_code = \"# Instantiate the BERTScorer object for English language\\nscorer = BERTScorer(lang=\\\"en\\\")\\n\\n# Calculate BERTScore for the summary 1 against the excerpt\\n# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively\\nP1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])\\n\\n# Calculate BERTScore for summary 2 against the excerpt\\n# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively\\nP2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])\\n\\nprint(\\\"Summary 1 F1 Score:\\\", F1_1.tolist()[0])\\nprint(\\\"Summary 2 F1 Score:\\\", F2_2.tolist()[0])\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ", "text/plain": [ "" ] @@ -475,6 +413,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": { "cell_id": "79d07eaa9a344985838133ffc9e9e02b", @@ -491,6 +430,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": { "cell_id": "f0d66b7a59334ed3ba51d9bbbcb85890", @@ -503,17 +443,21 @@ "deepnote_cell_type": "markdown" }, "source": [ - "## Using GPT-4 for reference-free evaluation\n", + "## Evaluating using GPT-4\n", "\n", - "We implement a reference-free text evaluator using `gpt-4`, inspired by the G-Eval framework which evaluates the quality of generated text using large language models. Unlike metrics like `ROUGE` or `BERTScore` that rely on comparison to reference summaries, the `gpt-4` based evaluator is reference-free - it assesses the quality of generated content based solely on the input prompt and text, without any ground truth references. This makes it applicable to new datasets and tasks where human references are sparse or unavailable. It has the following key components:\n", + "Here we implement an example **reference-free** text evaluator using `gpt-4`, inspired by the [G-Eval]((https://arxiv.org/pdf/2303.16634.pdf)) framework which evaluates the quality of generated text using large language models. Unlike metrics like `ROUGE` or `BERTScore` that rely on comparison to reference summaries, the `gpt-4` based evaluator assesses the quality of generated content based solely on the input prompt and text, without any ground truth references. This makes it applicable to new datasets and tasks where human references are sparse or unavailable. \n", "\n", - "* Prompt design: A prompt designed to define the NLG task and specify the desired evaluation criteria for the summarization task.\n", - "* Chain-of-thought generation: a detailed chain-of-thought with step-by-step evaluation instructions.\n", - "* Scoring function: The LLM fills out an evaluation form by generating scores for metrics like coherence, consistency, etc. based on the prompt and chain-of-thought.\n", + "Here's an overview of this method:\n", "\n", - "**In this demonstration, we're using the direct scoring function where GPT-4 generates a discrete score like 1-5 for each metric. Normalizing the scores and taking a weighted sum could result in more robust, continuous scores that better reflect the quality and diversity of the summaries. In the future, implementing the probability weighting would likely improve the correlation with human judgments.**\n", + "1. We define four distinct criteria:\n", + " 1. **Relevance**: Evaluates if the summary includes only important information and excludes redundancies.\n", + " 2. **Coherence**: Assesses the logical flow and organization of the summary.\n", + " 3. **Consistency**: Checks if the summary aligns with the facts in the source document.\n", + " 4. **Fluency**: Rates the grammar and readability of the summary.\n", + "2. We craft prompts for each of these criteria, taking the original document and the summary as inputs, and leveraging chain-of-thought generation and guiding the model to output a numeric score from 1-5 for each criteria. \n", + "3. We generate scores from `gpt-4` with the defined prompts, comparing them across summaries.\n", "\n", - "_Source: [G-EVAL: NLG Evaluation Using GPT-4 with Better Human Alignment.](https://arxiv.org/pdf/2303.16634.pdf) - Liu Y, Iter D, Xu Y, Wang S, Xu R, Zhu C._\n" + "In this demonstration, we're using a direct scoring function where `gpt-4` generates a discrete score (1-5) for each metric. Normalizing the scores and taking a weighted sum could result in more robust, continuous scores that better reflect the quality and diversity of the summaries." ] }, { @@ -591,24 +535,7 @@ }, { "data": { - "application/javascript": [ - "\n", - " setTimeout(function() {\n", - " var nbb_cell_id = 15;\n", - " var nbb_unformatted_code = \"# Evaluation prompt template based on G-Eval\\nEVALUATION_PROMPT_TEMPLATE = \\\"\\\"\\\"\\nYou will be given one summary written for an article. Your task is to rate the summary on one metric.\\nPlease make sure you read and understand these instructions very carefully. \\nPlease keep this document open while reviewing, and refer to it as needed.\\n\\nEvaluation Criteria:\\n\\n{criteria}\\n\\nEvaluation Steps:\\n\\n{steps}\\n\\nExample:\\n\\nSource Text:\\n\\n{document}\\n\\nSummary:\\n\\n{summary}\\n\\nEvaluation Form (scores ONLY):\\n\\n- {metric_name}\\n\\\"\\\"\\\"\\n\\n# Metric 1: Relevance\\n\\nRELEVANCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nRelevance(1-5) - selection of important content from the source. \\\\\\nThe summary should include only important information from the source document. \\\\\\nAnnotators were instructed to penalize summaries which contained redundancies and excess information.\\n\\\"\\\"\\\"\\n\\nRELEVANCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the summary and the source document carefully.\\n2. Compare the summary to the source document and identify the main points of the article.\\n3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\\n4. Assign a relevance score from 1 to 5.\\n\\\"\\\"\\\"\\n\\n# Metric 2: Coherence\\n\\nCOHERENCE_SCORE_CRITERIA = \\\"\\\"\\\"\\nCoherence(1-5) - the collective quality of all sentences. \\\\\\nWe align this dimension with the DUC quality question of structure and coherence \\\\\\nwhereby \\\"the summary should be well-structured and well-organized. \\\\\\nThe summary should not just be a heap of related information, but should build from sentence to a\\\\\\ncoherent body of information about a topic.\\\"\\n\\\"\\\"\\\"\\n\\nCOHERENCE_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main topic and key points.\\n2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,\\nand if it presents them in a clear and logical order.\\n3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 3: Consistency\\n\\nCONSISTENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nConsistency(1-5) - the factual alignment between the summary and the summarized source. \\\\\\nA factually consistent summary contains only statements that are entailed by the source document. \\\\\\nAnnotators were also asked to penalize summaries that contained hallucinated facts.\\n\\\"\\\"\\\"\\n\\nCONSISTENCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main facts and details it presents.\\n2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.\\n3. Assign a score for consistency based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 4: Fluency\\n\\nFLUENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nFluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\\n1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\\n2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\\n3: Good. The summary has few or no errors and is easy to read and follow.\\n\\\"\\\"\\\"\\n\\nFLUENCY_SCORE_STEPS = \\\"\\\"\\\"\\nRead the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.\\n\\\"\\\"\\\"\\n\\n\\ndef get_geval_score(\\n criteria: str, steps: str, document: str, summary: str, metric_name: str\\n):\\n prompt = EVALUATION_PROMPT_TEMPLATE.format(\\n criteria=criteria,\\n steps=steps,\\n metric_name=metric_name,\\n document=document,\\n summary=summary,\\n )\\n response = openai.ChatCompletion.create(\\n model=\\\"gpt-4\\\",\\n messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": prompt}],\\n temperature=0,\\n max_tokens=5,\\n top_p=1,\\n frequency_penalty=0,\\n presence_penalty=0,\\n )\\n return response.choices[0].message.content\\n\\n\\nevaluation_metrics = {\\n \\\"Relevance\\\": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),\\n \\\"Coherence\\\": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),\\n \\\"Consistency\\\": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),\\n \\\"Fluency\\\": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),\\n}\\n\\nsummaries = {\\\"Summary 1\\\": eval_summary_1, \\\"Summary 2\\\": eval_summary_2}\\n\\ndata = {\\\"Evaluation Type\\\": [], \\\"Summary Type\\\": [], \\\"Score\\\": []}\\n\\nfor eval_type, (criteria, steps) in evaluation_metrics.items():\\n for summ_type, summary in summaries.items():\\n data[\\\"Evaluation Type\\\"].append(eval_type)\\n data[\\\"Summary Type\\\"].append(summ_type)\\n result = get_geval_score(criteria, steps, excerpt, summary, eval_type)\\n score_num = int(result.strip())\\n data[\\\"Score\\\"].append(score_num)\\n\\npivot_df = pd.DataFrame(data, index=None).pivot(\\n index=\\\"Evaluation Type\\\", columns=\\\"Summary Type\\\", values=\\\"Score\\\"\\n)\\nstyled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)\\ndisplay(styled_pivot_df)\";\n", - " var nbb_formatted_code = \"# Evaluation prompt template based on G-Eval\\nEVALUATION_PROMPT_TEMPLATE = \\\"\\\"\\\"\\nYou will be given one summary written for an article. Your task is to rate the summary on one metric.\\nPlease make sure you read and understand these instructions very carefully. \\nPlease keep this document open while reviewing, and refer to it as needed.\\n\\nEvaluation Criteria:\\n\\n{criteria}\\n\\nEvaluation Steps:\\n\\n{steps}\\n\\nExample:\\n\\nSource Text:\\n\\n{document}\\n\\nSummary:\\n\\n{summary}\\n\\nEvaluation Form (scores ONLY):\\n\\n- {metric_name}\\n\\\"\\\"\\\"\\n\\n# Metric 1: Relevance\\n\\nRELEVANCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nRelevance(1-5) - selection of important content from the source. \\\\\\nThe summary should include only important information from the source document. \\\\\\nAnnotators were instructed to penalize summaries which contained redundancies and excess information.\\n\\\"\\\"\\\"\\n\\nRELEVANCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the summary and the source document carefully.\\n2. Compare the summary to the source document and identify the main points of the article.\\n3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\\n4. Assign a relevance score from 1 to 5.\\n\\\"\\\"\\\"\\n\\n# Metric 2: Coherence\\n\\nCOHERENCE_SCORE_CRITERIA = \\\"\\\"\\\"\\nCoherence(1-5) - the collective quality of all sentences. \\\\\\nWe align this dimension with the DUC quality question of structure and coherence \\\\\\nwhereby \\\"the summary should be well-structured and well-organized. \\\\\\nThe summary should not just be a heap of related information, but should build from sentence to a\\\\\\ncoherent body of information about a topic.\\\"\\n\\\"\\\"\\\"\\n\\nCOHERENCE_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main topic and key points.\\n2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,\\nand if it presents them in a clear and logical order.\\n3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 3: Consistency\\n\\nCONSISTENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nConsistency(1-5) - the factual alignment between the summary and the summarized source. \\\\\\nA factually consistent summary contains only statements that are entailed by the source document. \\\\\\nAnnotators were also asked to penalize summaries that contained hallucinated facts.\\n\\\"\\\"\\\"\\n\\nCONSISTENCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main facts and details it presents.\\n2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.\\n3. Assign a score for consistency based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 4: Fluency\\n\\nFLUENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nFluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\\n1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\\n2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\\n3: Good. The summary has few or no errors and is easy to read and follow.\\n\\\"\\\"\\\"\\n\\nFLUENCY_SCORE_STEPS = \\\"\\\"\\\"\\nRead the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.\\n\\\"\\\"\\\"\\n\\n\\ndef get_geval_score(\\n criteria: str, steps: str, document: str, summary: str, metric_name: str\\n):\\n prompt = EVALUATION_PROMPT_TEMPLATE.format(\\n criteria=criteria,\\n steps=steps,\\n metric_name=metric_name,\\n document=document,\\n summary=summary,\\n )\\n response = openai.ChatCompletion.create(\\n model=\\\"gpt-4\\\",\\n messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": prompt}],\\n temperature=0,\\n max_tokens=5,\\n top_p=1,\\n frequency_penalty=0,\\n presence_penalty=0,\\n )\\n return response.choices[0].message.content\\n\\n\\nevaluation_metrics = {\\n \\\"Relevance\\\": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),\\n \\\"Coherence\\\": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),\\n \\\"Consistency\\\": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),\\n \\\"Fluency\\\": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),\\n}\\n\\nsummaries = {\\\"Summary 1\\\": eval_summary_1, \\\"Summary 2\\\": eval_summary_2}\\n\\ndata = {\\\"Evaluation Type\\\": [], \\\"Summary Type\\\": [], \\\"Score\\\": []}\\n\\nfor eval_type, (criteria, steps) in evaluation_metrics.items():\\n for summ_type, summary in summaries.items():\\n data[\\\"Evaluation Type\\\"].append(eval_type)\\n data[\\\"Summary Type\\\"].append(summ_type)\\n result = get_geval_score(criteria, steps, excerpt, summary, eval_type)\\n score_num = int(result.strip())\\n data[\\\"Score\\\"].append(score_num)\\n\\npivot_df = pd.DataFrame(data, index=None).pivot(\\n index=\\\"Evaluation Type\\\", columns=\\\"Summary Type\\\", values=\\\"Score\\\"\\n)\\nstyled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)\\ndisplay(styled_pivot_df)\";\n", - " var nbb_cells = Jupyter.notebook.get_cells();\n", - " for (var i = 0; i < nbb_cells.length; ++i) {\n", - " if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n", - " if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n", - " nbb_cells[i].set_text(nbb_formatted_code);\n", - " }\n", - " break;\n", - " }\n", - " }\n", - " }, 500);\n", - " " - ], + "application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 15;\n var nbb_unformatted_code = \"# Evaluation prompt template based on G-Eval\\nEVALUATION_PROMPT_TEMPLATE = \\\"\\\"\\\"\\nYou will be given one summary written for an article. Your task is to rate the summary on one metric.\\nPlease make sure you read and understand these instructions very carefully. \\nPlease keep this document open while reviewing, and refer to it as needed.\\n\\nEvaluation Criteria:\\n\\n{criteria}\\n\\nEvaluation Steps:\\n\\n{steps}\\n\\nExample:\\n\\nSource Text:\\n\\n{document}\\n\\nSummary:\\n\\n{summary}\\n\\nEvaluation Form (scores ONLY):\\n\\n- {metric_name}\\n\\\"\\\"\\\"\\n\\n# Metric 1: Relevance\\n\\nRELEVANCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nRelevance(1-5) - selection of important content from the source. \\\\\\nThe summary should include only important information from the source document. \\\\\\nAnnotators were instructed to penalize summaries which contained redundancies and excess information.\\n\\\"\\\"\\\"\\n\\nRELEVANCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the summary and the source document carefully.\\n2. Compare the summary to the source document and identify the main points of the article.\\n3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\\n4. Assign a relevance score from 1 to 5.\\n\\\"\\\"\\\"\\n\\n# Metric 2: Coherence\\n\\nCOHERENCE_SCORE_CRITERIA = \\\"\\\"\\\"\\nCoherence(1-5) - the collective quality of all sentences. \\\\\\nWe align this dimension with the DUC quality question of structure and coherence \\\\\\nwhereby \\\"the summary should be well-structured and well-organized. \\\\\\nThe summary should not just be a heap of related information, but should build from sentence to a\\\\\\ncoherent body of information about a topic.\\\"\\n\\\"\\\"\\\"\\n\\nCOHERENCE_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main topic and key points.\\n2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,\\nand if it presents them in a clear and logical order.\\n3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 3: Consistency\\n\\nCONSISTENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nConsistency(1-5) - the factual alignment between the summary and the summarized source. \\\\\\nA factually consistent summary contains only statements that are entailed by the source document. \\\\\\nAnnotators were also asked to penalize summaries that contained hallucinated facts.\\n\\\"\\\"\\\"\\n\\nCONSISTENCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main facts and details it presents.\\n2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.\\n3. Assign a score for consistency based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 4: Fluency\\n\\nFLUENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nFluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\\n1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\\n2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\\n3: Good. The summary has few or no errors and is easy to read and follow.\\n\\\"\\\"\\\"\\n\\nFLUENCY_SCORE_STEPS = \\\"\\\"\\\"\\nRead the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.\\n\\\"\\\"\\\"\\n\\n\\ndef get_geval_score(\\n criteria: str, steps: str, document: str, summary: str, metric_name: str\\n):\\n prompt = EVALUATION_PROMPT_TEMPLATE.format(\\n criteria=criteria,\\n steps=steps,\\n metric_name=metric_name,\\n document=document,\\n summary=summary,\\n )\\n response = openai.ChatCompletion.create(\\n model=\\\"gpt-4\\\",\\n messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": prompt}],\\n temperature=0,\\n max_tokens=5,\\n top_p=1,\\n frequency_penalty=0,\\n presence_penalty=0,\\n )\\n return response.choices[0].message.content\\n\\n\\nevaluation_metrics = {\\n \\\"Relevance\\\": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),\\n \\\"Coherence\\\": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),\\n \\\"Consistency\\\": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),\\n \\\"Fluency\\\": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),\\n}\\n\\nsummaries = {\\\"Summary 1\\\": eval_summary_1, \\\"Summary 2\\\": eval_summary_2}\\n\\ndata = {\\\"Evaluation Type\\\": [], \\\"Summary Type\\\": [], \\\"Score\\\": []}\\n\\nfor eval_type, (criteria, steps) in evaluation_metrics.items():\\n for summ_type, summary in summaries.items():\\n data[\\\"Evaluation Type\\\"].append(eval_type)\\n data[\\\"Summary Type\\\"].append(summ_type)\\n result = get_geval_score(criteria, steps, excerpt, summary, eval_type)\\n score_num = int(result.strip())\\n data[\\\"Score\\\"].append(score_num)\\n\\npivot_df = pd.DataFrame(data, index=None).pivot(\\n index=\\\"Evaluation Type\\\", columns=\\\"Summary Type\\\", values=\\\"Score\\\"\\n)\\nstyled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)\\ndisplay(styled_pivot_df)\";\n var nbb_formatted_code = \"# Evaluation prompt template based on G-Eval\\nEVALUATION_PROMPT_TEMPLATE = \\\"\\\"\\\"\\nYou will be given one summary written for an article. Your task is to rate the summary on one metric.\\nPlease make sure you read and understand these instructions very carefully. \\nPlease keep this document open while reviewing, and refer to it as needed.\\n\\nEvaluation Criteria:\\n\\n{criteria}\\n\\nEvaluation Steps:\\n\\n{steps}\\n\\nExample:\\n\\nSource Text:\\n\\n{document}\\n\\nSummary:\\n\\n{summary}\\n\\nEvaluation Form (scores ONLY):\\n\\n- {metric_name}\\n\\\"\\\"\\\"\\n\\n# Metric 1: Relevance\\n\\nRELEVANCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nRelevance(1-5) - selection of important content from the source. \\\\\\nThe summary should include only important information from the source document. \\\\\\nAnnotators were instructed to penalize summaries which contained redundancies and excess information.\\n\\\"\\\"\\\"\\n\\nRELEVANCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the summary and the source document carefully.\\n2. Compare the summary to the source document and identify the main points of the article.\\n3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\\n4. Assign a relevance score from 1 to 5.\\n\\\"\\\"\\\"\\n\\n# Metric 2: Coherence\\n\\nCOHERENCE_SCORE_CRITERIA = \\\"\\\"\\\"\\nCoherence(1-5) - the collective quality of all sentences. \\\\\\nWe align this dimension with the DUC quality question of structure and coherence \\\\\\nwhereby \\\"the summary should be well-structured and well-organized. \\\\\\nThe summary should not just be a heap of related information, but should build from sentence to a\\\\\\ncoherent body of information about a topic.\\\"\\n\\\"\\\"\\\"\\n\\nCOHERENCE_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main topic and key points.\\n2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,\\nand if it presents them in a clear and logical order.\\n3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 3: Consistency\\n\\nCONSISTENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nConsistency(1-5) - the factual alignment between the summary and the summarized source. \\\\\\nA factually consistent summary contains only statements that are entailed by the source document. \\\\\\nAnnotators were also asked to penalize summaries that contained hallucinated facts.\\n\\\"\\\"\\\"\\n\\nCONSISTENCY_SCORE_STEPS = \\\"\\\"\\\"\\n1. Read the article carefully and identify the main facts and details it presents.\\n2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.\\n3. Assign a score for consistency based on the Evaluation Criteria.\\n\\\"\\\"\\\"\\n\\n# Metric 4: Fluency\\n\\nFLUENCY_SCORE_CRITERIA = \\\"\\\"\\\"\\nFluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\\n1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\\n2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\\n3: Good. The summary has few or no errors and is easy to read and follow.\\n\\\"\\\"\\\"\\n\\nFLUENCY_SCORE_STEPS = \\\"\\\"\\\"\\nRead the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.\\n\\\"\\\"\\\"\\n\\n\\ndef get_geval_score(\\n criteria: str, steps: str, document: str, summary: str, metric_name: str\\n):\\n prompt = EVALUATION_PROMPT_TEMPLATE.format(\\n criteria=criteria,\\n steps=steps,\\n metric_name=metric_name,\\n document=document,\\n summary=summary,\\n )\\n response = openai.ChatCompletion.create(\\n model=\\\"gpt-4\\\",\\n messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": prompt}],\\n temperature=0,\\n max_tokens=5,\\n top_p=1,\\n frequency_penalty=0,\\n presence_penalty=0,\\n )\\n return response.choices[0].message.content\\n\\n\\nevaluation_metrics = {\\n \\\"Relevance\\\": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),\\n \\\"Coherence\\\": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),\\n \\\"Consistency\\\": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),\\n \\\"Fluency\\\": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),\\n}\\n\\nsummaries = {\\\"Summary 1\\\": eval_summary_1, \\\"Summary 2\\\": eval_summary_2}\\n\\ndata = {\\\"Evaluation Type\\\": [], \\\"Summary Type\\\": [], \\\"Score\\\": []}\\n\\nfor eval_type, (criteria, steps) in evaluation_metrics.items():\\n for summ_type, summary in summaries.items():\\n data[\\\"Evaluation Type\\\"].append(eval_type)\\n data[\\\"Summary Type\\\"].append(summ_type)\\n result = get_geval_score(criteria, steps, excerpt, summary, eval_type)\\n score_num = int(result.strip())\\n data[\\\"Score\\\"].append(score_num)\\n\\npivot_df = pd.DataFrame(data, index=None).pivot(\\n index=\\\"Evaluation Type\\\", columns=\\\"Summary Type\\\", values=\\\"Score\\\"\\n)\\nstyled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)\\ndisplay(styled_pivot_df)\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ", "text/plain": [ "" ] @@ -756,6 +683,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": { "cell_id": "03a7682297624a71ae448cf67192d8fd", @@ -768,22 +696,18 @@ "deepnote_cell_type": "markdown" }, "source": [ - "- **Coherence**: Summary 1 received a higher score of 5, compared to the score for Summary 2. This indicates that the Summary 1 is perceived to be more logically structured and coherent. Coherence is a metric used to assess the quality of the learned topics in a text or document.\n", - "- **Consistency**: Both the summaries received equal scores of 5. This shows that both summaries maintain a uniform tone and content, without contradictions.\n", - "- **Fluency**: The Summary 1 scored higher again with a score of 3, whereas Summary 2 scored 2. This means that the Summary 1 is considered more fluent and reads more naturally. Fluency is a metric used to evaluate the smoothness and naturalness of a text or speech\n", - "- **Relevance**: Lastly, relevance, where Summary 1 scored higher, is a metric used to assess the degree to which something is related or applicable to a particular topic or context. Both summaries perform equally well.\n", "\n", "Overall, the Summary 1 appears to outperform Summary 2 in three of the four categories (Coherence, Relevance and Fluency). Both summaries are found to be consistent with each other. The result might suggest that Summary 1 is generally preferable based on the given evaluation criteria.\n", "\n", - "##### Limitations\n", + "### Limitations\n", "\n", - "Note that LLM-based metrics could have a bias towards preferring LLM-generated texts over human-written texts. Additionally LLM based metrics are sensitive to system messages/prompts. We recommend experimenting with other techniques that can help improve performance and/or get consistent scores, striking the right balance between high-quality expensive evaluation and automated evaluations.\n", + "Note that LLM-based metrics could have a bias towards preferring LLM-generated texts over human-written texts. Additionally LLM based metrics are sensitive to system messages/prompts. We recommend experimenting with other techniques that can help improve performance and/or get consistent scores, striking the right balance between high-quality expensive evaluation and automated evaluations. It is also worth noting that this scoring methodology is currently limited by `gpt-4`'s context window.\n", "\n", - "### Conclusion\n", + "## Conclusion\n", "\n", "Evaluating abstractive summarization remains an open area for further improvement. Traditional metrics like `ROUGE`, `BLEU`, and `BERTScore` provide useful automatic evaluation but have limitations in capturing semantic similarity and nuanced aspects of summarization quality. Moreover, they require reference outputs which can be expensive to collect/label. LLM-based metrics offer promise as a reference-free method of evaluating coherence, fluency, and relevance. However, they too have potential biases favoring text generated by LLMs. Ultimately, a combination of automatic metrics and human evaluation is ideal for reliably assessing abstractive summarization systems. While human evaluation is indispensable for gaining a comprehensive understanding of summary quality, it should be complemented with automated evaluation to enable efficient, large-scale testing. The field will continue to evolve more robust evaluation techniques, balancing quality, scalability, and fairness. Advancing evaluation methods is crucial for driving progress in production applications.\n", "\n", - "### References\n", + "## References\n", "\n", "- [G-EVAL: NLG Evaluation Using GPT-4 with Better Human Alignment](https://arxiv.org/pdf/2303.16634.pdf) - Liu Y, Iter D, Xu Y, Wang S, Xu R, Zhu C.\n", "- [BERTScore: Evaluating Text Generation with BERT](https://arxiv.org/abs/1904.09675) - Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. Published online February 24, 2020.\n",