add synthetic data cookbook

3 months ago · 2905dd4094
parent 812a2dea93
commit 2905dd4094
1 changed files with 549 additions and 0 deletions
--- a/examples/evaluation/ragas/openai-ragas-synthetic-test.ipynb
+++ b/examples/evaluation/ragas/openai-ragas-synthetic-test.ipynb
@ -0,0 +1,549 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "45b95acd-543f-4248-be8a-28e7379d2470",
+   "metadata": {},
+   "source": [
+    "# Introduction\n",
+    "\n",
+    "Ragas is the de-facto opensource standard for RAG evaluations. Ragas provides features and methods to help evaluate RAG applications. In this notebook we will build a synthetic test dataset using Ragas to evaluate your RAG. \n",
+    "\n",
+    "### Contents\n",
+    "- [Prerequisites]()\n",
+    "- [Dataset preparation]()\n",
+    "- [Evaluation]()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36edfc55-b18a-44db-bac1-c1ec0a91c9db",
+   "metadata": {},
+   "source": [
+    "### Prerequisites\n",
+    "- Ragas is a python package and we can install it using pip\n",
+    "- For creating QA pairs, you will need some documents from which you intend to create it. For the sake of this notebook, I am using few papers regarding prompt engineering\n",
+    "- Ragas uses model guided techniques underneath to produce scores for each metric. In this tutorial, we will use OpenAI `gpt-3.5-turbo` and `text-embedding-ada-002`. These are the default models used in ragas but you can use any LLM or Embedding of your choice by referring to this [guide](https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html). I highly recommend that you try this notebook with open-ai so that you get a feel of it with ease.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bc320c4f-2367-4ecc-b2a7-5df941e07bf9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install ragas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "50779956",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Cloning into 'moe-papers-collection'...\n",
+      "remote: Enumerating objects: 15, done.\u001b[K\n",
+      "remote: Counting objects: 100% (12/12), done.\u001b[K\n",
+      "remote: Compressing objects: 100% (12/12), done.\u001b[K\n",
+      "remote: Total 15 (delta 1), reused 0 (delta 0), pack-reused 3\u001b[K\n",
+      "Unpacking objects: 100% (15/15), 2.70 MiB | 11.71 MiB/s, done.\n",
+      "Filtering content: 100% (2/2), 8.11 MiB | 5.72 MiB/s, done.\n"
+     ]
+    }
+   ],
+   "source": [
+    "!git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-guide-papers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "8dbfaeda-49a2-437f-8543-dd242c6422b2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"<your-open-api-key>\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "de2bf933-50cb-4d79-ad34-bed8db5a5872",
+   "metadata": {},
+   "source": [
+    "### Data preparation\n",
+    "\n",
+    "Here I am loading and parsing each of our documents to a `Document` object using langchain document loaders. You can also use llama-index so that same. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "8dc30b79",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import DirectoryLoader\n",
+    "from ragas.testset.generator import TestsetGenerator\n",
+    "from ragas.testset.evolutions import simple, reasoning, multi_context, conditional\n",
+    "\n",
+    "loader = DirectoryLoader(\"./prompt-engineering-guide-papers\", use_multithreading=True, silent_errors=True,sample_size=5)\n",
+    "documents = loader.load()\n",
+    "\n",
+    "for document in documents:\n",
+    "    document.metadata['filename'] = document.metadata['source']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e73de7f-b983-419a-9bd1-b60aae48dc67",
+   "metadata": {},
+   "source": [
+    "### Test set generation\n",
+    "\n",
+    "Ragas aims to create high quality and diverse test dataset containing questions of different difficulty levels and types. For this we use a paradigm inspired from the idea of question evolution. One can create test dataset with different types of questions that can be synthetised by ragas, which is controlled using `distributions` parameter. Here I am creating some sample with uniform distribution of each question type.\n",
+    "\n",
+    "**Note:** *To know more about the underlying paradigm refer to our [docs](https://docs.ragas.io/en/stable/concepts/testset_generation.html).*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "782f15f8-0503-48a7-9b38-5e59ce692c3e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/var/folders/ww/sk5dkfhn673234cmy5w7008r0000gn/T/ipykernel_51325/2981689800.py:2: DeprecationWarning: The function with_openai was deprecated in 0.1.4, and will be removed in the 0.2.0 release. Use from_langchain instead.\n",
+      "  generator = TestsetGenerator.with_openai()\n"
+     ]
+    }
+   ],
+   "source": [
+    "generator = TestsetGenerator.with_openai()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "360880ab-d5c7-485a-8ca0-fee1e639c8f6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "distributions = {simple: 0.25, reasoning: 0.25, multi_context: 0.25, conditional:0.25}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "438335a5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "embedding nodes:   0%|          | 0/286 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "508f7c85484b49efadee68da0030eeec",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Generating:   0%|          | 0/25 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "testset = generator.generate_with_langchain_docs(documents, test_size=25, \n",
+    "                                                 raise_exceptions=False, with_debugging_logs=False,\n",
+    "                                                 distributions=distributions)    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "c603d429",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = testset.to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "165e010d-3f8f-4201-bf1d-7cc3c0a13413",
+   "metadata": {},
+   "source": [
+    "And Wola! That's it. You now have a test dataset. Let's inspect and save it"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cb3721b0-1e04-4b25-9348-71c251c0eff9",
+   "metadata": {},
+   "source": [
+    "### Saving results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "fc0f24ad-645a-4923-93ee-1e05acf0a47e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>question</th>\n",
+       "      <th>contexts</th>\n",
+       "      <th>ground_truth</th>\n",
+       "      <th>evolution_type</th>\n",
+       "      <th>episode_done</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>How does instruction tuning affect the zero-sh...</td>\n",
+       "      <td>[ tasks (see Table 2 in the Appendix), FLAN on...</td>\n",
+       "      <td>For larger models on the order of 100B paramet...</td>\n",
+       "      <td>simple</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>What is the Zero-shot-CoT method and how does ...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Zero-shot-CoT is a zero-shot template-based pr...</td>\n",
+       "      <td>simple</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>How does prompt tuning affect model performanc...</td>\n",
+       "      <td>[080.863.867.439.249.4\\n\\nTask Cluster:# datas...</td>\n",
+       "      <td>Prompt tuning improves model performance in im...</td>\n",
+       "      <td>simple</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>What is the purpose of instruction tuning in l...</td>\n",
+       "      <td>[ via natural language instructions, such as “...</td>\n",
+       "      <td>The purpose of instruction tuning in language ...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>What distinguishes Zero-shot-CoT from Few-shot...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Zero-shot-CoT differs from Few-shot-CoT in tha...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Which language models were used in the experim...</td>\n",
+       "      <td>[list\\n\\n1. For all authors...\\n\\n(a) Do the m...</td>\n",
+       "      <td>The language models used in the experiment 'Ex...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>How does Zero-shot-CoT differ from previous fe...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Zero-shot-CoT differs from previous few-shot a...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>What are the stages in the Zero-shot-CoT metho...</td>\n",
+       "      <td>[ it differs from most of the prior template p...</td>\n",
+       "      <td>The Zero-shot-CoT method for reasoning and ans...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>What are the main approaches for inducing LLMs...</td>\n",
+       "      <td>[2 2 0 2\\n\\nt c O 7\\n\\n] L C . s c [\\n\\n1 v 3 ...</td>\n",
+       "      <td>The main approaches for inducing LLMs to perfo...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>Which sorting method has the most impact on Au...</td>\n",
+       "      <td>[ t a R\\n\\n30\\n\\n20\\n\\n%\\n\\n(\\n\\ne t a R\\n\\n40...</td>\n",
+       "      <td>The sorting method that has the most impact on...</td>\n",
+       "      <td>multi_context</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>What are the pros and cons of prompting method...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Our work is based on prompting methods for lar...</td>\n",
+       "      <td>multi_context</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>11</th>\n",
+       "      <td>What are the stages in Zero-shot-CoT for reaso...</td>\n",
+       "      <td>[ it differs from most of the prior template p...</td>\n",
+       "      <td>Zero-shot-CoT involves two stages: reasoning e...</td>\n",
+       "      <td>multi_context</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12</th>\n",
+       "      <td>How does the number of datasets and templates ...</td>\n",
+       "      <td>[oze\\n\\n94.8a 90.0 92.0 90.0 89.0 [10] 91.0 92...</td>\n",
+       "      <td>Using more datasets per task cluster improves ...</td>\n",
+       "      <td>multi_context</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>What technique surpasses zero-shot large langu...</td>\n",
+       "      <td>[3 2 0 2\\n\\nn a J\\n\\n9 2\\n\\n] L C . s c [\\n\\n4...</td>\n",
+       "      <td>Chain of thought (CoT) prompting</td>\n",
+       "      <td>multi_context</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14</th>\n",
+       "      <td>How does language model scale impact instructi...</td>\n",
+       "      <td>[ tasks (see Table 2 in the Appendix), FLAN on...</td>\n",
+       "      <td>For larger language models on the order of 100...</td>\n",
+       "      <td>conditional</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>15</th>\n",
+       "      <td>What's the advantage of using Zero-shot-CoT pr...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Zero-shot-CoT prompts offer the advantage of n...</td>\n",
+       "      <td>conditional</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16</th>\n",
+       "      <td>What's the difference in unresolving rate betw...</td>\n",
+       "      <td>[-Q-CoT.\\n\\nTo begin with, we invoke Zero-Shot...</td>\n",
+       "      <td>The unresolving rate of Retrieval-Q-CoT is 46....</td>\n",
+       "      <td>conditional</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>17</th>\n",
+       "      <td>What are the pros and cons of prompting method...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Prompting methods for large language models ha...</td>\n",
+       "      <td>conditional</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18</th>\n",
+       "      <td>What are the stages and processes in the Auto-...</td>\n",
+       "      <td>[ wrong demonstrations may be eliminated with ...</td>\n",
+       "      <td>The Auto-CoT method for constructing demonstra...</td>\n",
+       "      <td>conditional</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19</th>\n",
+       "      <td>How are genes passed from one generation to th...</td>\n",
+       "      <td>[ Penguin is a kind of bird. Knowledge: Clouds...</td>\n",
+       "      <td>Genes are passed from parent to offspring.</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                             question  \\\n",
+       "0   How does instruction tuning affect the zero-sh...   \n",
+       "1   What is the Zero-shot-CoT method and how does ...   \n",
+       "2   How does prompt tuning affect model performanc...   \n",
+       "3   What is the purpose of instruction tuning in l...   \n",
+       "4   What distinguishes Zero-shot-CoT from Few-shot...   \n",
+       "5   Which language models were used in the experim...   \n",
+       "6   How does Zero-shot-CoT differ from previous fe...   \n",
+       "7   What are the stages in the Zero-shot-CoT metho...   \n",
+       "8   What are the main approaches for inducing LLMs...   \n",
+       "9   Which sorting method has the most impact on Au...   \n",
+       "10  What are the pros and cons of prompting method...   \n",
+       "11  What are the stages in Zero-shot-CoT for reaso...   \n",
+       "12  How does the number of datasets and templates ...   \n",
+       "13  What technique surpasses zero-shot large langu...   \n",
+       "14  How does language model scale impact instructi...   \n",
+       "15  What's the advantage of using Zero-shot-CoT pr...   \n",
+       "16  What's the difference in unresolving rate betw...   \n",
+       "17  What are the pros and cons of prompting method...   \n",
+       "18  What are the stages and processes in the Auto-...   \n",
+       "19  How are genes passed from one generation to th...   \n",
+       "\n",
+       "                                             contexts  \\\n",
+       "0   [ tasks (see Table 2 in the Appendix), FLAN on...   \n",
+       "1   [ prompts have also focused on per-task engine...   \n",
+       "2   [080.863.867.439.249.4\\n\\nTask Cluster:# datas...   \n",
+       "3   [ via natural language instructions, such as “...   \n",
+       "4   [ prompts have also focused on per-task engine...   \n",
+       "5   [list\\n\\n1. For all authors...\\n\\n(a) Do the m...   \n",
+       "6   [ prompts have also focused on per-task engine...   \n",
+       "7   [ it differs from most of the prior template p...   \n",
+       "8   [2 2 0 2\\n\\nt c O 7\\n\\n] L C . s c [\\n\\n1 v 3 ...   \n",
+       "9   [ t a R\\n\\n30\\n\\n20\\n\\n%\\n\\n(\\n\\ne t a R\\n\\n40...   \n",
+       "10  [ prompts have also focused on per-task engine...   \n",
+       "11  [ it differs from most of the prior template p...   \n",
+       "12  [oze\\n\\n94.8a 90.0 92.0 90.0 89.0 [10] 91.0 92...   \n",
+       "13  [3 2 0 2\\n\\nn a J\\n\\n9 2\\n\\n] L C . s c [\\n\\n4...   \n",
+       "14  [ tasks (see Table 2 in the Appendix), FLAN on...   \n",
+       "15  [ prompts have also focused on per-task engine...   \n",
+       "16  [-Q-CoT.\\n\\nTo begin with, we invoke Zero-Shot...   \n",
+       "17  [ prompts have also focused on per-task engine...   \n",
+       "18  [ wrong demonstrations may be eliminated with ...   \n",
+       "19  [ Penguin is a kind of bird. Knowledge: Clouds...   \n",
+       "\n",
+       "                                         ground_truth evolution_type  \\\n",
+       "0   For larger models on the order of 100B paramet...         simple   \n",
+       "1   Zero-shot-CoT is a zero-shot template-based pr...         simple   \n",
+       "2   Prompt tuning improves model performance in im...         simple   \n",
+       "3   The purpose of instruction tuning in language ...      reasoning   \n",
+       "4   Zero-shot-CoT differs from Few-shot-CoT in tha...      reasoning   \n",
+       "5   The language models used in the experiment 'Ex...      reasoning   \n",
+       "6   Zero-shot-CoT differs from previous few-shot a...      reasoning   \n",
+       "7   The Zero-shot-CoT method for reasoning and ans...      reasoning   \n",
+       "8   The main approaches for inducing LLMs to perfo...      reasoning   \n",
+       "9   The sorting method that has the most impact on...  multi_context   \n",
+       "10  Our work is based on prompting methods for lar...  multi_context   \n",
+       "11  Zero-shot-CoT involves two stages: reasoning e...  multi_context   \n",
+       "12  Using more datasets per task cluster improves ...  multi_context   \n",
+       "13                   Chain of thought (CoT) prompting  multi_context   \n",
+       "14  For larger language models on the order of 100...    conditional   \n",
+       "15  Zero-shot-CoT prompts offer the advantage of n...    conditional   \n",
+       "16  The unresolving rate of Retrieval-Q-CoT is 46....    conditional   \n",
+       "17  Prompting methods for large language models ha...    conditional   \n",
+       "18  The Auto-CoT method for constructing demonstra...    conditional   \n",
+       "19         Genes are passed from parent to offspring.      reasoning   \n",
+       "\n",
+       "    episode_done  \n",
+       "0           True  \n",
+       "1           True  \n",
+       "2           True  \n",
+       "3           True  \n",
+       "4           True  \n",
+       "5           True  \n",
+       "6           True  \n",
+       "7           True  \n",
+       "8           True  \n",
+       "9           True  \n",
+       "10          True  \n",
+       "11          True  \n",
+       "12          True  \n",
+       "13          True  \n",
+       "14          True  \n",
+       "15          True  \n",
+       "16          True  \n",
+       "17          True  \n",
+       "18          True  \n",
+       "19          True  "
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df = df[df['ground_truth']!=\"nan\"].reset_index(drop=True)\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "ad315aee-3029-46c2-812c-edf821e3f033",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.to_csv(\"synthetic_test_dataset.csv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "58e5c4c8-47dc-4195-8332-453f96e1a6d2",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ragas",
+   "language": "python",
+   "name": "ragas"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}