Merge 8dcde6a198 into 02525b5f3c

2 weeks ago · bea6a80297
parent 02525b5f3c 8dcde6a198
commit bea6a80297
3 changed files with 1073 additions and 0 deletions
--- a/examples/evaluation/ragas/openai-ragas-eval-cookbook.ipynb
+++ b/examples/evaluation/ragas/openai-ragas-eval-cookbook.ipynb
@ -0,0 +1,497 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "18c253af-fdb3-414b-bc68-8bd18004f5cc",
+   "metadata": {},
+   "source": [
+    "# Introduction\n",
+    "\n",
+    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/shahules786/openai-cookbook/blob/ragas/examples/evaluation/ragas/openai-ragas-eval-cookbook.ipynb\">\n",
+    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+    "</a>\n",
+    "\n",
+    "Ragas is the de-facto opensource standard for RAG evaluations. Ragas provides features and methods to help evaluate RAG applications. In this notebook we will cover basic steps for evaluating your RAG application with Ragas. \n",
+    "\n",
+    "### Contents\n",
+    "- [Prerequisites]()\n",
+    "- [Dataset preparation]()\n",
+    "- [Evaluation]()\n",
+    "- [Analysis]()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "73c40aa9-aa04-44fc-8ef3-2ab7bd132c36",
+   "metadata": {},
+   "source": [
+    "### Prerequisites\n",
+    "- Ragas is a python package and we can install it using pip\n",
+    "- Some documents to build our simple RAG pipeline\n",
+    "- Ragas uses model guided techniques underneath to produce scores for each metric. In this tutorial, we will use OpenAI `gpt-3.5-turbo` and `text-embedding-ada-002`. These are the default models used in ragas but you can use any LLM or Embedding of your choice by referring to this [guide](https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html). I highly recommend that you try this notebook with open-ai so that you get a feel of it with ease."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "59db1ff3-b618-4dca-924b-035a2f5def0c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install -q ragas llama-index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d0e43825-9c28-417e-a18a-03543383f3bd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-guide-papers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "f9e283b4-ae3f-4e76-b990-5b890b5364fa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"<your-openai-key>\"\n",
+    "\n",
+    "try:\n",
+    "  import google.colab\n",
+    "  PATH = \"/content/prompt-engineering-guide-papers\"\"\n",
+    "except:\n",
+    "  PATH = \"./prompt-engineering-guide-papers\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8ae41c3a-4fc2-4596-93f2-c763adeef56e",
+   "metadata": {},
+   "source": [
+    "And that's it. You're ready to go."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b7b14d57-68cc-4beb-9b99-76827687db88",
+   "metadata": {},
+   "source": [
+    "## Dataset preparation\n",
+    "\n",
+    "Evaluating any ML pipeline will require several data points that constitues a test dataset. For Ragas, the data points required for evaluating your RAG completely are\n",
+    "\n",
+    "- `question`: A question or query that is relevant to your RAG.\n",
+    "- `contexts`: The retrieved contexts corresponding to each question. This is a `list[list]` since each question can retrieve multiple text chunks.\n",
+    "- `answer`:  The answer generated by your RAG corresponding to each question.\n",
+    "- `ground_truth`: The expected correct answer corresponding to each question.\n",
+    "\n",
+    "For the purpose of this notebook, I have this dataset prepared from a simple RAG that I created myself to help me with NLP research. Let's use it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "326f3dc1-775f-4ec5-8f27-afd76e9b5b22",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "1ee49bc5-4661-4435-8463-197877c18fa3",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Found cached dataset json (/Users/shahules/.cache/huggingface/datasets/explodinggradients___json/explodinggradients--prompt-engineering-guide-papers-9147f70034f5334d/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "89035d84b3e04d489f59ef9673fa716a",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>question</th>\n",
+       "      <th>ground_truth</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>How does instruction tuning affect the zero-sh...</td>\n",
+       "      <td>For larger models on the order of 100B paramet...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>What is the Zero-shot-CoT method and how does ...</td>\n",
+       "      <td>Zero-shot-CoT is a zero-shot template-based pr...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>How does prompt tuning affect model performanc...</td>\n",
+       "      <td>Prompt tuning improves model performance in im...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>What is the purpose of instruction tuning in l...</td>\n",
+       "      <td>The purpose of instruction tuning in language ...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>What distinguishes Zero-shot-CoT from Few-shot...</td>\n",
+       "      <td>Zero-shot-CoT differs from Few-shot-CoT in tha...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                            question                                       ground_truth\n",
+       "0  How does instruction tuning affect the zero-sh...  For larger models on the order of 100B paramet...\n",
+       "1  What is the Zero-shot-CoT method and how does ...  Zero-shot-CoT is a zero-shot template-based pr...\n",
+       "2  How does prompt tuning affect model performanc...  Prompt tuning improves model performance in im...\n",
+       "3  What is the purpose of instruction tuning in l...  The purpose of instruction tuning in language ...\n",
+       "4  What distinguishes Zero-shot-CoT from Few-shot...  Zero-shot-CoT differs from Few-shot-CoT in tha..."
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "eval_dataset = load_dataset(\"explodinggradients/prompt-engineering-guide-papers\")\n",
+    "eval_dataset = eval_dataset['test'].to_pandas()\n",
+    "eval_dataset.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84ae0719-82bc-4103-8299-3df7021951e1",
+   "metadata": {},
+   "source": [
+    "As you can see, the dataset contains two of the required attributes mentioned,that is `question` and `ground_truth` answers. Now we can move on our next step to collect the other two attributes.\n",
+    "\n",
+    "**Note:**\n",
+    "*We know that it's hard to formulate a test data containing Question and ground truth answer pairs when starting out. We have the perfect solution for this in this form of a ragas synthetic test data generation feature. The questions and ground truth answers were created by [ragas synthetic data generation](https://colab.research.google.com/github/shahules786/openai-cookbook/blob/ragas/examples/evaluation/ragas/openai-ragas-synthetic-test.ipynb) feature. Check it out here once you finish this notebook*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6184b6b5-7373-4665-9754-b4fc08929000",
+   "metadata": {},
+   "source": [
+    "#### Simple RAG pipeline\n",
+    "\n",
+    "Now with the above step we have two attributes needed for evaluation, that is `question` and `ground_truth` answers. We now need to feed these test questions to our RAG pipeline to collect the other two attributes, ie `contexts` and `answer`.  Let's build a simple RAG using llama-index to do that. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "d7bbceb1-5e05-422d-8690-f49fc71245d4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import nest_asyncio\n",
+    "from llama_index.core.indices import VectorStoreIndex\n",
+    "from llama_index.core.readers import SimpleDirectoryReader\n",
+    "from llama_index.core.service_context import ServiceContext\n",
+    "from datasets import Dataset\n",
+    "\n",
+    "nest_asyncio.apply()\n",
+    "\n",
+    "\n",
+    "def build_query_engine(documents):\n",
+    "    vector_index = VectorStoreIndex.from_documents(\n",
+    "        documents, service_context=ServiceContext.from_defaults(chunk_size=512),\n",
+    "    )\n",
+    "\n",
+    "    query_engine = vector_index.as_query_engine(similarity_top_k=3)\n",
+    "    return query_engine\n",
+    "\n",
+    "# Function to evaluate as Llama index does not support async evaluation for HFInference API\n",
+    "def generate_responses(query_engine, test_questions, test_answers):\n",
+    "  responses = [query_engine.query(q) for q in test_questions]\n",
+    "\n",
+    "  answers = []\n",
+    "  contexts = []\n",
+    "  for r in responses:\n",
+    "    answers.append(r.response)\n",
+    "    contexts.append([c.node.get_content() for c in r.source_nodes])\n",
+    "  dataset_dict = {\n",
+    "        \"question\": test_questions,\n",
+    "        \"answer\": answers,\n",
+    "        \"contexts\": contexts,\n",
+    "  }\n",
+    "  if test_answers is not None:\n",
+    "    dataset_dict[\"ground_truth\"] = test_answers\n",
+    "  ds = Dataset.from_dict(dataset_dict)\n",
+    "  return ds"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "d299259a-1064-44d4-9c96-5ae423d9e2f8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "reader = SimpleDirectoryReader(PATH,num_files_limit=30, required_exts=[\".pdf\"])\n",
+    "documents = reader.load_data()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "546f7b5c-36b7-4ffc-90d6-bea27df01aa5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_questions = eval_dataset['question'].values.tolist()\n",
+    "test_answers = eval_dataset['ground_truth'].values.tolist()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "52a67da6-2eeb-452c-a90c-5c1ea860545b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_engine1 = build_query_engine(documents)\n",
+    "result_ds = generate_responses(query_engine1, test_questions, test_answers)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6be10812-894e-43ba-857d-36627eb54dc8",
+   "metadata": {},
+   "source": [
+    "## Evaluation\n",
+    "For evaluation ragas provides several metrics which is aimed to quantify the end-end performance of the pipeline and also the component wise performance of the pipeline. For this tutorial let's consider few of them\n",
+    "\n",
+    "**Note**: *Refer to our [metrics](https://docs.ragas.io/en/stable/concepts/metrics/index.html) docs to read more about different metrics.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "98b6cbba-fecb-4b92-8cd9-839d80025b22",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "3c413241484641c6984b6b95af0367c9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from ragas.metrics import answer_correctness, faithfulness \n",
+    "from ragas import evaluate\n",
+    "\n",
+    "ragas_results = evaluate(result_ds, metrics=[answer_correctness, faithfulness ])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7fa8eaaa-b6b9-4ae9-a2df-f65f0559b565",
+   "metadata": {},
+   "source": [
+    "## Analysis\n",
+    "You can export the individual scores to dataframe and analyse it. You can also add [callbacks and tracing](https://docs.ragas.io/en/latest/howtos/applications/tracing.html) to ragas to do indepth analysis."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "2ff74280-02c4-4992-998d-4a9689e47b89",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>question</th>\n",
+       "      <th>answer</th>\n",
+       "      <th>contexts</th>\n",
+       "      <th>ground_truth</th>\n",
+       "      <th>answer_correctness</th>\n",
+       "      <th>faithfulness</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>How does instruction tuning affect the zero-sh...</td>\n",
+       "      <td>Instruction tuning enhances the zero-shot perf...</td>\n",
+       "      <td>[34\\nthe effectiveness of different constructi...</td>\n",
+       "      <td>For larger models on the order of 100B paramet...</td>\n",
+       "      <td>0.781983</td>\n",
+       "      <td>1.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>What is the Zero-shot-CoT method and how does ...</td>\n",
+       "      <td>Zero-shot-CoT is a method that involves append...</td>\n",
+       "      <td>[Plan-and-Solve Prompting: Improving Zero-Shot...</td>\n",
+       "      <td>Zero-shot-CoT is a zero-shot template-based pr...</td>\n",
+       "      <td>0.667026</td>\n",
+       "      <td>1.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>How does prompt tuning affect model performanc...</td>\n",
+       "      <td>Prompt tuning can impact model performance in ...</td>\n",
+       "      <td>[4 C. Liu et al.\\nto generate results directly...</td>\n",
+       "      <td>Prompt tuning improves model performance in im...</td>\n",
+       "      <td>0.396040</td>\n",
+       "      <td>1.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>What is the purpose of instruction tuning in l...</td>\n",
+       "      <td>The purpose of instruction tuning in language ...</td>\n",
+       "      <td>[In practice,\\ninstruction tuning offers a gen...</td>\n",
+       "      <td>The purpose of instruction tuning in language ...</td>\n",
+       "      <td>0.694074</td>\n",
+       "      <td>1.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>What distinguishes Zero-shot-CoT from Few-shot...</td>\n",
+       "      <td>Zero-shot-CoT conditions the LM on a single pr...</td>\n",
+       "      <td>[Wei et al. (2022b ) observe that the success ...</td>\n",
+       "      <td>Zero-shot-CoT differs from Few-shot-CoT in tha...</td>\n",
+       "      <td>0.530018</td>\n",
+       "      <td>1.0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                            question  ... faithfulness\n",
+       "0  How does instruction tuning affect the zero-sh...  ...          1.0\n",
+       "1  What is the Zero-shot-CoT method and how does ...  ...          1.0\n",
+       "2  How does prompt tuning affect model performanc...  ...          1.0\n",
+       "3  What is the purpose of instruction tuning in l...  ...          1.0\n",
+       "4  What distinguishes Zero-shot-CoT from Few-shot...  ...          1.0\n",
+       "\n",
+       "[5 rows x 6 columns]"
+      ]
+     },
+     "execution_count": 34,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "ragas_results.to_pandas().head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d7e7242c-a785-4ac3-94c8-2c1e795bb53a",
+   "metadata": {},
+   "source": [
+    "**If you liked this tutorial, checkout [ragas](https://github.com/explodinggradients/ragas) and consider leaving a star!**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c020fec7-8451-46ae-8b2d-192ae468428e",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ragas",
+   "language": "python",
+   "name": "ragas"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/examples/evaluation/ragas/openai-ragas-synthetic-test.ipynb
+++ b/examples/evaluation/ragas/openai-ragas-synthetic-test.ipynb
@ -0,0 +1,570 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "45b95acd-543f-4248-be8a-28e7379d2470",
+   "metadata": {},
+   "source": [
+    "# Introduction\n",
+    "\n",
+    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/shahules786/openai-cookbook/blob/ragas/examples/evaluation/ragas/openai-ragas-synthetic-test.ipynb\">\n",
+    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+    "</a>\n",
+    "\n",
+    "Ragas is the de-facto opensource standard for RAG evaluations. Ragas provides features and methods to help evaluate RAG applications. In this notebook we will build a synthetic test dataset using Ragas to evaluate your RAG. \n",
+    "\n",
+    "### Contents\n",
+    "- [Prerequisites]()\n",
+    "- [Dataset preparation]()\n",
+    "- [Evaluation]()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36edfc55-b18a-44db-bac1-c1ec0a91c9db",
+   "metadata": {},
+   "source": [
+    "### Prerequisites\n",
+    "- Ragas is a python package and we can install it using pip\n",
+    "- For creating QA pairs, you will need some documents from which you intend to create it. For the sake of this notebook, I am using few papers regarding prompt engineering\n",
+    "- Ragas uses model guided techniques underneath to produce scores for each metric. In this tutorial, we will use OpenAI `gpt-3.5-turbo` and `text-embedding-ada-002`. These are the default models used in ragas but you can use any LLM or Embedding of your choice by referring to this [guide](https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html). I highly recommend that you try this notebook with open-ai so that you get a feel of it with ease.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "bc320c4f-2367-4ecc-b2a7-5df941e07bf9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install -q ragas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "50779956",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Cloning into 'moe-papers-collection'...\n",
+      "remote: Enumerating objects: 15, done.\u001b[K\n",
+      "remote: Counting objects: 100% (12/12), done.\u001b[K\n",
+      "remote: Compressing objects: 100% (12/12), done.\u001b[K\n",
+      "remote: Total 15 (delta 1), reused 0 (delta 0), pack-reused 3\u001b[K\n",
+      "Unpacking objects: 100% (15/15), 2.70 MiB | 11.71 MiB/s, done.\n",
+      "Filtering content: 100% (2/2), 8.11 MiB | 5.72 MiB/s, done.\n"
+     ]
+    }
+   ],
+   "source": [
+    "!git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-guide-papers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "8dbfaeda-49a2-437f-8543-dd242c6422b2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"<your-open-api-key>\"\n",
+    "\n",
+    "try:\n",
+    "  import google.colab\n",
+    "  PATH = \"/content/prompt-engineering-guide-papers\"\"\n",
+    "except:\n",
+    "  PATH = \"./prompt-engineering-guide-papers\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "de2bf933-50cb-4d79-ad34-bed8db5a5872",
+   "metadata": {},
+   "source": [
+    "### Data preparation\n",
+    "\n",
+    "Here I am loading and parsing each of our documents to a `Document` object using langchain document loaders. You can also use llama-index so that same. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "8dc30b79",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import DirectoryLoader\n",
+    "from ragas.testset.generator import TestsetGenerator\n",
+    "from ragas.testset.evolutions import simple, reasoning, multi_context, conditional\n",
+    "\n",
+    "loader = DirectoryLoader(PATH, use_multithreading=True, silent_errors=True,sample_size=5)\n",
+    "documents = loader.load()\n",
+    "\n",
+    "for document in documents:\n",
+    "    document.metadata['filename'] = document.metadata['source']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e73de7f-b983-419a-9bd1-b60aae48dc67",
+   "metadata": {},
+   "source": [
+    "### Test set generation\n",
+    "\n",
+    "Ragas aims to create high quality and diverse test dataset containing questions of different difficulty levels and types. For this we use a paradigm inspired from the idea of question evolution. One can create test dataset with different types of questions that can be synthetised by ragas, which is controlled using `distributions` parameter. Here I am creating some sample with uniform distribution of each question type.\n",
+    "\n",
+    "**Note:** *To know more about the underlying paradigm refer to our [docs](https://docs.ragas.io/en/stable/concepts/testset_generation.html).*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "782f15f8-0503-48a7-9b38-5e59ce692c3e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/var/folders/ww/sk5dkfhn673234cmy5w7008r0000gn/T/ipykernel_51325/2981689800.py:2: DeprecationWarning: The function with_openai was deprecated in 0.1.4, and will be removed in the 0.2.0 release. Use from_langchain instead.\n",
+      "  generator = TestsetGenerator.with_openai()\n"
+     ]
+    }
+   ],
+   "source": [
+    "generator = TestsetGenerator.with_openai()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "360880ab-d5c7-485a-8ca0-fee1e639c8f6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "distributions = {simple: 0.25, reasoning: 0.25, multi_context: 0.25, conditional:0.25}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "438335a5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "embedding nodes:   0%|          | 0/286 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "508f7c85484b49efadee68da0030eeec",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Generating:   0%|          | 0/25 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "testset = generator.generate_with_langchain_docs(documents, test_size=25, \n",
+    "                                                 raise_exceptions=False, with_debugging_logs=False,\n",
+    "                                                 distributions=distributions)    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "c603d429",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = testset.to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "165e010d-3f8f-4201-bf1d-7cc3c0a13413",
+   "metadata": {},
+   "source": [
+    "And Wola! That's it. You now have a test dataset. Let's inspect and save it"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cb3721b0-1e04-4b25-9348-71c251c0eff9",
+   "metadata": {},
+   "source": [
+    "### Saving results\n",
+    "- filter some samples that have no (nan) answers before saving"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "fc0f24ad-645a-4923-93ee-1e05acf0a47e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>question</th>\n",
+       "      <th>contexts</th>\n",
+       "      <th>ground_truth</th>\n",
+       "      <th>evolution_type</th>\n",
+       "      <th>episode_done</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>How does instruction tuning affect the zero-sh...</td>\n",
+       "      <td>[ tasks (see Table 2 in the Appendix), FLAN on...</td>\n",
+       "      <td>For larger models on the order of 100B paramet...</td>\n",
+       "      <td>simple</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>What is the Zero-shot-CoT method and how does ...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Zero-shot-CoT is a zero-shot template-based pr...</td>\n",
+       "      <td>simple</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>How does prompt tuning affect model performanc...</td>\n",
+       "      <td>[080.863.867.439.249.4\\n\\nTask Cluster:# datas...</td>\n",
+       "      <td>Prompt tuning improves model performance in im...</td>\n",
+       "      <td>simple</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>What is the purpose of instruction tuning in l...</td>\n",
+       "      <td>[ via natural language instructions, such as “...</td>\n",
+       "      <td>The purpose of instruction tuning in language ...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>What distinguishes Zero-shot-CoT from Few-shot...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Zero-shot-CoT differs from Few-shot-CoT in tha...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Which language models were used in the experim...</td>\n",
+       "      <td>[list\\n\\n1. For all authors...\\n\\n(a) Do the m...</td>\n",
+       "      <td>The language models used in the experiment 'Ex...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>How does Zero-shot-CoT differ from previous fe...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Zero-shot-CoT differs from previous few-shot a...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>What are the stages in the Zero-shot-CoT metho...</td>\n",
+       "      <td>[ it differs from most of the prior template p...</td>\n",
+       "      <td>The Zero-shot-CoT method for reasoning and ans...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>What are the main approaches for inducing LLMs...</td>\n",
+       "      <td>[2 2 0 2\\n\\nt c O 7\\n\\n] L C . s c [\\n\\n1 v 3 ...</td>\n",
+       "      <td>The main approaches for inducing LLMs to perfo...</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>Which sorting method has the most impact on Au...</td>\n",
+       "      <td>[ t a R\\n\\n30\\n\\n20\\n\\n%\\n\\n(\\n\\ne t a R\\n\\n40...</td>\n",
+       "      <td>The sorting method that has the most impact on...</td>\n",
+       "      <td>multi_context</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>What are the pros and cons of prompting method...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Our work is based on prompting methods for lar...</td>\n",
+       "      <td>multi_context</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>11</th>\n",
+       "      <td>What are the stages in Zero-shot-CoT for reaso...</td>\n",
+       "      <td>[ it differs from most of the prior template p...</td>\n",
+       "      <td>Zero-shot-CoT involves two stages: reasoning e...</td>\n",
+       "      <td>multi_context</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12</th>\n",
+       "      <td>How does the number of datasets and templates ...</td>\n",
+       "      <td>[oze\\n\\n94.8a 90.0 92.0 90.0 89.0 [10] 91.0 92...</td>\n",
+       "      <td>Using more datasets per task cluster improves ...</td>\n",
+       "      <td>multi_context</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>What technique surpasses zero-shot large langu...</td>\n",
+       "      <td>[3 2 0 2\\n\\nn a J\\n\\n9 2\\n\\n] L C . s c [\\n\\n4...</td>\n",
+       "      <td>Chain of thought (CoT) prompting</td>\n",
+       "      <td>multi_context</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14</th>\n",
+       "      <td>How does language model scale impact instructi...</td>\n",
+       "      <td>[ tasks (see Table 2 in the Appendix), FLAN on...</td>\n",
+       "      <td>For larger language models on the order of 100...</td>\n",
+       "      <td>conditional</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>15</th>\n",
+       "      <td>What's the advantage of using Zero-shot-CoT pr...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Zero-shot-CoT prompts offer the advantage of n...</td>\n",
+       "      <td>conditional</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16</th>\n",
+       "      <td>What's the difference in unresolving rate betw...</td>\n",
+       "      <td>[-Q-CoT.\\n\\nTo begin with, we invoke Zero-Shot...</td>\n",
+       "      <td>The unresolving rate of Retrieval-Q-CoT is 46....</td>\n",
+       "      <td>conditional</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>17</th>\n",
+       "      <td>What are the pros and cons of prompting method...</td>\n",
+       "      <td>[ prompts have also focused on per-task engine...</td>\n",
+       "      <td>Prompting methods for large language models ha...</td>\n",
+       "      <td>conditional</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18</th>\n",
+       "      <td>What are the stages and processes in the Auto-...</td>\n",
+       "      <td>[ wrong demonstrations may be eliminated with ...</td>\n",
+       "      <td>The Auto-CoT method for constructing demonstra...</td>\n",
+       "      <td>conditional</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19</th>\n",
+       "      <td>How are genes passed from one generation to th...</td>\n",
+       "      <td>[ Penguin is a kind of bird. Knowledge: Clouds...</td>\n",
+       "      <td>Genes are passed from parent to offspring.</td>\n",
+       "      <td>reasoning</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                             question  \\\n",
+       "0   How does instruction tuning affect the zero-sh...   \n",
+       "1   What is the Zero-shot-CoT method and how does ...   \n",
+       "2   How does prompt tuning affect model performanc...   \n",
+       "3   What is the purpose of instruction tuning in l...   \n",
+       "4   What distinguishes Zero-shot-CoT from Few-shot...   \n",
+       "5   Which language models were used in the experim...   \n",
+       "6   How does Zero-shot-CoT differ from previous fe...   \n",
+       "7   What are the stages in the Zero-shot-CoT metho...   \n",
+       "8   What are the main approaches for inducing LLMs...   \n",
+       "9   Which sorting method has the most impact on Au...   \n",
+       "10  What are the pros and cons of prompting method...   \n",
+       "11  What are the stages in Zero-shot-CoT for reaso...   \n",
+       "12  How does the number of datasets and templates ...   \n",
+       "13  What technique surpasses zero-shot large langu...   \n",
+       "14  How does language model scale impact instructi...   \n",
+       "15  What's the advantage of using Zero-shot-CoT pr...   \n",
+       "16  What's the difference in unresolving rate betw...   \n",
+       "17  What are the pros and cons of prompting method...   \n",
+       "18  What are the stages and processes in the Auto-...   \n",
+       "19  How are genes passed from one generation to th...   \n",
+       "\n",
+       "                                             contexts  \\\n",
+       "0   [ tasks (see Table 2 in the Appendix), FLAN on...   \n",
+       "1   [ prompts have also focused on per-task engine...   \n",
+       "2   [080.863.867.439.249.4\\n\\nTask Cluster:# datas...   \n",
+       "3   [ via natural language instructions, such as “...   \n",
+       "4   [ prompts have also focused on per-task engine...   \n",
+       "5   [list\\n\\n1. For all authors...\\n\\n(a) Do the m...   \n",
+       "6   [ prompts have also focused on per-task engine...   \n",
+       "7   [ it differs from most of the prior template p...   \n",
+       "8   [2 2 0 2\\n\\nt c O 7\\n\\n] L C . s c [\\n\\n1 v 3 ...   \n",
+       "9   [ t a R\\n\\n30\\n\\n20\\n\\n%\\n\\n(\\n\\ne t a R\\n\\n40...   \n",
+       "10  [ prompts have also focused on per-task engine...   \n",
+       "11  [ it differs from most of the prior template p...   \n",
+       "12  [oze\\n\\n94.8a 90.0 92.0 90.0 89.0 [10] 91.0 92...   \n",
+       "13  [3 2 0 2\\n\\nn a J\\n\\n9 2\\n\\n] L C . s c [\\n\\n4...   \n",
+       "14  [ tasks (see Table 2 in the Appendix), FLAN on...   \n",
+       "15  [ prompts have also focused on per-task engine...   \n",
+       "16  [-Q-CoT.\\n\\nTo begin with, we invoke Zero-Shot...   \n",
+       "17  [ prompts have also focused on per-task engine...   \n",
+       "18  [ wrong demonstrations may be eliminated with ...   \n",
+       "19  [ Penguin is a kind of bird. Knowledge: Clouds...   \n",
+       "\n",
+       "                                         ground_truth evolution_type  \\\n",
+       "0   For larger models on the order of 100B paramet...         simple   \n",
+       "1   Zero-shot-CoT is a zero-shot template-based pr...         simple   \n",
+       "2   Prompt tuning improves model performance in im...         simple   \n",
+       "3   The purpose of instruction tuning in language ...      reasoning   \n",
+       "4   Zero-shot-CoT differs from Few-shot-CoT in tha...      reasoning   \n",
+       "5   The language models used in the experiment 'Ex...      reasoning   \n",
+       "6   Zero-shot-CoT differs from previous few-shot a...      reasoning   \n",
+       "7   The Zero-shot-CoT method for reasoning and ans...      reasoning   \n",
+       "8   The main approaches for inducing LLMs to perfo...      reasoning   \n",
+       "9   The sorting method that has the most impact on...  multi_context   \n",
+       "10  Our work is based on prompting methods for lar...  multi_context   \n",
+       "11  Zero-shot-CoT involves two stages: reasoning e...  multi_context   \n",
+       "12  Using more datasets per task cluster improves ...  multi_context   \n",
+       "13                   Chain of thought (CoT) prompting  multi_context   \n",
+       "14  For larger language models on the order of 100...    conditional   \n",
+       "15  Zero-shot-CoT prompts offer the advantage of n...    conditional   \n",
+       "16  The unresolving rate of Retrieval-Q-CoT is 46....    conditional   \n",
+       "17  Prompting methods for large language models ha...    conditional   \n",
+       "18  The Auto-CoT method for constructing demonstra...    conditional   \n",
+       "19         Genes are passed from parent to offspring.      reasoning   \n",
+       "\n",
+       "    episode_done  \n",
+       "0           True  \n",
+       "1           True  \n",
+       "2           True  \n",
+       "3           True  \n",
+       "4           True  \n",
+       "5           True  \n",
+       "6           True  \n",
+       "7           True  \n",
+       "8           True  \n",
+       "9           True  \n",
+       "10          True  \n",
+       "11          True  \n",
+       "12          True  \n",
+       "13          True  \n",
+       "14          True  \n",
+       "15          True  \n",
+       "16          True  \n",
+       "17          True  \n",
+       "18          True  \n",
+       "19          True  "
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df = df[df['ground_truth']!=\"nan\"].reset_index(drop=True)\n",
+    "df.sample(5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "ad315aee-3029-46c2-812c-edf821e3f033",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.to_csv(\"synthetic_test_dataset.csv\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "840a2213-7e50-4774-b702-1b0c82c54d4f",
+   "metadata": {},
+   "source": [
+    "Upnext we are going into dive into how to use this to [evaluate your RAG](https://github.com/openai/openai-cookbook/examples/evaluation/ragas/openai-ragas-eval-cookbook.ipynb).\n",
+    "\n",
+    "**If you liked this tutorial, checkout [ragas](https://github.com/explodinggradients/ragas) and consider leaving a star**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "234a94b3-6527-47ee-af0c-cb1160da2c9b",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ragas",
+   "language": "python",
+   "name": "ragas"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/examples/evaluation/ragas/ragas_readme.md
+++ b/examples/evaluation/ragas/ragas_readme.md
@ -0,0 +1,6 @@
+## Ragas x Open AI
+
+- [Ragas evaluation](https://github.com/openai/openai-cookbook/examples/evaluation/ragas/openai-ragas-eval-cookbook.ipynb)
+- [Ragas synthetic test data generation](https://github.com/openai/openai-cookbook/examples/evaluation/ragas/openai-ragas-synthetic-test.ipynb)
+
+For more information on Ragas, refer to the [Ragas documentation](https://ragas.readthedocs.io/en/latest/)