WIP: Evals starter (#1107)

Co-authored-by: royziv11 <103690170+royziv11@users.noreply.github.com> Co-authored-by: Roy Ziv <roy@openai.com>
1 month ago · 6333678834
parent bed41103a2
commit 6333678834
3 changed files with 996 additions and 1 deletions
--- a/authors.yaml
+++ b/authors.yaml
@ -81,4 +81,9 @@ katiagg:
 jbeutler-openai:
  name: "Joe Beutler"
  website: "https://joebeutler.com"
-  avatar: "https://avatars.githubusercontent.com/u/156261485?v=4"
+  avatar: "https://avatars.githubusercontent.com/u/156261485?v=4"
+
+royziv11:
+  name: "Roy Ziv"
+  website: "https://www.linkedin.com/in/roy-ziv-a46001149/"
+  avatar: "https://media.licdn.com/dms/image/D5603AQHkaEOOGZWtbA/profile-displayphoto-shrink_400_400/0/1699500606122?e=1716422400&v=beta&t=wKEIx-vTEqm9wnqoC7-xr1WqJjghvcjjlMt034hXY_4"
--- a/examples/evaluation/Getting_Started_with_OpenAI_Evals.ipynb
+++ b/examples/evaluation/Getting_Started_with_OpenAI_Evals.ipynb
@ -0,0 +1,981 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "# Getting Started with OpenAI Evals"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "The [OpenAI Evals](https://github.com/openai/evals/tree/main) framework consists of\n",
+    "1. A framework to evaluate an LLM (large language model) or a system built on top of an LLM.\n",
+    "2. An open-source registry of challenging evals\n",
+    "\n",
+    "This notebook will cover:\n",
+    "* Introduction to Evaluation and the [OpenAI Evals](https://github.com/openai/evals/tree/main) library\n",
+    "* Building an Eval\n",
+    "* Running an Eval\n",
+    "\n",
+    "#### What are evaluations/ `evals`?\n",
+    "\n",
+    "Evaluation is the process of validating and testing the outputs that your LLM applications are producing. Having strong evaluations (\"evals\") will mean a more stable, reliable application that is resilient to code and model changes. An eval is a task used to measure the quality of the output of an LLM or LLM system. Given an input prompt, an output is generated. We evaluate this output with a set of ideal answers and find the quality of the LLM system.\n",
+    "\n",
+    "#### Importance of Evaluations\n",
+    "\n",
+    "If you are building with foundational models like `GPT-4`, creating high quality evals is one of the most impactful things you can do. Developing AI solutions involves an iterative design process. [Without evals, it can be very difficult and time intensive to understand](https://youtu.be/XGJNo8TpuVA?feature=shared&t=1089) how different model versions and prompts might affect your use case.\n",
+    "\n",
+    "With OpenAI’s [continuous model upgrades](https://platform.openai.com/docs/models/continuous-model-upgrades), evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases. You can also make evals a part of your CI/CD pipeline to make sure you achieve the desired accuracy before deploying.\n",
+    "\n",
+    "#### Types of evals\n",
+    "\n",
+    "There are two main ways we can evaluate/grade completions: writing some validation logic in code\n",
+    "or using the model itself to inspect the answer. We’ll introduce each with some examples.\n",
+    "\n",
+    "**Writing logic for answer checking**\n",
+    "\n",
+    "The simplest and most common type of eval has an input and an ideal response or answer. For example,\n",
+    "we can have an eval sample where the input is \"What year was Obama elected president for the first\n",
+    "time?\" and the ideal answer is \"2008\". We feed the input to a model and get the completion. If the model\n",
+    "says \"2008\", it is then graded as correct. We can write a string match to check if the completion includes the phrase \"2008\". If it does, we consider it correct.\n",
+    "\n",
+    "Consider another eval where the input is to generate valid JSON: We can write some code that\n",
+    "attempts to parse the completion as JSON and then considers the completion correct if it is\n",
+    "parsable.\n",
+    "\n",
+    "**Model grading: A two stage process where the model first answers the question, then we ask a\n",
+    "model to look at the response to check if it’s correct.**\n",
+    "\n",
+    "Consider an input that asks the model to write a funny joke. The model then generates a\n",
+    "completion. We then create a new input to the model to answer the question: \"Is this following\n",
+    "joke funny? First reason step by step, then answer yes or no\" that includes the completion.\" We\n",
+    "finally consider the original completion correct if the new model completion ends with \"yes\".\n",
+    "\n",
+    "Model grading works best with the latest, most powerful models like `GPT-4` and if we give them the ability\n",
+    "to reason before making a judgment. Model grading will have an error rate, so it is important to validate\n",
+    "the performance with human evaluation before running the evals at scale. For best results, it makes\n",
+    "sense to use a different model to do grading from the one that did the completion, like using `GPT-4` to\n",
+    "grade `GPT-3.5` answers.\n",
+    "\n",
+    "#### OpenAI Eval Templates\n",
+    "\n",
+    "In using evals, we have discovered several \"templates\" that accommodate many different benchmarks. We have implemented these templates in the OpenAI Evals library to simplify the development of new evals. For example, we have defined 2 types of eval templates that can be used out of the box:\n",
+    "\n",
+    "* **Basic Eval Templates**: These contain deterministic functions to compare the output to the ideal_answers. In cases where the desired model response has very little variation, such as answering multiple choice questions or simple questions with a straightforward answer, we have found this following templates to be useful.\n",
+    "\n",
+    "* **Model-Graded Templates**: These contain functions where an LLM compares the output to the ideal_answers and attempts to judge the accuracy. In cases where the desired model response can contain significant variation, such as answering an open-ended question, we have found that using the model to grade itself is a viable strategy for automated evaluation.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Getting Setup\n",
+    "\n",
+    "First, go to [github.com/openai/evals](https://github.com/openai/evals), clone the repository with `git clone git@github.com:openai/evals.git` and go through the [setup instructions](https://github.com/openai/evals). \n",
+    "\n",
+    "To run evals later in this notebook, you will need to set up and specify your OpenAI API key. After you obtain an API key, specify it using the `OPENAI_API_KEY` environment variable. \n",
+    "\n",
+    "Please be aware of the costs associated with using the API when running evals."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from openai import OpenAI\n",
+    "import pandas as pd\n",
+    "import os\n",
+    "import json\n",
+    "\n",
+    "client = OpenAI()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "## Building an evaluation for OpenAI Evals framework\n",
+    "\n",
+    "At its core, an eval is a dataset and an eval class that is defined in a YAML file. To start creating an eval, we need\n",
+    "\n",
+    "1. The test dataset in the `jsonl` format.\n",
+    "2. The eval template to be used\n",
+    "\n",
+    "### Creating the eval dataset\n",
+    "Lets create a dataset for a use case where we are evaluating the model's ability to generate syntactically correct SQL. In this use case, we have a series of tables that are related to car manufacturing\n",
+    "\n",
+    "First we will need to create a system prompt that we would like to evaluate. We will pass in instructions for the model as well as an overview of the table structure:\n",
+    "```\n",
+    "\"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]\"\n",
+    "```\n",
+    "\n",
+    "For this prompt, we can ask a specific question:\n",
+    "```\n",
+    "\"Q: how many car makers are their in germany?\"\n",
+    "```\n",
+    "\n",
+    "And we have an expected answer:\n",
+    "```\n",
+    "\"A: SELECT count ( * )  FROM CAR_MAKERS AS T1 JOIN COUNTRIES AS T2 ON T1.Country   =   T2.CountryId WHERE T2.CountryName   =   'germany'\"\n",
+    "```\n",
+    "\n",
+    "The dataset needs to be in the following format:\n",
+    "```\n",
+    "\"input\": [{\"role\": \"system\", \"content\": \"<input prompt>\"}, {\"role\": \"user\", \"content\": <user input>}, \"ideal\": \"correct answer\"]\n",
+    "```\n",
+    "\n",
+    "Putting it all together, we get:\n",
+    "```\n",
+    "{\"input\": [{\"role\": \"system\", \"content\": \"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]\\n\"}, {\"role\": \"system\", \"content\": \"Q: how many car makers are their in germany\"}, \"ideal\": [\"A: SELECT count ( * )  FROM CAR_MAKERS AS T1 JOIN COUNTRIES AS T2 ON T1.Country   =   T2.CountryId WHERE T2.CountryName   =   'germany'\"]}\n",
+    "```\n",
+    "\n",
+    "\n",
+    "One way to speed up the process of building eval datasets, is to use `GPT-4` to generate synthetic data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-18T07:23:04.862331Z",
+     "start_time": "2024-03-18T07:23:04.717601Z"
+    },
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Q: What is the average horsepower for cars made by makers in Europe?\n",
+      "A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "\n",
+      "Q: What is the average weight of cars produced by makers from the continent of Europe?\n",
+      "A: SELECT AVG(cars_data.Weight) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "\n",
+      "Q: What is the average MPG for cars made in countries in the continent of Europe?\n",
+      "A: SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "\n",
+      "Q: What is the average horsepower for cars made by a maker from Europe?\n",
+      "\n",
+      "A: SELECT AVG(cars_data.Horsepower) AS AverageHorsepower FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "\n",
+      "Q: What is the average horsepower for cars made by makers in the continent of Europe?\n",
+      "A: SELECT avg(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "## Use GPT-4 to generate synthetic data\n",
+    "# Define the system prompt and user input (these should be filled as per the specific use case)\n",
+    "system_prompt = \"\"\"You are a helpful assistant that can ask questions about a database table and write SQL queries to answer the question.\n",
+    "    A user will pass in a table schema and your job is to return a question answer pairing. The question should relevant to the schema of the table,\n",
+    "    and you can speculate on its contents. You will then have to generate a SQL query to answer the question. Below are some examples of what this should look like.\n",
+    "\n",
+    "    Example 1\n",
+    "    ```````````\n",
+    "    User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\\n\n",
+    "    Assistant Response:\n",
+    "    Q: How many visitors have visited the museum with the most staff?\n",
+    "    A: SELECT count ( * )  FROM VISIT AS T1 JOIN MUSEUM AS T2 ON T1.Museum_ID   =   T2.Museum_ID WHERE T2.Num_of_Staff   =   ( SELECT max ( Num_of_Staff )  FROM MUSEUM ) \n",
+    "    ```````````\n",
+    "\n",
+    "    Example 2\n",
+    "    ```````````\n",
+    "    User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\\n\n",
+    "    Assistant Response:\n",
+    "    Q: What are the names who have a membership level higher than 4?\n",
+    "    A: SELECT Name   FROM VISITOR AS T1 WHERE T1.Level_of_membership   >   4 \n",
+    "    ```````````\n",
+    "\n",
+    "    Example 3\n",
+    "    ```````````\n",
+    "    User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\\n\n",
+    "    Assistant Response:\n",
+    "    Q: How many tickets of customer id 5?\n",
+    "    A: SELECT count ( * )  FROM VISIT AS T1 JOIN VISITOR AS T2 ON T1.visitor_ID   =   T2.ID WHERE T2.ID   =   5 \n",
+    "    ```````````\n",
+    "    \"\"\"\n",
+    "\n",
+    "user_input = \"Table car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]\"\n",
+    "\n",
+    "messages = [{\n",
+    "        \"role\": \"system\",\n",
+    "        \"content\": system_prompt\n",
+    "    },\n",
+    "    {\n",
+    "        \"role\": \"user\",\n",
+    "        \"content\": user_input\n",
+    "    }\n",
+    "]\n",
+    "\n",
+    "completion = client.chat.completions.create(\n",
+    "    model=\"gpt-4-turbo-preview\",\n",
+    "    messages=messages,\n",
+    "    temperature=0.7,\n",
+    "    n=5\n",
+    ")\n",
+    "\n",
+    "for choice in completion.choices:\n",
+    "    print(choice.message.content + \"\\n\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once we have the synthetic data, we need to convert it to match the format of the eval dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made by makers in Europe?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average weight of cars produced by makers from the continent of Europe?'}], 'ideal': \"SELECT AVG(cars_data.Weight) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average MPG for cars made in countries in the continent of Europe?'}], 'ideal': \"SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made by a maker from Europe?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) AS AverageHorsepower FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made by makers in the continent of Europe?'}], 'ideal': \"SELECT avg(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_data = []\n",
+    "input_prompt = \"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]\"\n",
+    "\n",
+    "for choice in completion.choices:\n",
+    "    question = choice.message.content.split(\"Q: \")[1].split(\"\\n\")[0]  # Extracting the question\n",
+    "    answer = choice.message.content.split(\"\\nA: \")[1].split(\"\\n\")[0]  # Extracting the answer\n",
+    "    eval_data.append({\n",
+    "        \"input\": [\n",
+    "            {\"role\": \"system\", \"content\": input_prompt},\n",
+    "            {\"role\": \"user\", \"content\": question},\n",
+    "        ],\n",
+    "        \"ideal\": answer\n",
+    "    })\n",
+    "\n",
+    "for item in eval_data:\n",
+    "    print(item)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next we need to create the eval registry to run it in the framework.\n",
+    "\n",
+    "The evals framework requires a `.yaml` file structured with the following properties:\n",
+    "* `id` - An identifier for your eval\n",
+    "* `description` - A short description of your eval\n",
+    "* `disclaimer` - An additional notes about your eval\n",
+    "* `metrics` - There are three types of eval metrics we can choose from: match, includes, fuzzyMatch\n",
+    "\n",
+    "For our eval, we will configure the following:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-18T07:23:04.716044Z",
+     "start_time": "2024-03-18T07:23:04.708437Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'\\nspider-sql:\\n  id: spider-sql.dev.v0\\n  metrics: [accuracy]\\n  description: Eval that scores SQL code from 194 examples in the Spider Text-to-SQL test dataset. The problems are selected by taking the first 10 problems for each database that appears in the test set.\\n    Yu, Tao, et al. \"Spider; A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.\" Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1425.\\n  disclaimer: Problems are solved zero-shot with no prompting other than the schema; performance may improve with training examples, fine tuning, or a different schema format. Evaluation is currently done through model-grading, where SQL code is not actually executed; the model may judge correct SQL to be incorrect, or vice-versa.\\n\\n  '"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "\"\"\"\n",
+    "spider-sql:\n",
+    "  id: spider-sql.dev.v0\n",
+    "  metrics: [accuracy]\n",
+    "  description: Eval that scores SQL code from 194 examples in the Spider Text-to-SQL test dataset. The problems are selected by taking the first 10 problems for each database that appears in the test set.\n",
+    "    Yu, Tao, et al. \\\"Spider; A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.\\\" Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1425.\n",
+    "  disclaimer: Problems are solved zero-shot with no prompting other than the schema; performance may improve with training examples, fine tuning, or a different schema format. Evaluation is currently done through model-grading, where SQL code is not actually executed; the model may judge correct SQL to be incorrect, or vice-versa.\n",
+    "\n",
+    "  \"\"\"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "## Running an evaluation\n",
+    "\n",
+    "We can run this eval using the `oaieval` CLI. To get setup, install the library: `pip install .` (if you are running the [OpenAI Evals library](github.com/openai/evals) locally) or `pip install oaieval` if you are running an existing eval.\n",
+    "\n",
+    "Then, run the eval using the CLI: `oaieval gpt-3.5-turbo spider-sql`\n",
+    "\n",
+    "This command expects a model name and an eval set name. Note that we provide two command line interfaces (CLIs): `oaieval` for running a single eval and `oaievalset` for running a set of evals. The valid eval names are specified in the YAML files under `evals/registry/evals` and their corresponding implementations can be found in `evals/elsuite`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-18T07:29:03.774758Z",
+     "start_time": "2024-03-18T07:26:29.321664Z"
+    },
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# !pip install evals"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `oaieval` CLI can accept various flags to modify the default behavior. You can run `oaieval --help` to see a full list of CLI options. \n",
+    "\n",
+    "After running that command, you’ll see the final report of accuracy printed to the console, as well as a file path to a temporary file that contains the full report."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "These CLIs can accept various flags to modify their default behavior. You can run `oaieval --help` to see a full list of CLI options. \n",
+    "\n",
+    "After running that command, you’ll see the final report of accuracy printed to the console, as well as a file path to a temporary file that contains the full report."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-18T07:31:42.602736Z",
+     "start_time": "2024-03-18T07:29:03.776339Z"
+    },
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[2024-03-25 13:23:36,497] [registry.py:257] Loading registry from /Users/roy/Documents/Github/openai-cookbook/.venv/lib/python3.9/site-packages/evals/registry/evals\n",
+      "[2024-03-25 13:23:38,131] [registry.py:257] Loading registry from /Users/roy/.evals/evals\n",
+      "[2024-03-25 13:23:38,133] [oaieval.py:189] \u001b[1;35mRun started: 2403252023385ZVJZ3UF\u001b[0m\n",
+      "[2024-03-25 13:23:38,143] [registry.py:257] Loading registry from /Users/roy/Documents/Github/openai-cookbook/.venv/lib/python3.9/site-packages/evals/registry/modelgraded\n",
+      "[2024-03-25 13:23:38,217] [registry.py:257] Loading registry from /Users/roy/.evals/modelgraded\n",
+      "[2024-03-25 13:23:38,218] [data.py:90] Fetching /Users/roy/Documents/Github/openai-cookbook/.venv/lib/python3.9/site-packages/evals/registry/data/sql/spider_sql.jsonl\n",
+      "[2024-03-25 13:23:38,224] [eval.py:36] Evaluating 20 samples\n",
+      "[2024-03-25 13:23:38,282] [eval.py:144] Running in threaded mode with 10 threads!\n",
+      "  0%|                                                    | 0/20 [00:00<?, ?it/s][2024-03-25 13:23:38,795] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:38,836] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:38,839] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:38,862] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:38,875] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:38,981] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:39,070] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:39,581] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:39,829] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:40,234] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:40,593] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "  5%|██▏                                         | 1/20 [00:02<00:43,  2.31s/it][2024-03-25 13:23:40,868] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 10%|████▍                                       | 2/20 [00:02<00:20,  1.11s/it][2024-03-25 13:23:41,090] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 15%|██████▌                                     | 3/20 [00:02<00:12,  1.41it/s][2024-03-25 13:23:41,356] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:41,707] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:42,223] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 20%|████████▊                                   | 4/20 [00:03<00:13,  1.14it/s][2024-03-25 13:23:42,342] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 25%|███████████                                 | 5/20 [00:04<00:09,  1.66it/s][2024-03-25 13:23:42,532] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 30%|█████████████▏                              | 6/20 [00:04<00:06,  2.17it/s][2024-03-25 13:23:42,787] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 35%|███████████████▍                            | 7/20 [00:04<00:05,  2.54it/s][2024-03-25 13:23:42,963] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:42,984] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 40%|█████████████████▌                          | 8/20 [00:04<00:03,  3.02it/s][2024-03-25 13:23:43,056] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:43,108] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:43,127] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 45%|███████████████████▊                        | 9/20 [00:04<00:02,  3.67it/s][2024-03-25 13:23:43,585] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 50%|█████████████████████▌                     | 10/20 [00:05<00:03,  3.04it/s][2024-03-25 13:23:43,653] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:43,699] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:43,839] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:43,927] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:44,946] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:45,205] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:45,213] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 55%|███████████████████████▋                   | 11/20 [00:06<00:06,  1.38it/s][2024-03-25 13:23:45,485] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 65%|███████████████████████████▉               | 13/20 [00:07<00:03,  2.21it/s][2024-03-25 13:23:45,611] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 70%|██████████████████████████████             | 14/20 [00:07<00:02,  2.70it/s][2024-03-25 13:23:45,730] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 75%|████████████████████████████████▎          | 15/20 [00:07<00:01,  3.28it/s][2024-03-25 13:23:45,769] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-25 13:23:46,265] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 85%|████████████████████████████████████▌      | 17/20 [00:07<00:00,  3.46it/s][2024-03-25 13:23:46,393] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 90%|██████████████████████████████████████▋    | 18/20 [00:08<00:00,  3.99it/s][2024-03-25 13:23:47,284] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 95%|████████████████████████████████████████▊  | 19/20 [00:09<00:00,  2.43it/s][2024-03-25 13:23:49,136] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "100%|███████████████████████████████████████████| 20/20 [00:10<00:00,  1.84it/s]\n",
+      "[2024-03-25 13:23:49,153] [record.py:360] Final report: {'counts/Correct': 17, 'counts/Incorrect': 3, 'score': 0.85}. Logged to /tmp/evallogs/2403252023385ZVJZ3UF_gpt-3.5-turbo_spider-sql.jsonl\n",
+      "[2024-03-25 13:23:49,154] [oaieval.py:229] Final report:\n",
+      "[2024-03-25 13:23:49,154] [oaieval.py:231] counts/Correct: 17\n",
+      "[2024-03-25 13:23:49,154] [oaieval.py:231] counts/Incorrect: 3\n",
+      "[2024-03-25 13:23:49,154] [oaieval.py:231] score: 0.85\n",
+      "[2024-03-25 13:23:49,176] [record.py:349] Logged 60 rows of events to /tmp/evallogs/2403252023385ZVJZ3UF_gpt-3.5-turbo_spider-sql.jsonl: insert_time=20.087ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "!oaieval gpt-3.5-turbo spider-sql --max_samples 20"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`oaievalset` expects a model name and an eval set name, for which the valid options are specified in the YAML files under `evals/registry/eval_sets`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Going through eval logs\n",
+    "\n",
+    "The eval logs are located at `/tmp/evallogs` and different log files are created for each evaluation run. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-18T20:37:01.920497Z",
+     "start_time": "2024-03-18T20:37:01.553288Z"
+    },
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>spec</th>\n",
+       "      <th>final_report</th>\n",
+       "      <th>run_id</th>\n",
+       "      <th>event_id</th>\n",
+       "      <th>sample_id</th>\n",
+       "      <th>type</th>\n",
+       "      <th>data</th>\n",
+       "      <th>created_by</th>\n",
+       "      <th>created_at</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>{'completion_fns': ['gpt-3.5-turbo'], 'eval_na...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaT</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>{'counts/Correct': 17, 'counts/Incorrect': 3, ...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaT</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>2403252023385ZVJZ3UF</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>spider-sql.dev.117</td>\n",
+       "      <td>sampling</td>\n",
+       "      <td>{'prompt': [{'content': 'Answer the following ...</td>\n",
+       "      <td></td>\n",
+       "      <td>2024-03-25 20:23:38.803226+00:00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>2403252023385ZVJZ3UF</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>spider-sql.dev.72</td>\n",
+       "      <td>sampling</td>\n",
+       "      <td>{'prompt': [{'content': 'Answer the following ...</td>\n",
+       "      <td></td>\n",
+       "      <td>2024-03-25 20:23:38.840276+00:00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>2403252023385ZVJZ3UF</td>\n",
+       "      <td>2.0</td>\n",
+       "      <td>spider-sql.dev.88</td>\n",
+       "      <td>sampling</td>\n",
+       "      <td>{'prompt': [{'content': 'Answer the following ...</td>\n",
+       "      <td></td>\n",
+       "      <td>2024-03-25 20:23:38.841729+00:00</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                spec  \\\n",
+       "0  {'completion_fns': ['gpt-3.5-turbo'], 'eval_na...   \n",
+       "1                                                NaN   \n",
+       "2                                                NaN   \n",
+       "3                                                NaN   \n",
+       "4                                                NaN   \n",
+       "\n",
+       "                                        final_report                run_id  \\\n",
+       "0                                                NaN                   NaN   \n",
+       "1  {'counts/Correct': 17, 'counts/Incorrect': 3, ...                   NaN   \n",
+       "2                                                NaN  2403252023385ZVJZ3UF   \n",
+       "3                                                NaN  2403252023385ZVJZ3UF   \n",
+       "4                                                NaN  2403252023385ZVJZ3UF   \n",
+       "\n",
+       "   event_id           sample_id      type  \\\n",
+       "0       NaN                 NaN       NaN   \n",
+       "1       NaN                 NaN       NaN   \n",
+       "2       0.0  spider-sql.dev.117  sampling   \n",
+       "3       1.0   spider-sql.dev.72  sampling   \n",
+       "4       2.0   spider-sql.dev.88  sampling   \n",
+       "\n",
+       "                                                data created_by  \\\n",
+       "0                                                NaN        NaN   \n",
+       "1                                                NaN        NaN   \n",
+       "2  {'prompt': [{'content': 'Answer the following ...              \n",
+       "3  {'prompt': [{'content': 'Answer the following ...              \n",
+       "4  {'prompt': [{'content': 'Answer the following ...              \n",
+       "\n",
+       "                        created_at  \n",
+       "0                              NaT  \n",
+       "1                              NaT  \n",
+       "2 2024-03-25 20:23:38.803226+00:00  \n",
+       "3 2024-03-25 20:23:38.840276+00:00  \n",
+       "4 2024-03-25 20:23:38.841729+00:00  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "log_name = '2403252023385ZVJZ3UF_gpt-3.5-turbo_spider-sql.jsonl' # \"EDIT THIS\" - copy from above\n",
+    "events = f\"/tmp/evallogs/{log_name}\"\n",
+    "display(pd.read_json(events, lines=True).head(5))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# processing the log events generated by oaieval\n",
+    "\n",
+    "with open(events, \"r\") as f:\n",
+    "    events_df = pd.read_json(f, lines=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This file will contain structured logs of the evaluation. The first entry provides a detailed specification of the evaluation, including the completion functions, evaluation name, run configuration, creator’s name, run ID, and creation timestamp."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'completion_fns': ['gpt-3.5-turbo'],\n",
+       " 'eval_name': 'spider-sql.dev.v0',\n",
+       " 'base_eval': 'spider-sql',\n",
+       " 'split': 'dev',\n",
+       " 'run_config': {'completion_fns': ['gpt-3.5-turbo'],\n",
+       "  'eval_spec': {'cls': 'evals.elsuite.modelgraded.classify:ModelBasedClassify',\n",
+       "   'registry_path': '/Users/roy/Documents/Github/openai-cookbook/.venv/lib/python3.9/site-packages/evals/registry',\n",
+       "   'args': {'samples_jsonl': 'sql/spider_sql.jsonl',\n",
+       "    'eval_type': 'cot_classify',\n",
+       "    'modelgraded_spec': 'sql'},\n",
+       "   'key': 'spider-sql.dev.v0',\n",
+       "   'group': 'sql'},\n",
+       "  'seed': 20220722,\n",
+       "  'max_samples': 20,\n",
+       "  'command': '/Users/roy/Documents/Github/openai-cookbook/.venv/bin/oaieval gpt-3.5-turbo spider-sql --max_samples 20',\n",
+       "  'initial_settings': {'visible': False}},\n",
+       " 'created_by': '',\n",
+       " 'run_id': '2403252023385ZVJZ3UF',\n",
+       " 'created_at': '2024-03-25 20:23:38.132021'}"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "display(events_df.iloc[0].spec)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's also look at the entry which provides the final report of the evaluation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'counts/Correct': 17, 'counts/Incorrect': 3, 'score': 0.85}"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "display(events_df.dropna(subset=['final_report']).iloc[0]['final_report'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can also review individual evaluation events that provide specific samples (`sample_id`), results, event types, and metadata."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "run_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         2403252023385ZVJZ3UF\n",
+       "event_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        0.0\n",
+       "sample_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spider-sql.dev.117\n",
+       "type                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       sampling\n",
+       "data          {'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\n",
+       "Use only the following tables and columns:\n",
+       "Table: TV_Channel. Columns: id (text), series_name (text), Country (text), Language (text), Content (text), Pixel_aspect_ratio_PAR (text), Hight_definition_TV (text), Pay_per_view_PPV (text), Package_Option (text)\n",
+       "Table: TV_series. Columns: id (number), Episode (text), Air_Date (text), Rating (text), Share (number), 18_49_Rating_Share (text), Viewers_m (text), Weekly_Rank (number), Channel (text)\n",
+       "Table: Cartoon. Columns: id (number), Title (text), Directed_by (text), Written_by (text), Original_air_date (text), Production_code (number), Channel (text)\n",
+       "\n",
+       "Question: What is the name and directors of all the cartoons that are ordered by air date?\n",
+       "', 'role': 'system'}], 'sampled': ['SELECT Title, Directed_by\n",
+       "FROM Cartoon\n",
+       "ORDER BY Original_air_date;']}\n",
+       "created_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         2024-03-25 20:23:38.803226+00:00\n",
+       "Name: 2, dtype: object"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "pd.set_option('display.max_colwidth', None)  # None means no truncation\n",
+    "display(events_df.iloc[2][['run_id', 'event_id', 'sample_id', 'type', 'data', 'created_at']])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: TV_Channel. Columns: id (text), series_name (text), Country (text), Language (text), Content (text), Pixel_aspect_ratio_PAR (text), Hight_definition_TV (text), Pay_per_view_PPV (text), Package_Option (text)\\nTable: TV_series. Columns: id (number), Episode (text), Air_Date (text), Rating (text), Share (number), 18_49_Rating_Share (text), Viewers_m (text), Weekly_Rank (number), Channel (text)\\nTable: Cartoon. Columns: id (number), Title (text), Directed_by (text), Written_by (text), Original_air_date (text), Production_code (number), Channel (text)\\n\\nQuestion: What is the name and directors of all the cartoons that are ordered by air date?\\n', 'role': 'system'}]\n",
+      "Sampled: ['SELECT Title, Directed_by\\nFROM Cartoon\\nORDER BY Original_air_date;']\n",
+      "----------\n",
+      "Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: museum. Columns: Museum_ID (number), Name (text), Num_of_Staff (number), Open_Year (text)\\nTable: visitor. Columns: ID (number), Name (text), Level_of_membership (number), Age (number)\\nTable: visit. Columns: Museum_ID (number), visitor_ID (text), Num_of_Ticket (number), Total_spent (number)\\n\\nQuestion: What is the average age of the visitors whose membership level is not higher than 4?\\n', 'role': 'system'}]\n",
+      "Sampled: ['SELECT AVG(Age) \\nFROM visitor \\nWHERE Level_of_membership <= 4;']\n",
+      "----------\n",
+      "Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text)\\nTable: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number)\\nTable: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number)\\n\\nQuestion: Find the average rank of winners in all matches.\\n', 'role': 'system'}]\n",
+      "Sampled: ['SELECT AVG(winner_rank) AS average_winner_rank\\nFROM matches;']\n",
+      "----------\n",
+      "Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: continents. Columns: ContId (number), Continent (text)\\nTable: countries. Columns: CountryId (number), CountryName (text), Continent (number)\\nTable: car_makers. Columns: Id (number), Maker (text), FullName (text), Country (text)\\nTable: model_list. Columns: ModelId (number), Maker (number), Model (text)\\nTable: car_names. Columns: MakeId (number), Model (text), Make (text)\\nTable: cars_data. Columns: Id (number), MPG (text), Cylinders (number), Edispl (number), Horsepower (text), Weight (number), Accelerate (number), Year (number)\\n\\nQuestion: How many countries exist?\\n', 'role': 'system'}]\n",
+      "Sampled: ['```sql\\nSELECT COUNT(*) AS TotalCountries\\nFROM countries;\\n```']\n",
+      "----------\n",
+      "Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: city. Columns: ID (number), Name (text), CountryCode (text), District (text), Population (number)\\nTable: sqlite_sequence. Columns: name (text), seq (text)\\nTable: country. Columns: Code (text), Name (text), Continent (text), Region (text), SurfaceArea (number), IndepYear (number), Population (number), LifeExpectancy (number), GNP (number), GNPOld (number), LocalName (text), GovernmentForm (text), HeadOfState (text), Capital (number), Code2 (text)\\nTable: countrylanguage. Columns: CountryCode (text), Language (text), IsOfficial (text), Percentage (number)\\n\\nQuestion: How many countries have a republic as their form of government?\\n', 'role': 'system'}]\n",
+      "Sampled: [\"```sql\\nSELECT COUNT(*) \\nFROM country \\nWHERE GovernmentForm = 'Republic';\\n```\"]\n",
+      "----------\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Inspect samples\n",
+    "for i, row in events_df[events_df['type'] == 'sampling'].head(5).iterrows():\n",
+    "    data = pd.json_normalize(row['data'])\n",
+    "    print(f\"Prompt: {data['prompt'].iloc[0]}\")\n",
+    "    print(f\"Sampled: {data['sampled'].iloc[0]}\")\n",
+    "    print(\"-\" * 10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's review our failures to understand which tests did not succeed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def pretty_print_text(prompt):\n",
+    "    # Define markers for the start of each section\n",
+    "    question_marker = \"[Question]:\"\n",
+    "    expert_marker = \"[Expert]:\"\n",
+    "    submission_marker = \"[Submission]:\"\n",
+    "\n",
+    "    # Find the start indices of each section\n",
+    "    question_start = prompt.find(question_marker) + len(question_marker)\n",
+    "    expert_start = prompt.find(expert_marker) + len(expert_marker)\n",
+    "    submission_start = prompt.find(submission_marker) + len(submission_marker)\n",
+    "\n",
+    "    # Find the end index for the question and expert sections by looking for the next section's start\n",
+    "    question_end = prompt.find(expert_marker)\n",
+    "    expert_end = prompt.find(submission_marker)\n",
+    "    submission_end = prompt.find('[END DATA]')\n",
+    "\n",
+    "    # Extract the text for each section\n",
+    "    question_text = prompt[question_start:question_end].strip()\n",
+    "    expert_answer_text = prompt[expert_start:expert_end].strip()\n",
+    "    submission_text = prompt[submission_start:submission_end].strip().replace(\"```sql\", \"\").replace(\"```\", \"\").strip()\n",
+    "\n",
+    "    # Remove table definitions from the question text\n",
+    "    question_text = question_text.split(\"\\n\\nQuestion:\")[1].strip() if \"\\n\\nQuestion:\" in question_text else question_text\n",
+    "\n",
+    "    # Define ANSI color codes for readability\n",
+    "    color_question = '\\033[94m'  # Blue\n",
+    "    color_expert = '\\033[92m'   # Green\n",
+    "    color_submission = '\\033[93m' # Yellow\n",
+    "    color_end = '\\033[0m'        # Reset to default color\n",
+    "\n",
+    "    # Print with section headers and colors\n",
+    "    print(f\"{color_question}QUESTION:\\n{question_text}{color_end}\")\n",
+    "    print(f\"{color_expert}EXPECTED:\\n{expert_answer_text}{color_end}\")\n",
+    "    print(f\"{color_submission}SUBMISSION:\\n{submission_text}{color_end}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[94mQUESTION:\n",
+      "Return the document id, template id, and description for the document with the name Robbin CV.\n",
+      "\n",
+      "************\u001b[0m\n",
+      "\u001b[92mEXPECTED:\n",
+      "SELECT document_id ,  template_id ,  Document_Description FROM Documents WHERE document_name  =  \"Robbin CV\"\n",
+      "************\u001b[0m\n",
+      "\u001b[93mSUBMISSION:\n",
+      "SELECT Documents.Document_ID, Documents.Template_ID, Documents.Document_Description\n",
+      "FROM Documents\n",
+      "JOIN Templates ON Documents.Template_ID = Templates.Template_ID\n",
+      "WHERE Documents.Document_Name = 'Robbin CV';\n",
+      "\n",
+      "************\u001b[0m\n",
+      "----------------------------------------\n",
+      "\u001b[94mQUESTION:\n",
+      "What country is Jetblue Airways affiliated with?\n",
+      "\n",
+      "************\u001b[0m\n",
+      "\u001b[92mEXPECTED:\n",
+      "SELECT Country FROM AIRLINES WHERE Airline  =  \"JetBlue Airways\"\n",
+      "************\u001b[0m\n",
+      "\u001b[93mSUBMISSION:\n",
+      "SELECT Country\n",
+      "FROM airlines\n",
+      "WHERE Airline = 'Jetblue Airways';\n",
+      "************\u001b[0m\n",
+      "----------------------------------------\n",
+      "\u001b[94mQUESTION:\n",
+      "Find the maximum weight for each type of pet. List the maximum weight and pet type.\n",
+      "\n",
+      "************\u001b[0m\n",
+      "\u001b[92mEXPECTED:\n",
+      "SELECT max(weight) ,  petType FROM pets GROUP BY petType\n",
+      "************\u001b[0m\n",
+      "\u001b[93mSUBMISSION:\n",
+      "SELECT PetType, MAX(weight) AS max_weight\n",
+      "FROM Pets\n",
+      "GROUP BY PetType;\n",
+      "\n",
+      "************\u001b[0m\n",
+      "----------------------------------------\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Inspect metrics where choice is made and print only the prompt, result, and expected result if the choice is incorrect\n",
+    "for i, row in events_df[events_df['type'] == 'metrics'].iterrows():\n",
+    "    if row['data']['choice'] == 'Incorrect':\n",
+    "        # Get the previous row's data, which contains the prompt and the expected result\n",
+    "        prev_row = events_df.iloc[i-1]\n",
+    "        prompt = prev_row['data']['prompt'][0]['content'] if 'prompt' in prev_row['data'] and len(prev_row['data']['prompt']) > 0 else \"Prompt not available\"\n",
+    "        expected_result = prev_row['data'].get('ideal', 'Expected result not provided')\n",
+    "        \n",
+    "        # Current row's data will be the actual result\n",
+    "        result = row['data'].get('result', 'Actual result not provided')\n",
+    "        \n",
+    "        pretty_print_text(prompt)\n",
+    "        print(\"-\" * 40)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Reviewing each of these failures we see the following:\n",
+    "* The first incorrect answer had an unnecessary join with the 'Templates' table. Our eval was able to accurately identify this and flag this as incorrect. \n",
+    "* The following two answers are technically correct and would succeeed if we compared the results, however they have minor syntax differences that caused the answers to get flagged.\n",
+    "  * In situations like this, it would be worthwhile exploring whether we should continue iterating on the prompt to ensure certain stylistic choices, or if we should modify the evaluation suite to capture this variation.\n",
+    "  * This type of failure hints at the potential need for model-graded evals as a way to ensure accuracy in grading the results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Conclusion"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Building out effective evals is a core part of the development cycle of LLM-based applications. The OpenAI Evals framework provides the core structure of building evals out of the box, and allows you to quickly spin up new tests for your various use cases. In this guide, we demonstrated step-by-step how to create an eval, run it, and analyze the results.\n",
+    "\n",
+    "The example shown in this guide represent a straightfoward use case for evals. As you continue to explore this framework, we recommend you explore creating more complex model-graded evals for actual production use cases. Happy evaluating!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/registry.yaml
+++ b/registry.yaml
@ -475,6 +475,15 @@
    - embeddings
    - completions

+- title: Getting Started with OpenAI Evals
+  path: examples/evaluation/Getting_Started_with_OpenAI_Evals.ipynb
+  date: 2024-03-21
+  authors:
+    - royziv11
+    - shyamal-anadkat
+  tags:
+    - completions
+
 - title: Fine-Tuned Q&A - collect data
  path: examples/fine-tuned_qa/olympics-1-collect-data.ipynb
  date: 2022-03-10