refactor

3 months ago · 92e4e5e283
parent 4448f9b7e9
commit 92e4e5e283
1 changed files with 176 additions and 224 deletions
--- a/examples/evaluation/Getting_Started_with_OpenAI_Evals.ipynb
+++ b/examples/evaluation/Getting_Started_with_OpenAI_Evals.ipynb
@ -26,29 +26,28 @@
    "2. An open-source registry of challenging evals\n",
    "\n",
    "This notebook will cover:\n",
-    "* What are Evals\n",
-    "* Introduction to [OpenAI Evals](https://github.com/openai/evals/tree/main) library\n",
+    "* Introduction to Evaluation and the [OpenAI Evals](https://github.com/openai/evals/tree/main) library\n",
    "* Building an Eval\n",
    "* Running an Eval\n",
    "\n",
    "#### What are evaluations/ `evals`?\n",
    "\n",
-    "Evaluation is the process of validating and testing the outputs that your LLM applications are producing. Having strong evaluations (\"evals\") will mean a more stable, reliable application which is resilient to code and model changes. An eval is basically a task used to measure the quality of output of an LLM or LLM system. Given an input prompt, an output is generated. We evaluate this output with a set of ideal_answers and find the quality of the LLM system.\n",
+    "Evaluation is the process of validating and testing the outputs that your LLM applications are producing. Having strong evaluations (\"evals\") will mean a more stable, reliable application that is resilient to code and model changes. An eval is basically a task used to measure the quality of the output of an LLM or LLM system. Given an input prompt, an output is generated. We evaluate this output with a set of ideal answers and find the quality of the LLM system.\n",
    "\n",
-    "#### Why is it important to evaluate?\n",
+    "#### Importance of Evaluations\n",
    "\n",
-    "If you are building with foundational models like `GPT-4`, creating high quality evals is one of the most impactful things you can do. Developing AI solutions involves an iterative design process. Without evals, it can be very difficult and time intensive to understand how different model versions and prompts might affect your use case. \n",
+    "If you are building with foundational models like `GPT-4`, creating high quality evals is one of the most impactful things you can do. Developing AI solutions involves an iterative design process. [Without evals, it can be very difficult and time intensive to understand](https://youtu.be/XGJNo8TpuVA?feature=shared&t=1089) how different model versions and prompts might affect your use case. \n",
    "\n",
-    "With OpenAI’s new continuous model upgrades, evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases. You can also make evals a part of your CI/CD pipeline (recommended) to make sure you achieve the desired accuracy before deploying.\n",
+    "With OpenAI’s [continuous model upgrades](https://platform.openai.com/docs/models/continuous-model-upgrades), evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases. You can also make evals a part of your CI/CD pipeline to make sure you achieve the desired accuracy before deploying.\n",
    "\n",
    "#### Types of Evals\n",
    "\n",
    "The simplest and most common type of eval has an input and an ideal response or answer. For example,\n",
-    "we can have an eval sample where the input is `“What year was Obama elected president for the first\n",
-    "time?”` and the ideal answer is `“2008”`. We feed the input to a model and get the completion. If the model\n",
-    "says `“2008”`, it is then graded as correct. Eval samples are aggregated into an eval dataset that can\n",
+    "we can have an eval sample where the input is \"What year was Obama elected president for the first\n",
+    "time?\" and the ideal answer is \"2008\". We feed the input to a model and get the completion. If the model\n",
+    "says \"2008\", it is then graded as correct. Eval samples are aggregated into an eval dataset that can\n",
    "quantify overall performance within a certain topic. For example, this eval sample may be part of a\n",
-    "“president-election-years” eval that checks for every U.S. President, what year they were first elected.\n",
+    "`president-election-years` eval that checks for every U.S. President, what year they were first elected.\n",
    "Evals are not restricted to checking factual accuracy: all that is needed is a reproducible way to grade a\n",
    "completion. Here are some other examples of valid evals:\n",
    "\n",
@ -68,8 +67,8 @@
    "\n",
    "**Writing logic for answer checking**\n",
    "\n",
-    "* Consider the Obama example from above, where the ideal response is `\"2008\"`. We can write a\n",
-    "string match to check if the completion includes the phrase `“2008”`. If it does, we consider it\n",
+    "* Consider the Obama example from above, where the ideal response is \"2008\". We can write a\n",
+    "string match to check if the completion includes the phrase \"2008\". If it does, we consider it\n",
    "correct.\n",
    "* Consider another eval where the input is to generate valid JSON: We can write some code that\n",
    "attempts to parse the completion as JSON and then considers the completion correct if it is\n",
@ -79,14 +78,14 @@
    "model to look at the response to check if it’s correct.**\n",
    "\n",
    "* Consider an input that asks the model to write a funny joke. The model then generates a\n",
-    "completion. We then create a new input to the model to answer the question: `“Is this following\n",
-    "joke funny? First reason step by step, then answer yes or no”` that includes the completion. We\n",
-    "finally consider the original completion correct if the new model completion ends with `“yes”`.\n",
+    "completion. We then create a new input to the model to answer the question: \"Is this following\n",
+    "joke funny? First reason step by step, then answer yes or no that includes the completion\". We\n",
+    "finally consider the original completion correct if the new model completion ends with \"yes\".\n",
    "\n",
    "Model grading works best with the latest, most powerful models like `GPT-4` and if we give them the ability\n",
    "to reason before making a judgment. Model grading will have an error rate, so it is important to validate\n",
    "the performance with human evaluation before running the evals at scale. For best results, it makes\n",
-    "sense to use a different model to do grading from the one that did the completion, like using GPT-4 to\n",
+    "sense to use a different model to do grading from the one that did the completion, like using `GPT-4` to\n",
    "grade `GPT-3.5` answers.\n",
    "\n",
    "#### OpenAI Eval Tempplates\n",
@ -95,7 +94,7 @@
    "\n",
    "* **Basic Eval Templates**: These contain deterministic functions to compare the output to the ideal_answers. In cases where the desired model response has very little variation, such as answering multiple choice questions or simple questions with a straightforward answer, we have found this following templates to be useful.\n",
    "\n",
-    "* **Model-Graded Templates**: These contain functions where an LLM compares the output to the ideal_answers and attempts to judge the accuracy. In cases where the desired model response can contain significant variation, such as answering an open-ended question, we have found that using the model to grade itself is a viable strategy for automated evaluation. I\n"
+    "* **Model-Graded Templates**: These contain functions where an LLM compares the output to the ideal_answers and attempts to judge the accuracy. In cases where the desired model response can contain significant variation, such as answering an open-ended question, we have found that using the model to grade itself is a viable strategy for automated evaluation.\n"
   ]
  },
  {
@ -137,7 +136,7 @@
    "\n",
    "To start creating an eval, we need\n",
    "\n",
-    "1. The test dataset in the JSONL format.\n",
+    "1. The test dataset in the `jsonl` format.\n",
    "2. The eval template to be used\n",
    "\n",
    "### Creating the eval dataset\n",
@ -158,7 +157,7 @@
    "\"A: SELECT count ( * )  FROM CAR_MAKERS AS T1 JOIN COUNTRIES AS T2 ON T1.Country   =   T2.CountryId WHERE T2.CountryName   =   'germany'\"\n",
    "```\n",
    "\n",
-    "The dataset needs to be in the followingformat\"\n",
+    "The dataset needs to be in the following format:\n",
    "```\n",
    "\"input\": [{\"role\": \"system\", \"content\": \"<input prompt>\"}, {\"role\": \"user\", \"content\": <user input>}, \"ideal\": \"correct answer\"]\n",
    "```\n",
@ -169,7 +168,7 @@
    "```\n",
    "\n",
    "\n",
-    "One way to speed up the process of building eval datasets, is to use GPT-4 to generate synthetic data"
+    "One way to speed up the process of building eval datasets, is to use `GPT-4` to generate synthetic data"
   ]
  },
  {
@ -190,20 +189,44 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Q: What is the average horsepower for cars made in countries in the continent of Europe?\n",
-      "A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "Q: Which continent has the highest average horsepower for their cars?\n",
+      "A: \n",
+      "```sql\n",
+      "SELECT continents.Continent, AVG(cars_data.Horsepower) AS AvgHorsepower\n",
+      "FROM continents\n",
+      "JOIN countries ON continents.ContId = countries.Continent\n",
+      "JOIN car_makers ON countries.CountryId = car_makers.Country\n",
+      "JOIN model_list ON car_makers.Id = model_list.Maker\n",
+      "JOIN car_names ON model_list.Model = car_names.Model\n",
+      "JOIN cars_data ON car_names.MakeId = cars_data.Id\n",
+      "GROUP BY continents.Continent\n",
+      "ORDER BY AvgHorsepower DESC\n",
+      "LIMIT 1\n",
+      "```\n",
      "\n",
-      "Q: What is the average MPG for cars made by makers from the continent of Europe?\n",
-      "A: SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "Q: What is the average MPG (Miles Per Gallon) for cars made by makers in Europe?\n",
+      "A: SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
      "\n",
-      "Q: What is the average horsepower for cars made in the USA?\n",
-      "A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'\n",
+      "Q: Which continent has the highest average horsepower for its cars?\n",
+      "A: \n",
+      "```sql\n",
+      "SELECT continents.Continent, AVG(cars_data.Horsepower) as AvgHorsepower\n",
+      "FROM cars_data\n",
+      "JOIN car_names ON car_names.MakeId = cars_data.Id\n",
+      "JOIN model_list ON model_list.Model = car_names.Model\n",
+      "JOIN car_makers ON car_makers.Id = model_list.Maker\n",
+      "JOIN countries ON car_makers.Country = countries.CountryId\n",
+      "JOIN continents ON countries.Continent = continents.ContId\n",
+      "GROUP BY continents.Continent\n",
+      "ORDER BY AvgHorsepower DESC\n",
+      "LIMIT 1\n",
+      "```\n",
      "\n",
-      "Q: What is the average horsepower for cars made in countries from the continent with ID 3?\n",
-      "A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.ContId = 3\n",
+      "Q: What is the average horsepower for cars made by a maker from Japan?\n",
+      "A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'Japan'\n",
      "\n",
-      "Q: What is the average horsepower for cars made in countries located in Europe?\n",
-      "A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.Make = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "Q: What is the average horsepower for cars made in the USA?\n",
+      "A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.Make = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'\n",
      "\n"
     ]
    }
@ -242,22 +265,22 @@
    "\n",
    "user_input = \"Table car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]\"\n",
    "\n",
-    "messages = []\n",
-    "messages.append({\n",
-    "    \"role\": \"system\",\n",
-    "    \"content\": system_prompt\n",
-    "})\n",
-    "\n",
-    "messages.append({\n",
-    "    \"role\": \"user\",\n",
-    "    \"content\": user_input\n",
-    "})\n",
+    "messages = [\n",
+    "    {\n",
+    "        \"role\": \"system\",\n",
+    "        \"content\": system_prompt\n",
+    "    },\n",
+    "    {\n",
+    "        \"role\": \"user\",\n",
+    "        \"content\": user_input\n",
+    "    }\n",
+    "]\n",
    "\n",
    "completion = client.chat.completions.create(\n",
-    "  model=\"gpt-4-turbo-preview\",\n",
-    "  messages=messages,\n",
-    "  temperature=0.7,\n",
-    "  n=5\n",
+    "    model=\"gpt-4-turbo-preview\",\n",
+    "    messages=messages,\n",
+    "    temperature=0.7,\n",
+    "    n=5\n",
    ")\n",
    "\n",
    "for choice in completion.choices:\n",
@ -280,11 +303,11 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in countries in the continent of Europe?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
-      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average MPG for cars made by makers from the continent of Europe?'}], 'ideal': \"SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
-      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in the USA?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'\"}\n",
-      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in countries from the continent with ID 3?'}], 'ideal': 'SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.ContId = 3'}\n",
-      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in countries located in Europe?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.Make = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n"
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which continent has the highest average horsepower for their cars?'}], 'ideal': ''}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average MPG (Miles Per Gallon) for cars made by makers in Europe?'}], 'ideal': \"SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which continent has the highest average horsepower for its cars?'}], 'ideal': ''}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made by a maker from Japan?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'Japan'\"}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in the USA?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.Make = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'\"}\n"
     ]
    }
   ],
@ -304,7 +327,7 @@
    "    })\n",
    "\n",
    "for item in eval_data:\n",
-    "    print(item)\n"
+    "    print(item)"
   ]
  },
  {
@ -366,13 +389,11 @@
   "source": [
    "## Running an evaluation\n",
    "\n",
-    "We can run this eval using the `oaieval` CLI:\n",
-    "\n",
-    "First, install the library: `pip install .` (if you are running the [OpenAI Evals library](github.com/openai/evals) locally) or `pip install oaieval` if you are running an existing eval. \n",
+    "We can run this eval using the `oaieval` CLI. To get setup, install the library: `pip install .` (if you are running the [OpenAI Evals library](github.com/openai/evals) locally) or `pip install oaieval` if you are running an existing eval. \n",
    "\n",
    "Then, run the eval using the CLI: `oaieval gpt-3.5-turbo spider-sql`\n",
    "\n",
-    "The valid eval names are specified in the YAML files under `evals/registry/evals` and their corresponding implementations can be found in `evals/elsuite`."
+    "This command expects a model name and an eval set name. Note that we provide two command line interfaces (CLIs): `oaieval` for running a single eval and `oaievalset` for running a set of evals. The valid eval names are specified in the YAML files under `evals/registry/evals` and their corresponding implementations can be found in `evals/elsuite`."
   ]
  },
  {
@ -420,61 +441,61 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "[2024-03-18 14:59:44,243] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/evals\n",
-      "[2024-03-18 14:59:44,882] [registry.py:257] Loading registry from /Users/shyamal/.evals/evals\n",
-      "[2024-03-18 14:59:44,885] [oaieval.py:189] \u001b[1;35mRun started: 240318215944ESN7L5HJ\u001b[0m\n",
-      "[2024-03-18 14:59:44,888] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/modelgraded\n",
-      "[2024-03-18 14:59:44,930] [registry.py:257] Loading registry from /Users/shyamal/.evals/modelgraded\n",
-      "[2024-03-18 14:59:44,930] [data.py:90] Fetching /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/data/sql/spider_sql.jsonl\n",
-      "[2024-03-18 14:59:44,932] [eval.py:36] Evaluating 20 samples\n",
-      "[2024-03-18 14:59:44,951] [eval.py:144] Running in threaded mode with 10 threads!\n",
-      "  0%|                                                    | 0/20 [00:00<?, ?it/s][2024-03-18 14:59:45,634] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:45,647] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:45,648] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:45,685] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:45,757] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:45,793] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:45,872] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:46,056] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:46,081] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:46,404] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:46,991] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "  5%|██▏                                         | 1/20 [00:02<00:38,  2.04s/it][2024-03-18 14:59:47,003] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:47,311] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 15%|██████▌                                     | 3/20 [00:02<00:10,  1.55it/s][2024-03-18 14:59:47,499] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 20%|████████▊                                   | 4/20 [00:02<00:07,  2.03it/s][2024-03-18 14:59:47,561] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:47,572] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:47,688] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 35%|███████████████▍                            | 7/20 [00:02<00:03,  4.29it/s][2024-03-18 14:59:47,714] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:47,928] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:47,983] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 40%|█████████████████▌                          | 8/20 [00:03<00:02,  4.05it/s][2024-03-18 14:59:48,089] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:48,115] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:48,130] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 45%|███████████████████▊                        | 9/20 [00:03<00:02,  4.50it/s][2024-03-18 14:59:48,184] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:48,268] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 50%|█████████████████████▌                     | 10/20 [00:03<00:02,  4.96it/s][2024-03-18 14:59:48,306] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:48,460] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:48,549] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:48,699] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:49,194] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:49,765] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 55%|███████████████████████▋                   | 11/20 [00:04<00:04,  1.83it/s][2024-03-18 14:59:49,842] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:50,000] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 65%|███████████████████████████▉               | 13/20 [00:05<00:02,  2.77it/s][2024-03-18 14:59:50,006] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:50,033] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:50,476] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 80%|██████████████████████████████████▍        | 16/20 [00:05<00:01,  3.79it/s][2024-03-18 14:59:50,575] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "[2024-03-18 14:59:50,901] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 90%|██████████████████████████████████████▋    | 18/20 [00:05<00:00,  4.04it/s][2024-03-18 14:59:51,044] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      " 95%|████████████████████████████████████████▊  | 19/20 [00:06<00:00,  4.36it/s][2024-03-18 14:59:51,342] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
-      "100%|███████████████████████████████████████████| 20/20 [00:06<00:00,  3.13it/s]\n",
-      "[2024-03-18 14:59:51,354] [record.py:360] Final report: {'counts/Correct': 18, 'counts/Incorrect': 2, 'score': 0.9}. Logged to /tmp/evallogs/240318215944ESN7L5HJ_gpt-3.5-turbo_spider-sql.jsonl\n",
-      "[2024-03-18 14:59:51,355] [oaieval.py:229] Final report:\n",
-      "[2024-03-18 14:59:51,355] [oaieval.py:231] counts/Correct: 18\n",
-      "[2024-03-18 14:59:51,355] [oaieval.py:231] counts/Incorrect: 2\n",
-      "[2024-03-18 14:59:51,355] [oaieval.py:231] score: 0.9\n",
-      "[2024-03-18 14:59:51,393] [record.py:349] Logged 60 rows of events to /tmp/evallogs/240318215944ESN7L5HJ_gpt-3.5-turbo_spider-sql.jsonl: insert_time=34.696ms\n"
+      "[2024-03-18 20:45:46,391] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/evals\n",
+      "[2024-03-18 20:45:50,433] [registry.py:257] Loading registry from /Users/shyamal/.evals/evals\n",
+      "[2024-03-18 20:45:50,444] [oaieval.py:189] \u001b[1;35mRun started: 240319034550VLDKMJVL\u001b[0m\n",
+      "[2024-03-18 20:45:50,466] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/modelgraded\n",
+      "[2024-03-18 20:45:50,592] [registry.py:257] Loading registry from /Users/shyamal/.evals/modelgraded\n",
+      "[2024-03-18 20:45:50,593] [data.py:90] Fetching /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/data/sql/spider_sql.jsonl\n",
+      "[2024-03-18 20:45:50,608] [eval.py:36] Evaluating 20 samples\n",
+      "[2024-03-18 20:45:50,691] [eval.py:144] Running in threaded mode with 10 threads!\n",
+      "  0%|                                                    | 0/20 [00:00<?, ?it/s][2024-03-18 20:45:51,459] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:51,462] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:51,462] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:51,478] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:51,562] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:51,592] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:51,624] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:51,779] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:52,052] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:52,059] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:53,062] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:53,063] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:53,065] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "  5%|██▏                                         | 1/20 [00:02<00:45,  2.38s/it][2024-03-18 20:45:53,094] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:53,471] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 25%|███████████                                 | 5/20 [00:02<00:06,  2.27it/s][2024-03-18 20:45:53,620] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 30%|█████████████▏                              | 6/20 [00:02<00:05,  2.69it/s][2024-03-18 20:45:53,674] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:53,787] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 40%|█████████████████▌                          | 8/20 [00:03<00:03,  3.90it/s][2024-03-18 20:45:53,891] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 45%|███████████████████▊                        | 9/20 [00:03<00:02,  4.49it/s][2024-03-18 20:45:53,992] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:54,009] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:54,196] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:54,218] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:54,384] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 50%|█████████████████████▌                     | 10/20 [00:03<00:02,  3.47it/s][2024-03-18 20:45:54,430] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:54,632] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:54,683] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:54,731] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:55,214] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:55,292] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:55,725] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 55%|███████████████████████▋                   | 11/20 [00:05<00:05,  1.79it/s][2024-03-18 20:45:56,006] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 60%|█████████████████████████▊                 | 12/20 [00:05<00:03,  2.07it/s][2024-03-18 20:45:56,201] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "[2024-03-18 20:45:56,206] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 65%|███████████████████████████▉               | 13/20 [00:05<00:02,  2.48it/s][2024-03-18 20:45:56,400] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 75%|████████████████████████████████▎          | 15/20 [00:05<00:01,  3.72it/s][2024-03-18 20:45:56,644] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 80%|██████████████████████████████████▍        | 16/20 [00:05<00:01,  3.80it/s][2024-03-18 20:45:56,837] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 85%|████████████████████████████████████▌      | 17/20 [00:06<00:00,  4.08it/s][2024-03-18 20:45:57,111] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 90%|██████████████████████████████████████▋    | 18/20 [00:06<00:00,  3.95it/s][2024-03-18 20:45:57,262] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      " 95%|████████████████████████████████████████▊  | 19/20 [00:06<00:00,  4.44it/s][2024-03-18 20:45:57,304] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "100%|███████████████████████████████████████████| 20/20 [00:06<00:00,  3.02it/s]\n",
+      "[2024-03-18 20:45:57,316] [record.py:360] Final report: {'counts/Correct': 18, 'counts/Incorrect': 2, 'score': 0.9}. Logged to /tmp/evallogs/240319034550VLDKMJVL_gpt-3.5-turbo_spider-sql.jsonl\n",
+      "[2024-03-18 20:45:57,316] [oaieval.py:229] Final report:\n",
+      "[2024-03-18 20:45:57,317] [oaieval.py:231] counts/Correct: 18\n",
+      "[2024-03-18 20:45:57,317] [oaieval.py:231] counts/Incorrect: 2\n",
+      "[2024-03-18 20:45:57,317] [oaieval.py:231] score: 0.9\n",
+      "[2024-03-18 20:45:57,342] [record.py:349] Logged 60 rows of events to /tmp/evallogs/240319034550VLDKMJVL_gpt-3.5-turbo_spider-sql.jsonl: insert_time=21.218ms\n"
     ]
    }
   ],
@ -482,13 +503,6 @@
    "!oaieval gpt-3.5-turbo spider-sql --max_samples 20"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "`oaievalset` expects a model name and an eval set name, for which the valid options are specified in the YAML files under `evals/registry/eval_sets`."
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -500,7 +514,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-03-18T20:37:01.920497Z",
@ -547,7 +561,7 @@
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
-       "      <td>{'completion_fns': ['gpt-3.5-turbo'], 'eval_name': 'spider-sql.dev.v0', 'base_eval': 'spider-sql', 'split': 'dev', 'run_config': {'completion_fns': ['gpt-3.5-turbo'], 'eval_spec': {'cls': 'evals.elsuite.modelgraded.classify:ModelBasedClassify', 'registry_path': '/Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry', 'args': {'samples_jsonl': 'sql/spider_sql.jsonl', 'eval_type': 'cot_classify', 'modelgraded_spec': 'sql'}, 'key': 'spider-sql.dev.v0', 'group': 'sql'}, 'seed': 20220722, 'max_samples': 20, 'command': '/Users/shyamal/.virtualenvs/openai/bin/oaieval gpt-3.5-turbo spider-sql --max_samples 20', 'initial_settings': {'visible': False}}, 'created_by': '', 'run_id': '240318215944ESN7L5HJ', 'created_at': '2024-03-18 21:59:44.882930'}</td>\n",
+       "      <td>{'completion_fns': ['gpt-3.5-turbo'], 'eval_na...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
@ -560,7 +574,7 @@
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
-       "      <td>{'counts/Correct': 18, 'counts/Incorrect': 2, 'score': 0.9}</td>\n",
+       "      <td>{'counts/Correct': 18, 'counts/Incorrect': 2, ...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
@ -577,15 +591,7 @@
       "      <td>0.0</td>\n",
       "      <td>spider-sql.dev.94</td>\n",
       "      <td>sampling</td>\n",
-       "      <td>{'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\n",
-       "Use only the following tables and columns:\n",
-       "Table: battle. Columns: id (number), name (text), date (text), bulgarian_commander (text), latin_commander (text), result (text)\n",
-       "Table: ship. Columns: lost_in_battle (number), id (number), name (text), tonnage (text), ship_type (text), location (text), disposition_of_ship (text)\n",
-       "Table: death. Columns: caused_by_ship_id (number), id (number), note (text), killed (number), injured (number)\n",
-       "\n",
-       "Question: What is the average number of injuries caused each time?\n",
-       "', 'role': 'system'}], 'sampled': ['SELECT AVG(injured) AS average_injuries_caused\n",
-       "FROM death;']}</td>\n",
+       "      <td>{'prompt': [{'content': 'Answer the following ...</td>\n",
       "      <td></td>\n",
       "      <td>2024-03-18 21:59:45.655060+00:00</td>\n",
       "    </tr>\n",
@ -597,20 +603,7 @@
       "      <td>1.0</td>\n",
       "      <td>spider-sql.dev.25</td>\n",
       "      <td>sampling</td>\n",
-       "      <td>{'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\n",
-       "Use only the following tables and columns:\n",
-       "Table: continents. Columns: ContId (number), Continent (text)\n",
-       "Table: countries. Columns: CountryId (number), CountryName (text), Continent (number)\n",
-       "Table: car_makers. Columns: Id (number), Maker (text), FullName (text), Country (text)\n",
-       "Table: model_list. Columns: ModelId (number), Maker (number), Model (text)\n",
-       "Table: car_names. Columns: MakeId (number), Model (text), Make (text)\n",
-       "Table: cars_data. Columns: Id (number), MPG (text), Cylinders (number), Edispl (number), Horsepower (text), Weight (number), Accelerate (number), Year (number)\n",
-       "\n",
-       "Question: How many countries exist?\n",
-       "', 'role': 'system'}], 'sampled': ['```sql\n",
-       "SELECT COUNT(*) AS TotalCountries\n",
-       "FROM countries;\n",
-       "```']}</td>\n",
+       "      <td>{'prompt': [{'content': 'Answer the following ...</td>\n",
       "      <td></td>\n",
       "      <td>2024-03-18 21:59:45.656165+00:00</td>\n",
       "    </tr>\n",
@ -622,15 +615,7 @@
       "      <td>2.0</td>\n",
       "      <td>spider-sql.dev.82</td>\n",
       "      <td>sampling</td>\n",
-       "      <td>{'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\n",
-       "Use only the following tables and columns:\n",
-       "Table: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text)\n",
-       "Table: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number)\n",
-       "Table: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number)\n",
-       "\n",
-       "Question: Find the total number of matches.\n",
-       "', 'role': 'system'}], 'sampled': ['SELECT COUNT(*) AS total_matches\n",
-       "FROM matches;']}</td>\n",
+       "      <td>{'prompt': [{'content': 'Answer the following ...</td>\n",
       "      <td></td>\n",
       "      <td>2024-03-18 21:59:45.656846+00:00</td>\n",
       "    </tr>\n",
@ -639,69 +624,40 @@
       "</div>"
      ],
      "text/plain": [
-       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   spec  \\\n",
-       "0  {'completion_fns': ['gpt-3.5-turbo'], 'eval_name': 'spider-sql.dev.v0', 'base_eval': 'spider-sql', 'split': 'dev', 'run_config': {'completion_fns': ['gpt-3.5-turbo'], 'eval_spec': {'cls': 'evals.elsuite.modelgraded.classify:ModelBasedClassify', 'registry_path': '/Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry', 'args': {'samples_jsonl': 'sql/spider_sql.jsonl', 'eval_type': 'cot_classify', 'modelgraded_spec': 'sql'}, 'key': 'spider-sql.dev.v0', 'group': 'sql'}, 'seed': 20220722, 'max_samples': 20, 'command': '/Users/shyamal/.virtualenvs/openai/bin/oaieval gpt-3.5-turbo spider-sql --max_samples 20', 'initial_settings': {'visible': False}}, 'created_by': '', 'run_id': '240318215944ESN7L5HJ', 'created_at': '2024-03-18 21:59:44.882930'}   \n",
-       "1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   NaN   \n",
-       "2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   NaN   \n",
-       "3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   NaN   \n",
-       "4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   NaN   \n",
-       "\n",
-       "                                                  final_report  \\\n",
-       "0                                                          NaN   \n",
-       "1  {'counts/Correct': 18, 'counts/Incorrect': 2, 'score': 0.9}   \n",
-       "2                                                          NaN   \n",
-       "3                                                          NaN   \n",
-       "4                                                          NaN   \n",
-       "\n",
-       "                 run_id  event_id          sample_id      type  \\\n",
-       "0                   NaN       NaN                NaN       NaN   \n",
-       "1                   NaN       NaN                NaN       NaN   \n",
-       "2  240318215944ESN7L5HJ       0.0  spider-sql.dev.94  sampling   \n",
-       "3  240318215944ESN7L5HJ       1.0  spider-sql.dev.25  sampling   \n",
-       "4  240318215944ESN7L5HJ       2.0  spider-sql.dev.82  sampling   \n",
-       "\n",
-       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          data  \\\n",
-       "0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NaN   \n",
-       "1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NaN   \n",
-       "2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               {'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\n",
-       "Use only the following tables and columns:\n",
-       "Table: battle. Columns: id (number), name (text), date (text), bulgarian_commander (text), latin_commander (text), result (text)\n",
-       "Table: ship. Columns: lost_in_battle (number), id (number), name (text), tonnage (text), ship_type (text), location (text), disposition_of_ship (text)\n",
-       "Table: death. Columns: caused_by_ship_id (number), id (number), note (text), killed (number), injured (number)\n",
+       "                                                spec  \\\n",
+       "0  {'completion_fns': ['gpt-3.5-turbo'], 'eval_na...   \n",
+       "1                                                NaN   \n",
+       "2                                                NaN   \n",
+       "3                                                NaN   \n",
+       "4                                                NaN   \n",
       "\n",
-       "Question: What is the average number of injuries caused each time?\n",
-       "', 'role': 'system'}], 'sampled': ['SELECT AVG(injured) AS average_injuries_caused\n",
-       "FROM death;']}   \n",
-       "3                                                                                                                                                                                                                                                                                                                                                                                       {'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\n",
-       "Use only the following tables and columns:\n",
-       "Table: continents. Columns: ContId (number), Continent (text)\n",
-       "Table: countries. Columns: CountryId (number), CountryName (text), Continent (number)\n",
-       "Table: car_makers. Columns: Id (number), Maker (text), FullName (text), Country (text)\n",
-       "Table: model_list. Columns: ModelId (number), Maker (number), Model (text)\n",
-       "Table: car_names. Columns: MakeId (number), Model (text), Make (text)\n",
-       "Table: cars_data. Columns: Id (number), MPG (text), Cylinders (number), Edispl (number), Horsepower (text), Weight (number), Accelerate (number), Year (number)\n",
+       "                                        final_report                run_id  \\\n",
+       "0                                                NaN                   NaN   \n",
+       "1  {'counts/Correct': 18, 'counts/Incorrect': 2, ...                   NaN   \n",
+       "2                                                NaN  240318215944ESN7L5HJ   \n",
+       "3                                                NaN  240318215944ESN7L5HJ   \n",
+       "4                                                NaN  240318215944ESN7L5HJ   \n",
       "\n",
-       "Question: How many countries exist?\n",
-       "', 'role': 'system'}], 'sampled': ['```sql\n",
-       "SELECT COUNT(*) AS TotalCountries\n",
-       "FROM countries;\n",
-       "```']}   \n",
-       "4  {'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\n",
-       "Use only the following tables and columns:\n",
-       "Table: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text)\n",
-       "Table: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number)\n",
-       "Table: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number)\n",
+       "   event_id          sample_id      type  \\\n",
+       "0       NaN                NaN       NaN   \n",
+       "1       NaN                NaN       NaN   \n",
+       "2       0.0  spider-sql.dev.94  sampling   \n",
+       "3       1.0  spider-sql.dev.25  sampling   \n",
+       "4       2.0  spider-sql.dev.82  sampling   \n",
       "\n",
-       "Question: Find the total number of matches.\n",
-       "', 'role': 'system'}], 'sampled': ['SELECT COUNT(*) AS total_matches\n",
-       "FROM matches;']}   \n",
+       "                                                data created_by  \\\n",
+       "0                                                NaN        NaN   \n",
+       "1                                                NaN        NaN   \n",
+       "2  {'prompt': [{'content': 'Answer the following ...              \n",
+       "3  {'prompt': [{'content': 'Answer the following ...              \n",
+       "4  {'prompt': [{'content': 'Answer the following ...              \n",
       "\n",
-       "  created_by                       created_at  \n",
-       "0        NaN                              NaT  \n",
-       "1        NaN                              NaT  \n",
-       "2            2024-03-18 21:59:45.655060+00:00  \n",
-       "3            2024-03-18 21:59:45.656165+00:00  \n",
-       "4            2024-03-18 21:59:45.656846+00:00  "
+       "                        created_at  \n",
+       "0                              NaT  \n",
+       "1                              NaT  \n",
+       "2 2024-03-18 21:59:45.655060+00:00  \n",
+       "3 2024-03-18 21:59:45.656165+00:00  \n",
+       "4 2024-03-18 21:59:45.656846+00:00  "
      ]
     },
     "metadata": {},
@ -709,13 +665,14 @@
    }
   ],
   "source": [
-    "# display only few lines of jsonl\n",
-    "display(pd.read_json('/tmp/evallogs/240318215944ESN7L5HJ_gpt-3.5-turbo_spider-sql.jsonl', lines=True).head(5))"
+    "log_name = '240318215944ESN7L5HJ_gpt-3.5-turbo_spider-sql.jsonl' # \"EDIT THIS\" - copy from above\n",
+    "events = f\"/tmp/evallogs/{log_name}\"\n",
+    "display(pd.read_json(events, lines=True).head(5))"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "jupyter": {
@ -724,12 +681,7 @@
   },
   "outputs": [],
   "source": [
-    "# How to process the log events generated by oaieval\n",
-    "\n",
-    "# log_name = \"EDIT THIS\"  # copy from above\n",
-    "log_name = '240318215944ESN7L5HJ_gpt-3.5-turbo_spider-sql.jsonl'\n",
-    "events = f\"/tmp/evallogs/{log_name}\"\n",
-    "\n",
+    "# processing the log events generated by oaieval\n",
    "with open(events, \"r\") as f:\n",
    "    events_df = pd.read_json(f, lines=True)"
   ]
@ -743,7 +695,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
@ -787,7 +739,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
@ -808,12 +760,12 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can also review individual evaluation events that provide spefific samples (`sample_id`), results, event types, and timestamps."
+    "We can also review individual evaluation events that provide specific samples (`sample_id`), results, event types, and other metadata."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
@ -847,7 +799,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {