More sections

pull/1113/head
Roy Ziv 3 months ago
parent 83e563c019
commit 8a4ab09ef1

@ -26,28 +26,29 @@
"2. An open-source registry of challenging evals\n",
"\n",
"This notebook will cover:\n",
"* Introduction to Evaluation and the [OpenAI Evals](https://github.com/openai/evals/tree/main) library\n",
"* What are Evals\n",
"* Introduction to [OpenAI Evals](https://github.com/openai/evals/tree/main) library\n",
"* Building an Eval\n",
"* Running an Eval\n",
"\n",
"#### What are evaluations/ `evals`?\n",
"\n",
"Evaluation is the process of validating and testing the outputs that your LLM applications are producing. Having strong evaluations (\"evals\") will mean a more stable, reliable application that is resilient to code and model changes. An eval is basically a task used to measure the quality of the output of an LLM or LLM system. Given an input prompt, an output is generated. We evaluate this output with a set of ideal answers and find the quality of the LLM system.\n",
"Evaluation is the process of validating and testing the outputs that your LLM applications are producing. Having strong evaluations (\"evals\") will mean a more stable, reliable application which is resilient to code and model changes. An eval is basically a task used to measure the quality of output of an LLM or LLM system. Given an input prompt, an output is generated. We evaluate this output with a set of ideal_answers and find the quality of the LLM system.\n",
"\n",
"#### Importance of Evaluations\n",
"#### Why is it important to evaluate?\n",
"\n",
"If you are building with foundational models like `GPT-4`, creating high quality evals is one of the most impactful things you can do. Developing AI solutions involves an iterative design process. [Without evals, it can be very difficult and time intensive to understand](https://youtu.be/XGJNo8TpuVA?feature=shared&t=1089) how different model versions and prompts might affect your use case. \n",
"If you are building with foundational models like `GPT-4`, creating high quality evals is one of the most impactful things you can do. Developing AI solutions involves an iterative design process. Without evals, it can be very difficult and time intensive to understand how different model versions and prompts might affect your use case. \n",
"\n",
"With OpenAIs [continuous model upgrades](https://platform.openai.com/docs/models/continuous-model-upgrades), evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases. You can also make evals a part of your CI/CD pipeline to make sure you achieve the desired accuracy before deploying.\n",
"With OpenAIs new continuous model upgrades, evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases. You can also make evals a part of your CI/CD pipeline (recommended) to make sure you achieve the desired accuracy before deploying.\n",
"\n",
"#### Types of Evals\n",
"\n",
"The simplest and most common type of eval has an input and an ideal response or answer. For example,\n",
"we can have an eval sample where the input is \"What year was Obama elected president for the first\n",
"time?\" and the ideal answer is \"2008\". We feed the input to a model and get the completion. If the model\n",
"says \"2008\", it is then graded as correct. Eval samples are aggregated into an eval dataset that can\n",
"we can have an eval sample where the input is `“What year was Obama elected president for the first\n",
"time?”` and the ideal answer is `“2008”`. We feed the input to a model and get the completion. If the model\n",
"says `“2008”`, it is then graded as correct. Eval samples are aggregated into an eval dataset that can\n",
"quantify overall performance within a certain topic. For example, this eval sample may be part of a\n",
"`president-election-years` eval that checks for every U.S. President, what year they were first elected.\n",
"“president-election-years” eval that checks for every U.S. President, what year they were first elected.\n",
"Evals are not restricted to checking factual accuracy: all that is needed is a reproducible way to grade a\n",
"completion. Here are some other examples of valid evals:\n",
"\n",
@ -67,8 +68,8 @@
"\n",
"**Writing logic for answer checking**\n",
"\n",
"* Consider the Obama example from above, where the ideal response is \"2008\". We can write a\n",
"string match to check if the completion includes the phrase \"2008\". If it does, we consider it\n",
"* Consider the Obama example from above, where the ideal response is `\"2008\"`. We can write a\n",
"string match to check if the completion includes the phrase `“2008”`. If it does, we consider it\n",
"correct.\n",
"* Consider another eval where the input is to generate valid JSON: We can write some code that\n",
"attempts to parse the completion as JSON and then considers the completion correct if it is\n",
@ -78,14 +79,14 @@
"model to look at the response to check if its correct.**\n",
"\n",
"* Consider an input that asks the model to write a funny joke. The model then generates a\n",
"completion. We then create a new input to the model to answer the question: \"Is this following\n",
"joke funny? First reason step by step, then answer yes or no that includes the completion\". We\n",
"finally consider the original completion correct if the new model completion ends with \"yes\".\n",
"completion. We then create a new input to the model to answer the question: `“Is this following\n",
"joke funny? First reason step by step, then answer yes or no”` that includes the completion. We\n",
"finally consider the original completion correct if the new model completion ends with `“yes”`.\n",
"\n",
"Model grading works best with the latest, most powerful models like `GPT-4` and if we give them the ability\n",
"to reason before making a judgment. Model grading will have an error rate, so it is important to validate\n",
"the performance with human evaluation before running the evals at scale. For best results, it makes\n",
"sense to use a different model to do grading from the one that did the completion, like using `GPT-4` to\n",
"sense to use a different model to do grading from the one that did the completion, like using GPT-4 to\n",
"grade `GPT-3.5` answers.\n",
"\n",
"#### OpenAI Eval Tempplates\n",
@ -119,6 +120,7 @@
"from openai import OpenAI\n",
"import pandas as pd\n",
"import os\n",
"import json\n",
"\n",
"client = OpenAI()"
]
@ -147,7 +149,7 @@
"\n",
"To start creating an eval, we need\n",
"\n",
"1. The test dataset in the `jsonl` format.\n",
"1. The test dataset in the JSONL format.\n",
"2. The eval template to be used\n",
"\n",
"### Creating the eval dataset\n",
@ -168,7 +170,7 @@
"\"A: SELECT count ( * ) FROM CAR_MAKERS AS T1 JOIN COUNTRIES AS T2 ON T1.Country = T2.CountryId WHERE T2.CountryName = 'germany'\"\n",
"```\n",
"\n",
"The dataset needs to be in the following format:\n",
"The dataset needs to be in the followingformat\"\n",
"```\n",
"\"input\": [{\"role\": \"system\", \"content\": \"<input prompt>\"}, {\"role\": \"user\", \"content\": <user input>}, \"ideal\": \"correct answer\"]\n",
"```\n",
@ -179,12 +181,12 @@
"```\n",
"\n",
"\n",
"One way to speed up the process of building eval datasets, is to use `GPT-4` to generate synthetic data"
"One way to speed up the process of building eval datasets, is to use GPT-4 to generate synthetic data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2024-03-18T07:23:04.862331Z",
@ -200,44 +202,21 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Q: Which continent has the highest average horsepower for their cars?\n",
"A: \n",
"```sql\n",
"SELECT continents.Continent, AVG(cars_data.Horsepower) AS AvgHorsepower\n",
"FROM continents\n",
"JOIN countries ON continents.ContId = countries.Continent\n",
"JOIN car_makers ON countries.CountryId = car_makers.Country\n",
"JOIN model_list ON car_makers.Id = model_list.Maker\n",
"JOIN car_names ON model_list.Model = car_names.Model\n",
"JOIN cars_data ON car_names.MakeId = cars_data.Id\n",
"GROUP BY continents.Continent\n",
"ORDER BY AvgHorsepower DESC\n",
"LIMIT 1\n",
"```\n",
"Q: What is the average MPG for cars made by makers from the continent of Europe?\n",
"A: SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
"\n",
"Q: What is the average MPG (Miles Per Gallon) for cars made by makers in Europe?\n",
"A: SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
"Q: What are the names of car makers based in Europe?\n",
"A: SELECT Maker FROM car_makers AS cm JOIN countries AS c ON cm.Country = c.CountryId JOIN continents AS cont ON c.Continent = cont.ContId WHERE cont.Continent = 'Europe'\n",
"\n",
"Q: Which continent has the highest average horsepower for its cars?\n",
"A: \n",
"```sql\n",
"SELECT continents.Continent, AVG(cars_data.Horsepower) as AvgHorsepower\n",
"FROM cars_data\n",
"JOIN car_names ON car_names.MakeId = cars_data.Id\n",
"JOIN model_list ON model_list.Model = car_names.Model\n",
"JOIN car_makers ON car_makers.Id = model_list.Maker\n",
"JOIN countries ON car_makers.Country = countries.CountryId\n",
"JOIN continents ON countries.Continent = continents.ContId\n",
"GROUP BY continents.Continent\n",
"ORDER BY AvgHorsepower DESC\n",
"LIMIT 1\n",
"```\n",
"Q: Which car maker has the highest average horsepower for their cars?\n",
"\n",
"A: SELECT cm.Maker, AVG(cd.Horsepower) as AvgHorsepower FROM car_makers AS cm JOIN car_names AS cn ON cm.Id = cn.MakeId JOIN cars_data AS cd ON cn.MakeId = cd.Id GROUP BY cm.Maker ORDER BY AvgHorsepower DESC LIMIT 1\n",
"\n",
"Q: What is the average horsepower for cars made in countries belonging to the continent with ID 3?\n",
"A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.ContId = 3\n",
"\n",
"Q: What is the average horsepower for cars made by a maker from Japan?\n",
"A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'Japan'\n",
"\n",
"Q: What is the average horsepower for cars made in the USA?\n",
"A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.Make = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'\n",
"\n"
]
}
@ -276,22 +255,22 @@
"\n",
"user_input = \"Table car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]\"\n",
"\n",
"messages = [\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": system_prompt\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": user_input\n",
" }\n",
"]\n",
"messages = []\n",
"messages.append({\n",
" \"role\": \"system\",\n",
" \"content\": system_prompt\n",
"})\n",
"\n",
"messages.append({\n",
" \"role\": \"user\",\n",
" \"content\": user_input\n",
"})\n",
"\n",
"completion = client.chat.completions.create(\n",
" model=\"gpt-4-turbo-preview\",\n",
" messages=messages,\n",
" temperature=0.7,\n",
" n=5\n",
" model=\"gpt-4-turbo-preview\",\n",
" messages=messages,\n",
" temperature=0.7,\n",
" n=5\n",
")\n",
"\n",
"for choice in completion.choices:\n",
@ -314,11 +293,11 @@
"name": "stdout",
"output_type": "stream",
"text": [
"{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which continent has the highest average horsepower for their cars?'}], 'ideal': ''}\n",
"{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average MPG (Miles Per Gallon) for cars made by makers in Europe?'}], 'ideal': \"SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
"{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which continent has the highest average horsepower for its cars?'}], 'ideal': ''}\n",
"{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made by a maker from Japan?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'Japan'\"}\n",
"{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in the USA?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.Make = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'\"}\n"
"{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average MPG for cars made by makers from the continent of Europe?'}], 'ideal': \"SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
"{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What are the names of car makers based in Europe?'}], 'ideal': \"SELECT Maker FROM car_makers AS cm JOIN countries AS c ON cm.Country = c.CountryId JOIN continents AS cont ON c.Continent = cont.ContId WHERE cont.Continent = 'Europe'\"}\n",
"{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which car maker has the highest average horsepower for their cars?'}], 'ideal': 'SELECT cm.Maker, AVG(cd.Horsepower) as AvgHorsepower FROM car_makers AS cm JOIN car_names AS cn ON cm.Id = cn.MakeId JOIN cars_data AS cd ON cn.MakeId = cd.Id GROUP BY cm.Maker ORDER BY AvgHorsepower DESC LIMIT 1'}\n",
"{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in countries belonging to the continent with ID 3?'}], 'ideal': 'SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.ContId = 3'}\n",
"{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made by a maker from Japan?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'Japan'\"}\n"
]
}
],
@ -338,7 +317,7 @@
" })\n",
"\n",
"for item in eval_data:\n",
" print(item)"
" print(item)\n"
]
},
{
@ -400,11 +379,13 @@
"source": [
"## Running an evaluation\n",
"\n",
"We can run this eval using the `oaieval` CLI. To get setup, install the library: `pip install .` (if you are running the [OpenAI Evals library](github.com/openai/evals) locally) or `pip install oaieval` if you are running an existing eval. \n",
"We can run this eval using the `oaieval` CLI:\n",
"\n",
"First, install the library: `pip install .` (if you are running the [OpenAI Evals library](github.com/openai/evals) locally) or `pip install oaieval` if you are running an existing eval. \n",
"\n",
"Then, run the eval using the CLI: `oaieval gpt-3.5-turbo spider-sql`\n",
"\n",
"This command expects a model name and an eval set name. Note that we provide two command line interfaces (CLIs): `oaieval` for running a single eval and `oaievalset` for running a set of evals. The valid eval names are specified in the YAML files under `evals/registry/evals` and their corresponding implementations can be found in `evals/elsuite`."
"The valid eval names are specified in the YAML files under `evals/registry/evals` and their corresponding implementations can be found in `evals/elsuite`."
]
},
{
@ -461,61 +442,61 @@
"name": "stdout",
"output_type": "stream",
"text": [
"[2024-03-18 20:45:46,391] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/evals\n",
"[2024-03-18 20:45:50,433] [registry.py:257] Loading registry from /Users/shyamal/.evals/evals\n",
"[2024-03-18 20:45:50,444] [oaieval.py:189] \u001b[1;35mRun started: 240319034550VLDKMJVL\u001b[0m\n",
"[2024-03-18 20:45:50,466] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/modelgraded\n",
"[2024-03-18 20:45:50,592] [registry.py:257] Loading registry from /Users/shyamal/.evals/modelgraded\n",
"[2024-03-18 20:45:50,593] [data.py:90] Fetching /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/data/sql/spider_sql.jsonl\n",
"[2024-03-18 20:45:50,608] [eval.py:36] Evaluating 20 samples\n",
"[2024-03-18 20:45:50,691] [eval.py:144] Running in threaded mode with 10 threads!\n",
" 0%| | 0/20 [00:00<?, ?it/s][2024-03-18 20:45:51,459] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:51,462] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:51,462] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:51,478] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:51,562] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:51,592] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:51,624] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:51,779] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:52,052] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:52,059] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:53,062] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:53,063] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:53,065] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 5%|██▏ | 1/20 [00:02<00:45, 2.38s/it][2024-03-18 20:45:53,094] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:53,471] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 25%|███████████ | 5/20 [00:02<00:06, 2.27it/s][2024-03-18 20:45:53,620] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 30%|█████████████▏ | 6/20 [00:02<00:05, 2.69it/s][2024-03-18 20:45:53,674] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:53,787] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 40%|█████████████████▌ | 8/20 [00:03<00:03, 3.90it/s][2024-03-18 20:45:53,891] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 45%|███████████████████▊ | 9/20 [00:03<00:02, 4.49it/s][2024-03-18 20:45:53,992] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:54,009] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:54,196] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:54,218] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:54,384] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 50%|█████████████████████▌ | 10/20 [00:03<00:02, 3.47it/s][2024-03-18 20:45:54,430] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:54,632] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:54,683] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:54,731] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:55,214] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:55,292] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:55,725] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 55%|███████████████████████▋ | 11/20 [00:05<00:05, 1.79it/s][2024-03-18 20:45:56,006] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 60%|█████████████████████████▊ | 12/20 [00:05<00:03, 2.07it/s][2024-03-18 20:45:56,201] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 20:45:56,206] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 65%|███████████████████████████▉ | 13/20 [00:05<00:02, 2.48it/s][2024-03-18 20:45:56,400] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 75%|████████████████████████████████▎ | 15/20 [00:05<00:01, 3.72it/s][2024-03-18 20:45:56,644] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 80%|██████████████████████████████████▍ | 16/20 [00:05<00:01, 3.80it/s][2024-03-18 20:45:56,837] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 85%|████████████████████████████████████▌ | 17/20 [00:06<00:00, 4.08it/s][2024-03-18 20:45:57,111] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 90%|██████████████████████████████████████▋ | 18/20 [00:06<00:00, 3.95it/s][2024-03-18 20:45:57,262] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 95%|████████████████████████████████████████▊ | 19/20 [00:06<00:00, 4.44it/s][2024-03-18 20:45:57,304] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"100%|███████████████████████████████████████████| 20/20 [00:06<00:00, 3.02it/s]\n",
"[2024-03-18 20:45:57,316] [record.py:360] Final report: {'counts/Correct': 18, 'counts/Incorrect': 2, 'score': 0.9}. Logged to /tmp/evallogs/240319034550VLDKMJVL_gpt-3.5-turbo_spider-sql.jsonl\n",
"[2024-03-18 20:45:57,316] [oaieval.py:229] Final report:\n",
"[2024-03-18 20:45:57,317] [oaieval.py:231] counts/Correct: 18\n",
"[2024-03-18 20:45:57,317] [oaieval.py:231] counts/Incorrect: 2\n",
"[2024-03-18 20:45:57,317] [oaieval.py:231] score: 0.9\n",
"[2024-03-18 20:45:57,342] [record.py:349] Logged 60 rows of events to /tmp/evallogs/240319034550VLDKMJVL_gpt-3.5-turbo_spider-sql.jsonl: insert_time=21.218ms\n"
"[2024-03-18 21:45:20,283] [registry.py:257] Loading registry from /Users/roy/Documents/Github/openai-cookbook/.venv/lib/python3.9/site-packages/evals/registry/evals\n",
"[2024-03-18 21:45:22,245] [registry.py:257] Loading registry from /Users/roy/.evals/evals\n",
"[2024-03-18 21:45:22,248] [oaieval.py:189] \u001b[1;35mRun started: 2403190445227UCQH3DZ\u001b[0m\n",
"[2024-03-18 21:45:22,265] [registry.py:257] Loading registry from /Users/roy/Documents/Github/openai-cookbook/.venv/lib/python3.9/site-packages/evals/registry/modelgraded\n",
"[2024-03-18 21:45:22,380] [registry.py:257] Loading registry from /Users/roy/.evals/modelgraded\n",
"[2024-03-18 21:45:22,380] [data.py:90] Fetching /Users/roy/Documents/Github/openai-cookbook/.venv/lib/python3.9/site-packages/evals/registry/data/sql/spider_sql.jsonl\n",
"[2024-03-18 21:45:22,385] [eval.py:36] Evaluating 20 samples\n",
"[2024-03-18 21:45:22,416] [eval.py:144] Running in threaded mode with 10 threads!\n",
" 0%| | 0/20 [00:00<?, ?it/s][2024-03-18 21:45:23,192] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:23,229] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:23,243] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:23,389] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:23,438] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:23,447] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:23,622] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:23,799] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:23,814] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:24,675] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:24,935] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:24,937] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 5%|██▏ | 1/20 [00:02<00:48, 2.53s/it][2024-03-18 21:45:25,008] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:25,282] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 20%|████████▊ | 4/20 [00:02<00:09, 1.75it/s][2024-03-18 21:45:25,489] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 25%|███████████ | 5/20 [00:03<00:07, 2.13it/s][2024-03-18 21:45:25,642] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 30%|█████████████▏ | 6/20 [00:03<00:05, 2.64it/s][2024-03-18 21:45:25,661] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:25,725] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:25,954] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 40%|█████████████████▌ | 8/20 [00:03<00:03, 3.59it/s][2024-03-18 21:45:26,024] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:26,042] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:26,066] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 45%|███████████████████▊ | 9/20 [00:03<00:02, 4.20it/s][2024-03-18 21:45:26,091] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:26,139] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:26,163] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:26,486] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 50%|█████████████████████▌ | 10/20 [00:04<00:02, 3.51it/s][2024-03-18 21:45:26,538] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:26,569] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:27,064] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:27,555] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:27,567] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 55%|███████████████████████▋ | 11/20 [00:05<00:04, 2.00it/s][2024-03-18 21:45:27,777] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 60%|█████████████████████████▊ | 12/20 [00:05<00:03, 2.38it/s][2024-03-18 21:45:27,918] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 65%|███████████████████████████▉ | 13/20 [00:05<00:02, 2.95it/s][2024-03-18 21:45:28,117] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 70%|██████████████████████████████ | 14/20 [00:05<00:01, 3.34it/s][2024-03-18 21:45:28,170] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:28,476] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 80%|██████████████████████████████████▍ | 16/20 [00:06<00:00, 4.08it/s][2024-03-18 21:45:28,639] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 85%|████████████████████████████████████▌ | 17/20 [00:06<00:00, 4.45it/s][2024-03-18 21:45:28,660] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"[2024-03-18 21:45:28,790] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
" 95%|████████████████████████████████████████▊ | 19/20 [00:06<00:00, 6.12it/s][2024-03-18 21:45:30,143] [_client.py:1013] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
"100%|███████████████████████████████████████████| 20/20 [00:07<00:00, 2.59it/s]\n",
"[2024-03-18 21:45:30,153] [record.py:360] Final report: {'counts/Correct': 18, 'counts/Incorrect': 2, 'score': 0.9}. Logged to /tmp/evallogs/2403190445227UCQH3DZ_gpt-3.5-turbo_spider-sql.jsonl\n",
"[2024-03-18 21:45:30,153] [oaieval.py:229] Final report:\n",
"[2024-03-18 21:45:30,153] [oaieval.py:231] counts/Correct: 18\n",
"[2024-03-18 21:45:30,153] [oaieval.py:231] counts/Incorrect: 2\n",
"[2024-03-18 21:45:30,153] [oaieval.py:231] score: 0.9\n",
"[2024-03-18 21:45:30,166] [record.py:349] Logged 60 rows of events to /tmp/evallogs/2403190445227UCQH3DZ_gpt-3.5-turbo_spider-sql.jsonl: insert_time=12.300ms\n"
]
}
],
@ -523,6 +504,13 @@
"!oaieval gpt-3.5-turbo spider-sql --max_samples 20"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`oaievalset` expects a model name and an eval set name, for which the valid options are specified in the YAML files under `evals/registry/eval_sets`."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -607,37 +595,37 @@
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>240318215944ESN7L5HJ</td>\n",
" <td>240319032709UOPMNPEQ</td>\n",
" <td>0.0</td>\n",
" <td>spider-sql.dev.94</td>\n",
" <td>spider-sql.dev.25</td>\n",
" <td>sampling</td>\n",
" <td>{'prompt': [{'content': 'Answer the following ...</td>\n",
" <td></td>\n",
" <td>2024-03-18 21:59:45.655060+00:00</td>\n",
" <td>2024-03-19 03:27:10.017992+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>240318215944ESN7L5HJ</td>\n",
" <td>240319032709UOPMNPEQ</td>\n",
" <td>1.0</td>\n",
" <td>spider-sql.dev.25</td>\n",
" <td>spider-sql.dev.88</td>\n",
" <td>sampling</td>\n",
" <td>{'prompt': [{'content': 'Answer the following ...</td>\n",
" <td></td>\n",
" <td>2024-03-18 21:59:45.656165+00:00</td>\n",
" <td>2024-03-19 03:27:10.171886+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>240318215944ESN7L5HJ</td>\n",
" <td>240319032709UOPMNPEQ</td>\n",
" <td>2.0</td>\n",
" <td>spider-sql.dev.82</td>\n",
" <td>spider-sql.dev.72</td>\n",
" <td>sampling</td>\n",
" <td>{'prompt': [{'content': 'Answer the following ...</td>\n",
" <td></td>\n",
" <td>2024-03-18 21:59:45.656846+00:00</td>\n",
" <td>2024-03-19 03:27:10.183700+00:00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
@ -654,16 +642,16 @@
" final_report run_id \\\n",
"0 NaN NaN \n",
"1 {'counts/Correct': 18, 'counts/Incorrect': 2, ... NaN \n",
"2 NaN 240318215944ESN7L5HJ \n",
"3 NaN 240318215944ESN7L5HJ \n",
"4 NaN 240318215944ESN7L5HJ \n",
"2 NaN 240319032709UOPMNPEQ \n",
"3 NaN 240319032709UOPMNPEQ \n",
"4 NaN 240319032709UOPMNPEQ \n",
"\n",
" event_id sample_id type \\\n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 0.0 spider-sql.dev.94 sampling \n",
"3 1.0 spider-sql.dev.25 sampling \n",
"4 2.0 spider-sql.dev.82 sampling \n",
"2 0.0 spider-sql.dev.25 sampling \n",
"3 1.0 spider-sql.dev.88 sampling \n",
"4 2.0 spider-sql.dev.72 sampling \n",
"\n",
" data created_by \\\n",
"0 NaN NaN \n",
@ -675,9 +663,9 @@
" created_at \n",
"0 NaT \n",
"1 NaT \n",
"2 2024-03-18 21:59:45.655060+00:00 \n",
"3 2024-03-18 21:59:45.656165+00:00 \n",
"4 2024-03-18 21:59:45.656846+00:00 "
"2 2024-03-19 03:27:10.017992+00:00 \n",
"3 2024-03-19 03:27:10.171886+00:00 \n",
"4 2024-03-19 03:27:10.183700+00:00 "
]
},
"metadata": {},
@ -685,7 +673,7 @@
}
],
"source": [
"log_name = '240318215944ESN7L5HJ_gpt-3.5-turbo_spider-sql.jsonl' # \"EDIT THIS\" - copy from above\n",
"log_name = '240319032709UOPMNPEQ_gpt-3.5-turbo_spider-sql.jsonl' # \"EDIT THIS\" - copy from above\n",
"events = f\"/tmp/evallogs/{log_name}\"\n",
"display(pd.read_json(events, lines=True).head(5))"
]
@ -701,7 +689,12 @@
},
"outputs": [],
"source": [
"# processing the log events generated by oaieval\n",
"# How to process the log events generated by oaieval\n",
"\n",
"# log_name = \"EDIT THIS\" # copy from above\n",
"log_name = '240319032709UOPMNPEQ_gpt-3.5-turbo_spider-sql.jsonl'\n",
"events = f\"/tmp/evallogs/{log_name}\"\n",
"\n",
"with open(events, \"r\") as f:\n",
" events_df = pd.read_json(f, lines=True)"
]
@ -727,7 +720,7 @@
" 'split': 'dev',\n",
" 'run_config': {'completion_fns': ['gpt-3.5-turbo'],\n",
" 'eval_spec': {'cls': 'evals.elsuite.modelgraded.classify:ModelBasedClassify',\n",
" 'registry_path': '/Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry',\n",
" 'registry_path': '/Users/roy/Documents/Github/openai-cookbook/.venv/lib/python3.9/site-packages/evals/registry',\n",
" 'args': {'samples_jsonl': 'sql/spider_sql.jsonl',\n",
" 'eval_type': 'cot_classify',\n",
" 'modelgraded_spec': 'sql'},\n",
@ -735,11 +728,11 @@
" 'group': 'sql'},\n",
" 'seed': 20220722,\n",
" 'max_samples': 20,\n",
" 'command': '/Users/shyamal/.virtualenvs/openai/bin/oaieval gpt-3.5-turbo spider-sql --max_samples 20',\n",
" 'command': '/Users/roy/Documents/Github/openai-cookbook/.venv/bin/oaieval gpt-3.5-turbo spider-sql --max_samples 20',\n",
" 'initial_settings': {'visible': False}},\n",
" 'created_by': '',\n",
" 'run_id': '240318215944ESN7L5HJ',\n",
" 'created_at': '2024-03-18 21:59:44.882930'}"
" 'run_id': '240319032709UOPMNPEQ',\n",
" 'created_at': '2024-03-19 03:27:09.329431'}"
]
},
"metadata": {},
@ -780,7 +773,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also review individual evaluation events that provide specific samples (`sample_id`), results, event types, and other metadata."
"We can also review individual evaluation events that provide spefific samples (`sample_id`), results, event types, and timestamps."
]
},
{
@ -791,20 +784,23 @@
{
"data": {
"text/plain": [
"run_id 240318215944ESN7L5HJ\n",
"event_id 0.0\n",
"sample_id spider-sql.dev.94\n",
"type sampling\n",
"run_id 240319032709UOPMNPEQ\n",
"event_id 0.0\n",
"sample_id spider-sql.dev.25\n",
"type sampling\n",
"data {'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\n",
"Use only the following tables and columns:\n",
"Table: battle. Columns: id (number), name (text), date (text), bulgarian_commander (text), latin_commander (text), result (text)\n",
"Table: ship. Columns: lost_in_battle (number), id (number), name (text), tonnage (text), ship_type (text), location (text), disposition_of_ship (text)\n",
"Table: death. Columns: caused_by_ship_id (number), id (number), note (text), killed (number), injured (number)\n",
"Table: continents. Columns: ContId (number), Continent (text)\n",
"Table: countries. Columns: CountryId (number), CountryName (text), Continent (number)\n",
"Table: car_makers. Columns: Id (number), Maker (text), FullName (text), Country (text)\n",
"Table: model_list. Columns: ModelId (number), Maker (number), Model (text)\n",
"Table: car_names. Columns: MakeId (number), Model (text), Make (text)\n",
"Table: cars_data. Columns: Id (number), MPG (text), Cylinders (number), Edispl (number), Horsepower (text), Weight (number), Accelerate (number), Year (number)\n",
"\n",
"Question: What is the average number of injuries caused each time?\n",
"', 'role': 'system'}], 'sampled': ['SELECT AVG(injured) AS average_injuries_caused\n",
"FROM death;']}\n",
"created_at 2024-03-18 21:59:45.655060+00:00\n",
"Question: How many countries exist?\n",
"', 'role': 'system'}], 'sampled': ['SELECT COUNT(*) AS TotalCountries\n",
"FROM countries;']}\n",
"created_at 2024-03-19 03:27:10.017992+00:00\n",
"Name: 2, dtype: object"
]
},
@ -826,20 +822,20 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: battle. Columns: id (number), name (text), date (text), bulgarian_commander (text), latin_commander (text), result (text)\\nTable: ship. Columns: lost_in_battle (number), id (number), name (text), tonnage (text), ship_type (text), location (text), disposition_of_ship (text)\\nTable: death. Columns: caused_by_ship_id (number), id (number), note (text), killed (number), injured (number)\\n\\nQuestion: What is the average number of injuries caused each time?\\n', 'role': 'system'}]\n",
"Sampled: ['SELECT AVG(injured) AS average_injuries_caused\\nFROM death;']\n",
"----------\n",
"Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: continents. Columns: ContId (number), Continent (text)\\nTable: countries. Columns: CountryId (number), CountryName (text), Continent (number)\\nTable: car_makers. Columns: Id (number), Maker (text), FullName (text), Country (text)\\nTable: model_list. Columns: ModelId (number), Maker (number), Model (text)\\nTable: car_names. Columns: MakeId (number), Model (text), Make (text)\\nTable: cars_data. Columns: Id (number), MPG (text), Cylinders (number), Edispl (number), Horsepower (text), Weight (number), Accelerate (number), Year (number)\\n\\nQuestion: How many countries exist?\\n', 'role': 'system'}]\n",
"Sampled: ['```sql\\nSELECT COUNT(*) AS TotalCountries\\nFROM countries;\\n```']\n",
"Sampled: ['SELECT COUNT(*) AS TotalCountries\\nFROM countries;']\n",
"----------\n",
"Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text)\\nTable: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number)\\nTable: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number)\\n\\nQuestion: Find the total number of matches.\\n', 'role': 'system'}]\n",
"Sampled: ['SELECT COUNT(*) AS total_matches\\nFROM matches;']\n",
"Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text)\\nTable: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number)\\nTable: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number)\\n\\nQuestion: Find the average rank of winners in all matches.\\n', 'role': 'system'}]\n",
"Sampled: ['SELECT AVG(winner_rank) AS average_rank_of_winners\\nFROM matches;']\n",
"----------\n",
"Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: museum. Columns: Museum_ID (number), Name (text), Num_of_Staff (number), Open_Year (text)\\nTable: visitor. Columns: ID (number), Name (text), Level_of_membership (number), Age (number)\\nTable: visit. Columns: Museum_ID (number), visitor_ID (text), Num_of_Ticket (number), Total_spent (number)\\n\\nQuestion: What is the average age of the visitors whose membership level is not higher than 4?\\n', 'role': 'system'}]\n",
"Sampled: ['SELECT AVG(Age) \\nFROM visitor \\nWHERE Level_of_membership <= 4;']\n",
"----------\n",
"Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: city. Columns: ID (number), Name (text), CountryCode (text), District (text), Population (number)\\nTable: sqlite_sequence. Columns: name (text), seq (text)\\nTable: country. Columns: Code (text), Name (text), Continent (text), Region (text), SurfaceArea (number), IndepYear (number), Population (number), LifeExpectancy (number), GNP (number), GNPOld (number), LocalName (text), GovernmentForm (text), HeadOfState (text), Capital (number), Code2 (text)\\nTable: countrylanguage. Columns: CountryCode (text), Language (text), IsOfficial (text), Percentage (number)\\n\\nQuestion: How many countries have a republic as their form of government?\\n', 'role': 'system'}]\n",
"Sampled: [\"```sql\\nSELECT COUNT(*) \\nFROM country \\nWHERE GovernmentForm = 'Republic';\\n```\"]\n",
"Sampled: [\"```sql\\nSELECT COUNT(*) \\nFROM country \\nWHERE GovernmentForm LIKE '%Republic%';\\n```\"]\n",
"----------\n",
"Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text)\\nTable: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number)\\nTable: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number)\\n\\nQuestion: Find the average rank of winners in all matches.\\n', 'role': 'system'}]\n",
"Sampled: ['SELECT AVG(winner_rank) AS average_winner_rank\\nFROM matches;']\n",
"Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\\nUse only the following tables and columns:\\nTable: battle. Columns: id (number), name (text), date (text), bulgarian_commander (text), latin_commander (text), result (text)\\nTable: ship. Columns: lost_in_battle (number), id (number), name (text), tonnage (text), ship_type (text), location (text), disposition_of_ship (text)\\nTable: death. Columns: caused_by_ship_id (number), id (number), note (text), killed (number), injured (number)\\n\\nQuestion: What is the average number of injuries caused each time?\\n', 'role': 'system'}]\n",
"Sampled: ['SELECT AVG(injured) AS average_injuries_caused\\nFROM death;']\n",
"----------\n"
]
}
@ -857,15 +853,135 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion"
"Let's review our failures to understand which tests did not succeed."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": []
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"You are comparing a submitted answer to an expert answer on a given SQL coding question. Here is the data:\n",
"[BEGIN DATA]\n",
"************\n",
"[Question]: Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\n",
"Use only the following tables and columns:\n",
"Table: Ref_Template_Types. Columns: Template_Type_Code (text), Template_Type_Description (text)\n",
"Table: Templates. Columns: Template_ID (number), Version_Number (number), Template_Type_Code (text), Date_Effective_From (time), Date_Effective_To (time), Template_Details (text)\n",
"Table: Documents. Columns: Document_ID (number), Template_ID (number), Document_Name (text), Document_Description (text), Other_Details (text)\n",
"Table: Paragraphs. Columns: Paragraph_ID (number), Document_ID (number), Paragraph_Text (text), Other_Details (text)\n",
"\n",
"Question: Return the document id, template id, and description for the document with the name Robbin CV.\n",
"\n",
"************\n",
"[Expert]: SELECT document_id , template_id , Document_Description FROM Documents WHERE document_name = \"Robbin CV\"\n",
"************\n",
"[Submission]: ```sql\n",
"SELECT Documents.Document_ID, Documents.Template_ID, Documents.Document_Description\n",
"FROM Documents\n",
"JOIN Templates ON Documents.Template_ID = Templates.Template_ID\n",
"WHERE Documents.Document_Name = 'Robbin CV';\n",
"```\n",
"************\n",
"[END DATA]\n",
"\n",
"Compare the content and correctness of the submitted SQL with the expert answer. Ignore any differences in whitespace, style, or output column names.\n",
"The submitted answer may either be correct or incorrect. Determine which case applies. Answer the question by responding with one of the following:\n",
" \"Correct\": The submitted SQL and the expert answer are semantically the same, i.e. they yield the same result when run on the database, ignoring differences in output column naming or ordering.\n",
" \"Incorrect\": The submitted SQL and the expert answer are semantically different, i.e. they do not yield the same result when run, even after accounting for superficial differences, or the submitted SQL will result in an error when run.\n",
"\n",
"First, write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Then print only a single choice from \"Correct\" or \"Incorrect\" (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the answer by itself on a new line.\n",
"\n",
"Reasoning:\n",
"--------------------\n",
"You are comparing a submitted answer to an expert answer on a given SQL coding question. Here is the data:\n",
"[BEGIN DATA]\n",
"************\n",
"[Question]: Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\n",
"Use only the following tables and columns:\n",
"Table: continents. Columns: ContId (number), Continent (text)\n",
"Table: countries. Columns: CountryId (number), CountryName (text), Continent (number)\n",
"Table: car_makers. Columns: Id (number), Maker (text), FullName (text), Country (text)\n",
"Table: model_list. Columns: ModelId (number), Maker (number), Model (text)\n",
"Table: car_names. Columns: MakeId (number), Model (text), Make (text)\n",
"Table: cars_data. Columns: Id (number), MPG (text), Cylinders (number), Edispl (number), Horsepower (text), Weight (number), Accelerate (number), Year (number)\n",
"\n",
"Question: Which model of the car has the minimum horsepower?\n",
"\n",
"************\n",
"[Expert]: SELECT T1.Model FROM CAR_NAMES AS T1 JOIN CARS_DATA AS T2 ON T1.MakeId = T2.Id ORDER BY T2.horsepower ASC LIMIT 1;\n",
"************\n",
"[Submission]: ```sql\n",
"SELECT Model\n",
"FROM model_list\n",
"WHERE ModelId = (\n",
" SELECT ModelId\n",
" FROM cars_data\n",
" WHERE Horsepower = (\n",
" SELECT MIN(Horsepower)\n",
" FROM cars_data\n",
" )\n",
")\n",
"```\n",
"************\n",
"[END DATA]\n",
"\n",
"Compare the content and correctness of the submitted SQL with the expert answer. Ignore any differences in whitespace, style, or output column names.\n",
"The submitted answer may either be correct or incorrect. Determine which case applies. Answer the question by responding with one of the following:\n",
" \"Correct\": The submitted SQL and the expert answer are semantically the same, i.e. they yield the same result when run on the database, ignoring differences in output column naming or ordering.\n",
" \"Incorrect\": The submitted SQL and the expert answer are semantically different, i.e. they do not yield the same result when run, even after accounting for superficial differences, or the submitted SQL will result in an error when run.\n",
"\n",
"First, write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Then print only a single choice from \"Correct\" or \"Incorrect\" (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the answer by itself on a new line.\n",
"\n",
"Reasoning:\n",
"--------------------\n"
]
}
],
"source": [
"# Inspect metrics where choice is made and print only the prompt, result, and expected result if the choice is incorrect\n",
"for i, row in events_df[events_df['type'] == 'metrics'].iterrows():\n",
" if row['data']['choice'] == 'Incorrect':\n",
" # Get the previous row's data, which contains the prompt and the expected result\n",
" prev_row = events_df.iloc[i-1]\n",
" prompt = prev_row['data']['prompt'][0]['content'] if 'prompt' in prev_row['data'] and len(prev_row['data']['prompt']) > 0 else \"Prompt not available\"\n",
" expected_result = prev_row['data'].get('ideal', 'Expected result not provided')\n",
" \n",
" # Current row's data will be the actual result\n",
" result = row['data'].get('result', 'Actual result not provided')\n",
" \n",
" print(prompt)\n",
" print(\"-\" * 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reviewing each of these failures we see the following:\n",
"* In the first failure, the model returns an unnecessary JOIN to the TEMPLATES table despite it not being used. Everything else in the query appears to be correct. To improve our results here, we could potentially prompt engineer our system prompt and add a clause to state that it should write queries as efficiently as possible, without unnecessary joins.\n",
"* In the second failure, both answers are technically correct, but are structured in different ways. This begs the question if evaluating SQL based on exact syntax matches is the most efficient method. For a situation like this, we could potentially considered using a model graded eval to understand the structure of the expected vs. actual and determine if they are the same"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Building out effective evals is a core part of the development cycle of LLM-based applications. The OpenAI Evals framework provides the core structure of building evals out of the box, and allows you to quickly spin up new tests for your various use cases. In this guide, we demonstrated step-by-step how to create an eval, run it, and analyze the results.\n",
"\n",
"The example shown in this guide represent a straightfoward use case for evals. As you continue to explore this framework, we recommend you explore creating more complex model-graded evals for actual production use cases. "
]
}
],
"metadata": {
@ -884,7 +1000,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
"version": "3.9.6"
}
},
"nbformat": 4,

Loading…
Cancel
Save