Eval Registry (#1105)

3 months ago · c4bad1e088
parent 266b7be8b3
commit c4bad1e088
1 changed files with 98 additions and 21 deletions
--- a/examples/evaluation/Getting_Started_with_OpenAI_Evals.ipynb
+++ b/examples/evaluation/Getting_Started_with_OpenAI_Evals.ipynb
@ -17,7 +17,7 @@
   "source": [
    "\n",
    "This notebook will go over:\n",
-    "* Introduction to OpenAI Evals library [enter link]\n",
+    "* Introduction to OpenAI Evals library [[enter link](https://github.com/openai/evals/tree/main)]\n",
    "* What are Evals\n",
    "* Building an Eval\n",
    "* Running an Eval\n",
@ -30,7 +30,7 @@
    "\n",
    "*Why is it important to evaluate?*\n",
    "\n",
-    "If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how different model versions might affect your use case. With OpenAI’s new continuous model upgrades, evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases.\n",
+    "If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. The developing AI solutions involves an iterative design process. Without evals, it can be very difficult and time intensive to understand how different model versions and prompts might affect your use case. With OpenAI’s new continuous model upgrades, evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases.\n",
    "\n",
    "*Types of Evals*\n",
    "\n",
@ -113,7 +113,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
@ -122,20 +122,54 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Q: Which maker has the highest number of models available in the dataset?\n",
-      "A: SELECT Maker, COUNT(Model) AS ModelCount FROM model_list GROUP BY Maker ORDER BY ModelCount DESC LIMIT 1\n",
+      "Q: Which continent has the highest average horsepower for its cars?\n",
      "\n",
-      "Q: What is the average horsepower of cars made by a maker from the continent 'Europe'?\n",
-      "A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "A: SELECT continents.Continent, AVG(cars_data.Horsepower) AS AvgHorsepower FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId GROUP BY continents.Continent ORDER BY AvgHorsepower DESC LIMIT 1\n",
      "\n",
-      "Q: What are the average horsepower and weight for cars made by makers from the continent of Europe?\n",
-      "A: SELECT AVG(cars_data.Horsepower) AS AVG_Horsepower, AVG(cars_data.Weight) AS AVG_Weight FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "Q: Which country has produced the most car models according to this database?\n",
+      "A: SELECT countries.CountryName, COUNT(*) AS number_of_models \n",
+      "   FROM car_makers \n",
+      "   JOIN countries ON car_makers.Country = countries.CountryId \n",
+      "   JOIN model_list ON car_makers.Id = model_list.Maker \n",
+      "   GROUP BY countries.CountryName \n",
+      "   ORDER BY number_of_models DESC \n",
+      "   LIMIT 1\n",
      "\n",
-      "Q: Which car maker has the most models with horsepower greater than 200?\n",
-      "A: SELECT Maker, count(*) as ModelCount FROM car_names AS cn JOIN model_list AS ml ON cn.Model = ml.Model JOIN car_makers AS cm ON ml.Maker = cm.Id JOIN cars_data AS cd ON cn.MakeId = cd.Id WHERE cd.Horsepower > 200 GROUP BY Maker ORDER BY ModelCount DESC LIMIT 1\n",
+      "Q: What is the average horsepower of cars made by makers from the continent with the highest number of car makers?\n",
      "\n",
-      "Q: What is the average MPG (Miles Per Gallon) for cars made by manufacturers from Europe?\n",
-      "A: SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\n",
+      "A: \n",
+      "```sql\n",
+      "SELECT AVG(cars_data.Horsepower) AS Average_Horsepower\n",
+      "FROM cars_data\n",
+      "JOIN car_names ON cars_data.Id = car_names.MakeId\n",
+      "JOIN car_makers ON car_names.Make = car_makers.Id\n",
+      "JOIN countries ON car_makers.Country = countries.CountryId\n",
+      "JOIN continents ON countries.Continent = continents.ContId\n",
+      "WHERE continents.ContId = (\n",
+      "    SELECT continents.ContId\n",
+      "    FROM car_makers\n",
+      "    JOIN countries ON car_makers.Country = countries.CountryId\n",
+      "    JOIN continents ON countries.Continent = continents.ContId\n",
+      "    GROUP BY continents.ContId\n",
+      "    ORDER BY COUNT(car_makers.Id) DESC\n",
+      "    LIMIT 1\n",
+      ")\n",
+      "```\n",
+      "\n",
+      "Q: Which continent produces the most models with a horsepower greater than 200?\n",
+      "A: SELECT continents.Continent FROM continents JOIN countries ON continents.ContId = countries.Continent JOIN car_makers ON countries.CountryId = car_makers.Country JOIN model_list ON car_makers.Id = model_list.Maker JOIN car_names ON model_list.Model = car_names.Model JOIN cars_data ON car_names.MakeId = cars_data.Id WHERE cars_data.Horsepower > 200 GROUP BY continents.Continent ORDER BY COUNT(model_list.Model) DESC LIMIT 1\n",
+      "\n",
+      "Q: Which car maker based in Europe has the highest number of models?\n",
+      "\n",
+      "A: SELECT car_makers.FullName, COUNT(model_list.Model) AS NumberOfModels\n",
+      "   FROM car_makers\n",
+      "   JOIN countries ON car_makers.Country = countries.CountryId\n",
+      "   JOIN continents ON countries.Continent = continents.ContId\n",
+      "   JOIN model_list ON car_makers.Id = model_list.Maker\n",
+      "   WHERE continents.Continent = 'Europe'\n",
+      "   GROUP BY car_makers.FullName\n",
+      "   ORDER BY NumberOfModels DESC\n",
+      "   LIMIT 1\n",
      "\n"
     ]
    }
@ -209,18 +243,18 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which maker has the highest number of models available in the dataset?'}], 'ideal': 'SELECT Maker, COUNT(Model) AS ModelCount FROM model_list GROUP BY Maker ORDER BY ModelCount DESC LIMIT 1'}\n",
-      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': \"What is the average horsepower of cars made by a maker from the continent 'Europe'?\"}], 'ideal': \"SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
-      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What are the average horsepower and weight for cars made by makers from the continent of Europe?'}], 'ideal': \"SELECT AVG(cars_data.Horsepower) AS AVG_Horsepower, AVG(cars_data.Weight) AS AVG_Weight FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n",
-      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which car maker has the most models with horsepower greater than 200?'}], 'ideal': 'SELECT Maker, count(*) as ModelCount FROM car_names AS cn JOIN model_list AS ml ON cn.Model = ml.Model JOIN car_makers AS cm ON ml.Maker = cm.Id JOIN cars_data AS cd ON cn.MakeId = cd.Id WHERE cd.Horsepower > 200 GROUP BY Maker ORDER BY ModelCount DESC LIMIT 1'}\n",
-      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average MPG (Miles Per Gallon) for cars made by manufacturers from Europe?'}], 'ideal': \"SELECT AVG(cars_data.MPG) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'\"}\n"
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which continent has the highest average horsepower for its cars?'}], 'ideal': 'SELECT continents.Continent, AVG(cars_data.Horsepower) AS AvgHorsepower FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId GROUP BY continents.Continent ORDER BY AvgHorsepower DESC LIMIT 1'}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which country has produced the most car models according to this database?'}], 'ideal': 'SELECT countries.CountryName, COUNT(*) AS number_of_models '}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower of cars made by makers from the continent with the highest number of car makers?'}], 'ideal': ''}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which continent produces the most models with a horsepower greater than 200?'}], 'ideal': 'SELECT continents.Continent FROM continents JOIN countries ON continents.ContId = countries.Continent JOIN car_makers ON countries.CountryId = car_makers.Country JOIN model_list ON car_makers.Id = model_list.Maker JOIN car_names ON model_list.Model = car_names.Model JOIN cars_data ON car_names.MakeId = cars_data.Id WHERE cars_data.Horsepower > 200 GROUP BY continents.Continent ORDER BY COUNT(model_list.Model) DESC LIMIT 1'}\n",
+      "{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\\nTable car_names, columns = [*,MakeId,Model,Make]\\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\\nTable continents, columns = [*,ContId,Continent]\\nTable countries, columns = [*,CountryId,CountryName,Continent]\\nTable model_list, columns = [*,ModelId,Maker,Model]\\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'Which car maker based in Europe has the highest number of models?'}], 'ideal': 'SELECT car_makers.FullName, COUNT(model_list.Model) AS NumberOfModels'}\n"
     ]
    }
   ],
@ -243,6 +277,49 @@
    "    print(item)\n"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next we need to create the eval registry to run it in the framework.\n",
+    "\n",
+    "The evals framework requires a .yaml file structured with the following properties:\n",
+    "* id - An identifier for your eval\n",
+    "* description - A short description of your eval\n",
+    "* disclaimer - An additional notes about your eval\n",
+    "* metrics - There are three types of eval metrics we can choose from: match, includes, fuzzyMatch\n",
+    "\n",
+    "For our eval, we will configure the following:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'\\nspider-sql:\\n  id: spider-sql.dev.v0\\n  metrics: [accuracy]\\n  description: Eval that scores SQL code from 194 examples in the Spider Text-to-SQL test dataset. The problems are selected by taking the first 10 problems for each database that appears in the test set.\\n    Yu, Tao, et al. \"Spider; A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.\" Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1425.\\n  disclaimer: Problems are solved zero-shot with no prompting other than the schema; performance may improve with training examples, fine tuning, or a different schema format. Evaluation is currently done through model-grading, where SQL code is not actually executed; the model may judge correct SQL to be incorrect, or vice-versa.\\n\\n  '"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "\"\"\"\n",
+    "spider-sql:\n",
+    "  id: spider-sql.dev.v0\n",
+    "  metrics: [accuracy]\n",
+    "  description: Eval that scores SQL code from 194 examples in the Spider Text-to-SQL test dataset. The problems are selected by taking the first 10 problems for each database that appears in the test set.\n",
+    "    Yu, Tao, et al. \\\"Spider; A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.\\\" Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1425.\n",
+    "  disclaimer: Problems are solved zero-shot with no prompting other than the schema; performance may improve with training examples, fine tuning, or a different schema format. Evaluation is currently done through model-grading, where SQL code is not actually executed; the model may judge correct SQL to be incorrect, or vice-versa.\n",
+    "\n",
+    "  \"\"\"\"\""
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {
@ -251,10 +328,10 @@
   "source": [
    "## Running an evaluation\n",
    "\n",
-    "we can run this eval using the oaieval CLI like this\n",
+    "We can run this eval using the oaieval CLI like this\n",
    "\n",
    "pip install .\n",
-    "oaieval gpt-3.5-turbo <name of eval>\n",
+    "oaieval gpt-3.5-turbo spider-sql\n",
    "\n",
    "### Going through eval logs"
   ]