Creating Eval dataset

3 months ago · 10a09374b8
parent d32e7aa18d
commit 10a09374b8
1 changed files with 43 additions and 25 deletions
--- a/examples/evaluation/Getting_Started_with_OpenAI_Evals.ipynb
+++ b/examples/evaluation/Getting_Started_with_OpenAI_Evals.ipynb
@ -2,15 +2,18 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "source": [
-    "# Getting Started with OpenAI Evals"
-   ],
   "metadata": {
    "collapsed": false
-   }
+   },
+   "source": [
+    "# Getting Started with OpenAI Evals"
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
   "source": [
    "\n",
    "This notebook will go over:\n",
@ -40,7 +43,7 @@
    "Evals are not restricted to checking factual accuracy: all that is needed is a reproducible way to grade a\n",
    "completion. Here are some other examples of valid evals:\n",
    "* The input asks to write a short essay on a topic. The grading criteria is to check if the essay is of\n",
-    "* particular length or if certain keywords or themes are present in the completion.\n",
+    "particular length or if certain keywords or themes are present in the completion.\n",
    "* The input is to write a funny joke, and the grading criteria is to check how funny it was.\n",
    "* The input is to follow a sequence of instructions, and the grading ensures that all instructions\n",
    "were followed.\n",
@ -71,13 +74,13 @@
    "the performance with human evaluation before running the evals at scale. For best results, it makes\n",
    "sense to use a different model to do grading from the one that did the completion, like using GPT-4 to\n",
    "grade GPT-3.5 answers.\n"
-   ],
-   "metadata": {
-    "collapsed": false
-   }
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
   "source": [
    "## Building an evaluation for the OpenAI Evals framework\n",
    "\n",
@ -87,25 +90,43 @@
    "2/ The eval template to be used\n",
    "\n",
    "### Creating the eval dataset\n",
+    "Lets create a dataset for a use case where we are evaluating the model's ability to generate syntactically correct SQL.\n",
    "\n",
-    "format\n",
-    "`\"input\": [{\"role\": \"system\", \"content\": \"<input prompt>\",\"name\":\"example-user\"}, \"ideal\": \"correct answer\"]`"
-   ],
-   "metadata": {
-    "collapsed": false
-   }
+    "First we will need to create a system prompt that we would like to evaluate. We will pass in instructions for the model as well as an overview of the table structure:\n",
+    "`\"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable city, columns = [*,ID,Name,CountryCode,District,Population]\\nTable country, columns = [*,Code,Name,Continent,Region,SurfaceArea,IndepYear,Population,LifeExpectancy,GNP,GNPOld,LocalName,GovernmentForm,HeadOfState,Capital,Code2]\\nTable countrylanguage, columns = [*,CountryCode,Language,IsOfficial,Percentage]\\nTable sqlite_sequence, columns = [*,name,seq]\\nForeign_keys = [city.CountryCode = country.Code,countrylanguage.CountryCode = country.Code]\\n\"`\n",
+    "\n",
+    "For this prompt, we can ask a specific question:\n",
+    "`\"Q: What is the GNP of Afghanistan?\"`\n",
+    "\n",
+    "And we have an expected answer:\n",
+    "`\"A: SELECT GNP FROM country WHERE name = \\\"Afghanistan\\\"\"`\n",
+    "\n",
+    "The dataset needs to be in the followingformat\"\n",
+    "`\"input\": [{\"role\": \"system\", \"content\": \"<input prompt>\",\"name\":\"example-user\"}, \"ideal\": \"correct answer\"]`\n",
+    "\n",
+    "Putting it all together, we get:\n",
+    "`{\"input\": [{\"role\": \"system\", \"content\": \"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable city, columns = [*,ID,Name,CountryCode,District,Population]\\nTable country, columns = [*,Code,Name,Continent,Region,SurfaceArea,IndepYear,Population,LifeExpectancy,GNP,GNPOld,LocalName,GovernmentForm,HeadOfState,Capital,Code2]\\nTable countrylanguage, columns = [*,CountryCode,Language,IsOfficial,Percentage]\\nTable sqlite_sequence, columns = [*,name,seq]\\nForeign_keys = [city.CountryCode = country.Code,countrylanguage.CountryCode = country.Code]\\n\"}, {\"role\": \"user\", \"content\": \"Q: What is the GNP of Afghanistan?\"}], \"ideal\": [\"A: SELECT GNP FROM country WHERE name = \\\"Afghanistan\\\"\"]}`\n",
+    "\n",
+    "\n",
+    "One way to speed up the process of building eval datasets, is to use GPT-4 to generate synthetic data"
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "outputs": [],
-   "source": [],
   "metadata": {
    "collapsed": false
-   }
+   },
+   "outputs": [],
+   "source": [
+    "## Use GPT-4 to generate synthetic data"
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
   "source": [
    "## Running an evaluation\n",
    "\n",
@ -115,19 +136,16 @@
    "oaieval gpt-3.5-turbo <name of eval>\n",
    "\n",
    "### Going through eval logs"
-   ],
-   "metadata": {
-    "collapsed": false
-   }
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "outputs": [],
-   "source": [],
   "metadata": {
    "collapsed": false
-   }
+   },
+   "outputs": [],
+   "source": []
  }
 ],
 "metadata": {