Creating Eval dataset

pull/1099/head
Roy Ziv 3 months ago
parent d32e7aa18d
commit 10a09374b8

@ -2,15 +2,18 @@
"cells": [
{
"cell_type": "markdown",
"source": [
"# Getting Started with OpenAI Evals"
],
"metadata": {
"collapsed": false
}
},
"source": [
"# Getting Started with OpenAI Evals"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"\n",
"This notebook will go over:\n",
@ -40,7 +43,7 @@
"Evals are not restricted to checking factual accuracy: all that is needed is a reproducible way to grade a\n",
"completion. Here are some other examples of valid evals:\n",
"* The input asks to write a short essay on a topic. The grading criteria is to check if the essay is of\n",
"* particular length or if certain keywords or themes are present in the completion.\n",
"particular length or if certain keywords or themes are present in the completion.\n",
"* The input is to write a funny joke, and the grading criteria is to check how funny it was.\n",
"* The input is to follow a sequence of instructions, and the grading ensures that all instructions\n",
"were followed.\n",
@ -71,13 +74,13 @@
"the performance with human evaluation before running the evals at scale. For best results, it makes\n",
"sense to use a different model to do grading from the one that did the completion, like using GPT-4 to\n",
"grade GPT-3.5 answers.\n"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Building an evaluation for the OpenAI Evals framework\n",
"\n",
@ -87,25 +90,43 @@
"2/ The eval template to be used\n",
"\n",
"### Creating the eval dataset\n",
"Lets create a dataset for a use case where we are evaluating the model's ability to generate syntactically correct SQL.\n",
"\n",
"format\n",
"`\"input\": [{\"role\": \"system\", \"content\": \"<input prompt>\",\"name\":\"example-user\"}, \"ideal\": \"correct answer\"]`"
],
"metadata": {
"collapsed": false
}
"First we will need to create a system prompt that we would like to evaluate. We will pass in instructions for the model as well as an overview of the table structure:\n",
"`\"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable city, columns = [*,ID,Name,CountryCode,District,Population]\\nTable country, columns = [*,Code,Name,Continent,Region,SurfaceArea,IndepYear,Population,LifeExpectancy,GNP,GNPOld,LocalName,GovernmentForm,HeadOfState,Capital,Code2]\\nTable countrylanguage, columns = [*,CountryCode,Language,IsOfficial,Percentage]\\nTable sqlite_sequence, columns = [*,name,seq]\\nForeign_keys = [city.CountryCode = country.Code,countrylanguage.CountryCode = country.Code]\\n\"`\n",
"\n",
"For this prompt, we can ask a specific question:\n",
"`\"Q: What is the GNP of Afghanistan?\"`\n",
"\n",
"And we have an expected answer:\n",
"`\"A: SELECT GNP FROM country WHERE name = \\\"Afghanistan\\\"\"`\n",
"\n",
"The dataset needs to be in the followingformat\"\n",
"`\"input\": [{\"role\": \"system\", \"content\": \"<input prompt>\",\"name\":\"example-user\"}, \"ideal\": \"correct answer\"]`\n",
"\n",
"Putting it all together, we get:\n",
"`{\"input\": [{\"role\": \"system\", \"content\": \"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\\nTable city, columns = [*,ID,Name,CountryCode,District,Population]\\nTable country, columns = [*,Code,Name,Continent,Region,SurfaceArea,IndepYear,Population,LifeExpectancy,GNP,GNPOld,LocalName,GovernmentForm,HeadOfState,Capital,Code2]\\nTable countrylanguage, columns = [*,CountryCode,Language,IsOfficial,Percentage]\\nTable sqlite_sequence, columns = [*,name,seq]\\nForeign_keys = [city.CountryCode = country.Code,countrylanguage.CountryCode = country.Code]\\n\"}, {\"role\": \"user\", \"content\": \"Q: What is the GNP of Afghanistan?\"}], \"ideal\": [\"A: SELECT GNP FROM country WHERE name = \\\"Afghanistan\\\"\"]}`\n",
"\n",
"\n",
"One way to speed up the process of building eval datasets, is to use GPT-4 to generate synthetic data"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
}
},
"outputs": [],
"source": [
"## Use GPT-4 to generate synthetic data"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Running an evaluation\n",
"\n",
@ -115,19 +136,16 @@
"oaieval gpt-3.5-turbo <name of eval>\n",
"\n",
"### Going through eval logs"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
}
},
"outputs": [],
"source": []
}
],
"metadata": {

Loading…
Cancel
Save