Create Getting_Started_with_OpenAI_Evals.ipynb

pull/1099/head
Shyamal H Anadkat 3 months ago
parent ed6194e621
commit 9969b97f16

@ -0,0 +1,137 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Getting Started with OpenAI Evals"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"This notebook will go over:\n",
"* Introduction to OpenAI Evals library [enter link]\n",
"* What are Evals\n",
"* Building an Eval\n",
"* Running an Eval\n",
"\n",
"Evaluation is the process of validating and testing the outputs that your LLM applications are producing. Having strong evaluations (“evals”) will mean a more stable, reliable application which is resilient to code and model changes.An eval is basically a task used to measure the quality of output of an LLM or LLM system. Given an input prompt, an output is generated. We evaluate this output with a set of ideal_answers and find the quality of the LLM system.\n",
"\n",
"OpenAI Evals consists of:\n",
"1. A framework to evaluate an LLM (large language model) or a system built on top of an LLM.\n",
"2. An open-source registry of challenging evals\n",
"\n",
"*Why is it important to evaluate?*\n",
"\n",
"If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how different model versions might affect your use case. With OpenAIs new continuous model upgrades, evals allow you to efficiently test model performance for your use cases in a standardized way. Developing a suite of evals customized to your objectives will help you quickly and effectively understand how new models may perform for your use cases.\n",
"\n",
"*Types of Evals*\n",
"\n",
"The simplest and most common type of eval has an input and an ideal response or answer. For example,\n",
"we can have an eval sample where the input is “What year was Obama elected president for the first\n",
"time?” and the ideal answer is “2008”. We feed the input to a model and get the completion. If the model\n",
"says “2008”, it is then graded as correct. Eval samples are aggregated into an eval dataset that can\n",
"quantify overall performance within a certain topic. For example, this eval sample may be part of a\n",
"“president-election-years” eval that checks for every U.S. President, what year they were first elected.\n",
"Evals are not restricted to checking factual accuracy: all that is needed is a reproducible way to grade a\n",
"completion. Here are some other examples of valid evals:\n",
"* The input asks to write a short essay on a topic. The grading criteria is to check if the essay is of\n",
"* particular length or if certain keywords or themes are present in the completion.\n",
"* The input is to write a funny joke, and the grading criteria is to check how funny it was.\n",
"* The input is to follow a sequence of instructions, and the grading ensures that all instructions\n",
"were followed.\n",
"\n",
"In a naive implementation, we could just grade each completion by hand based on the criteria. Ideally,\n",
"wed like to automate the grading process to let these experiments scale to huge datasets. In the next\n",
"section, well talk about the ways in which weve automated eval grading.\n",
"Grading evals\n",
"\n",
"There are two main ways we can automatically grade completions: writing some validation logic in code\n",
"or using the model itself to inspect the answer. Well introduce each with some examples.\n",
"Writing logic for answer checking\n",
"\n",
"* Consider the Obama example from above, where the ideal response is 2008. We can write a\n",
"string match to check if the completion includes the phrase “2008”. If it does, we consider it\n",
"correct.\n",
"* Consider another eval where the input is to generate valid JSON: We can write some code that\n",
"attempts to parse the completion as JSON and then considers the completion correct if it is\n",
"parsable.\n",
"Model grading: A two stage process where the model first answers the question, then we ask a\n",
"model to look at the response to check if its correct.\n",
"* Consider an input that asks the model to write a funny joke. The model then generates a\n",
"completion. We then create a new input to the model to answer the question: “Is this following\n",
"joke funny? First reason step by step, then answer yes or no” that includes the completion. We\n",
"finally consider the original completion correct if the new model completion ends with “yes”.\n",
"Model grading works best with the latest, most powerful models like GPT-4 and if we give them the ability\n",
"to reason before making a judgment. Model grading will have an error rate, so it is important to validate\n",
"the performance with human evaluation before running the evals at scale. For best results, it makes\n",
"sense to use a different model to do grading from the one that did the completion, like using GPT-4 to\n",
"grade GPT-3.5 answers.\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## Building an evaluation for the OpenAI Evals framework\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## Running an evaluation"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading…
Cancel
Save