Add perplexity example to the `logprobs` user guide (#1071)

Add perplexity example to using logprobs cookbook
2 months ago · ed6194e621
parent 76ed3e46f9
commit ed6194e621
1 changed files with 86 additions and 9 deletions
--- a/examples/Using_logprobs.ipynb
+++ b/examples/Using_logprobs.ipynb
@ -29,8 +29,11 @@
    "3. Autocomplete\n",
    "*  `logprobs` could help us decide how to suggest words as a user is typing.\n",
    "\n",
-    "4. Token highlighting and outputting bytes\\n\",\n",
-    "*  Users can easily create a token highlighter using the built in tokenization that comes with enabling `logprobs`. Additionally, the bytes parameter includes the ASCII encoding of each output character, which is particularly useful for reproducing emojis and special characters.\""
+    "4. Token highlighting and outputting bytes\n",
+    "*  Users can easily create a token highlighter using the built in tokenization that comes with enabling `logprobs`. Additionally, the bytes parameter includes the ASCII encoding of each output character, which is particularly useful for reproducing emojis and special characters.\n",
+    "\n",
+    "5. Calculating perplexity\n",
+    "* `logprobs` can be used to help us assess the model's overall confidence in a result and help us compare the confidence of results from different prompts."
   ]
  },
  {
@ -42,7 +45,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 264,
+   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
@ -57,7 +60,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 265,
+   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
@ -764,7 +767,82 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## 5. Conclusion"
+    "## 5. Calculating perplexity\n",
+    "\n",
+    "When looking to assess the model's confidence in a result, it can be useful to calculate perplexity, which is a measure of the uncertainty. Perplexity can be calculated by exponentiating the negative of the average of the logprobs. Generally, a higher perplexity indicates a more uncertain result, and a lower perplexity indicates a more confident result. As such, perplexity can be used to both assess the result of an individual model run and also to compare the relative confidence of results between model runs. While a high confidence doesn't guarantee result accuracy, it can be a helpful signal that can be paired with other evaluation metrics to build a better understanding of your prompt's behavior.\n",
+    "\n",
+    "For example, let's say that I want to use `gpt-3.5-turbo` to learn more about artificial intelligence. I could ask a question about recent history and a question about the future:"
+   ]
+  },
+  {
+    "cell_type": "code",
+    "execution_count": 4,
+    "metadata": {},
+    "outputs": [
+     {
+      "name": "stdout",
+      "output_type": "stream",
+      "text": [
+       "Prompt:     In a short sentence, has artifical intelligence grown in the last decade?\n",
+       "Response:   Yes, artificial intelligence has grown significantly in the last decade. \n",
+       "\n",
+       "Tokens:                Yes              ,     artificial   intelligence            has          grown  significantly             in            the           last         decade              .\n",
+       "Logprobs:            -0.00          -0.00          -0.00          -0.00          -0.00          -0.53          -0.11          -0.00          -0.00          -0.01          -0.00          -0.00\n",
+       "Perplexity: 1.0564125277713383 \n",
+       "\n",
+       "Prompt:     In a short sentence, what are your thoughts on the future of artificial intelligence?\n",
+       "Response:   The future of artificial intelligence holds great potential for transforming industries and improving efficiency, but also raises ethical and societal concerns that must be carefully addressed. \n",
+       "\n",
+       "Tokens:               The        future            of    artificial  intelligence         holds         great     potential           for  transforming    industries           and     improving    efficiency             ,           but          also        raises       ethical           and      societal      concerns          that          must            be     carefully     addressed             .\n",
+       "Logprobs:           -0.19         -0.03         -0.00         -0.00         -0.00         -0.30         -0.51         -0.24         -0.03         -1.45         -0.23         -0.03         -0.22         -0.83         -0.48         -0.01         -0.38         -0.07         -0.47         -0.63         -0.18         -0.26         -0.01         -0.14         -0.00         -0.59         -0.55         -0.00\n",
+       "Perplexity: 1.3220795252314004 \n",
+       "\n"
+      ]
+     }
+    ],
+    "source": [
+     "prompts = [\n",
+     "    \"In a short sentence, has artifical intelligence grown in the last decade?\",\n",
+     "    \"In a short sentence, what are your thoughts on the future of artificial intelligence?\",\n",
+     "]\n",
+     "\n",
+     "for prompt in prompts:\n",
+     "    API_RESPONSE = get_completion(\n",
+     "        [{\"role\": \"user\", \"content\": prompt}],\n",
+     "        model=\"gpt-3.5-turbo\",\n",
+     "        logprobs=True,\n",
+     "    )\n",
+     "\n",
+     "    logprobs = [token.logprob for token in API_RESPONSE.choices[0].logprobs.content]\n",
+     "    response_text = API_RESPONSE.choices[0].message.content\n",
+     "    response_text_tokens = [token.token for token in API_RESPONSE.choices[0].logprobs.content]\n",
+     "    max_starter_length = max(len(s) for s in [\"Prompt:\", \"Response:\", \"Tokens:\", \"Logprobs:\", \"Perplexity:\"])\n",
+     "    max_token_length = max(len(s) for s in response_text_tokens)\n",
+     "    \n",
+     "\n",
+     "    formatted_response_tokens = [s.rjust(max_token_length) for s in response_text_tokens]\n",
+     "    formatted_lps = [f\"{lp:.2f}\".rjust(max_token_length) for lp in logprobs]\n",
+     "\n",
+     "    perplexity_score = np.exp(-np.mean(logprobs))\n",
+     "    print(\"Prompt:\".ljust(max_starter_length), prompt)\n",
+     "    print(\"Response:\".ljust(max_starter_length), response_text, \"\\n\")\n",
+     "    print(\"Tokens:\".ljust(max_starter_length), \" \".join(formatted_response_tokens))\n",
+     "    print(\"Logprobs:\".ljust(max_starter_length), \" \".join(formatted_lps))\n",
+     "    print(\"Perplexity:\".ljust(max_starter_length), perplexity_score, \"\\n\")"
+    ]
+   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this example, `gpt-3.5-turbo` returned a lower perplexity score for a more deterministic question about recent history, and a higher perplexity score for a more speculative assessment about the near future. Again, while these differences don't guarantee accuracy, they help point the way for our interpretation of the model's results and our future use of them."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Conclusion"
   ]
  },
  {
@ -778,7 +856,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## 6. Possible extensions"
+    "## 7. Possible extensions"
   ]
  },
  {
@ -786,7 +864,6 @@
   "metadata": {},
   "source": [
    "There are many other use cases for `logprobs` that are not covered in this cookbook. We can use `logprobs` for:\n",
-    "  - Evaluations (e.g.: calculate `perplexity` of outputs, which is the evaluation metric of uncertainty or surprise of the model at its outcomes)\n",
    "  - Moderation\n",
    "  - Keyword selection\n",
    "  - Improve prompts and interpretability of outputs\n",
@ -811,9 +888,9 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.16"
+   "version": "3.11.8"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }