openai-cookbook/examples/Using_logprobs.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using logprobs for classification and Q&A evaluation\n",
    "\n",
    "This notebook demonstrates the use of the `logprobs` parameter in the Chat Completions API. When `logprobs` is enabled, the API returns the log probabilities of each output token, along with a limited number of the most likely tokens at each token position and their log probabilities. The relevant request parameters are:\n",
    "* `logprobs`: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message. This option is currently not available on the `gpt-4-vision-preview` model.\n",
    "* `top_logprobs`: An integer between 0 and 5 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to true if this parameter is used.\n",
    "\n",
    "Log probabilities of output tokens indicate the likelihood of each token occurring in the sequence given the context. To simplify, a logprob is `log(p)`, where `p` = probability of a token occurring at a specific position based on the previous tokens in the context. Some key points about `logprobs`:\n",
    "* Higher log probabilities suggest a higher likelihood of the token in that context. This allows users to gauge the model's confidence in its output or explore alternative responses the model considered.\n",
    "* Logprob can be any negative number or `0.0`. `0.0` corresponds to 100% probability.\n",
    "* Logprobs allow us to compute the joint probability of a sequence as the sum of the logprobs of the individual tokens. This is useful for scoring and ranking model outputs. Another common approach is to take the average per-token logprob of a sentence to choose the best generation.\n",
    "* We can examine the `logprobs` assigned to different candidate tokens to understand what options the model considered plausible or implausible.\n",
    "\n",
    "While there are a wide array of use cases for `logprobs`, this notebook will focus on its use for:\n",
    "\n",
    "1. Classification tasks\n",
    "\n",
    "* Large Language Models excel at many classification tasks, but accurately measuring the model's confidence in its outputs can be challenging. `logprobs` provide a probability associated with each class prediction, enabling users to set their own classification or confidence thresholds.\n",
    "\n",
    "2. Retrieval (Q&A) evaluation\n",
    "\n",
    "* `logprobs` can assist with self-evaluation in retrieval applications. In the Q&A example, the model outputs a contrived `has_sufficient_context_for_answer` boolean, which can serve as a confidence score of whether the answer is contained in the retrieved content. Evaluations of this type can reduce retrieval-based hallucinations and enhance accuracy.\n",
    "\n",
    "3. Autocomplete\n",
    "*  `logprobs` could help us decide how to suggest words as a user is typing.\n",
    "\n",
    "4. Token highlighting and outputting bytes\n",
    "*  Users can easily create a token highlighter using the built in tokenization that comes with enabling `logprobs`. Additionally, the bytes parameter includes the ASCII encoding of each output character, which is particularly useful for reproducing emojis and special characters.\n",
    "\n",
    "5. Calculating perplexity\n",
    "* `logprobs` can be used to help us assess the model's overall confidence in a result and help us compare the confidence of results from different prompts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Imports and utils"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "from math import exp\n",
    "import numpy as np\n",
    "from IPython.display import display, HTML\n",
    "import os\n",
    "\n",
    "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_completion(\n",
    "    messages: list[dict[str, str]],\n",
    "    model: str = \"gpt-4\",\n",
    "    max_tokens=500,\n",
    "    temperature=0,\n",
    "    stop=None,\n",
    "    seed=123,\n",
    "    tools=None,\n",
    "    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..\n",
    "    top_logprobs=None,\n",
    ") -> str:\n",
    "    params = {\n",
    "        \"model\": model,\n",
    "        \"messages\": messages,\n",
    "        \"max_tokens\": max_tokens,\n",
    "        \"temperature\": temperature,\n",
    "        \"stop\": stop,\n",
    "        \"seed\": seed,\n",
    "        \"logprobs\": logprobs,\n",
    "        \"top_logprobs\": top_logprobs,\n",
    "    }\n",
    "    if tools:\n",
    "        params[\"tools\"] = tools\n",
    "\n",
    "    completion = client.chat.completions.create(**params)\n",
    "    return completion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Using `logprobs` to assess confidence for classification tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say we want to create a system to classify news articles into a set of pre-defined categories. Without `logprobs`, we can use Chat Completions to do this, but it is much more difficult to assess the certainty with which the model made its classifications.\n",
    "\n",
    "Now, with `logprobs` enabled, we can see exactly how confident the model is in its predictions, which is crucial for creating an accurate and trustworthy classifier. For example, if the log probability for the chosen category is high, this suggests the model is quite confident in its classification. If it's low, this suggests the model is less confident. This can be particularly useful in cases where the model's classification is not what you expected, or when the model's output needs to be reviewed or validated by a human."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll begin with a prompt that presents the model with four categories: **Technology, Politics, Sports, and Arts**. The model is then tasked with classifying articles into these categories based solely on their headlines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 266,
   "metadata": {},
   "outputs": [],
   "source": [
    "CLASSIFICATION_PROMPT = \"\"\"You will be given a headline of a news article.\n",
    "Classify the article into one of the following categories: Technology, Politics, Sports, and Art.\n",
    "Return only the name of the category, and nothing else.\n",
    "MAKE SURE your output is one of the four categories stated.\n",
    "Article headline: {headline}\"\"\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at three sample headlines, and first begin with a standard Chat Completions output, without `logprobs`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 267,
   "metadata": {},
   "outputs": [],
   "source": [
    "headlines = [\n",
    "    \"Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.\",\n",
    "    \"Local Mayor Launches Initiative to Enhance Urban Public Transport.\",\n",
    "    \"Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut\",\n",
    "]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 268,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Headline: Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.\n",
      "Category: Technology\n",
      "\n",
      "\n",
      "Headline: Local Mayor Launches Initiative to Enhance Urban Public Transport.\n",
      "Category: Politics\n",
      "\n",
      "\n",
      "Headline: Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut\n",
      "Category: Art\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for headline in headlines:\n",
    "    print(f\"\\nHeadline: {headline}\")\n",
    "    API_RESPONSE = get_completion(\n",
    "        [{\"role\": \"user\", \"content\": CLASSIFICATION_PROMPT.format(headline=headline)}],\n",
    "        model=\"gpt-4\",\n",
    "    )\n",
    "    print(f\"Category: {API_RESPONSE.choices[0].message.content}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see the selected category for each headline. However, we have no visibility into the confidence of the model in its predictions. Let's rerun the same prompt but with `logprobs` enabled, and `top_logprobs` set to 2 (this will show us the 2 most likely output tokens for each token). Additionally we can also output the linear probability of each output token, in order to convert the log probability to the more easily interprable scale of 0-100%. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 269,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Headline: Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<span style='color: cyan'>Output token 1:</span> Technology, <span style='color: darkorange'>logprobs:</span> -2.4584822e-06, <span style='color: magenta'>linear probability:</span> 100.0%<br><span style='color: cyan'>Output token 2:</span> Techn, <span style='color: darkorange'>logprobs:</span> -13.781253, <span style='color: magenta'>linear probability:</span> 0.0%<br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "Headline: Local Mayor Launches Initiative to Enhance Urban Public Transport.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<span style='color: cyan'>Output token 1:</span> Politics, <span style='color: darkorange'>logprobs:</span> -2.4584822e-06, <span style='color: magenta'>linear probability:</span> 100.0%<br><span style='color: cyan'>Output token 2:</span> Technology, <span style='color: darkorange'>logprobs:</span> -13.937503, <span style='color: magenta'>linear probability:</span> 0.0%<br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "Headline: Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<span style='color: cyan'>Output token 1:</span> Art, <span style='color: darkorange'>logprobs:</span> -0.009169078, <span style='color: magenta'>linear probability:</span> 99.09%<br><span style='color: cyan'>Output token 2:</span> Sports, <span style='color: darkorange'>logprobs:</span> -4.696669, <span style='color: magenta'>linear probability:</span> 0.91%<br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for headline in headlines:\n",
    "    print(f\"\\nHeadline: {headline}\")\n",
    "    API_RESPONSE = get_completion(\n",
    "        [{\"role\": \"user\", \"content\": CLASSIFICATION_PROMPT.format(headline=headline)}],\n",
    "        model=\"gpt-4\",\n",
    "        logprobs=True,\n",
    "        top_logprobs=2,\n",
    "    )\n",
    "    top_two_logprobs = API_RESPONSE.choices[0].logprobs.content[0].top_logprobs\n",
    "    html_content = \"\"\n",
    "    for i, logprob in enumerate(top_two_logprobs, start=1):\n",
    "        html_content += (\n",
    "            f\"<span style='color: cyan'>Output token {i}:</span> {logprob.token}, \"\n",
    "            f\"<span style='color: darkorange'>logprobs:</span> {logprob.logprob}, \"\n",
    "            f\"<span style='color: magenta'>linear probability:</span> {np.round(np.exp(logprob.logprob)*100,2)}%<br>\"\n",
    "        )\n",
    "    display(HTML(html_content))\n",
    "    print(\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected from the first two headlines, `gpt-4` is nearly 100% confident in its classifications, as the content is clearly technology and politics focused respectively. However, the third headline combines both sports and art-related themes, so we see the model is less confident in its selection.\n",
    "\n",
    "This shows how important using `logprobs` can be, as if we are using LLMs for classification tasks we can set confidence theshholds, or output several potential output tokens if the log probability of the selected output is not sufficiently high. For instance, if we are creating a recommendation engine to tag articles, we can automatically classify headlines crossing a certain threshold, and send the less certain headlines for manual review."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Retrieval confidence scoring to reduce hallucinations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To reduce hallucinations, and the performance of our RAG-based Q&A system, we can use `logprobs` to evaluate how confident the model is in its retrieval."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say we have built a retrieval system using RAG for Q&A, but are struggling with hallucinated answers to our questions. *Note:* we will use a hardcoded article for this example, but see other entries in the cookbook for tutorials on using RAG for Q&A."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 270,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Article retrieved\n",
    "ada_lovelace_article = \"\"\"Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation.\n",
    "Ada Byron was the only legitimate child of poet Lord Byron and reformer Lady Byron. All Lovelace's half-siblings, Lord Byron's other children, were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever. He died in Greece when Ada was eight. Her mother was anxious about her upbringing and promoted Ada's interest in mathematics and logic in an effort to prevent her from developing her father's perceived insanity. Despite this, Ada remained interested in him, naming her two sons Byron and Gordon. Upon her death, she was buried next to him at her request. Although often ill in her childhood, Ada pursued her studies assiduously. She married William King in 1835. King was made Earl of Lovelace in 1838, Ada thereby becoming Countess of Lovelace.\n",
    "Her educational and social exploits brought her into contact with scientists such as Andrew Crosse, Charles Babbage, Sir David Brewster, Charles Wheatstone, Michael Faraday, and the author Charles Dickens, contacts which she used to further her education. Ada described her approach as \"poetical science\" and herself as an \"Analyst (& Metaphysician)\".\n",
    "When she was eighteen, her mathematical talents led her to a long working relationship and friendship with fellow British mathematician Charles Babbage, who is known as \"the father of computers\". She was in particular interested in Babbage's work on the Analytical Engine. Lovelace first met him in June 1833, through their mutual friend, and her private tutor, Mary Somerville.\n",
    "Between 1842 and 1843, Ada translated an article by the military engineer Luigi Menabrea (later Prime Minister of Italy) about the Analytical Engine, supplementing it with an elaborate set of seven notes, simply called \"Notes\".\n",
    "Lovelace's notes are important in the early history of computers, especially since the seventh one contained what many consider to be the first computer program—that is, an algorithm designed to be carried out by a machine. Other historians reject this perspective and point out that Babbage's personal notes from the years 1836/1837 contain the first programs for the engine. She also developed a vision of the capability of computers to go beyond mere calculating or number-crunching, while many others, including Babbage himself, focused only on those capabilities. Her mindset of \"poetical science\" led her to ask questions about the Analytical Engine (as shown in her notes) examining how individuals and society relate to technology as a collaborative tool.\n",
    "\"\"\"\n",
    "\n",
    "# Questions that can be easily answered given the article\n",
    "easy_questions = [\n",
    "    \"What nationality was Ada Lovelace?\",\n",
    "    \"What was an important finding from Lovelace's seventh note?\",\n",
    "]\n",
    "\n",
    "# Questions that are not fully covered in the article\n",
    "medium_questions = [\n",
    "    \"Did Lovelace collaborate with Charles Dickens\",\n",
    "    \"What concepts did Lovelace build with Charles Babbage\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, what we can do is ask the model to respond to the question, but then also evaluate its response. Specifically, we will ask the model to output a boolean `has_sufficient_context_for_answer`. We can then evaluate the `logprobs` to see just how confident the model is that its answer was contained in the provided context"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 271,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROMPT = \"\"\"You retrieved this article: {article}. The question is: {question}.\n",
    "Before even answering the question, consider whether you have sufficient information in the article to answer the question fully.\n",
    "Your output should JUST be the boolean true or false, of if you have sufficient information in the article to answer the question.\n",
    "Respond with just one word, the boolean true or false. You must output the word 'True', or the word 'False', nothing else.\n",
    "\"\"\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 272,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "Questions clearly answered in article<p style=\"color:green\">Question: What nationality was Ada Lovelace?</p><p style=\"color:cyan\">has_sufficient_context_for_answer: True, <span style=\"color:darkorange\">logprobs: -3.1281633e-07, <span style=\"color:magenta\">linear probability: 100.0%</span></p><p style=\"color:green\">Question: What was an important finding from Lovelace's seventh note?</p><p style=\"color:cyan\">has_sufficient_context_for_answer: True, <span style=\"color:darkorange\">logprobs: -7.89631e-07, <span style=\"color:magenta\">linear probability: 100.0%</span></p>Questions only partially covered in the article<p style=\"color:green\">Question: Did Lovelace collaborate with Charles Dickens</p><p style=\"color:cyan\">has_sufficient_context_for_answer: True, <span style=\"color:darkorange\">logprobs: -0.06993677, <span style=\"color:magenta\">linear probability: 93.25%</span></p><p style=\"color:green\">Question: What concepts did Lovelace build with Charles Babbage</p><p style=\"color:cyan\">has_sufficient_context_for_answer: False, <span style=\"color:darkorange\">logprobs: -0.61807257, <span style=\"color:magenta\">linear probability: 53.9%</span></p>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "html_output = \"\"\n",
    "html_output += \"Questions clearly answered in article\"\n",
    "\n",
    "for question in easy_questions:\n",
    "    API_RESPONSE = get_completion(\n",
    "        [\n",
    "            {\n",
    "                \"role\": \"user\",\n",
    "                \"content\": PROMPT.format(\n",
    "                    article=ada_lovelace_article, question=question\n",
    "                ),\n",
    "            }\n",
    "        ],\n",
    "        model=\"gpt-4\",\n",
    "        logprobs=True,\n",
    "    )\n",
    "    html_output += f'<p style=\"color:green\">Question: {question}</p>'\n",
    "    for logprob in API_RESPONSE.choices[0].logprobs.content:\n",
    "        html_output += f'<p style=\"color:cyan\">has_sufficient_context_for_answer: {logprob.token}, <span style=\"color:darkorange\">logprobs: {logprob.logprob}, <span style=\"color:magenta\">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'\n",
    "\n",
    "html_output += \"Questions only partially covered in the article\"\n",
    "\n",
    "for question in medium_questions:\n",
    "    API_RESPONSE = get_completion(\n",
    "        [\n",
    "            {\n",
    "                \"role\": \"user\",\n",
    "                \"content\": PROMPT.format(\n",
    "                    article=ada_lovelace_article, question=question\n",
    "                ),\n",
    "            }\n",
    "        ],\n",
    "        model=\"gpt-4\",\n",
    "        logprobs=True,\n",
    "        top_logprobs=3,\n",
    "    )\n",
    "    html_output += f'<p style=\"color:green\">Question: {question}</p>'\n",
    "    for logprob in API_RESPONSE.choices[0].logprobs.content:\n",
    "        html_output += f'<p style=\"color:cyan\">has_sufficient_context_for_answer: {logprob.token}, <span style=\"color:darkorange\">logprobs: {logprob.logprob}, <span style=\"color:magenta\">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'\n",
    "\n",
    "display(HTML(html_output))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the first two questions, our model asserts with (near) 100% confidence that the article has sufficient context to answer the posed questions.<br><br>\n",
    "On the other hand, for the more tricky questions which are less clearly answered in the article, the model is less confident that it has sufficient context. This is a great guardrail to help ensure our retrieved content is sufficient.<br><br>\n",
    "This self-evaluation can help reduce hallucinations, as you can restrict answers or re-prompt the user when your `sufficient_context_for_answer` log probability is below a certain threshold. Methods like this have been shown to significantly reduce RAG for Q&A hallucinations and errors ([Example](https://jfan001.medium.com/how-we-cut-the-rate-of-gpt-hallucinations-from-20-to-less-than-2-f3bfcc10e4ec)) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Autocomplete"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another use case for `logprobs` are autocomplete systems. Without creating the entire autocomplete system end-to-end, let's demonstrate how `logprobs` could help us decide how to suggest words as a user is typing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's come up with a sample sentence: `\"My least favorite TV show is Breaking Bad.\"` Let's say we want it to dynamically recommend the next word or token as we are typing the sentence, but *only* if the model is quite sure of what the next word will be. To demonstrate this, let's break up the sentence into sequential components."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 273,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence_list = [\n",
    "    \"My\",\n",
    "    \"My least\",\n",
    "    \"My least favorite\",\n",
    "    \"My least favorite TV\",\n",
    "    \"My least favorite TV show\",\n",
    "    \"My least favorite TV show is\",\n",
    "    \"My least favorite TV show is Breaking Bad\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can ask `gpt-3.5-turbo` to act as an autocomplete engine with whatever context the model is given. We can enable `logprobs` and can see how confident the model is in its prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 274,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<p>Sentence: My</p><p style=\"color:cyan\">Predicted next token: favorite, <span style=\"color:darkorange\">logprobs: -0.18245785, <span style=\"color:magenta\">linear probability: 83.32%</span></p><p style=\"color:cyan\">Predicted next token: dog, <span style=\"color:darkorange\">logprobs: -2.397172, <span style=\"color:magenta\">linear probability: 9.1%</span></p><p style=\"color:cyan\">Predicted next token: ap, <span style=\"color:darkorange\">logprobs: -3.8732424, <span style=\"color:magenta\">linear probability: 2.08%</span></p><br><p>Sentence: My least</p><p style=\"color:cyan\">Predicted next token: favorite, <span style=\"color:darkorange\">logprobs: -0.0146376295, <span style=\"color:magenta\">linear probability: 98.55%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -4.2417912, <span style=\"color:magenta\">linear probability: 1.44%</span></p><p style=\"color:cyan\">Predicted next token:  favorite, <span style=\"color:darkorange\">logprobs: -9.748788, <span style=\"color:magenta\">linear probability: 0.01%</span></p><br><p>Sentence: My least favorite</p><p style=\"color:cyan\">Predicted next token: food, <span style=\"color:darkorange\">logprobs: -0.9481721, <span style=\"color:magenta\">linear probability: 38.74%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -1.3447137, <span style=\"color:magenta\">linear probability: 26.06%</span></p><p style=\"color:cyan\">Predicted next token: color, <span style=\"color:darkorange\">logprobs: -1.3887696, <span style=\"color:magenta\">linear probability: 24.94%</span></p><br><p>Sentence: My least favorite TV</p><p style=\"color:cyan\">Predicted next token: show, <span style=\"color:darkorange\">logprobs: -0.0007898556, <span style=\"color:magenta\">linear probability: 99.92%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -7.711523, <span style=\"color:magenta\">linear probability: 0.04%</span></p><p style=\"color:cyan\">Predicted next token: series, <span style=\"color:darkorange\">logprobs: -9.348547, <span style=\"color:magenta\">linear probability: 0.01%</span></p><br><p>Sentence: My least favorite TV show</p><p style=\"color:cyan\">Predicted next token: is, <span style=\"color:darkorange\">logprobs: -0.2851253, <span style=\"color:magenta\">linear probability: 75.19%</span></p><p style=\"color:cyan\">Predicted next token: of, <span style=\"color:darkorange\">logprobs: -1.55335, <span style=\"color:magenta\">linear probability: 21.15%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -3.4928775, <span style=\"color:magenta\">linear probability: 3.04%</span></p><br><p>Sentence: My least favorite TV show is</p><p style=\"color:cyan\">Predicted next token: \"My, <span style=\"color:darkorange\">logprobs: -0.69349754, <span style=\"color:magenta\">linear probability: 49.98%</span></p><p style=\"color:cyan\">Predicted next token: \"The, <span style=\"color:darkorange\">logprobs: -1.2899293, <span style=\"color:magenta\">linear probability: 27.53%</span></p><p style=\"color:cyan\">Predicted next token: My, <span style=\"color:darkorange\">logprobs: -2.4170141, <span style=\"color:magenta\">linear probability: 8.92%</span></p><br><p>Sentence: My least favorite TV show is Breaking Bad</p><p style=\"color:cyan\">Predicted next token: because, <span style=\"color:darkorange\">logprobs: -0.17786823, <span style=\"color:magenta\">linear probability: 83.71%</span></p><p style=\"color:cyan\">Predicted next token: ,, <span style=\"color:darkorange\">logprobs: -2.3946173, <span style=\"color:magenta\">linear probability: 9.12%</span></p><p style=\"color:cyan\">Predicted next token: ., <span style=\"color:darkorange\">logprobs: -3.1861975, <span style=\"color:magenta\">linear probability: 4.13%</span></p><br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "high_prob_completions = {}\n",
    "low_prob_completions = {}\n",
    "html_output = \"\"\n",
    "\n",
    "for sentence in sentence_list:\n",
    "    PROMPT = \"\"\"Complete this sentence. You are acting as auto-complete. Simply complete the sentence to the best of your ability, make sure it is just ONE sentence: {sentence}\"\"\"\n",
    "    API_RESPONSE = get_completion(\n",
    "        [{\"role\": \"user\", \"content\": PROMPT.format(sentence=sentence)}],\n",
    "        model=\"gpt-3.5-turbo\",\n",
    "        logprobs=True,\n",
    "        top_logprobs=3,\n",
    "    )\n",
    "    html_output += f'<p>Sentence: {sentence}</p>'\n",
    "    first_token = True\n",
    "    for token in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:\n",
    "        html_output += f'<p style=\"color:cyan\">Predicted next token: {token.token}, <span style=\"color:darkorange\">logprobs: {token.logprob}, <span style=\"color:magenta\">linear probability: {np.round(np.exp(token.logprob)*100,2)}%</span></p>'\n",
    "        if first_token:\n",
    "            if np.exp(token.logprob) > 0.95:\n",
    "                high_prob_completions[sentence] = token.token\n",
    "            if np.exp(token.logprob) < 0.60:\n",
    "                low_prob_completions[sentence] = token.token\n",
    "        first_token = False\n",
    "    html_output += \"<br>\"\n",
    "\n",
    "display(HTML(html_output))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the high confidence autocompletions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 275,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'My least': 'favorite', 'My least favorite TV': 'show'}"
      ]
     },
     "execution_count": 275,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "high_prob_completions\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These look reasonable! We can feel confident in those suggestions. It's pretty likely you want to write 'show' after writing 'My least favorite TV'! Now let's look at the autocompletion suggestions the model was less confident about:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 276,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'My least favorite': 'food', 'My least favorite TV show is': '\"My'}"
      ]
     },
     "execution_count": 276,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "low_prob_completions\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are logical as well. It's pretty unclear what the user is going to say with just the prefix 'my least favorite', and it's really anyone's guess what the author's favorite TV show is. <br><br>\n",
    "So, using `gpt-3.5-turbo`, we can create the root of a dynamic autocompletion engine with `logprobs`!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Highlighter and bytes parameter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's quickly touch on creating a simple token highlighter with `logprobs`, and using the bytes parameter. First, we can create a function that counts and highlights each token. While this doesn't use the log probabilities, it uses the built in tokenization that comes with enabling `logprobs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 277,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROMPT = \"\"\"What's the longest word in the English language?\"\"\"\n",
    "\n",
    "API_RESPONSE = get_completion(\n",
    "    [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4\", logprobs=True, top_logprobs=5\n",
    ")\n",
    "\n",
    "\n",
    "def highlight_text(api_response):\n",
    "    colors = [\n",
    "        \"#FF00FF\",  # Magenta\n",
    "        \"#008000\",  # Green\n",
    "        \"#FF8C00\",  # Dark Orange\n",
    "        \"#FF0000\",  # Red\n",
    "        \"#0000FF\",  # Blue\n",
    "    ]\n",
    "    tokens = api_response.choices[0].logprobs.content\n",
    "\n",
    "    color_idx = 0  # Initialize color index\n",
    "    html_output = \"\"  # Initialize HTML output\n",
    "    for t in tokens:\n",
    "        token_str = bytes(t.bytes).decode(\"utf-8\")  # Decode bytes to string\n",
    "\n",
    "        # Add colored token to HTML output\n",
    "        html_output += f\"<span style='color: {colors[color_idx]}'>{token_str}</span>\"\n",
    "\n",
    "        # Move to the next color\n",
    "        color_idx = (color_idx + 1) % len(colors)\n",
    "    display(HTML(html_output))  # Display HTML output\n",
    "    print(f\"Total number of tokens: {len(tokens)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 278,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span style='color: #FF00FF'>The</span><span style='color: #008000'> longest</span><span style='color: #FF8C00'> word</span><span style='color: #FF0000'> in</span><span style='color: #0000FF'> the</span><span style='color: #FF00FF'> English</span><span style='color: #008000'> language</span><span style='color: #FF8C00'>,</span><span style='color: #FF0000'> according</span><span style='color: #0000FF'> to</span><span style='color: #FF00FF'> the</span><span style='color: #008000'> Guinness</span><span style='color: #FF8C00'> World</span><span style='color: #FF0000'> Records</span><span style='color: #0000FF'>,</span><span style='color: #FF00FF'> is</span><span style='color: #008000'> '</span><span style='color: #FF8C00'>p</span><span style='color: #FF0000'>ne</span><span style='color: #0000FF'>um</span><span style='color: #FF00FF'>on</span><span style='color: #008000'>oul</span><span style='color: #FF8C00'>tram</span><span style='color: #FF0000'>icro</span><span style='color: #0000FF'>sc</span><span style='color: #FF00FF'>op</span><span style='color: #008000'>ics</span><span style='color: #FF8C00'>il</span><span style='color: #FF0000'>ic</span><span style='color: #0000FF'>ov</span><span style='color: #FF00FF'>ol</span><span style='color: #008000'>cano</span><span style='color: #FF8C00'>con</span><span style='color: #FF0000'>iosis</span><span style='color: #0000FF'>'.</span><span style='color: #FF00FF'> It</span><span style='color: #008000'> is</span><span style='color: #FF8C00'> a</span><span style='color: #FF0000'> type</span><span style='color: #0000FF'> of</span><span style='color: #FF00FF'> lung</span><span style='color: #008000'> disease</span><span style='color: #FF8C00'> caused</span><span style='color: #FF0000'> by</span><span style='color: #0000FF'> inh</span><span style='color: #FF00FF'>aling</span><span style='color: #008000'> ash</span><span style='color: #FF8C00'> and</span><span style='color: #FF0000'> sand</span><span style='color: #0000FF'> dust</span><span style='color: #FF00FF'>.</span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of tokens: 51\n"
     ]
    }
   ],
   "source": [
    "highlight_text(API_RESPONSE)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's reconstruct a sentence using the bytes parameter. With `logprobs` enabled, we are given both each token and the ASCII (decimal utf-8) values of the token string. These ASCII values can be helpful when handling tokens of or containing emojis or special characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 279,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token: \\xf0\\x9f\\x92\n",
      "Log prob: -0.0003056686\n",
      "Linear prob: 99.97 %\n",
      "Bytes: [240, 159, 146] \n",
      "\n",
      "Token: \\x99\n",
      "Log prob: 0.0\n",
      "Linear prob: 100.0 %\n",
      "Bytes: [153] \n",
      "\n",
      "Token:  -\n",
      "Log prob: -0.0096905725\n",
      "Linear prob: 99.04 %\n",
      "Bytes: [32, 45] \n",
      "\n",
      "Token:  Blue\n",
      "Log prob: -0.00042042506\n",
      "Linear prob: 99.96 %\n",
      "Bytes: [32, 66, 108, 117, 101] \n",
      "\n",
      "Token:  Heart\n",
      "Log prob: -7.302705e-05\n",
      "Linear prob: 99.99 %\n",
      "Bytes: [32, 72, 101, 97, 114, 116] \n",
      "\n",
      "Bytes array: [240, 159, 146, 153, 32, 45, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]\n",
      "Decoded bytes: 💙 - Blue Heart\n",
      "Joint prob: 98.96 %\n"
     ]
    }
   ],
   "source": [
    "PROMPT = \"\"\"Output the blue heart emoji and its name.\"\"\"\n",
    "API_RESPONSE = get_completion(\n",
    "    [{\"role\": \"user\", \"content\": PROMPT}], model=\"gpt-4\", logprobs=True\n",
    ")\n",
    "\n",
    "aggregated_bytes = []\n",
    "joint_logprob = 0.0\n",
    "\n",
    "# Iterate over tokens, aggregate bytes and calculate joint logprob\n",
    "for token in API_RESPONSE.choices[0].logprobs.content:\n",
    "    print(\"Token:\", token.token)\n",
    "    print(\"Log prob:\", token.logprob)\n",
    "    print(\"Linear prob:\", np.round(exp(token.logprob) * 100, 2), \"%\")\n",
    "    print(\"Bytes:\", token.bytes, \"\\n\")\n",
    "    aggregated_bytes += token.bytes\n",
    "    joint_logprob += token.logprob\n",
    "\n",
    "# Decode the aggregated bytes to text\n",
    "aggregated_text = bytes(aggregated_bytes).decode(\"utf-8\")\n",
    "\n",
    "# Assert that the decoded text is the same as the message content\n",
    "assert API_RESPONSE.choices[0].message.content == aggregated_text\n",
    "\n",
    "# Print the results\n",
    "print(\"Bytes array:\", aggregated_bytes)\n",
    "print(f\"Decoded bytes: {aggregated_text}\")\n",
    "print(\"Joint prob:\", np.round(exp(joint_logprob) * 100, 2), \"%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we see that while the first token was `\\xf0\\x9f\\x92'`, we can get its ASCII value and append it to a bytes array. Then, we can easily decode this array into a full sentence, and validate with our assert statement that the decoded bytes is the same as our completion message!\n",
    "\n",
    "Additionally, we can get the joint probability of the entire completion, which is the exponentiated product of each token's log probability. This gives us how `likely` this given completion is given the prompt. Since, our prompt is quite directive (asking for a certain emoji and its name), the joint probability of this output is high! If we ask for a random output however, we'll see a much lower joint probability. This can also be a good tactic for developers during prompt engineering. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Calculating perplexity\n",
    "\n",
    "When looking to assess the model's confidence in a result, it can be useful to calculate perplexity, which is a measure of the uncertainty. Perplexity can be calculated by exponentiating the negative of the average of the logprobs. Generally, a higher perplexity indicates a more uncertain result, and a lower perplexity indicates a more confident result. As such, perplexity can be used to both assess the result of an individual model run and also to compare the relative confidence of results between model runs. While a high confidence doesn't guarantee result accuracy, it can be a helpful signal that can be paired with other evaluation metrics to build a better understanding of your prompt's behavior.\n",
    "\n",
    "For example, let's say that I want to use `gpt-3.5-turbo` to learn more about artificial intelligence. I could ask a question about recent history and a question about the future:"
   ]
  },
  {
    "cell_type": "code",
    "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "Prompt:     In a short sentence, has artifical intelligence grown in the last decade?\n",
       "Response:   Yes, artificial intelligence has grown significantly in the last decade. \n",
       "\n",
       "Tokens:                Yes              ,     artificial   intelligence            has          grown  significantly             in            the           last         decade              .\n",
       "Logprobs:            -0.00          -0.00          -0.00          -0.00          -0.00          -0.53          -0.11          -0.00          -0.00          -0.01          -0.00          -0.00\n",
       "Perplexity: 1.0564125277713383 \n",
       "\n",
       "Prompt:     In a short sentence, what are your thoughts on the future of artificial intelligence?\n",
       "Response:   The future of artificial intelligence holds great potential for transforming industries and improving efficiency, but also raises ethical and societal concerns that must be carefully addressed. \n",
       "\n",
       "Tokens:               The        future            of    artificial  intelligence         holds         great     potential           for  transforming    industries           and     improving    efficiency             ,           but          also        raises       ethical           and      societal      concerns          that          must            be     carefully     addressed             .\n",
       "Logprobs:           -0.19         -0.03         -0.00         -0.00         -0.00         -0.30         -0.51         -0.24         -0.03         -1.45         -0.23         -0.03         -0.22         -0.83         -0.48         -0.01         -0.38         -0.07         -0.47         -0.63         -0.18         -0.26         -0.01         -0.14         -0.00         -0.59         -0.55         -0.00\n",
       "Perplexity: 1.3220795252314004 \n",
       "\n"
      ]
     }
    ],
    "source": [
     "prompts = [\n",
     "    \"In a short sentence, has artifical intelligence grown in the last decade?\",\n",
     "    \"In a short sentence, what are your thoughts on the future of artificial intelligence?\",\n",
     "]\n",
     "\n",
     "for prompt in prompts:\n",
     "    API_RESPONSE = get_completion(\n",
     "        [{\"role\": \"user\", \"content\": prompt}],\n",
     "        model=\"gpt-3.5-turbo\",\n",
     "        logprobs=True,\n",
     "    )\n",
     "\n",
     "    logprobs = [token.logprob for token in API_RESPONSE.choices[0].logprobs.content]\n",
     "    response_text = API_RESPONSE.choices[0].message.content\n",
     "    response_text_tokens = [token.token for token in API_RESPONSE.choices[0].logprobs.content]\n",
     "    max_starter_length = max(len(s) for s in [\"Prompt:\", \"Response:\", \"Tokens:\", \"Logprobs:\", \"Perplexity:\"])\n",
     "    max_token_length = max(len(s) for s in response_text_tokens)\n",
     "    \n",
     "\n",
     "    formatted_response_tokens = [s.rjust(max_token_length) for s in response_text_tokens]\n",
     "    formatted_lps = [f\"{lp:.2f}\".rjust(max_token_length) for lp in logprobs]\n",
     "\n",
     "    perplexity_score = np.exp(-np.mean(logprobs))\n",
     "    print(\"Prompt:\".ljust(max_starter_length), prompt)\n",
     "    print(\"Response:\".ljust(max_starter_length), response_text, \"\\n\")\n",
     "    print(\"Tokens:\".ljust(max_starter_length), \" \".join(formatted_response_tokens))\n",
     "    print(\"Logprobs:\".ljust(max_starter_length), \" \".join(formatted_lps))\n",
     "    print(\"Perplexity:\".ljust(max_starter_length), perplexity_score, \"\\n\")"
    ]
   },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, `gpt-3.5-turbo` returned a lower perplexity score for a more deterministic question about recent history, and a higher perplexity score for a more speculative assessment about the near future. Again, while these differences don't guarantee accuracy, they help point the way for our interpretation of the model's results and our future use of them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nice! We were able to use the `logprobs` parameter to build a more robust classifier, evaluate our retrieval for Q&A system, and encode and decode each 'byte' of our tokens! `logprobs` adds useful information and signal to our completions output, and we are excited to see how developers incorporate it to improve applications."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Possible extensions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many other use cases for `logprobs` that are not covered in this cookbook. We can use `logprobs` for:\n",
    "  - Moderation\n",
    "  - Keyword selection\n",
    "  - Improve prompts and interpretability of outputs\n",
    "  - Token healing\n",
    "  - and more!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "openai",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}