updates token counting guide

1 year ago · b45d2b2346
parent afa9436334
commit b45d2b2346
1 changed files with 33 additions and 34 deletions
--- a/examples/How_to_count_tokens_with_tiktoken.ipynb
+++ b/examples/How_to_count_tokens_with_tiktoken.ipynb
@ -11,7 +11,7 @@
    "\n",
    "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"cl100k_base\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
    "\n",
-    "Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",
+    "Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).\n",
    "\n",
    "\n",
    "## Encodings\n",
@ -22,8 +22,8 @@
    "\n",
    "| Encoding name           | OpenAI models                                       |\n",
    "|-------------------------|-----------------------------------------------------|\n",
-    "| `cl100k_base`           | ChatGPT models, `text-embedding-ada-002`            |\n",
-    "| `p50k_base`             | Code models, `text-davinci-002`, `text-davinci-003` |\n",
+    "| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`  |\n",
+    "| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003`|\n",
    "| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                         |\n",
    "\n",
    "You can retrieve the encoding for a model using `tiktoken.encoding_for_model()` as follows:\n",
@ -31,8 +31,7 @@
    "encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')\n",
    "```\n",
    "\n",
-    "`p50k_base` overlaps substantially with `r50k_base`, and for non-code applications, they will usually give the same tokens.\n",
-    "\n",
+    "Note that `p50k_base` overlaps substantially with `r50k_base`, and for non-code applications, they will usually give the same tokens.\n",
    "\n",
    "## Tokenizer libraries by language\n",
    "\n",
@ -61,7 +60,7 @@
   "source": [
    "## 0. Install `tiktoken`\n",
    "\n",
-    "Install `tiktoken` with `pip`:"
+    "If needed, install `tiktoken` with `pip`:"
   ]
  },
  {
@ -76,8 +75,8 @@
      "Requirement already satisfied: tiktoken in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (0.3.2)\n",
      "Requirement already satisfied: regex>=2022.1.18 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from tiktoken) (2022.10.31)\n",
      "Requirement already satisfied: requests>=2.26.0 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from tiktoken) (2.28.2)\n",
-      "Requirement already satisfied: idna<4,>=2.5 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (3.3)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (2.0.9)\n",
+      "Requirement already satisfied: idna<4,>=2.5 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (3.3)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (2021.10.8)\n",
      "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (1.26.7)\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
@ -98,7 +97,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
@ -119,7 +118,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
@ -136,7 +135,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
@ -162,7 +161,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
@ -171,7 +170,7 @@
       "[83, 1609, 5963, 374, 2294, 0]"
      ]
     },
-     "execution_count": 5,
+     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -190,7 +189,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
@ -203,7 +202,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
@ -212,7 +211,7 @@
       "6"
      ]
     },
-     "execution_count": 7,
+     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -239,7 +238,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
@ -248,7 +247,7 @@
       "'tiktoken is great!'"
      ]
     },
-     "execution_count": 8,
+     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -275,7 +274,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
@ -284,7 +283,7 @@
       "[b't', b'ik', b'token', b' is', b' great', b'!']"
      ]
     },
-     "execution_count": 9,
+     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -308,12 +307,12 @@
   "source": [
    "## 5. Comparing encodings\n",
    "\n",
-    "Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
+    "Different encodings vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
@ -336,7 +335,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
@ -366,7 +365,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
@ -396,7 +395,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
@ -431,16 +430,16 @@
   "source": [
    "## 6. Counting tokens for chat API calls\n",
    "\n",
-    "ChatGPT models like `gpt-3.5-turbo` use tokens in the same way as past completions models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.\n",
+    "ChatGPT models like `gpt-3.5-turbo` and `gpt-4` use tokens in the same way as older completions models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.\n",
    "\n",
    "Below is an example function for counting tokens for messages passed to `gpt-3.5-turbo-0301` or `gpt-4-0314`.\n",
    "\n",
-    "Note that the exact way that messages are converted into tokens may change from model to model. So when future model versions are released, the answers returned by this function may be only approximate. The [ChatML documentation](https://github.com/openai/openai-python/blob/main/chatml.md) explains in more detail how the OpenAI API converts messages into tokens."
+    "Note that the exact way that messages are converted into tokens may change from model to model, and may even change over time for the same model. Therefore, the counts returned by the function below should be considered an estimate, not a guarantee."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
@ -458,7 +457,7 @@
    "        print(\"Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.\")\n",
    "        return num_tokens_from_messages(messages, model=\"gpt-4-0314\")\n",
    "    elif model == \"gpt-3.5-turbo-0301\":\n",
-    "        tokens_per_message = 4  # every message follows <im_start>{role/name}\\n{content}<im_end>\\n\n",
+    "        tokens_per_message = 4  # every message follows <|start|>{role/name}\\n{content}<|end|>\\n\n",
    "        tokens_per_name = -1  # if there's a name, the role is omitted\n",
    "    elif model == \"gpt-4-0314\":\n",
    "        tokens_per_message = 3\n",
@ -472,13 +471,13 @@
    "            num_tokens += len(encoding.encode(value))\n",
    "            if key == \"name\":\n",
    "                num_tokens += tokens_per_name\n",
-    "    num_tokens += 2  # every reply is primed with <im_start>assistant\n",
+    "    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>\n",
    "    return num_tokens\n"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
@ -486,12 +485,12 @@
     "output_type": "stream",
     "text": [
      "gpt-3.5-turbo-0301\n",
-      "126 prompt tokens counted by num_tokens_from_messages().\n",
-      "126 prompt tokens counted by the OpenAI API.\n",
+      "127 prompt tokens counted by num_tokens_from_messages().\n",
+      "127 prompt tokens counted by the OpenAI API.\n",
      "\n",
      "gpt-4-0314\n",
-      "128 prompt tokens counted by num_tokens_from_messages().\n",
-      "128 prompt tokens counted by the OpenAI API.\n",
+      "129 prompt tokens counted by num_tokens_from_messages().\n",
+      "129 prompt tokens counted by the OpenAI API.\n",
      "\n"
     ]
    }