openai-cookbook/examples/How_to_count_tokens_with_tiktoken.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to count tokens with tiktoken\n",
    "\n",
    "[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n",
    "\n",
    "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"gpt2\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
    "\n",
    "Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",
    "\n",
    "`tiktoken` supports three encodings used by OpenAI models:\n",
    "\n",
    "| Encoding name           | OpenAI models                                       |\n",
    "|-------------------------|-----------------------------------------------------|\n",
    "| `gpt2` (or `r50k_base`) | Most GPT-3 models                                   |\n",
    "| `p50k_base`             | Code models, `text-davinci-002`, `text-davinci-003` |\n",
    "| `cl100k_base`           | `text-embedding-ada-002`                            |\n",
    "\n",
    "`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.\n",
    "\n",
    "## Tokenizer libraries and languages\n",
    "\n",
    "For `gpt2` encodings, tokenizers are available in many languages.\n",
    "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n",
    "- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n",
    "- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n",
    "- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n",
    "- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n",
    "\n",
    "(OpenAI makes no endorsements or guarantees of third-party libraries.)\n",
    "\n",
    "For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.\n",
    "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n",
    "\n",
    "## How strings are typically tokenized\n",
    "\n",
    "In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Install `tiktoken`\n",
    "\n",
    "In your terminal, install `tiktoken` with `pip`:\n",
    "\n",
    "```bash\n",
    "pip install tiktoken\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Import `tiktoken`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tiktoken\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load an encoding\n",
    "\n",
    "Use `tiktoken.get_encoding()` to load an encoding by name.\n",
    "\n",
    "The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "encoding = tiktoken.get_encoding(\"gpt2\")\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Turn text into tokens with `encoding.encode()`\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `.encode()` method converts a text string into a list of token integers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[83, 1134, 30001, 318, 1049, 0]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoding.encode(\"tiktoken is great!\")\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Count tokens by counting the length of the list returned by `.encode()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def num_tokens_from_string(string: str, encoding_name: str) -> int:\n",
    "    \"\"\"Returns the number of tokens in a text string.\"\"\"\n",
    "    encoding = tiktoken.get_encoding(encoding_name)\n",
    "    num_tokens = len(encoding.encode(string))\n",
    "    return num_tokens\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num_tokens_from_string(\"tiktoken is great!\", \"gpt2\")\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Turn tokens into text with `encoding.decode()`"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`.decode()` converts a list of token integers to a string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'tiktoken is great!'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoding.decode([83, 1134, 30001, 318, 1049, 0])\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[b't', b'ik', b'token', b' is', b' great', b'!']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(The `b` in front of the strings indicates that the strings are byte strings.)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Comparing encodings\n",
    "\n",
    "Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compare_encodings(example_string: str) -> None:\n",
    "    \"\"\"Prints a comparison of three string encodings.\"\"\"\n",
    "    # print the example string\n",
    "    print(f'\\nExample string: \"{example_string}\"')\n",
    "    # for each encoding, print the # of tokens, the token integers, and the token bytes\n",
    "    for encoding_name in [\"gpt2\", \"p50k_base\", \"cl100k_base\"]:\n",
    "        encoding = tiktoken.get_encoding(encoding_name)\n",
    "        token_integers = encoding.encode(example_string)\n",
    "        num_tokens = len(token_integers)\n",
    "        token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n",
    "        print()\n",
    "        print(f\"{encoding_name}: {num_tokens} tokens\")\n",
    "        print(f\"token integers: {token_integers}\")\n",
    "        print(f\"token bytes: {token_bytes}\")\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Example string: \"antidisestablishmentarianism\"\n",
      "\n",
      "gpt2: 5 tokens\n",
      "token integers: [415, 29207, 44390, 3699, 1042]\n",
      "token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n",
      "\n",
      "p50k_base: 5 tokens\n",
      "token integers: [415, 29207, 44390, 3699, 1042]\n",
      "token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n",
      "\n",
      "cl100k_base: 6 tokens\n",
      "token integers: [519, 85342, 34500, 479, 8997, 2191]\n",
      "token bytes: [b'ant', b'idis', b'establish', b'ment', b'arian', b'ism']\n"
     ]
    }
   ],
   "source": [
    "compare_encodings(\"antidisestablishmentarianism\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Example string: \"2 + 2 = 4\"\n",
      "\n",
      "gpt2: 5 tokens\n",
      "token integers: [17, 1343, 362, 796, 604]\n",
      "token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n",
      "\n",
      "p50k_base: 5 tokens\n",
      "token integers: [17, 1343, 362, 796, 604]\n",
      "token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n",
      "\n",
      "cl100k_base: 7 tokens\n",
      "token integers: [17, 489, 220, 17, 284, 220, 19]\n",
      "token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']\n"
     ]
    }
   ],
   "source": [
    "compare_encodings(\"2 + 2 = 4\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Example string: \"お誕生日おめでとう\"\n",
      "\n",
      "gpt2: 14 tokens\n",
      "token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n",
      "token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n",
      "\n",
      "p50k_base: 14 tokens\n",
      "token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n",
      "token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n",
      "\n",
      "cl100k_base: 9 tokens\n",
      "token integers: [33334, 45918, 243, 21990, 9080, 33334, 62004, 16556, 78699]\n",
      "token bytes: [b'\\xe3\\x81\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97\\xa5', b'\\xe3\\x81\\x8a', b'\\xe3\\x82\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8\\xe3\\x81\\x86']\n"
     ]
    }
   ],
   "source": [
    "compare_encodings(\"お誕生日おめでとう\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "openai",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
adds tiktoken example 2022-12-16 20:40:41 +00:00			`{`
			`"cells": [`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# How to count tokens with tiktoken\n",`
			`"\n",`
			"[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n",
			`"\n",`
			"Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"gpt2\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
			`"\n",`
			`"Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",`
			`"\n",`
			"`tiktoken` supports three encodings used by OpenAI models:\n",
			`"\n",`
			`"\| Encoding name \| OpenAI models \|\n",`
			`"\|-------------------------\|-----------------------------------------------------\|\n",`
			"\| `gpt2` (or `r50k_base`) \| Most GPT-3 models \|\n",
			"\| `p50k_base` \| Code models, `text-davinci-002`, `text-davinci-003` \|\n",
			"\| `cl100k_base` \| `text-embedding-ada-002` \|\n",
			`"\n",`
			"`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.\n",
			`"\n",`
			`"## Tokenizer libraries and languages\n",`
			`"\n",`
			"For `gpt2` encodings, tokenizers are available in many languages.\n",
			`"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n",`
			`"- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n",`
			`"- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n",`
			`"- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n",`
			`"- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n",`
			`"\n",`
			`"(OpenAI makes no endorsements or guarantees of third-party libraries.)\n",`
			`"\n",`
			"For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.\n",
			`"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n",`
			`"\n",`
			`"## How strings are typically tokenized\n",`
			`"\n",`
			"In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer)."
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"## 0. Install `tiktoken`\n",
			`"\n",`
			"In your terminal, install `tiktoken` with `pip`:\n",
			`"\n",`
			"```bash\n",
			`"pip install tiktoken\n",`
			"```"
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"## 1. Import `tiktoken`"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"import tiktoken\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## 2. Load an encoding\n",`
			`"\n",`
			"Use `tiktoken.get_encoding()` to load an encoding by name.\n",
			`"\n",`
			`"The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"encoding = tiktoken.get_encoding(\"gpt2\")\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"## 3. Turn text into tokens with `encoding.encode()`\n",
			`"\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"The `.encode()` method converts a text string into a list of token integers."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[83, 1134, 30001, 318, 1049, 0]"`
			`]`
			`},`
			`"execution_count": 3,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"encoding.encode(\"tiktoken is great!\")\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"Count tokens by counting the length of the list returned by `.encode()`."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"def num_tokens_from_string(string: str, encoding_name: str) -> int:\n",`
			`" \"\"\"Returns the number of tokens in a text string.\"\"\"\n",`
			`" encoding = tiktoken.get_encoding(encoding_name)\n",`
			`" num_tokens = len(encoding.encode(string))\n",`
			`" return num_tokens\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"6"`
			`]`
			`},`
			`"execution_count": 5,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"num_tokens_from_string(\"tiktoken is great!\", \"gpt2\")\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"## 4. Turn tokens into text with `encoding.decode()`"
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"`.decode()` converts a list of token integers to a string."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'tiktoken is great!'"`
			`]`
			`},`
			`"execution_count": 6,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"encoding.decode([83, 1134, 30001, 318, 1049, 0])\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries."
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[b't', b'ik', b'token', b' is', b' great', b'!']"`
			`]`
			`},`
			`"execution_count": 7,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"(The `b` in front of the strings indicates that the strings are byte strings.)"
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## 5. Comparing encodings\n",`
			`"\n",`
			`"Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"def compare_encodings(example_string: str) -> None:\n",`
			`" \"\"\"Prints a comparison of three string encodings.\"\"\"\n",`
			`" # print the example string\n",`
			`" print(f'\\nExample string: \"{example_string}\"')\n",`
			`" # for each encoding, print the # of tokens, the token integers, and the token bytes\n",`
			`" for encoding_name in [\"gpt2\", \"p50k_base\", \"cl100k_base\"]:\n",`
			`" encoding = tiktoken.get_encoding(encoding_name)\n",`
			`" token_integers = encoding.encode(example_string)\n",`
			`" num_tokens = len(token_integers)\n",`
			`" token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n",`
			`" print()\n",`
			`" print(f\"{encoding_name}: {num_tokens} tokens\")\n",`
			`" print(f\"token integers: {token_integers}\")\n",`
			`" print(f\"token bytes: {token_bytes}\")\n",`
			`" "`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"\n",`
			`"Example string: \"antidisestablishmentarianism\"\n",`
			`"\n",`
			`"gpt2: 5 tokens\n",`
			`"token integers: [415, 29207, 44390, 3699, 1042]\n",`
			`"token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n",`
			`"\n",`
			`"p50k_base: 5 tokens\n",`
			`"token integers: [415, 29207, 44390, 3699, 1042]\n",`
			`"token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n",`
			`"\n",`
			`"cl100k_base: 6 tokens\n",`
			`"token integers: [519, 85342, 34500, 479, 8997, 2191]\n",`
			`"token bytes: [b'ant', b'idis', b'establish', b'ment', b'arian', b'ism']\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"compare_encodings(\"antidisestablishmentarianism\")\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 10,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"\n",`
			`"Example string: \"2 + 2 = 4\"\n",`
			`"\n",`
			`"gpt2: 5 tokens\n",`
			`"token integers: [17, 1343, 362, 796, 604]\n",`
			`"token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n",`
			`"\n",`
			`"p50k_base: 5 tokens\n",`
			`"token integers: [17, 1343, 362, 796, 604]\n",`
			`"token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n",`
			`"\n",`
			`"cl100k_base: 7 tokens\n",`
			`"token integers: [17, 489, 220, 17, 284, 220, 19]\n",`
			`"token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"compare_encodings(\"2 + 2 = 4\")\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 11,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"\n",`
			`"Example string: \"お誕生日おめでとう\"\n",`
			`"\n",`
			`"gpt2: 14 tokens\n",`
			`"token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n",`
			`"token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n",`
			`"\n",`
			`"p50k_base: 14 tokens\n",`
			`"token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n",`
			`"token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n",`
			`"\n",`
			`"cl100k_base: 9 tokens\n",`
			`"token integers: [33334, 45918, 243, 21990, 9080, 33334, 62004, 16556, 78699]\n",`
			`"token bytes: [b'\\xe3\\x81\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97\\xa5', b'\\xe3\\x81\\x8a', b'\\xe3\\x82\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8\\xe3\\x81\\x86']\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"compare_encodings(\"お誕生日おめでとう\")\n"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "openai",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.9.9"`
			`},`
			`"orig_nbformat": 4,`
			`"vscode": {`
			`"interpreter": {`
			`"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"`
			`}`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 2`
			`}`