diff --git a/examples/How_to_count_tokens_with_tiktoken.ipynb b/examples/How_to_count_tokens_with_tiktoken.ipynb new file mode 100644 index 00000000..9a821972 --- /dev/null +++ b/examples/How_to_count_tokens_with_tiktoken.ipynb @@ -0,0 +1,406 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to count tokens with tiktoken\n", + "\n", + "[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n", + "\n", + "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"gpt2\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n", + "\n", + "Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n", + "\n", + "`tiktoken` supports three encodings used by OpenAI models:\n", + "\n", + "| Encoding name | OpenAI models |\n", + "|-------------------------|-----------------------------------------------------|\n", + "| `gpt2` (or `r50k_base`) | Most GPT-3 models |\n", + "| `p50k_base` | Code models, `text-davinci-002`, `text-davinci-003` |\n", + "| `cl100k_base` | `text-embedding-ada-002` |\n", + "\n", + "`p50k_base` overlaps substantially with `gpt2`, and for non-code applications, they will usually give the same tokens.\n", + "\n", + "## Tokenizer libraries and languages\n", + "\n", + "For `gpt2` encodings, tokenizers are available in many languages.\n", + "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n", + "- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n", + "- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n", + "- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n", + "- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n", + "\n", + "(OpenAI makes no endorsements or guarantees of third-party libraries.)\n", + "\n", + "For `p50k_base` and `cl100k_base` encodings, `tiktoken` is the only tokenizer available as of January 2023.\n", + "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n", + "\n", + "## How strings are typically tokenized\n", + "\n", + "In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer)." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 0. Install `tiktoken`\n", + "\n", + "In your terminal, install `tiktoken` with `pip`:\n", + "\n", + "```bash\n", + "pip install tiktoken\n", + "```" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import `tiktoken`" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import tiktoken\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load an encoding\n", + "\n", + "Use `tiktoken.get_encoding()` to load an encoding by name.\n", + "\n", + "The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "encoding = tiktoken.get_encoding(\"gpt2\")\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Turn text into tokens with `encoding.encode()`\n", + "\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `.encode()` method converts a text string into a list of token integers." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[83, 1134, 30001, 318, 1049, 0]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "encoding.encode(\"tiktoken is great!\")\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Count tokens by counting the length of the list returned by `.encode()`." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "def num_tokens_from_string(string: str, encoding_name: str) -> int:\n", + " \"\"\"Returns the number of tokens in a text string.\"\"\"\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " num_tokens = len(encoding.encode(string))\n", + " return num_tokens\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "6" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "num_tokens_from_string(\"tiktoken is great!\", \"gpt2\")\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Turn tokens into text with `encoding.decode()`" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`.decode()` converts a list of token integers to a string." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'tiktoken is great!'" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "encoding.decode([83, 1134, 30001, 318, 1049, 0])\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[b't', b'ik', b'token', b' is', b' great', b'!']" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "[encoding.decode_single_token_bytes(token) for token in [83, 1134, 30001, 318, 1049, 0]]\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(The `b` in front of the strings indicates that the strings are byte strings.)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Comparing encodings\n", + "\n", + "Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "def compare_encodings(example_string: str) -> None:\n", + " \"\"\"Prints a comparison of three string encodings.\"\"\"\n", + " # print the example string\n", + " print(f'\\nExample string: \"{example_string}\"')\n", + " # for each encoding, print the # of tokens, the token integers, and the token bytes\n", + " for encoding_name in [\"gpt2\", \"p50k_base\", \"cl100k_base\"]:\n", + " encoding = tiktoken.get_encoding(encoding_name)\n", + " token_integers = encoding.encode(example_string)\n", + " num_tokens = len(token_integers)\n", + " token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n", + " print()\n", + " print(f\"{encoding_name}: {num_tokens} tokens\")\n", + " print(f\"token integers: {token_integers}\")\n", + " print(f\"token bytes: {token_bytes}\")\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Example string: \"antidisestablishmentarianism\"\n", + "\n", + "gpt2: 5 tokens\n", + "token integers: [415, 29207, 44390, 3699, 1042]\n", + "token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n", + "\n", + "p50k_base: 5 tokens\n", + "token integers: [415, 29207, 44390, 3699, 1042]\n", + "token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n", + "\n", + "cl100k_base: 6 tokens\n", + "token integers: [519, 85342, 34500, 479, 8997, 2191]\n", + "token bytes: [b'ant', b'idis', b'establish', b'ment', b'arian', b'ism']\n" + ] + } + ], + "source": [ + "compare_encodings(\"antidisestablishmentarianism\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Example string: \"2 + 2 = 4\"\n", + "\n", + "gpt2: 5 tokens\n", + "token integers: [17, 1343, 362, 796, 604]\n", + "token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n", + "\n", + "p50k_base: 5 tokens\n", + "token integers: [17, 1343, 362, 796, 604]\n", + "token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n", + "\n", + "cl100k_base: 7 tokens\n", + "token integers: [17, 489, 220, 17, 284, 220, 19]\n", + "token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']\n" + ] + } + ], + "source": [ + "compare_encodings(\"2 + 2 = 4\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Example string: \"お誕生日おめでとう\"\n", + "\n", + "gpt2: 14 tokens\n", + "token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n", + "token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n", + "\n", + "p50k_base: 14 tokens\n", + "token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n", + "token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n", + "\n", + "cl100k_base: 9 tokens\n", + "token integers: [33334, 45918, 243, 21990, 9080, 33334, 62004, 16556, 78699]\n", + "token bytes: [b'\\xe3\\x81\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97\\xa5', b'\\xe3\\x81\\x8a', b'\\xe3\\x82\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8\\xe3\\x81\\x86']\n" + ] + } + ], + "source": [ + "compare_encodings(\"お誕生日おめでとう\")\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "openai", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.9" + }, + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97" + } + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}