You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/How_to_count_tokens_with_ti...

552 lines
18 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to count tokens with tiktoken\n",
"\n",
"[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n",
"\n",
"Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"cl100k_base\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
"\n",
"Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",
"\n",
"\n",
"## Encodings\n",
"\n",
"Encodings specify how text is converted into tokens. Different models use different encodings.\n",
"\n",
"`tiktoken` supports three encodings used by OpenAI models:\n",
"\n",
"| Encoding name | OpenAI models |\n",
"|-------------------------|-----------------------------------------------------|\n",
"| `cl100k_base` | ChatGPT models, `text-embedding-ada-002` |\n",
"| `p50k_base` | Code models, `text-davinci-002`, `text-davinci-003` |\n",
"| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci` |\n",
"\n",
"You can retrieve the encoding for a model using `tiktoken.encoding_for_model()` as follows:\n",
"```python\n",
"encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')\n",
"```\n",
"\n",
"`p50k_base` overlaps substantially with `r50k_base`, and for non-code applications, they will usually give the same tokens.\n",
"\n",
"\n",
"## Tokenizer libraries by language\n",
"\n",
"For `cl100k_base` and `p50k_base` encodings, `tiktoken` is the only tokenizer available as of March 2023.\n",
"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n",
"\n",
"For `r50k_base` (`gpt2`) encodings, tokenizers are available in many languages.\n",
"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n",
"- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)\n",
"- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)\n",
"- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)\n",
"- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)\n",
"\n",
"(OpenAI makes no endorsements or guarantees of third-party libraries.)\n",
"\n",
"\n",
"## How strings are typically tokenized\n",
"\n",
"In English, tokens commonly range in length from one character to one word (e.g., `\"t\"` or `\" great\"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `\" is\"` instead of `\"is \"` or `\" \"`+`\"is\"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 0. Install `tiktoken`\n",
"\n",
"Install `tiktoken` with `pip`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade tiktoken"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Import `tiktoken`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import tiktoken\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Load an encoding\n",
"\n",
"Use `tiktoken.get_encoding()` to load an encoding by name.\n",
"\n",
"The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"encoding = tiktoken.get_encoding(\"cl100k_base\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `tiktoken.encoding_for_model()` to automatically load the correct encoding for a given model name."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"encoding = tiktoken.encoding_for_model(\"gpt-3.5-turbo\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Turn text into tokens with `encoding.encode()`\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The `.encode()` method converts a text string into a list of token integers."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[83, 1609, 5963, 374, 2294, 0]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encoding.encode(\"tiktoken is great!\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Count tokens by counting the length of the list returned by `.encode()`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def num_tokens_from_string(string: str, encoding_name: str) -> int:\n",
" \"\"\"Returns the number of tokens in a text string.\"\"\"\n",
" encoding = tiktoken.get_encoding(encoding_name)\n",
" num_tokens = len(encoding.encode(string))\n",
" return num_tokens\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_tokens_from_string(\"tiktoken is great!\", \"cl100k_base\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Turn tokens into text with `encoding.decode()`"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"`.decode()` converts a list of token integers to a string."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'tiktoken is great!'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encoding.decode([83, 1609, 5963, 374, 2294, 0])\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[b't', b'ik', b'token', b' is', b' great', b'!']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[encoding.decode_single_token_bytes(token) for token in [83, 1609, 5963, 374, 2294, 0]]\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(The `b` in front of the strings indicates that the strings are byte strings.)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Comparing encodings\n",
"\n",
"Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def compare_encodings(example_string: str) -> None:\n",
" \"\"\"Prints a comparison of three string encodings.\"\"\"\n",
" # print the example string\n",
" print(f'\\nExample string: \"{example_string}\"')\n",
" # for each encoding, print the # of tokens, the token integers, and the token bytes\n",
" for encoding_name in [\"gpt2\", \"p50k_base\", \"cl100k_base\"]:\n",
" encoding = tiktoken.get_encoding(encoding_name)\n",
" token_integers = encoding.encode(example_string)\n",
" num_tokens = len(token_integers)\n",
" token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]\n",
" print()\n",
" print(f\"{encoding_name}: {num_tokens} tokens\")\n",
" print(f\"token integers: {token_integers}\")\n",
" print(f\"token bytes: {token_bytes}\")\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Example string: \"antidisestablishmentarianism\"\n",
"\n",
"gpt2: 5 tokens\n",
"token integers: [415, 29207, 44390, 3699, 1042]\n",
"token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n",
"\n",
"p50k_base: 5 tokens\n",
"token integers: [415, 29207, 44390, 3699, 1042]\n",
"token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']\n",
"\n",
"cl100k_base: 6 tokens\n",
"token integers: [519, 85342, 34500, 479, 8997, 2191]\n",
"token bytes: [b'ant', b'idis', b'establish', b'ment', b'arian', b'ism']\n"
]
}
],
"source": [
"compare_encodings(\"antidisestablishmentarianism\")\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Example string: \"2 + 2 = 4\"\n",
"\n",
"gpt2: 5 tokens\n",
"token integers: [17, 1343, 362, 796, 604]\n",
"token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n",
"\n",
"p50k_base: 5 tokens\n",
"token integers: [17, 1343, 362, 796, 604]\n",
"token bytes: [b'2', b' +', b' 2', b' =', b' 4']\n",
"\n",
"cl100k_base: 7 tokens\n",
"token integers: [17, 489, 220, 17, 284, 220, 19]\n",
"token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']\n"
]
}
],
"source": [
"compare_encodings(\"2 + 2 = 4\")\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Example string: \"お誕生日おめでとう\"\n",
"\n",
"gpt2: 14 tokens\n",
"token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n",
"token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n",
"\n",
"p50k_base: 14 tokens\n",
"token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]\n",
"token bytes: [b'\\xe3\\x81', b'\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97', b'\\xa5', b'\\xe3\\x81', b'\\x8a', b'\\xe3\\x82', b'\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8', b'\\xe3\\x81\\x86']\n",
"\n",
"cl100k_base: 9 tokens\n",
"token integers: [33334, 45918, 243, 21990, 9080, 33334, 62004, 16556, 78699]\n",
"token bytes: [b'\\xe3\\x81\\x8a', b'\\xe8\\xaa', b'\\x95', b'\\xe7\\x94\\x9f', b'\\xe6\\x97\\xa5', b'\\xe3\\x81\\x8a', b'\\xe3\\x82\\x81', b'\\xe3\\x81\\xa7', b'\\xe3\\x81\\xa8\\xe3\\x81\\x86']\n"
]
}
],
"source": [
"compare_encodings(\"お誕生日おめでとう\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Counting tokens for chat API calls\n",
"\n",
"ChatGPT models like `gpt-3.5-turbo` use tokens in the same way as other models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.\n",
"\n",
"Below is an example function for counting tokens for messages passed to `gpt-3.5-turbo-0301`.\n",
"\n",
"The exact way that messages are converted into tokens may change from model to model. So when future model versions are released, the answers returned by this function may be only approximate. The [ChatML documentation](https://github.com/openai/openai-python/blob/main/chatml.md) explains how messages are converted into tokens by the OpenAI API, and may be useful for writing your own function."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def num_tokens_from_messages(messages, model=\"gpt-3.5-turbo-0301\"):\n",
" \"\"\"Returns the number of tokens used by a list of messages.\"\"\"\n",
" try:\n",
" encoding = tiktoken.encoding_for_model(model)\n",
" except KeyError:\n",
" encoding = tiktoken.get_encoding(\"cl100k_base\")\n",
" if model == \"gpt-3.5-turbo-0301\": # note: future models may deviate from this\n",
" num_tokens = 0\n",
" for message in messages:\n",
" num_tokens += 4 # every message follows <im_start>{role/name}\\n{content}<im_end>\\n\n",
" for key, value in message.items():\n",
" num_tokens += len(encoding.encode(value))\n",
" if key == \"name\": # if there's a name, the role is omitted\n",
" num_tokens += -1 # role is always required and always 1 token\n",
" num_tokens += 2 # every reply is primed with <im_start>assistant\n",
" return num_tokens\n",
" else:\n",
" raise NotImplementedError(f\"\"\"num_tokens_from_messages() is not presently implemented for model {model}.\n",
"See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.\"\"\")\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" {\"role\": \"system\", \"content\": \"You are a helpful, pattern-following assistant that translates corporate jargon into plain English.\"},\n",
" {\"role\": \"system\", \"name\":\"example_user\", \"content\": \"New synergies will help drive top-line growth.\"},\n",
" {\"role\": \"system\", \"name\": \"example_assistant\", \"content\": \"Things working well together will increase revenue.\"},\n",
" {\"role\": \"system\", \"name\":\"example_user\", \"content\": \"Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage.\"},\n",
" {\"role\": \"system\", \"name\": \"example_assistant\", \"content\": \"Let's talk later when we're less busy about how to do better.\"},\n",
" {\"role\": \"user\", \"content\": \"This late pivot means we don't have time to boil the ocean for the client deliverable.\"},\n",
"]\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"126 prompt tokens counted.\n"
]
}
],
"source": [
"# example token count from the function defined above\n",
"model = \"gpt-3.5-turbo-0301\"\n",
"\n",
"print(f\"{num_tokens_from_messages(messages, model)} prompt tokens counted.\")\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"126 prompt tokens used.\n"
]
}
],
"source": [
"# example token count from the OpenAI API\n",
"import openai\n",
"\n",
"\n",
"response = openai.ChatCompletion.create(\n",
" model=model,\n",
" messages=messages,\n",
" temperature=0,\n",
")\n",
"\n",
"print(f'{response[\"usage\"][\"prompt_tokens\"]} prompt tokens used.')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "openai",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}