"Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"cl100k_base\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
"Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"cl100k_base\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
"\n",
"\n",
"Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",
"Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).\n",
"`p50k_base` overlaps substantially with `r50k_base`, and for non-code applications, they will usually give the same tokens.\n",
"Note that `p50k_base` overlaps substantially with `r50k_base`, and for non-code applications, they will usually give the same tokens.\n",
"\n",
"\n",
"\n",
"## Tokenizer libraries by language\n",
"## Tokenizer libraries by language\n",
"\n",
"\n",
@ -61,7 +60,7 @@
"source": [
"source": [
"## 0. Install `tiktoken`\n",
"## 0. Install `tiktoken`\n",
"\n",
"\n",
"Install `tiktoken` with `pip`:"
"If needed, install `tiktoken` with `pip`:"
]
]
},
},
{
{
@ -76,8 +75,8 @@
"Requirement already satisfied: tiktoken in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (0.3.2)\n",
"Requirement already satisfied: tiktoken in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (0.3.2)\n",
"Requirement already satisfied: regex>=2022.1.18 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from tiktoken) (2022.10.31)\n",
"Requirement already satisfied: regex>=2022.1.18 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from tiktoken) (2022.10.31)\n",
"Requirement already satisfied: requests>=2.26.0 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from tiktoken) (2.28.2)\n",
"Requirement already satisfied: requests>=2.26.0 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from tiktoken) (2.28.2)\n",
"Requirement already satisfied: idna<4,>=2.5 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (3.3)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (2.0.9)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (2.0.9)\n",
"Requirement already satisfied: idna<4,>=2.5 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (3.3)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (2021.10.8)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (2021.10.8)\n",
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (1.26.7)\n",
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (1.26.7)\n",
"Note: you may need to restart the kernel to use updated packages.\n"
"Note: you may need to restart the kernel to use updated packages.\n"
@ -98,7 +97,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
@ -119,7 +118,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 3,
"execution_count": 2,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
@ -136,7 +135,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 4,
"execution_count": 3,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
@ -162,7 +161,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
@ -171,7 +170,7 @@
"[83, 1609, 5963, 374, 2294, 0]"
"[83, 1609, 5963, 374, 2294, 0]"
]
]
},
},
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"metadata": {},
"output_type": "execute_result"
"output_type": "execute_result"
}
}
@ -190,7 +189,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 6,
"execution_count": 5,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
@ -203,7 +202,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 7,
"execution_count": 6,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
@ -212,7 +211,7 @@
"6"
"6"
]
]
},
},
"execution_count": 7,
"execution_count": 6,
"metadata": {},
"metadata": {},
"output_type": "execute_result"
"output_type": "execute_result"
}
}
@ -239,7 +238,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 8,
"execution_count": 7,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
@ -248,7 +247,7 @@
"'tiktoken is great!'"
"'tiktoken is great!'"
]
]
},
},
"execution_count": 8,
"execution_count": 7,
"metadata": {},
"metadata": {},
"output_type": "execute_result"
"output_type": "execute_result"
}
}
@ -275,7 +274,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 9,
"execution_count": 8,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
@ -284,7 +283,7 @@
"[b't', b'ik', b'token', b' is', b' great', b'!']"
"[b't', b'ik', b'token', b' is', b' great', b'!']"
]
]
},
},
"execution_count": 9,
"execution_count": 8,
"metadata": {},
"metadata": {},
"output_type": "execute_result"
"output_type": "execute_result"
}
}
@ -308,12 +307,12 @@
"source": [
"source": [
"## 5. Comparing encodings\n",
"## 5. Comparing encodings\n",
"\n",
"\n",
"Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
"Different encodings vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 10,
"execution_count": 9,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
@ -336,7 +335,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 11,
"execution_count": 10,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
@ -366,7 +365,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 12,
"execution_count": 11,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
@ -396,7 +395,7 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 13,
"execution_count": 12,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
@ -431,16 +430,16 @@
"source": [
"source": [
"## 6. Counting tokens for chat API calls\n",
"## 6. Counting tokens for chat API calls\n",
"\n",
"\n",
"ChatGPT models like `gpt-3.5-turbo` use tokens in the same way as past completions models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.\n",
"ChatGPT models like `gpt-3.5-turbo` and `gpt-4` use tokens in the same way as older completions models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.\n",
"\n",
"\n",
"Below is an example function for counting tokens for messages passed to `gpt-3.5-turbo-0301` or `gpt-4-0314`.\n",
"Below is an example function for counting tokens for messages passed to `gpt-3.5-turbo-0301` or `gpt-4-0314`.\n",
"\n",
"\n",
"Note that the exact way that messages are converted into tokens may change from model to model. So when future model versions are released, the answers returned by this function may be only approximate. The [ChatML documentation](https://github.com/openai/openai-python/blob/main/chatml.md) explains in more detail how the OpenAI API converts messages into tokens."
"Note that the exact way that messages are converted into tokens may change from model to model, and may even change over time for the same model. Therefore, the counts returned by the function below should be considered an estimate, not a guarantee."
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 14,
"execution_count": 13,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
@ -458,7 +457,7 @@
" print(\"Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.\")\n",
" print(\"Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.\")\n",