polishes text and re-runs notebook

This commit is contained in:
Ted Sanders 2023-01-19 10:04:31 -08:00
parent ee69beb8cd
commit 14262d47e8

View File

@ -1,28 +1,30 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Embedding texts that are longer than the model's context length\n",
"# Embedding texts that are longer than the model's maximum context length\n",
"\n",
"All models have a maximum context length for the input text they take in. However, this maximum length is defined in terms of _tokens_ instead of string length. If you are unfamiliar with tokenization, you can check out the [\"How to count tokens with tiktoken\"](How_to_count_tokens_with_tiktoken.ipynb) notebook in this same cookbook.\n",
"OpenAI's embedding models cannot embed text that exceeds a maximum length. The maximum length varies by model, and is measured by _tokens_, not string length. If you are unfamiliar with tokenization, check out [How to count tokens with tiktoken](How_to_count_tokens_with_tiktoken.ipynb).\n",
"\n",
"In this notebook, we will go over how to handle texts that are larger than a model's context length. In these examples, we will focus on embedding texts using the `text-embedding-ada-002`, but similar approaches can also be applied to other models and tasks. To learn about how to embed a text, check out the [Get embeddings](Get_embeddings.ipynb) notebook and the OpenAI [embeddings page](https://beta.openai.com/docs/guides/embeddings).\n"
"This notebook shows how to handle texts that are longer than a model's maximum context length. We'll demonstrate using embeddings from `text-embedding-ada-002`, but the same ideas can be applied to other models and tasks. To learn more about embeddings, check out the [OpenAI Embeddings Guide](https://beta.openai.com/docs/guides/embeddings).\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Model context length\n",
"\n",
"First, let us define the model we will be working with and a funciton to get embeddings from the API."
"First, we select the model and define a function to get embeddings from the API."
]
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
@ -49,7 +51,7 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": 2,
"metadata": {},
"outputs": [
{
@ -69,24 +71,26 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly we want to avoid these errors, particularly when handling programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we will describe and provide recipes for the main approaches to handling these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually."
"Clearly we want to avoid these errors, particularly when handling programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we describe and provide recipes for the main approaches to handling these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Truncating the input text\n",
"\n",
"The simplest solution is to truncate the input text to the maximum allowed length. Since the context length is in terms of tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, thus as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function."
"The simplest solution is to truncate the input text to the maximum allowed length. Because the context length is measured in tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, so as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function."
]
},
{
"cell_type": "code",
"execution_count": 27,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
@ -99,22 +103,25 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Our example from before now works."
"Our example from before now works without error."
]
},
{
"cell_type": "code",
"execution_count": 32,
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "1536"
"text/plain": [
"1536"
]
},
"execution_count": 32,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
@ -125,19 +132,20 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Chunking the input text\n",
"\n",
"Though the option above works, it has the clear drawback of simply discarding all text after the maximum context is filled. Another possible approach that addresses this issue is to in fact divide the input text into chunks and then embed each chunk individually. We can then either use the chunk embeddings separately, or combine them in some way, such as for example calculating their average (weighted by the size of each chunk).\n",
"Though truncation works, discarding potentially relevant text is a clear drawback. Another approach is to divide the input text into chunks and then embed each chunk individually. Then, we can either use the chunk embeddings separately, or combine them in some way, such as averaging (weighted by the size of each chunk).\n",
"\n",
"We will first take a function from [python's own cookbook](https://docs.python.org/3/library/itertools.html#itertools-recipes) that breaks up a sequence into chunks."
"We will take a function from [Python's own cookbook](https://docs.python.org/3/library/itertools.html#itertools-recipes) that breaks up a sequence into chunks."
]
},
{
"cell_type": "code",
"execution_count": 29,
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
@ -154,15 +162,16 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's define a function that encodes a string into tokens and then breaks it up into chunks."
"Now we define a function that encodes a string into tokens and then breaks it up into chunks."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
@ -182,7 +191,7 @@
},
{
"cell_type": "code",
"execution_count": 104,
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
@ -200,23 +209,24 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Once again, we can verify that we can now handle long input texts."
"Once again, we can now handle long input texts."
]
},
{
"cell_type": "code",
"execution_count": 105,
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Setting reduce=None gives us 2 embedding vectors.\n",
"Setting reduce='average' gives us 1 embedding vector.\n"
"Setting average=True gives us a single 1536-dimensional embedding vector for our long text.\n",
"Setting average=False gives us 2 embedding vectors, one for each of the chunks.\n"
]
}
],
@ -227,6 +237,14 @@
"print(f\"Setting average=True gives us a single {len(average_embedding_vector)}-dimensional embedding vector for our long text.\")\n",
"print(f\"Setting average=False gives us {len(chunks_embedding_vectors)} embedding vectors, one for each of the chunks.\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In some cases, it may make sense to split chunks on paragraph boundaries or sentence boundaries to help preserve the meaning of the text."
]
}
],
"metadata": {