openai-cookbook/examples/Embedding_long_inputs.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Embedding texts that are longer than the model's context length\n",
    "\n",
    "All models have a maximum context length for the input text they take in. However, this maximum length is defined in terms of _tokens_ instead of string length. If you are unfamiliar with tokenization, you can check out the [\"How to count tokens with tiktoken\"](How_to_count_tokens_with_tiktoken.ipynb) notebook in this same cookbook.\n",
    "\n",
    "In this notebook, we will go over how to handle texts that are larger than a model's context length. In these examples, we will focus on embedding texts using the `text-embedding-ada-002`, but similar approaches can also be applied to other models and tasks. To learn about how to embed a text, check out the [Get embeddings](Get_embeddings.ipynb) notebook and the OpenAI [embeddings page](https://beta.openai.com/docs/guides/embeddings).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Model context length\n",
    "\n",
    "First, let us define the model we will be working with and a funciton to get embeddings from the API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "import openai\n",
    "from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type\n",
    "\n",
    "\n",
    "EMBEDDING_MODEL = 'text-embedding-ada-002'\n",
    "EMBEDDING_CTX_LENGTH = 8191\n",
    "EMBEDDING_ENCODING = 'cl100k_base'\n",
    "\n",
    "# let's make sure to not retry on an invalid request, because that is what we want to demonstrate\n",
    "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), retry=retry_if_not_exception_type(openai.InvalidRequestError))\n",
    "def get_embedding(text_or_tokens, model=EMBEDDING_MODEL):\n",
    "    return openai.Embedding.create(input=text_or_tokens, model=model)[\"data\"][0][\"embedding\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `text-embedding-ada-002` model has a context length of 8191 tokens with the `cl100k_base` encoding, and we can see that going over that limit causes an error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.\n"
     ]
    }
   ],
   "source": [
    "long_text = 'AGI ' * 5000\n",
    "try:\n",
    "    get_embedding(long_text)\n",
    "except openai.InvalidRequestError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clearly we want to avoid these errors, particularly when handling programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we will describe and provide recipes for the main approaches to handling these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Truncating the input text\n",
    "\n",
    "The simplest solution is to truncate the input text to the maximum allowed length. Since the context length is in terms of tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, thus as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tiktoken\n",
    "\n",
    "def truncate_text_tokens(text, encoding_name=EMBEDDING_ENCODING, max_tokens=EMBEDDING_CTX_LENGTH):\n",
    "    \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n",
    "    encoding = tiktoken.get_encoding(encoding_name)\n",
    "    return encoding.encode(text)[:max_tokens]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our example from before now works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "1536"
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "truncated = truncate_text_tokens(long_text)\n",
    "len(get_embedding(truncated))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Chunking the input text\n",
    "\n",
    "Though the option above works, it has the clear drawback of simply discarding all text after the maximum context is filled. Another possible approach that addresses this issue is to in fact divide the input text into chunks and then embed each chunk individually. We can then either use the chunk embeddings separately, or combine them in some way, such as for example calculating their average (weighted by the size of each chunk).\n",
    "\n",
    "We will first take a function from [python's own cookbook](https://docs.python.org/3/library/itertools.html#itertools-recipes) that breaks up a sequence into chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import islice\n",
    "\n",
    "def batched(iterable, n):\n",
    "    \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n",
    "    # batched('ABCDEFG', 3) --> ABC DEF G\n",
    "    if n < 1:\n",
    "        raise ValueError('n must be at least one')\n",
    "    it = iter(iterable)\n",
    "    while (batch := tuple(islice(it, n))):\n",
    "        yield batch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's define a function that encodes a string into tokens and then breaks it up into chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def chunked_tokens(text, encoding_name, chunk_length):\n",
    "    encoding = tiktoken.get_encoding(encoding_name)\n",
    "    tokens = encoding.encode(text)\n",
    "    chunks_iterator = batched(tokens, chunk_length)\n",
    "    yield from chunks_iterator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The `average` flag can be set to `True` to return the weighted average of the chunk embeddings, or `False` to simply return the unmodified list of chunk embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "def len_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, average=True):\n",
    "    chunk_embeddings = []\n",
    "    for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens):\n",
    "        chunk_embeddings.append(get_embedding(chunk, model=model))\n",
    "\n",
    "    if average:\n",
    "        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()\n",
    "    return chunk_embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, we can verify that we can now handle long input texts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Setting reduce=None gives us 2 embedding vectors.\n",
      "Setting reduce='average' gives us 1 embedding vector.\n"
     ]
    }
   ],
   "source": [
    "average_embedding_vector = len_safe_get_embedding(long_text, average=True)\n",
    "chunks_embedding_vectors = len_safe_get_embedding(long_text, average=False)\n",
    "\n",
    "print(f\"Setting average=True gives us a single {len(average_embedding_vector)}-dimensional embedding vector for our long text.\")\n",
    "print(f\"Setting average=False gives us {len(chunks_embedding_vectors)} embedding vectors, one for each of the chunks.\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
Write initial draft 1 year ago			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Rename file 1 year ago			`"# Embedding texts that are longer than the model's context length\n",`
Write initial draft 1 year ago			`"\n",`
Finish first draft 1 year ago			`"All models have a maximum context length for the input text they take in. However, this maximum length is defined in terms of _tokens_ instead of string length. If you are unfamiliar with tokenization, you can check out the [\"How to count tokens with tiktoken\"](How_to_count_tokens_with_tiktoken.ipynb) notebook in this same cookbook.\n",`
Write initial draft 1 year ago			`"\n",`
Change len_safe function signature + other changes 1 year ago			"In this notebook, we will go over how to handle texts that are larger than a model's context length. In these examples, we will focus on embedding texts using the `text-embedding-ada-002`, but similar approaches can also be applied to other models and tasks. To learn about how to embed a text, check out the [Get embeddings](Get_embeddings.ipynb) notebook and the OpenAI [embeddings page](https://beta.openai.com/docs/guides/embeddings).\n"
Write initial draft 1 year ago			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Finish first draft 1 year ago			`"## 1. Model context length\n",`
Write initial draft 1 year ago			`"\n",`
Finish first draft 1 year ago			`"First, let us define the model we will be working with and a funciton to get embeddings from the API."`
Write initial draft 1 year ago			`]`
			`},`
			`{`
			`"cell_type": "code",`
Print len instead of full array 1 year ago			`"execution_count": 25,`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"outputs": [],`
			`"source": [`
Finish first draft 1 year ago			`"import openai\n",`
Change len_safe function signature + other changes 1 year ago			`"from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type\n",`
Write initial draft 1 year ago			`"\n",`
			`"\n",`
Finish first draft 1 year ago			`"EMBEDDING_MODEL = 'text-embedding-ada-002'\n",`
			`"EMBEDDING_CTX_LENGTH = 8191\n",`
			`"EMBEDDING_ENCODING = 'cl100k_base'\n",`
Write initial draft 1 year ago			`"\n",`
Change len_safe function signature + other changes 1 year ago			`"# let's make sure to not retry on an invalid request, because that is what we want to demonstrate\n",`
			`"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), retry=retry_if_not_exception_type(openai.InvalidRequestError))\n",`
Finish first draft 1 year ago			`"def get_embedding(text_or_tokens, model=EMBEDDING_MODEL):\n",`
			`" return openai.Embedding.create(input=text_or_tokens, model=model)[\"data\"][0][\"embedding\"]"`
Change len_safe function signature + other changes 1 year ago			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "markdown",`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"source": [`
Finish first draft 1 year ago			"The `text-embedding-ada-002` model has a context length of 8191 tokens with the `cl100k_base` encoding, and we can see that going over that limit causes an error."
Change len_safe function signature + other changes 1 year ago			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "code",`
Print len instead of full array 1 year ago			`"execution_count": 26,`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Finish first draft 1 year ago			`"outputs": [`
			`{`
Wrap error message to avoid saving all the trace to the notebook 1 year ago			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.\n"`
Finish first draft 1 year ago			`]`
			`}`
Write initial draft 1 year ago			`],`
			`"source": [`
Finish first draft 1 year ago			`"long_text = 'AGI ' * 5000\n",`
Wrap error message to avoid saving all the trace to the notebook 1 year ago			`"try:\n",`
			`" get_embedding(long_text)\n",`
			`"except openai.InvalidRequestError as e:\n",`
			`" print(e)"`
Change len_safe function signature + other changes 1 year ago			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "markdown",`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"source": [`
Change len_safe function signature + other changes 1 year ago			`"Clearly we want to avoid these errors, particularly when handling programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we will describe and provide recipes for the main approaches to handling these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually."`
			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "markdown",`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"source": [`
Finish first draft 1 year ago			`"## 1. Truncating the input text\n",`
			`"\n",`
			`"The simplest solution is to truncate the input text to the maximum allowed length. Since the context length is in terms of tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, thus as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function."`
Change len_safe function signature + other changes 1 year ago			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "code",`
Print len instead of full array 1 year ago			`"execution_count": 27,`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"outputs": [],`
			`"source": [`
Finish first draft 1 year ago			`"import tiktoken\n",`
			`"\n",`
			`"def truncate_text_tokens(text, encoding_name=EMBEDDING_ENCODING, max_tokens=EMBEDDING_CTX_LENGTH):\n",`
			" \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n",
Write initial draft 1 year ago			`" encoding = tiktoken.get_encoding(encoding_name)\n",`
Finish first draft 1 year ago			`" return encoding.encode(text)[:max_tokens]"`
Change len_safe function signature + other changes 1 year ago			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "markdown",`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"source": [`
Finish first draft 1 year ago			`"Our example from before now works."`
Change len_safe function signature + other changes 1 year ago			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "code",`
Print len instead of full array 1 year ago			`"execution_count": 32,`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Finish first draft 1 year ago			`"outputs": [`
			`{`
			`"data": {`
Print len instead of full array 1 year ago			`"text/plain": "1536"`
Finish first draft 1 year ago			`},`
Print len instead of full array 1 year ago			`"execution_count": 32,`
Finish first draft 1 year ago			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
Write initial draft 1 year ago			`],`
			`"source": [`
Finish first draft 1 year ago			`"truncated = truncate_text_tokens(long_text)\n",`
Print len instead of full array 1 year ago			`"len(get_embedding(truncated))"`
Change len_safe function signature + other changes 1 year ago			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "markdown",`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"source": [`
Finish first draft 1 year ago			`"## 2. Chunking the input text\n",`
Write initial draft 1 year ago			`"\n",`
Finish first draft 1 year ago			`"Though the option above works, it has the clear drawback of simply discarding all text after the maximum context is filled. Another possible approach that addresses this issue is to in fact divide the input text into chunks and then embed each chunk individually. We can then either use the chunk embeddings separately, or combine them in some way, such as for example calculating their average (weighted by the size of each chunk).\n",`
			`"\n",`
Change len_safe function signature + other changes 1 year ago			`"We will first take a function from [python's own cookbook](https://docs.python.org/3/library/itertools.html#itertools-recipes) that breaks up a sequence into chunks."`
			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "code",`
Print len instead of full array 1 year ago			`"execution_count": 29,`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"outputs": [],`
			`"source": [`
			`"from itertools import islice\n",`
			`"\n",`
			`"def batched(iterable, n):\n",`
			`" \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n",`
			`" # batched('ABCDEFG', 3) --> ABC DEF G\n",`
			`" if n < 1:\n",`
			`" raise ValueError('n must be at least one')\n",`
			`" it = iter(iterable)\n",`
			`" while (batch := tuple(islice(it, n))):\n",`
Finish first draft 1 year ago			`" yield batch"`
Change len_safe function signature + other changes 1 year ago			`]`
Write initial draft 1 year ago			`},`
			`{`
Finish first draft 1 year ago			`"cell_type": "markdown",`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"source": [`
Finish first draft 1 year ago			`"Now let's define a function that encodes a string into tokens and then breaks it up into chunks."`
Change len_safe function signature + other changes 1 year ago			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"outputs": [],`
			`"source": [`
Finish first draft 1 year ago			`"def chunked_tokens(text, encoding_name, chunk_length):\n",`
Write initial draft 1 year ago			`" encoding = tiktoken.get_encoding(encoding_name)\n",`
Finish first draft 1 year ago			`" tokens = encoding.encode(text)\n",`
			`" chunks_iterator = batched(tokens, chunk_length)\n",`
			`" yield from chunks_iterator"`
Change len_safe function signature + other changes 1 year ago			`]`
Write initial draft 1 year ago			`},`
			`{`
Finish first draft 1 year ago			`"cell_type": "markdown",`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Write initial draft 1 year ago			`"source": [`
Change len_safe function signature + other changes 1 year ago			"Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The `average` flag can be set to `True` to return the weighted average of the chunk embeddings, or `False` to simply return the unmodified list of chunk embeddings."
			`]`
Write initial draft 1 year ago			`},`
			`{`
			`"cell_type": "code",`
Change len_safe function signature + other changes 1 year ago			`"execution_count": 104,`
			`"metadata": {},`
Write initial draft 1 year ago			`"outputs": [],`
			`"source": [`
			`"import numpy as np\n",`
			`"\n",`
			`"\n",`
Change len_safe function signature + other changes 1 year ago			`"def len_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, average=True):\n",`
Write initial draft 1 year ago			`" chunk_embeddings = []\n",`
Finish first draft 1 year ago			`" for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens):\n",`
Write initial draft 1 year ago			`" chunk_embeddings.append(get_embedding(chunk, model=model))\n",`
			`"\n",`
Change len_safe function signature + other changes 1 year ago			`" if average:\n",`
			`" chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()\n",`
			`" return chunk_embeddings"`
			`]`
Write initial draft 1 year ago			`},`
Finish first draft 1 year ago			`{`
			`"cell_type": "markdown",`
Change len_safe function signature + other changes 1 year ago			`"metadata": {},`
Finish first draft 1 year ago			`"source": [`
			`"Once again, we can verify that we can now handle long input texts."`
Change len_safe function signature + other changes 1 year ago			`]`
Finish first draft 1 year ago			`},`
Write initial draft 1 year ago			`{`
			`"cell_type": "code",`
Change len_safe function signature + other changes 1 year ago			`"execution_count": 105,`
			`"metadata": {},`
Finish first draft 1 year ago			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
Change len_safe function signature + other changes 1 year ago			`"Setting reduce=None gives us 2 embedding vectors.\n",`
			`"Setting reduce='average' gives us 1 embedding vector.\n"`
Finish first draft 1 year ago			`]`
			`}`
			`],`
			`"source": [`
Change len_safe function signature + other changes 1 year ago			`"average_embedding_vector = len_safe_get_embedding(long_text, average=True)\n",`
			`"chunks_embedding_vectors = len_safe_get_embedding(long_text, average=False)\n",`
Finish first draft 1 year ago			`"\n",`
Change len_safe function signature + other changes 1 year ago			`"print(f\"Setting average=True gives us a single {len(average_embedding_vector)}-dimensional embedding vector for our long text.\")\n",`
			`"print(f\"Setting average=False gives us {len(chunks_embedding_vectors)} embedding vectors, one for each of the chunks.\")\n"`
			`]`
Write initial draft 1 year ago			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
Change len_safe function signature + other changes 1 year ago			`"display_name": "Python 3 (ipykernel)",`
Write initial draft 1 year ago			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.9.9"`
			`},`
			`"vscode": {`
			`"interpreter": {`
			`"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"`
			`}`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 2`
			`}`