You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/Embedding_long_inputs.ipynb

314 lines
44 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
1 year ago
"# Embedding texts that are longer than the model's context length\n",
"\n",
"All models have a maximum context length for the input text they take in. However, this maximum length is defined in terms of _tokens_ instead of string length. If you are unfamiliar with tokenization, you can check out the [\"How to count tokens with tiktoken\"](How_to_count_tokens_with_tiktoken.ipynb) notebook in this same cookbook.\n",
"\n",
"In this notebook, we will go over how to deal with texts that are larger than a model's context length. In these examples, we will focus on embedding texts using the `text-embedding-ada-002`, but similar approaches can also be applied to other models and tasks. To learn about how to embed a text, check out the [Get embeddings](Get_embeddings.ipynb) notebook and the OpenAI [embeddings page](https://beta.openai.com/docs/guides/embeddings).\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Model context length\n",
"\n",
"First, let us define the model we will be working with and a funciton to get embeddings from the API."
]
},
{
"cell_type": "code",
"execution_count": 85,
"outputs": [],
"source": [
"import openai\n",
"from tenacity import retry, wait_random_exponential, stop_after_attempt\n",
"\n",
"\n",
"EMBEDDING_MODEL = 'text-embedding-ada-002'\n",
"EMBEDDING_CTX_LENGTH = 8191\n",
"EMBEDDING_ENCODING = 'cl100k_base'\n",
"\n",
"\n",
"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
"def get_embedding(text_or_tokens, model=EMBEDDING_MODEL):\n",
" return openai.Embedding.create(input=text_or_tokens, model=model)[\"data\"][0][\"embedding\"]"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"The `text-embedding-ada-002` model has a context length of 8191 tokens with the `cl100k_base` encoding, and we can see that going over that limit causes an error."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 94,
"outputs": [
{
"ename": "InvalidRequestError",
"evalue": "This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.",
"output_type": "error",
"traceback": [
"\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
"\u001B[0;31mInvalidRequestError\u001B[0m Traceback (most recent call last)",
"Cell \u001B[0;32mIn [94], line 4\u001B[0m\n\u001B[1;32m 1\u001B[0m \u001B[38;5;28;01mimport\u001B[39;00m \u001B[38;5;21;01mopenai\u001B[39;00m\n\u001B[1;32m 3\u001B[0m long_text \u001B[38;5;241m=\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124mAGI \u001B[39m\u001B[38;5;124m'\u001B[39m \u001B[38;5;241m*\u001B[39m \u001B[38;5;241m5000\u001B[39m\n\u001B[0;32m----> 4\u001B[0m get_embedding(\u001B[43mopenai\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mEmbedding\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43minput\u001B[39;49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mlong_text\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmodel\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mEMBEDDING_MODEL\u001B[49m\u001B[43m)\u001B[49m)\n",
"File \u001B[0;32m~/code/openai-python/openai/api_resources/embedding.py:33\u001B[0m, in \u001B[0;36mEmbedding.create\u001B[0;34m(cls, *args, **kwargs)\u001B[0m\n\u001B[1;32m 31\u001B[0m \u001B[38;5;28;01mwhile\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m:\n\u001B[1;32m 32\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m---> 33\u001B[0m response \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 35\u001B[0m \u001B[38;5;66;03m# If a user specifies base64, we'll just return the encoded string.\u001B[39;00m\n\u001B[1;32m 36\u001B[0m \u001B[38;5;66;03m# This is only for the default case.\u001B[39;00m\n\u001B[1;32m 37\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m user_provided_encoding_format:\n",
"File \u001B[0;32m~/code/openai-python/openai/api_resources/abstract/engine_api_resource.py:153\u001B[0m, in \u001B[0;36mEngineAPIResource.create\u001B[0;34m(cls, api_key, api_base, api_type, request_id, api_version, organization, **params)\u001B[0m\n\u001B[1;32m 127\u001B[0m \u001B[38;5;129m@classmethod\u001B[39m\n\u001B[1;32m 128\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mcreate\u001B[39m(\n\u001B[1;32m 129\u001B[0m \u001B[38;5;28mcls\u001B[39m,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 136\u001B[0m \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams,\n\u001B[1;32m 137\u001B[0m ):\n\u001B[1;32m 138\u001B[0m (\n\u001B[1;32m 139\u001B[0m deployment_id,\n\u001B[1;32m 140\u001B[0m engine,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 150\u001B[0m api_key, api_base, api_type, api_version, organization, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams\n\u001B[1;32m 151\u001B[0m )\n\u001B[0;32m--> 153\u001B[0m response, _, api_key \u001B[38;5;241m=\u001B[39m \u001B[43mrequestor\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mrequest\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 154\u001B[0m \u001B[43m \u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mpost\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\n\u001B[1;32m 155\u001B[0m \u001B[43m \u001B[49m\u001B[43murl\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 156\u001B[0m \u001B[43m \u001B[49m\u001B[43mparams\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mparams\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 157\u001B[0m \u001B[43m \u001B[49m\u001B[43mheaders\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 158\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mstream\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 159\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_id\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_id\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 160\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_timeout\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_timeout\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 161\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 163\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream:\n\u001B[1;32m 164\u001B[0m \u001B[38;5;66;03m# must be an iterator\u001B[39;00m\n\u001B[1;32m 165\u001B[0m \u001B[38;5;28;01massert\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(response, OpenAIResponse)\n",
"File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:227\u001B[0m, in \u001B[0;36mAPIRequestor.request\u001B[0;34m(self, method, url, params, headers, files, stream, request_id, request_timeout)\u001B[0m\n\u001B[1;32m 206\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mrequest\u001B[39m(\n\u001B[1;32m 207\u001B[0m \u001B[38;5;28mself\u001B[39m,\n\u001B[1;32m 208\u001B[0m method,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 215\u001B[0m request_timeout: Optional[Union[\u001B[38;5;28mfloat\u001B[39m, Tuple[\u001B[38;5;28mfloat\u001B[39m, \u001B[38;5;28mfloat\u001B[39m]]] \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m,\n\u001B[1;32m 216\u001B[0m ) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m Tuple[Union[OpenAIResponse, Iterator[OpenAIResponse]], \u001B[38;5;28mbool\u001B[39m, \u001B[38;5;28mstr\u001B[39m]:\n\u001B[1;32m 217\u001B[0m result \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mrequest_raw(\n\u001B[1;32m 218\u001B[0m method\u001B[38;5;241m.\u001B[39mlower(),\n\u001B[1;32m 219\u001B[0m url,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 225\u001B[0m request_timeout\u001B[38;5;241m=\u001B[39mrequest_timeout,\n\u001B[1;32m 226\u001B[0m )\n\u001B[0;32m--> 227\u001B[0m resp, got_stream \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response\u001B[49m\u001B[43m(\u001B[49m\u001B[43mresult\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 228\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp, got_stream, \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mapi_key\n",
"File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:620\u001B[0m, in \u001B[0;36mAPIRequestor._interpret_response\u001B[0;34m(self, result, stream)\u001B[0m\n\u001B[1;32m 612\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m (\n\u001B[1;32m 613\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_interpret_response_line(\n\u001B[1;32m 614\u001B[0m line, result\u001B[38;5;241m.\u001B[39mstatus_code, result\u001B[38;5;241m.\u001B[39mheaders, stream\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m 615\u001B[0m )\n\u001B[1;32m 616\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m line \u001B[38;5;129;01min\u001B[39;00m parse_stream(result\u001B[38;5;241m.\u001B[39miter_lines())\n\u001B[1;32m 617\u001B[0m ), \u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m 618\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m 619\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m (\n\u001B[0;32m--> 620\u001B[0m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response_line\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 621\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcontent\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdecode\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mutf-8\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 622\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mstatus_code\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 623\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 624\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43;01mFalse\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[1;32m 625\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m,\n\u001B[1;32m 626\u001B[0m \u001B[38;5;28;01mFalse\u001B[39;00m,\n\u001B[1;32m 627\u001B[0m )\n",
"File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:680\u001B[0m, in \u001B[0;36mAPIRequestor._interpret_response_line\u001B[0;34m(self, rbody, rcode, rheaders, stream)\u001B[0m\n\u001B[1;32m 678\u001B[0m stream_error \u001B[38;5;241m=\u001B[39m stream \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124merror\u001B[39m\u001B[38;5;124m\"\u001B[39m \u001B[38;5;129;01min\u001B[39;00m resp\u001B[38;5;241m.\u001B[39mdata\n\u001B[1;32m 679\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream_error \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;241m200\u001B[39m \u001B[38;5;241m<\u001B[39m\u001B[38;5;241m=\u001B[39m rcode \u001B[38;5;241m<\u001B[39m \u001B[38;5;241m300\u001B[39m:\n\u001B[0;32m--> 680\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mhandle_error_response(\n\u001B[1;32m 681\u001B[0m rbody, rcode, resp\u001B[38;5;241m.\u001B[39mdata, rheaders, stream_error\u001B[38;5;241m=\u001B[39mstream_error\n\u001B[1;32m 682\u001B[0m )\n\u001B[1;32m 683\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp\n",
"\u001B[0;31mInvalidRequestError\u001B[0m: This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length."
]
}
],
"source": [
"long_text = 'AGI ' * 5000\n",
"get_embedding(input=long_text)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"Clearly we want to avoid these errors, particularly when dealing programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we will describe and provide recipes for the main approaches to dealing with these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 1. Truncating the input text\n",
"\n",
"The simplest solution is to truncate the input text to the maximum allowed length. Since the context length is in terms of tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, thus as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 98,
"outputs": [],
"source": [
"import tiktoken\n",
"\n",
"def truncate_text_tokens(text, encoding_name=EMBEDDING_ENCODING, max_tokens=EMBEDDING_CTX_LENGTH):\n",
" \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n",
" encoding = tiktoken.get_encoding(encoding_name)\n",
" return encoding.encode(text)[:max_tokens]"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"Our example from before now works."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 97,
"outputs": [
{
"data": {
"text/plain": "[-0.015384314581751823,\n 0.0031692360062152147,\n -0.007302511017769575,\n -0.02778581902384758,\n -0.013409210368990898,\n 0.0029592972714453936,\n -0.019119545817375183,\n -0.0004874778969679028,\n -0.010721994563937187,\n -0.023486273363232613,\n 0.016351712867617607,\n 0.005532307084649801,\n -0.009136536158621311,\n -0.014282556250691414,\n 0.005122506525367498,\n 0.02888757921755314,\n 0.020973725244402885,\n 0.009136536158621311,\n 0.003303596982732415,\n -0.013382338918745518,\n -0.024749264121055603,\n 0.03904525563120842,\n -0.01699664443731308,\n -0.010312194004654884,\n -0.009029047563672066,\n -0.001587137347087264,\n 0.017036953940987587,\n -0.056915245950222015,\n -0.011084768921136856,\n -0.006375421304255724,\n 0.011145230382680893,\n -0.01094368938356638,\n -0.010184550657868385,\n -0.009546336717903614,\n -0.012105911038815975,\n -0.004675756674259901,\n 0.002245505340397358,\n -0.0015040015568956733,\n -0.007457026280462742,\n 0.0029206685721874237,\n 0.03993203863501549,\n -0.02390279248356819,\n 0.003399329027161002,\n -0.02109465003013611,\n -0.026590008288621902,\n 0.004457420669496059,\n -0.03638491407036781,\n -0.018958313390612602,\n 0.002221992239356041,\n -0.007846672087907791,\n -0.0106548136100173,\n 0.0019096032483503222,\n -0.015451495535671711,\n -0.00783995445817709,\n 0.016821976751089096,\n 0.007409999612718821,\n -0.017601268365979195,\n 0.01502154115587473,\n -0.026119744405150414,\n -0.011333336122334003,\n -0.017184749245643616,\n -0.0352562814950943,\n -0.002327801426872611,\n 0.015666473656892776,\n -0.023069754242897034,\n -0.016821976751089096,\n -0.0005298855248838663,\n 0.0010933612938970327,\n 0.0048571438528597355,\n -0.034503862261772156,\n 0.007712311577051878,\n 0.038024116307497025,\n -0.017856555059552193,\n -0.02415807731449604,\n 0.020664695650339127,\n -0.01742659881711006,\n 0.012072320096194744,\n 0.015249954536557198,\n -0.008357243612408638,\n 0.001610650448128581,\n 0.018017787486314774,\n -0.02247856743633747,\n -3.219936115783639e-05,\n 0.02421182207763195,\n 0.010594351217150688,\n 0.01800435036420822,\n -0.019777914509177208,\n 0.024695521220564842,\n 0.0013805575435981154,\n -0.0138122932985425,\n 0.02132306434214115,\n 0.023325040936470032,\n 0.027597714215517044,\n 0.06062360480427742,\n -0.019562937319278717,\n 0.009559772908687592,\n -0.02183363400399685,\n 0.0173728559166193,\n -0.028242645785212517,\n -0.03058052435517311,\n 0.01847461424767971,\n -0.026536263525485992,\n -0.007947443053126335,\n -0.007517488207668066,\n -0.026616880670189857,\n 0.009183562360703945,\n 0.01872989907860756,\n -0.022075483575463295,\n 0.019589809700846672,\n -0.023916227743029594,\n 0.019347960129380226,\n 0.02378186769783497,\n 0.019764477387070656,\n -0.0202616136521101,\n -0.019401703029870987,\n 0.006335113197565079,\n 0.015209645964205265,\n -0.029935592785477638,\n -0.007013635244220495,\n -0.0363042950630188,\n 0.00704050762578845,\n 0.01616360805928707,\n 0.014981232583522797,\n -0.0013931537978351116,\n 0.030661141499876976,\n 0.01389290951192379,\n 0.007712311577051878,\n -0.01910611055791378,\n -0.0020792337600141764,\n -0.008404269814491272,\n 0.024722393602132797,\n 0.01699664443731308,\n 0.008350525982677937,\n 0.009727723896503448,\n -0.010695122182369232,\n 0.006560167297720909,\n -0.031386688351631165,\n 0.0263078510761261,\n -0.0001876852911664173,\n -0.01816558465361595,\n 0.019482320174574852,\n 0.023190679028630257,\n -0.015115593560039997,\n -0.015384314581751823,\n -0.005233354400843382,\n 0.004225648008286953,\n -0.0011555030941963196,\n -0.012092474848031998,\n 0.011602058075368404,\n -0.02179332636296749,\n -0.003029836807399988,\n 0.0030382343102246523,\n -0.011151948943734169,\n 0.007430153898894787,\n 0.001625766046345234,\n 0.010795892216265202,\n 0.0033136738929897547,\n 0.013167361728847027,\n -0.027033399790525436,\n 0.002052361611276865,\n 0.015061848796904087,\n 0.017762500792741776,\n 0.014349736273288727,\n -0.007047225721180439,\n 0.014887180179357529,\n 0.023190679028630257,\n 0.0055289482697844505,\n 0.018
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"truncated = truncate_text_tokens(long_text)\n",
"get_embedding(truncated)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 2. Chunking the input text\n",
"\n",
"Though the option above works, it has the clear drawback of simply discarding all text after the maximum context is filled. Another possible approach that addresses this issue is to in fact divide the input text into chunks and then embed each chunk individually. We can then either use the chunk embeddings separately, or combine them in some way, such as for example calculating their average (weighted by the size of each chunk).\n",
"\n",
"We will first take a function from python's own cookbook that breaks up a sequence into chunks."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 91,
"outputs": [],
"source": [
"from itertools import islice\n",
"\n",
"# From: https://docs.python.org/3/library/itertools.html#itertools-recipes\n",
"def batched(iterable, n):\n",
" \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n",
" # batched('ABCDEFG', 3) --> ABC DEF G\n",
" if n < 1:\n",
" raise ValueError('n must be at least one')\n",
" it = iter(iterable)\n",
" while (batch := tuple(islice(it, n))):\n",
" yield batch"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"Now let's define a function that encodes a string into tokens and then breaks it up into chunks."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"def chunked_tokens(text, encoding_name, chunk_length):\n",
" encoding = tiktoken.get_encoding(encoding_name)\n",
" tokens = encoding.encode(text)\n",
" chunks_iterator = batched(tokens, chunk_length)\n",
" yield from chunks_iterator"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The `reduction` flag can be set to either `'average'`, to return the weighted average of the chunk embeddings, or `None`, to simply return the unmodified list of chunk embeddings."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 101,
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"\n",
"def len_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, reduction=None):\n",
" chunk_embeddings = []\n",
" for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens):\n",
" chunk_embeddings.append(get_embedding(chunk, model=model))\n",
"\n",
" if reduction is None:\n",
" return chunk_embeddings\n",
" elif reduction == 'average':\n",
" return [np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()]\n",
" else:\n",
" raise ValueError(f'reduction {reduction} not valid.')\n",
"\n",
"\n",
"\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"Once again, we can verify that we can now handle long input texts."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 102,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Setting reduction=None gives us 2 embedding vectors.\n",
"Setting reduction='average' gives us 1 embedding vectors.\n"
]
}
],
"source": [
"embedding_vectors_no_reduce = len_safe_get_embedding(long_text, reduction=None)\n",
"average_embedding_vector = len_safe_get_embedding(long_text, reduction='average')\n",
"\n",
"print(f\"Setting reduction=None gives us {len(embedding_vectors_no_reduce)} embedding vectors.\")\n",
"print(f\"Setting reduction='average' gives us {len(average_embedding_vector)} embedding vector.\")\n"
],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "openai",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}