You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/Embedding_long_inputs.ipynb

1274 lines
61 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
1 year ago
"# Embedding texts that are longer than the model's context length\n",
"\n",
"All models have a maximum context length for the input text they take in. However, this maximum length is defined in terms of _tokens_ instead of string length. If you are unfamiliar with tokenization, you can check out the [\"How to count tokens with tiktoken\"](How_to_count_tokens_with_tiktoken.ipynb) notebook in this same cookbook.\n",
"\n",
"In this notebook, we will go over how to handle texts that are larger than a model's context length. In these examples, we will focus on embedding texts using the `text-embedding-ada-002`, but similar approaches can also be applied to other models and tasks. To learn about how to embed a text, check out the [Get embeddings](Get_embeddings.ipynb) notebook and the OpenAI [embeddings page](https://beta.openai.com/docs/guides/embeddings).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Model context length\n",
"\n",
"First, let us define the model we will be working with and a funciton to get embeddings from the API."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"from tenacity import retry, wait_random_exponential, stop_after_attempt, retry_if_not_exception_type\n",
"\n",
"\n",
"EMBEDDING_MODEL = 'text-embedding-ada-002'\n",
"EMBEDDING_CTX_LENGTH = 8191\n",
"EMBEDDING_ENCODING = 'cl100k_base'\n",
"\n",
"# let's make sure to not retry on an invalid request, because that is what we want to demonstrate\n",
"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), retry=retry_if_not_exception_type(openai.InvalidRequestError))\n",
"def get_embedding(text_or_tokens, model=EMBEDDING_MODEL):\n",
" return openai.Embedding.create(input=text_or_tokens, model=model)[\"data\"][0][\"embedding\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `text-embedding-ada-002` model has a context length of 8191 tokens with the `cl100k_base` encoding, and we can see that going over that limit causes an error."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"ename": "InvalidRequestError",
"evalue": "This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.",
"output_type": "error",
"traceback": [
"\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
"\u001B[0;31mInvalidRequestError\u001B[0m Traceback (most recent call last)",
"Cell \u001B[0;32mIn [18], line 2\u001B[0m\n\u001B[1;32m 1\u001B[0m long_text \u001B[38;5;241m=\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124mAGI \u001B[39m\u001B[38;5;124m'\u001B[39m \u001B[38;5;241m*\u001B[39m \u001B[38;5;241m5000\u001B[39m\n\u001B[0;32m----> 2\u001B[0m \u001B[43mget_embedding\u001B[49m\u001B[43m(\u001B[49m\u001B[43mlong_text\u001B[49m\u001B[43m)\u001B[49m\n",
"File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:326\u001B[0m, in \u001B[0;36mBaseRetrying.wraps.<locals>.wrapped_f\u001B[0;34m(*args, **kw)\u001B[0m\n\u001B[1;32m 324\u001B[0m \u001B[38;5;129m@functools\u001B[39m\u001B[38;5;241m.\u001B[39mwraps(f)\n\u001B[1;32m 325\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mwrapped_f\u001B[39m(\u001B[38;5;241m*\u001B[39margs: t\u001B[38;5;241m.\u001B[39mAny, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkw: t\u001B[38;5;241m.\u001B[39mAny) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m t\u001B[38;5;241m.\u001B[39mAny:\n\u001B[0;32m--> 326\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mf\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkw\u001B[49m\u001B[43m)\u001B[49m\n",
"File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:406\u001B[0m, in \u001B[0;36mRetrying.__call__\u001B[0;34m(self, fn, *args, **kwargs)\u001B[0m\n\u001B[1;32m 404\u001B[0m retry_state \u001B[38;5;241m=\u001B[39m RetryCallState(retry_object\u001B[38;5;241m=\u001B[39m\u001B[38;5;28mself\u001B[39m, fn\u001B[38;5;241m=\u001B[39mfn, args\u001B[38;5;241m=\u001B[39margs, kwargs\u001B[38;5;241m=\u001B[39mkwargs)\n\u001B[1;32m 405\u001B[0m \u001B[38;5;28;01mwhile\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m:\n\u001B[0;32m--> 406\u001B[0m do \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43miter\u001B[49m\u001B[43m(\u001B[49m\u001B[43mretry_state\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mretry_state\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 407\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(do, DoAttempt):\n\u001B[1;32m 408\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n",
"File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:351\u001B[0m, in \u001B[0;36mBaseRetrying.iter\u001B[0;34m(self, retry_state)\u001B[0m\n\u001B[1;32m 349\u001B[0m is_explicit_retry \u001B[38;5;241m=\u001B[39m retry_state\u001B[38;5;241m.\u001B[39moutcome\u001B[38;5;241m.\u001B[39mfailed \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(retry_state\u001B[38;5;241m.\u001B[39moutcome\u001B[38;5;241m.\u001B[39mexception(), TryAgain)\n\u001B[1;32m 350\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m (is_explicit_retry \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mretry(retry_state\u001B[38;5;241m=\u001B[39mretry_state)):\n\u001B[0;32m--> 351\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mfut\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mresult\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 353\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mafter \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[1;32m 354\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mafter(retry_state)\n",
"File \u001B[0;32m~/.pyenv/versions/3.9.9/lib/python3.9/concurrent/futures/_base.py:438\u001B[0m, in \u001B[0;36mFuture.result\u001B[0;34m(self, timeout)\u001B[0m\n\u001B[1;32m 436\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m CancelledError()\n\u001B[1;32m 437\u001B[0m \u001B[38;5;28;01melif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_state \u001B[38;5;241m==\u001B[39m FINISHED:\n\u001B[0;32m--> 438\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m__get_result\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 440\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_condition\u001B[38;5;241m.\u001B[39mwait(timeout)\n\u001B[1;32m 442\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_state \u001B[38;5;129;01min\u001B[39;00m [CANCELLED, CANCELLED_AND_NOTIFIED]:\n",
"File \u001B[0;32m~/.pyenv/versions/3.9.9/lib/python3.9/concurrent/futures/_base.py:390\u001B[0m, in \u001B[0;36mFuture.__get_result\u001B[0;34m(self)\u001B[0m\n\u001B[1;32m 388\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_exception:\n\u001B[1;32m 389\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m--> 390\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_exception\n\u001B[1;32m 391\u001B[0m \u001B[38;5;28;01mfinally\u001B[39;00m:\n\u001B[1;32m 392\u001B[0m \u001B[38;5;66;03m# Break a reference cycle with the exception in self._exception\u001B[39;00m\n\u001B[1;32m 393\u001B[0m \u001B[38;5;28mself\u001B[39m \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m\n",
"File \u001B[0;32m~/.virtualenvs/openai/lib/python3.9/site-packages/tenacity/__init__.py:409\u001B[0m, in \u001B[0;36mRetrying.__call__\u001B[0;34m(self, fn, *args, **kwargs)\u001B[0m\n\u001B[1;32m 407\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(do, DoAttempt):\n\u001B[1;32m 408\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m--> 409\u001B[0m result \u001B[38;5;241m=\u001B[39m \u001B[43mfn\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 410\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mBaseException\u001B[39;00m: \u001B[38;5;66;03m# noqa: B902\u001B[39;00m\n\u001B[1;32m 411\u001B[0m retry_state\u001B[38;5;241m.\u001B[39mset_exception(sys\u001B[38;5;241m.\u001B[39mexc_info())\n",
"Cell \u001B[0;32mIn [16], line 12\u001B[0m, in \u001B[0;36mget_embedding\u001B[0;34m(text_or_tokens, model)\u001B[0m\n\u001B[1;32m 10\u001B[0m \u001B[38;5;129m@retry\u001B[39m(wait\u001B[38;5;241m=\u001B[39mwait_random_exponential(\u001B[38;5;28mmin\u001B[39m\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m1\u001B[39m, \u001B[38;5;28mmax\u001B[39m\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m20\u001B[39m), stop\u001B[38;5;241m=\u001B[39mstop_after_attempt(\u001B[38;5;241m6\u001B[39m), retry\u001B[38;5;241m=\u001B[39mretry_if_not_exception_type(openai\u001B[38;5;241m.\u001B[39mInvalidRequestError))\n\u001B[1;32m 11\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mget_embedding\u001B[39m(text_or_tokens, model\u001B[38;5;241m=\u001B[39mEMBEDDING_MODEL):\n\u001B[0;32m---> 12\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mopenai\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mEmbedding\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43minput\u001B[39;49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mtext_or_tokens\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mmodel\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mmodel\u001B[49m\u001B[43m)\u001B[49m[\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mdata\u001B[39m\u001B[38;5;124m\"\u001B[39m][\u001B[38;5;241m0\u001B[39m][\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124membedding\u001B[39m\u001B[38;5;124m\"\u001B[39m]\n",
"File \u001B[0;32m~/code/openai-python/openai/api_resources/embedding.py:33\u001B[0m, in \u001B[0;36mEmbedding.create\u001B[0;34m(cls, *args, **kwargs)\u001B[0m\n\u001B[1;32m 31\u001B[0m \u001B[38;5;28;01mwhile\u001B[39;00m \u001B[38;5;28;01mTrue\u001B[39;00m:\n\u001B[1;32m 32\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m---> 33\u001B[0m response \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcreate\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43margs\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mkwargs\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 35\u001B[0m \u001B[38;5;66;03m# If a user specifies base64, we'll just return the encoded string.\u001B[39;00m\n\u001B[1;32m 36\u001B[0m \u001B[38;5;66;03m# This is only for the default case.\u001B[39;00m\n\u001B[1;32m 37\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m user_provided_encoding_format:\n",
"File \u001B[0;32m~/code/openai-python/openai/api_resources/abstract/engine_api_resource.py:153\u001B[0m, in \u001B[0;36mEngineAPIResource.create\u001B[0;34m(cls, api_key, api_base, api_type, request_id, api_version, organization, **params)\u001B[0m\n\u001B[1;32m 127\u001B[0m \u001B[38;5;129m@classmethod\u001B[39m\n\u001B[1;32m 128\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mcreate\u001B[39m(\n\u001B[1;32m 129\u001B[0m \u001B[38;5;28mcls\u001B[39m,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 136\u001B[0m \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams,\n\u001B[1;32m 137\u001B[0m ):\n\u001B[1;32m 138\u001B[0m (\n\u001B[1;32m 139\u001B[0m deployment_id,\n\u001B[1;32m 140\u001B[0m engine,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 150\u001B[0m api_key, api_base, api_type, api_version, organization, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mparams\n\u001B[1;32m 151\u001B[0m )\n\u001B[0;32m--> 153\u001B[0m response, _, api_key \u001B[38;5;241m=\u001B[39m \u001B[43mrequestor\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mrequest\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 154\u001B[0m \u001B[43m \u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mpost\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m,\u001B[49m\n\u001B[1;32m 155\u001B[0m \u001B[43m \u001B[49m\u001B[43murl\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 156\u001B[0m \u001B[43m \u001B[49m\u001B[43mparams\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mparams\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 157\u001B[0m \u001B[43m \u001B[49m\u001B[43mheaders\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 158\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mstream\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 159\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_id\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_id\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 160\u001B[0m \u001B[43m \u001B[49m\u001B[43mrequest_timeout\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mrequest_timeout\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 161\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 163\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream:\n\u001B[1;32m 164\u001B[0m \u001B[38;5;66;03m# must be an iterator\u001B[39;00m\n\u001B[1;32m 165\u001B[0m \u001B[38;5;28;01massert\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(response, OpenAIResponse)\n",
"File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:227\u001B[0m, in \u001B[0;36mAPIRequestor.request\u001B[0;34m(self, method, url, params, headers, files, stream, request_id, request_timeout)\u001B[0m\n\u001B[1;32m 206\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mrequest\u001B[39m(\n\u001B[1;32m 207\u001B[0m \u001B[38;5;28mself\u001B[39m,\n\u001B[1;32m 208\u001B[0m method,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 215\u001B[0m request_timeout: Optional[Union[\u001B[38;5;28mfloat\u001B[39m, Tuple[\u001B[38;5;28mfloat\u001B[39m, \u001B[38;5;28mfloat\u001B[39m]]] \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m,\n\u001B[1;32m 216\u001B[0m ) \u001B[38;5;241m-\u001B[39m\u001B[38;5;241m>\u001B[39m Tuple[Union[OpenAIResponse, Iterator[OpenAIResponse]], \u001B[38;5;28mbool\u001B[39m, \u001B[38;5;28mstr\u001B[39m]:\n\u001B[1;32m 217\u001B[0m result \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mrequest_raw(\n\u001B[1;32m 218\u001B[0m method\u001B[38;5;241m.\u001B[39mlower(),\n\u001B[1;32m 219\u001B[0m url,\n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 225\u001B[0m request_timeout\u001B[38;5;241m=\u001B[39mrequest_timeout,\n\u001B[1;32m 226\u001B[0m )\n\u001B[0;32m--> 227\u001B[0m resp, got_stream \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response\u001B[49m\u001B[43m(\u001B[49m\u001B[43mresult\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 228\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp, got_stream, \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mapi_key\n",
"File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:620\u001B[0m, in \u001B[0;36mAPIRequestor._interpret_response\u001B[0;34m(self, result, stream)\u001B[0m\n\u001B[1;32m 612\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m (\n\u001B[1;32m 613\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_interpret_response_line(\n\u001B[1;32m 614\u001B[0m line, result\u001B[38;5;241m.\u001B[39mstatus_code, result\u001B[38;5;241m.\u001B[39mheaders, stream\u001B[38;5;241m=\u001B[39m\u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m 615\u001B[0m )\n\u001B[1;32m 616\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m line \u001B[38;5;129;01min\u001B[39;00m parse_stream(result\u001B[38;5;241m.\u001B[39miter_lines())\n\u001B[1;32m 617\u001B[0m ), \u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m 618\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m 619\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m (\n\u001B[0;32m--> 620\u001B[0m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_interpret_response_line\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m 621\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcontent\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdecode\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[38;5;124;43mutf-8\u001B[39;49m\u001B[38;5;124;43m\"\u001B[39;49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 622\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mstatus_code\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 623\u001B[0m \u001B[43m \u001B[49m\u001B[43mresult\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mheaders\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m 624\u001B[0m \u001B[43m \u001B[49m\u001B[43mstream\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43;01mFalse\u001B[39;49;00m\u001B[43m,\u001B[49m\n\u001B[1;32m 625\u001B[0m \u001B[43m \u001B[49m\u001B[43m)\u001B[49m,\n\u001B[1;32m 626\u001B[0m \u001B[38;5;28;01mFalse\u001B[39;00m,\n\u001B[1;32m 627\u001B[0m )\n",
"File \u001B[0;32m~/code/openai-python/openai/api_requestor.py:680\u001B[0m, in \u001B[0;36mAPIRequestor._interpret_response_line\u001B[0;34m(self, rbody, rcode, rheaders, stream)\u001B[0m\n\u001B[1;32m 678\u001B[0m stream_error \u001B[38;5;241m=\u001B[39m stream \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124merror\u001B[39m\u001B[38;5;124m\"\u001B[39m \u001B[38;5;129;01min\u001B[39;00m resp\u001B[38;5;241m.\u001B[39mdata\n\u001B[1;32m 679\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m stream_error \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;241m200\u001B[39m \u001B[38;5;241m<\u001B[39m\u001B[38;5;241m=\u001B[39m rcode \u001B[38;5;241m<\u001B[39m \u001B[38;5;241m300\u001B[39m:\n\u001B[0;32m--> 680\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mhandle_error_response(\n\u001B[1;32m 681\u001B[0m rbody, rcode, resp\u001B[38;5;241m.\u001B[39mdata, rheaders, stream_error\u001B[38;5;241m=\u001B[39mstream_error\n\u001B[1;32m 682\u001B[0m )\n\u001B[1;32m 683\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m resp\n",
"\u001B[0;31mInvalidRequestError\u001B[0m: This model's maximum context length is 8191 tokens, however you requested 10001 tokens (10001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length."
]
}
],
"source": [
"long_text = 'AGI ' * 5000\n",
"get_embedding(long_text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly we want to avoid these errors, particularly when handling programatically with a large number of embeddings. Yet, we still might be faced with texts that are longer than the maximum context length. Below we will describe and provide recipes for the main approaches to handling these longer texts: (1) simply truncating the text to the maximum allowed length, and (2) chunking the text and embeddings each chunk individually."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Truncating the input text\n",
"\n",
"The simplest solution is to truncate the input text to the maximum allowed length. Since the context length is in terms of tokens, we have to first tokenize the text before truncating it. The API accepts inputs both in the form of text or tokens, thus as long as you are careful that you are using the appropriate encoding, there is no need to convert the tokens back into string form. Below is an example of such a truncation function."
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [],
"source": [
"import tiktoken\n",
"\n",
"def truncate_text_tokens(text, encoding_name=EMBEDDING_ENCODING, max_tokens=EMBEDDING_CTX_LENGTH):\n",
" \"\"\"Truncate a string to have `max_tokens` according to the given encoding.\"\"\"\n",
" encoding = tiktoken.get_encoding(encoding_name)\n",
" return encoding.encode(text)[:max_tokens]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our example from before now works."
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[-0.015384314581751823,\n",
" 0.0031692360062152147,\n",
" -0.007302511017769575,\n",
" -0.02778581902384758,\n",
" -0.013409210368990898,\n",
" 0.0029592972714453936,\n",
" -0.019119545817375183,\n",
" -0.0004874778969679028,\n",
" -0.010721994563937187,\n",
" -0.023486273363232613,\n",
" 0.016351712867617607,\n",
" 0.005532307084649801,\n",
" -0.009136536158621311,\n",
" -0.014282556250691414,\n",
" 0.005122506525367498,\n",
" 0.02888757921755314,\n",
" 0.020973725244402885,\n",
" 0.009136536158621311,\n",
" 0.003303596982732415,\n",
" -0.013382338918745518,\n",
" -0.024749264121055603,\n",
" 0.03904525563120842,\n",
" -0.01699664443731308,\n",
" -0.010312194004654884,\n",
" -0.009029047563672066,\n",
" -0.001587137347087264,\n",
" 0.017036953940987587,\n",
" -0.056915245950222015,\n",
" -0.011084768921136856,\n",
" -0.006375421304255724,\n",
" 0.011145230382680893,\n",
" -0.01094368938356638,\n",
" -0.010184550657868385,\n",
" -0.009546336717903614,\n",
" -0.012105911038815975,\n",
" -0.004675756674259901,\n",
" 0.002245505340397358,\n",
" -0.0015040015568956733,\n",
" -0.007457026280462742,\n",
" 0.0029206685721874237,\n",
" 0.03993203863501549,\n",
" -0.02390279248356819,\n",
" 0.003399329027161002,\n",
" -0.02109465003013611,\n",
" -0.026590008288621902,\n",
" 0.004457420669496059,\n",
" -0.03638491407036781,\n",
" -0.018958313390612602,\n",
" 0.002221992239356041,\n",
" -0.007846672087907791,\n",
" -0.0106548136100173,\n",
" 0.0019096032483503222,\n",
" -0.015451495535671711,\n",
" -0.00783995445817709,\n",
" 0.016821976751089096,\n",
" 0.007409999612718821,\n",
" -0.017601268365979195,\n",
" 0.01502154115587473,\n",
" -0.026119744405150414,\n",
" -0.011333336122334003,\n",
" -0.017184749245643616,\n",
" -0.0352562814950943,\n",
" -0.002327801426872611,\n",
" 0.015666473656892776,\n",
" -0.023069754242897034,\n",
" -0.016821976751089096,\n",
" -0.0005298855248838663,\n",
" 0.0010933612938970327,\n",
" 0.0048571438528597355,\n",
" -0.034503862261772156,\n",
" 0.007712311577051878,\n",
" 0.038024116307497025,\n",
" -0.017856555059552193,\n",
" -0.02415807731449604,\n",
" 0.020664695650339127,\n",
" -0.01742659881711006,\n",
" 0.012072320096194744,\n",
" 0.015249954536557198,\n",
" -0.008357243612408638,\n",
" 0.001610650448128581,\n",
" 0.018017787486314774,\n",
" -0.02247856743633747,\n",
" -3.219936115783639e-05,\n",
" 0.02421182207763195,\n",
" 0.010594351217150688,\n",
" 0.01800435036420822,\n",
" -0.019777914509177208,\n",
" 0.024695521220564842,\n",
" 0.0013805575435981154,\n",
" -0.0138122932985425,\n",
" 0.02132306434214115,\n",
" 0.023325040936470032,\n",
" 0.027597714215517044,\n",
" 0.06062360480427742,\n",
" -0.019562937319278717,\n",
" 0.009559772908687592,\n",
" -0.02183363400399685,\n",
" 0.0173728559166193,\n",
" -0.028242645785212517,\n",
" -0.03058052435517311,\n",
" 0.01847461424767971,\n",
" -0.026536263525485992,\n",
" -0.007947443053126335,\n",
" -0.007517488207668066,\n",
" -0.026616880670189857,\n",
" 0.009183562360703945,\n",
" 0.01872989907860756,\n",
" -0.022075483575463295,\n",
" 0.019589809700846672,\n",
" -0.023916227743029594,\n",
" 0.019347960129380226,\n",
" 0.02378186769783497,\n",
" 0.019764477387070656,\n",
" -0.0202616136521101,\n",
" -0.019401703029870987,\n",
" 0.006335113197565079,\n",
" 0.015209645964205265,\n",
" -0.029935592785477638,\n",
" -0.007013635244220495,\n",
" -0.0363042950630188,\n",
" 0.00704050762578845,\n",
" 0.01616360805928707,\n",
" 0.014981232583522797,\n",
" -0.0013931537978351116,\n",
" 0.030661141499876976,\n",
" 0.01389290951192379,\n",
" 0.007712311577051878,\n",
" -0.01910611055791378,\n",
" -0.0020792337600141764,\n",
" -0.008404269814491272,\n",
" 0.024722393602132797,\n",
" 0.01699664443731308,\n",
" 0.008350525982677937,\n",
" 0.009727723896503448,\n",
" -0.010695122182369232,\n",
" 0.006560167297720909,\n",
" -0.031386688351631165,\n",
" 0.0263078510761261,\n",
" -0.0001876852911664173,\n",
" -0.01816558465361595,\n",
" 0.019482320174574852,\n",
" 0.023190679028630257,\n",
" -0.015115593560039997,\n",
" -0.015384314581751823,\n",
" -0.005233354400843382,\n",
" 0.004225648008286953,\n",
" -0.0011555030941963196,\n",
" -0.012092474848031998,\n",
" 0.011602058075368404,\n",
" -0.02179332636296749,\n",
" -0.003029836807399988,\n",
" 0.0030382343102246523,\n",
" -0.011151948943734169,\n",
" 0.007430153898894787,\n",
" 0.001625766046345234,\n",
" 0.010795892216265202,\n",
" 0.0033136738929897547,\n",
" 0.013167361728847027,\n",
" -0.027033399790525436,\n",
" 0.002052361611276865,\n",
" 0.015061848796904087,\n",
" 0.017762500792741776,\n",
" 0.014349736273288727,\n",
" -0.007047225721180439,\n",
" 0.014887180179357529,\n",
" 0.023190679028630257,\n",
" 0.0055289482697844505,\n",
" 0.018084967508912086,\n",
" -0.0014888859586790204,\n",
" -0.003711717901751399,\n",
" 0.008290063589811325,\n",
" 0.03740605339407921,\n",
" 0.007960879243910313,\n",
" 0.01809840463101864,\n",
" 0.010916817001998425,\n",
" 0.03504130616784096,\n",
" 0.0031138123013079166,\n",
" -0.005303893703967333,\n",
" -0.022868212312459946,\n",
" -0.01373839471489191,\n",
" -0.013933218084275723,\n",
" 0.008525194600224495,\n",
" 0.05304565653204918,\n",
" 0.014537842012941837,\n",
" 0.006230983417481184,\n",
" -0.004920965526252985,\n",
" 0.002856847131624818,\n",
" -0.015868013724684715,\n",
" 0.006835607346147299,\n",
" -0.027449917048215866,\n",
" -0.0049041700549423695,\n",
" 0.009808340109884739,\n",
" 0.028914449736475945,\n",
" -0.017386291176080704,\n",
" -0.6199946403503418,\n",
" -0.02336534857749939,\n",
" -0.018353689461946487,\n",
" -0.0028131799772381783,\n",
" 0.019804786890745163,\n",
" 0.04409722611308098,\n",
" 0.005280380602926016,\n",
" 0.011702828109264374,\n",
" -0.024829881265759468,\n",
" 0.01465876679867506,\n",
" -0.013395775109529495,\n",
" 0.025877896696329117,\n",
" -0.01636514998972416,\n",
" 0.01199842244386673,\n",
" -0.01084291934967041,\n",
" -0.008827506564557552,\n",
" 0.00870658177882433,\n",
" -0.020086944103240967,\n",
" -0.0006025243201293051,\n",
" 0.027812691405415535,\n",
" -0.03404703363776207,\n",
" 0.0019079238409176469,\n",
" 0.0024403284769505262,\n",
" -0.006099981721490622,\n",
" 0.009082792326807976,\n",
" -0.0050519672222435474,\n",
" 0.014309428632259369,\n",
" -0.022921957075595856,\n",
" -0.02199486829340458,\n",
" 0.003607588354498148,\n",
" -0.008518476970493793,\n",
" 0.00871329940855503,\n",
" 0.02747678942978382,\n",
" -0.020906545221805573,\n",
" 0.04253863915801048,\n",
" 0.000455147324828431,\n",
" 0.014484097249805927,\n",
" 0.033079635351896286,\n",
" 0.026590008288621902,\n",
" 0.05258882790803909,\n",
" -0.025971949100494385,\n",
" 0.010493581183254719,\n",
" 0.026455648243427277,\n",
" -0.008511758409440517,\n",
" -0.019025493413209915,\n",
" 0.020019764080643654,\n",
" 0.01937483251094818,\n",
" -0.013415928930044174,\n",
" -0.0027863075956702232,\n",
" -0.007860108278691769,\n",
" 0.011662520468235016,\n",
" -0.007255484815686941,\n",
" 0.0033707772381603718,\n",
" -0.01479312777519226,\n",
" 0.009358231909573078,\n",
" 0.007100970018655062,\n",
" 0.02388935536146164,\n",
" -0.017171313986182213,\n",
" -0.008726735599339008,\n",
" 0.010977279394865036,\n",
" -0.003943490330129862,\n",
" 0.004695910960435867,\n",
" -0.003323751036077738,\n",
" 0.011642365716397762,\n",
" -0.014510969631373882,\n",
" 0.0063888574950397015,\n",
" -0.006832248065620661,\n",
" 0.01937483251094818,\n",
" 0.0011857342906296253,\n",
" 0.006308240815997124,\n",
" -0.0029643357265740633,\n",
" 0.012569455429911613,\n",
" -0.013677932322025299,\n",
" 0.01015767827630043,\n",
" -0.002561253262683749,\n",
" -0.0055994875729084015,\n",
" 0.024144640192389488,\n",
" 0.0076988753862679005,\n",
" -0.01128630992025137,\n",
" -0.022277025505900383,\n",
" 0.013422646559774876,\n",
" 0.00892155896872282,\n",
" 0.0036613326519727707,\n",
" -0.009438848122954369,\n",
" 0.04151749610900879,\n",
" -0.005727130454033613,\n",
" -0.00863268319517374,\n",
" -0.012804587371647358,\n",
" 0.011138512752950191,\n",
" -0.003283442696556449,\n",
" -0.00783995445817709,\n",
" 0.028538240119814873,\n",
" 0.00030609077657572925,\n",
" 0.006113417912274599,\n",
" 0.0205303356051445,\n",
" 0.0037721802946180105,\n",
" -0.02425212971866131,\n",
" 0.013771984726190567,\n",
" 0.0034833045210689306,\n",
" -0.01748034358024597,\n",
" -0.0062444196082651615,\n",
" -0.005653231870383024,\n",
" 0.011037741787731647,\n",
" 0.02684529311954975,\n",
" -0.023822175338864326,\n",
" 0.041598111391067505,\n",
" -0.02915629930794239,\n",
" -0.009895674884319305,\n",
" 0.03240783140063286,\n",
" -0.022639799863100052,\n",
" 0.01879708096385002,\n",
" -0.03727169334888458,\n",
" -0.02415807731449604,\n",
" -0.02132306434214115,\n",
" 0.014940924011170864,\n",
" -0.03536377102136612,\n",
" 0.012925512157380581,\n",
" 0.012421658262610435,\n",
" 0.017117569223046303,\n",
" -0.01281130500137806,\n",
" -0.014269120059907436,\n",
" -0.010144243016839027,\n",
" 0.0049075293354690075,\n",
" -0.01338905654847622,\n",
" 0.00038628740003332496,\n",
" -7.085434481268749e-05,\n",
" -0.00503853103145957,\n",
" -0.024507414549589157,\n",
" -0.022653235122561455,\n",
" 0.02374155819416046,\n",
" 0.03141356259584427,\n",
" -0.003390931524336338,\n",
" 0.015653036534786224,\n",
" -0.024386489763855934,\n",
" 0.05546415224671364,\n",
" 0.015438059344887733,\n",
" 0.02504485845565796,\n",
" -0.001958309207111597,\n",
" 0.013684650883078575,\n",
" -0.01979134976863861,\n",
" -0.022706979885697365,\n",
" 0.013825729489326477,\n",
" 0.008753607980906963,\n",
" -0.014537842012941837,\n",
" 0.01672792248427868,\n",
" -0.01663387008011341,\n",
" -0.014121322892606258,\n",
" -0.015451495535671711,\n",
" -0.005186328198760748,\n",
" 0.03512192144989967,\n",
" -0.008746890351176262,\n",
" 0.029693743214011192,\n",
" -0.016486072912812233,\n",
" 0.026482518762350082,\n",
" -0.023969972506165504,\n",
" -0.037916626781225204,\n",
" -0.017722193151712418,\n",
" -0.0202616136521101,\n",
" 0.023298168554902077,\n",
" 0.0012461966834962368,\n",
" 0.024695521220564842,\n",
" 0.021658966317772865,\n",
" -0.010936971753835678,\n",
" 0.002561253262683749,\n",
" -0.005156097002327442,\n",
" -0.01057419739663601,\n",
" 0.009754596278071404,\n",
" 0.03125232830643654,\n",
" -0.021282754838466644,\n",
" -0.031924132257699966,\n",
" 0.016647307202219963,\n",
" 0.013355466537177563,\n",
" -0.00647283298894763,\n",
" 0.019334523007273674,\n",
" 0.012032012455165386,\n",
" 0.02778581902384758,\n",
" -0.008001187816262245,\n",
" 0.011071332730352879,\n",
" 0.0004958754288963974,\n",
" 0.006368703208863735,\n",
" -0.002013732912018895,\n",
" -0.011951396241784096,\n",
" -0.02372812293469906,\n",
" 0.0049075293354690075,\n",
" 0.032810915261507034,\n",
" -0.010963844135403633,\n",
" -0.013986961916089058,\n",
" 0.041974324733018875,\n",
" -0.018541794270277023,\n",
" 0.022586055099964142,\n",
" -0.003758744103834033,\n",
" 0.020355666056275368,\n",
" 0.022129228338599205,\n",
" 0.004353290889412165,\n",
" -0.03531002625823021,\n",
" 0.0034564323723316193,\n",
" 0.00438352208584547,\n",
" -0.0040207477286458015,\n",
" -0.002181683899834752,\n",
" 0.005488639697432518,\n",
" 0.013227824121713638,\n",
" -0.002729204250499606,\n",
" 0.02899506688117981,\n",
" -0.027328992262482643,\n",
" 0.019536064937710762,\n",
" -0.01735941879451275,\n",
" 0.007235330529510975,\n",
" -0.02126931957900524,\n",
" 0.026388466358184814,\n",
" 0.016284532845020294,\n",
" 0.0048806569539010525,\n",
" -0.018971748650074005,\n",
" -0.008384115993976593,\n",
" 0.00697332713752985,\n",
" 0.023647505789995193,\n",
" 0.011843906715512276,\n",
" -0.004353290889412165,\n",
" 0.013059872202575207,\n",
" -0.014846871607005596,\n",
" -0.0016291250940412283,\n",
" 0.010896663181483746,\n",
" -0.003557202871888876,\n",
" 0.0031524410005658865,\n",
" -0.0016249263426288962,\n",
" -0.03509504720568657,\n",
" 0.005639795679599047,\n",
" 0.00781980063766241,\n",
" 0.007786210160702467,\n",
" 0.007907134480774403,\n",
" -0.01518277358263731,\n",
" 0.0005798509810119867,\n",
" 0.006738195661455393,\n",
" 0.014578149653971195,\n",
" 0.023553453385829926,\n",
" 0.013395775109529495,\n",
" 0.015706781297922134,\n",
" 0.019119545817375183,\n",
" -0.006818812340497971,\n",
" 0.01567990891635418,\n",
" -0.0037251540925353765,\n",
" 0.003856155788525939,\n",
" 0.009425411932170391,\n",
" 0.012052166275680065,\n",
" -0.030392419546842575,\n",
" 0.03697609901428223,\n",
" -0.0009875521063804626,\n",
" 0.05073465034365654,\n",
" -0.012119347229599953,\n",
" -0.03520253673195839,\n",
" 0.027758946642279625,\n",
" -0.012240272015333176,\n",
" 0.018246199935674667,\n",
" 0.0012361196568235755,\n",
" -0.002601561602205038,\n",
" 0.02625410631299019,\n",
" -0.03541751578450203,\n",
" 0.0018071531085297465,\n",
" 0.010795892216265202,\n",
" 0.002294211182743311,\n",
" 0.023486273363232613,\n",
" 0.02646908350288868,\n",
" 0.006677733268588781,\n",
" 0.0025394195690751076,\n",
" 0.010218140669167042,\n",
" 0.013200951740145683,\n",
" 0.02325785905122757,\n",
" 0.020328793674707413,\n",
" -0.004642166662961245,\n",
" -0.015558984130620956,\n",
" 0.008323653601109982,\n",
" -0.003956926520913839,\n",
" -0.015196209773421288,\n",
" -0.006422447506338358,\n",
" -0.010513735003769398,\n",
" 0.009794904850423336,\n",
" -0.011434106156229973,\n",
" -0.018555231392383575,\n",
" -0.014833435416221619,\n",
" 0.008934995159506798,\n",
" -0.00547184469178319,\n",
" -0.0012193245347589254,\n",
" -0.01504841260612011,\n",
" 0.022693544626235962,\n",
" 0.02642877586185932,\n",
" 0.001154663390479982,\n",
" -0.01336890272796154,\n",
" -0.004890734329819679,\n",
" 0.025071730837225914,\n",
" -0.005928671453148127,\n",
" 0.004323059692978859,\n",
" -0.012018576264381409,\n",
" 0.030822373926639557,\n",
" 0.007517488207668066,\n",
" 0.011675955727696419,\n",
" 0.004951196722686291,\n",
" 0.015827706083655357,\n",
" 0.014094451442360878,\n",
" -0.039018385112285614,\n",
" 0.003486663568764925,\n",
" 0.0029206685721874237,\n",
" 0.02284134179353714,\n",
" 0.00033863127464428544,\n",
" -0.014403481036424637,\n",
" -0.001502322033047676,\n",
" 0.0244671069085598,\n",
" 0.0013881153427064419,\n",
" 0.0076316953636705875,\n",
" -0.009700851514935493,\n",
" -0.02435961924493313,\n",
" -0.045494578778743744,\n",
" 0.010124088265001774,\n",
" 0.0009220512001775205,\n",
" -0.01584114134311676,\n",
" -0.005663308780640364,\n",
" -0.012549301609396935,\n",
" -0.0027980643790215254,\n",
" -0.017292238771915436,\n",
" -0.013241259381175041,\n",
" 0.005199763923883438,\n",
" -0.0024604827631264925,\n",
" 0.011541595682501793,\n",
" -0.005219918210059404,\n",
" -0.01762814074754715,\n",
" -0.0006478711147792637,\n",
" 0.09980322420597076,\n",
" 0.02805454097688198,\n",
" 0.004746296443045139,\n",
" 0.021699273958802223,\n",
" 0.006892710458487272,\n",
" 0.011971550062298775,\n",
" -0.007779492065310478,\n",
" -0.010426400229334831,\n",
" 0.009768032468855381,\n",
" -0.023029446601867676,\n",
" 0.01742659881711006,\n",
" 0.0013268132461234927,\n",
" 0.002284134039655328,\n",
" 0.007766055874526501,\n",
" 0.018461177125573158,\n",
" -0.016244225203990936,\n",
" -0.021336499601602554,\n",
" -0.0093447957187891,\n",
" 0.004245802294462919,\n",
" -0.004783245734870434,\n",
" -0.009808340109884739,\n",
" 0.014631894417107105,\n",
" -0.02362063340842724,\n",
" 0.013651059940457344,\n",
" 0.021954558789730072,\n",
" -0.02114839479327202,\n",
" 0.0031591590959578753,\n",
" 0.003164197551086545,\n",
" -0.010769020766019821,\n",
" -0.006855761166661978,\n",
" -0.016969772055745125,\n",
" -0.00590515835210681,\n",
" -0.015411186963319778,\n",
" 0.0001366701617371291,\n",
" -0.015196209773421288,\n",
" 0.0011731380363926291,\n",
" 0.009855367243289948,\n",
" -0.018071532249450684,\n",
" 0.03936772421002388,\n",
" -0.027342429384589195,\n",
" 0.029451893642544746,\n",
" 0.0027476788964122534,\n",
" -0.009828494861721992,\n",
" -0.02257261984050274,\n",
" 0.01479312777519226,\n",
" -0.026119744405150414,\n",
" -0.01007706206291914,\n",
" 0.009559772908687592,\n",
" -0.014752819202840328,\n",
" -0.03135981783270836,\n",
" 0.014322864823043346,\n",
" -0.0008481527329422534,\n",
" -0.01502154115587473,\n",
" 0.004148390609771013,\n",
" 0.010856354609131813,\n",
" 0.013919781893491745,\n",
" -0.03135981783270836,\n",
" -0.010688403621315956,\n",
" 0.008827506564557552,\n",
" -0.017251931130886078,\n",
" -0.009700851514935493,\n",
" -0.012925512157380581,\n",
" 0.010466708801686764,\n",
" -0.019831659272313118,\n",
" 0.009721006266772747,\n",
" -0.028403880074620247,\n",
" -0.0027325632981956005,\n",
" 0.00016994545876514167,\n",
" 0.0021850429475307465,\n",
" 0.004924324341118336,\n",
" 0.02958625555038452,\n",
" 0.008216165006160736,\n",
" -0.01915985345840454,\n",
" -0.0005004940903745592,\n",
" -0.004598499275743961,\n",
" 0.02642877586185932,\n",
" 0.011689391918480396,\n",
" -0.0024201744236052036,\n",
" 0.008384115993976593,\n",
" 0.02309662662446499,\n",
" 0.023378783836960793,\n",
" -0.040657587349414825,\n",
" -0.015908321365714073,\n",
" -0.0019129622960463166,\n",
" -0.019966019317507744,\n",
" 0.009156690910458565,\n",
" -0.01279115118086338,\n",
" -0.0228950846940279,\n",
" 0.002087631495669484,\n",
" -0.0008225401979871094,\n",
" 0.013281567953526974,\n",
" -0.0075779506005346775,\n",
" 0.00553902518004179,\n",
" -0.019038928672671318,\n",
" -0.02327129617333412,\n",
" 0.002831654390320182,\n",
" 0.023177243769168854,\n",
" -0.02531358040869236,\n",
" 0.0001918840571306646,\n",
" -0.004662320949137211,\n",
" 0.0281620305031538,\n",
" -0.00968741625547409,\n",
" 0.0023210833314806223,\n",
" -0.011561749503016472,\n",
" -0.0007293273811228573,\n",
" 0.018770208582282066,\n",
" 0.005458408500999212,\n",
" 0.0038628738839179277,\n",
" 0.00013467573444359004,\n",
" -0.0010253411019220948,\n",
" 0.014161631464958191,\n",
" 0.003374136285856366,\n",
" 0.012065602466464043,\n",
" -0.013617469929158688,\n",
" -0.0018323458498343825,\n",
" 0.005045249126851559,\n",
" 0.0011084768921136856,\n",
" -0.006785221863538027,\n",
" 0.009069356136023998,\n",
" -0.0005160295404493809,\n",
" -0.012636636383831501,\n",
" -0.00517289200797677,\n",
" 0.022357642650604248,\n",
" -0.006953172851353884,\n",
" -0.029666870832443237,\n",
" 0.0021296192426234484,\n",
" -0.0006881793960928917,\n",
" -0.002552855759859085,\n",
" 0.0004912567674182355,\n",
" -0.004255879204720259,\n",
" -0.0008040656102821231,\n",
" 0.018810516223311424,\n",
" -0.015196209773421288,\n",
" -0.015962066128849983,\n",
" -0.008565503172576427,\n",
" 0.0014930847100913525,\n",
" -0.023338476195931435,\n",
" 0.012092474848031998,\n",
" -0.025743534788489342,\n",
" -0.010957125574350357,\n",
" -0.033079635351896286,\n",
" -0.00035395679879002273,\n",
" 0.022760724648833275,\n",
" -0.003933413419872522,\n",
" -0.03563249111175537,\n",
" -0.04122190177440643,\n",
" 0.0057943109422922134,\n",
" 0.017077261582016945,\n",
" -0.008478168398141861,\n",
" 0.023862482979893684,\n",
" -0.004820194561034441,\n",
" -0.004339854698628187,\n",
" -0.007947443053126335,\n",
" -0.01799091510474682,\n",
" -0.01946888491511345,\n",
" -0.027812691405415535,\n",
" -0.019656989723443985,\n",
" 0.01883738860487938,\n",
" 0.02663031592965126,\n",
" 0.028484495356678963,\n",
" 0.02800079621374607,\n",
" 0.020328793674707413,\n",
" 0.02125588245689869,\n",
" -5.395427069743164e-05,\n",
" 0.02556886523962021,\n",
" -0.012052166275680065,\n",
" 0.03243470564484596,\n",
" 0.004974709823727608,\n",
" -0.0105204526335001,\n",
" 0.011897651478648186,\n",
" 0.013335312716662884,\n",
" -0.013825729489326477,\n",
" 0.014363172464072704,\n",
" -0.01811183989048004,\n",
" -0.0024403284769505262,\n",
" 0.03727169334888458,\n",
" -0.012092474848031998,\n",
" -0.016284532845020294,\n",
" -0.008148984052240849,\n",
" -0.031091095879673958,\n",
" -0.01621735282242298,\n",
" 0.006654220167547464,\n",
" 0.020879672840237617,\n",
" 0.005364356096833944,\n",
" -0.03436949849128723,\n",
" -0.025931639596819878,\n",
" 0.013590597547590733,\n",
" 0.008639401756227016,\n",
" 0.017278803512454033,\n",
" -0.012992692179977894,\n",
" 0.021766453981399536,\n",
" -0.003293519839644432,\n",
" 0.013919781893491745,\n",
" 0.009438848122954369,\n",
" 0.015505239367485046,\n",
" -0.02374155819416046,\n",
" -0.010863073170185089,\n",
" -0.0030936580151319504,\n",
" 0.0107891745865345,\n",
" 0.017668448388576508,\n",
" 0.005965620744973421,\n",
" 0.005928671453148127,\n",
" 0.002952579176053405,\n",
" -0.016714487224817276,\n",
" -0.017036953940987587,\n",
" 0.024964241310954094,\n",
" -0.01173641812056303,\n",
" -0.003752026241272688,\n",
" 0.01094368938356638,\n",
" -0.022747289389371872,\n",
" 0.00047992009785957634,\n",
" -0.01778937317430973,\n",
" -0.05425490438938141,\n",
" -0.01731911115348339,\n",
" 0.020812492817640305,\n",
" -0.0032431345898658037,\n",
" -0.0292100440710783,\n",
" -0.004279392305761576,\n",
" 0.012482120655477047,\n",
" -0.03541751578450203,\n",
" 0.002704011742025614,\n",
" -0.007759337779134512,\n",
" 0.02304288186132908,\n",
" 0.012199963442981243,\n",
" 0.028538240119814873,\n",
" 0.014860307797789574,\n",
" -0.012307452037930489,\n",
" -0.01936139538884163,\n",
" -0.0033607003279030323,\n",
" -0.004014029633253813,\n",
" -0.007638412993401289,\n",
" -0.010271885432302952,\n",
" 0.008021341636776924,\n",
" 0.0010925214737653732,\n",
" -0.0373791828751564,\n",
" 0.0024923933669924736,\n",
" 0.008021341636776924,\n",
" -0.00739656388759613,\n",
" -0.02410433255136013,\n",
" 0.025246400386095047,\n",
" 0.005374433007091284,\n",
" 0.010762302204966545,\n",
" -0.006627347785979509,\n",
" -0.015142465941607952,\n",
" -0.050439056009054184,\n",
" 0.04108754172921181,\n",
" 0.03869592025876045,\n",
" 0.004007311537861824,\n",
" 0.003866232931613922,\n",
" 0.004413753282278776,\n",
" 0.015129029750823975,\n",
" 0.023298168554902077,\n",
" -0.024064024910330772,\n",
" 0.0011177140986546874,\n",
" -0.009270897135138512,\n",
" 0.0016266057500615716,\n",
" 0.017386291176080704,\n",
" -0.013745113275945187,\n",
" 0.01694290153682232,\n",
" 0.003973721526563168,\n",
" -0.012011858634650707,\n",
" -7.295373507076874e-05,\n",
" -0.016324840486049652,\n",
" 0.011555030941963196,\n",
" 0.014725946821272373,\n",
" 0.003930054139345884,\n",
" -0.012253707274794579,\n",
" -0.01537087932229042,\n",
" 0.0050519672222435474,\n",
" 0.016136735677719116,\n",
" -0.04573642462491989,\n",
" -0.009647107683122158,\n",
" -0.014201940037310123,\n",
" -0.006543372292071581,\n",
" 0.017655013129115105,\n",
" 0.0035168947651982307,\n",
" -0.00868642795830965,\n",
" 0.011165385134518147,\n",
" -0.023768430575728416,\n",
" -0.011763290502130985,\n",
" 0.03350958973169327,\n",
" 0.003799052443355322,\n",
" 0.0060966224409639835,\n",
" 0.0007314267568290234,\n",
" -0.004679115954786539,\n",
" -0.003631101455539465,\n",
" 0.007705593481659889,\n",
" 0.010433118790388107,\n",
" 0.029021939262747765,\n",
" -0.008390833623707294,\n",
" -0.023929663002490997,\n",
" -0.010963844135403633,\n",
" 0.00109504081774503,\n",
" -0.0034161240328103304,\n",
" 0.009304487146437168,\n",
" -0.014282556250691414,\n",
" -0.00626121461391449,\n",
" 0.03221972659230232,\n",
" -0.04299546405673027,\n",
" 0.007121123839169741,\n",
" 0.014551278203725815,\n",
" -0.012206681072711945,\n",
" -0.008169138804078102,\n",
" 0.001264671329408884,\n",
" -0.004766450263559818,\n",
" 0.00836396124213934,\n",
" 0.04237740486860275,\n",
" 0.003034875262528658,\n",
" -0.01231416966766119,\n",
" -0.01523651834577322,\n",
" 0.017775937914848328,\n",
" 0.03990516811609268,\n",
" -0.002383225131779909,\n",
" 0.004830271936953068,\n",
" 0.013563726097345352,\n",
" 0.000969917222391814,\n",
" 0.01346967276185751,\n",
" 0.002389943227171898,\n",
" -0.014806563034653664,\n",
" 0.007436871994286776,\n",
" -0.039448339492082596,\n",
" 0.009015611372888088,\n",
" 0.0007436032174155116,\n",
" -0.004622012376785278,\n",
" 0.004222289193421602,\n",
" 0.016244225203990936,\n",
" 0.01831338182091713,\n",
" 0.005146019626408815,\n",
" -0.013691368512809277,\n",
" -0.03904525563120842,\n",
" -0.024695521220564842,\n",
" -0.019562937319278717,\n",
" -0.013852601870894432,\n",
" -0.009385104291141033,\n",
" 0.003081901464611292,\n",
" -0.013019564561545849,\n",
" -0.025851024314761162,\n",
" 0.011440824717283249,\n",
" -0.02679155021905899,\n",
" -0.025219528004527092,\n",
" 0.01173641812056303,\n",
" 0.01402727048844099,\n",
" 0.02888757921755314,\n",
" 0.020503463223576546,\n",
" -0.007759337779134512,\n",
" -0.013852601870894432,\n",
" -0.005596128758043051,\n",
" 0.0010958805214613676,\n",
" -0.05113773047924042,\n",
" -0.022236717864871025,\n",
" -0.0123679144307971,\n",
" 0.021954558789730072,\n",
" 0.015196209773421288,\n",
" -0.03004308231174946,\n",
" -0.03135981783270836,\n",
" -0.016284532845020294,\n",
" -0.05863506719470024,\n",
" -0.018138712272047997,\n",
" 0.006852402351796627,\n",
" 0.014282556250691414,\n",
" 0.016459202393889427,\n",
" -0.013006128370761871,\n",
" 0.009613517671823502,\n",
" 0.020705003291368484,\n",
" 0.0090760737657547,\n",
" 0.0022656593937426805,\n",
" -0.006879274267703295,\n",
" -0.02109465003013611,\n",
" -0.003799052443355322,\n",
" -0.006419088691473007,\n",
" 0.000651650014333427,\n",
" -0.01878364384174347,\n",
" 0.002342917025089264,\n",
" -0.015572420321404934,\n",
" 0.010453272610902786,\n",
" -0.015962066128849983,\n",
" -0.00675163185223937,\n",
" 0.021229011937975883,\n",
" 0.0007910493877716362,\n",
" -0.004830271936953068,\n",
" -0.015518675558269024,\n",
" 0.007087533827871084,\n",
" 0.013295004144310951,\n",
" 0.025877896696329117,\n",
" 0.007873544469475746,\n",
" -0.027973923832178116,\n",
" -0.0028232568874955177,\n",
" 0.008303498849272728,\n",
" -0.0018491409718990326,\n",
" -0.014712510630488396,\n",
" -0.010056908242404461,\n",
" 0.0013133770553395152,\n",
" 0.0015375918010249734,\n",
" 0.025394197553396225,\n",
" -0.0009573209099471569,\n",
" -0.0033640593755990267,\n",
" -0.011749854311347008,\n",
" -0.01386603806167841,\n",
" 0.02336534857749939,\n",
" 0.010809328407049179,\n",
" 0.010695122182369232,\n",
" 0.0006445121252909303,\n",
" -0.010668249800801277,\n",
" 0.0041886987164616585,\n",
" 0.013630906119942665,\n",
" 0.015397750772535801,\n",
" -0.015505239367485046,\n",
" -0.0025142270606011152,\n",
" 0.02105434238910675,\n",
" -0.005992493126541376,\n",
" 0.019025493413209915,\n",
" -0.04425845667719841,\n",
" 0.007114405743777752,\n",
" -0.023754995316267014,\n",
" -0.010473426431417465,\n",
" 0.012764278799295425,\n",
" -0.012018576264381409,\n",
" -0.04624699801206589,\n",
" 0.022586055099964142,\n",
" 0.00040014335536397994,\n",
" -0.009257460944354534,\n",
" -0.006214188411831856,\n",
" 0.011252719908952713,\n",
" -0.011467697098851204,\n",
" -0.00610669981688261,\n",
" -0.02998933754861355,\n",
" 0.017184749245643616,\n",
" -0.026482518762350082,\n",
" -0.02105434238910675,\n",
" -0.006083186715841293,\n",
" -0.00826319120824337,\n",
" -0.001481328159570694,\n",
" -0.010983997955918312,\n",
" 0.03326774016022682,\n",
" 0.0012226835824549198,\n",
" -0.004769809544086456,\n",
" 0.19713421165943146,\n",
" -0.014269120059907436,\n",
" -0.0032179418485611677,\n",
" 0.012105911038815975,\n",
" -0.028511367738246918,\n",
" -0.017131006345152855,\n",
" 0.044177841395139694,\n",
" 0.0017853195313364267,\n",
" -0.0024302515666931868,\n",
" 0.03783601149916649,\n",
" -0.00892155896872282,\n",
" 0.012549301609396935,\n",
" 0.027275249361991882,\n",
" -0.01050029881298542,\n",
" 0.012723970226943493,\n",
" -0.02199486829340458,\n",
" -0.03536377102136612,\n",
" -0.02347283624112606,\n",
" 0.009519464336335659,\n",
" 0.005925312638282776,\n",
" 0.018703026697039604,\n",
" 0.0024755983613431454,\n",
" -0.01386603806167841,\n",
" -0.011481133289635181,\n",
" 0.0009497631108388305,\n",
" 0.000136355243739672,\n",
" -0.007302511017769575,\n",
" -0.00022925317171029747,\n",
" 0.010896663181483746,\n",
" -0.0023731482215225697,\n",
" -0.008934995159506798,\n",
" -0.02426556497812271,\n",
" -0.005135942716151476,\n",
" 0.014040706679224968,\n",
" -4.135794370085932e-05,\n",
" 0.003822565544396639,\n",
" 0.022236717864871025,\n",
" 0.01215293724089861,\n",
" 0.02794705331325531,\n",
" 0.013476391322910786,\n",
" 0.02304288186132908,\n",
" 0.016297968104481697,\n",
" 0.007846672087907791,\n",
" -0.035229410976171494,\n",
" -0.018716463819146156,\n",
" 0.03547126054763794,\n",
" ...]"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"truncated = truncate_text_tokens(long_text)\n",
"get_embedding(truncated)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Chunking the input text\n",
"\n",
"Though the option above works, it has the clear drawback of simply discarding all text after the maximum context is filled. Another possible approach that addresses this issue is to in fact divide the input text into chunks and then embed each chunk individually. We can then either use the chunk embeddings separately, or combine them in some way, such as for example calculating their average (weighted by the size of each chunk).\n",
"\n",
"We will first take a function from [python's own cookbook](https://docs.python.org/3/library/itertools.html#itertools-recipes) that breaks up a sequence into chunks."
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"from itertools import islice\n",
"\n",
"def batched(iterable, n):\n",
" \"\"\"Batch data into tuples of length n. The last batch may be shorter.\"\"\"\n",
" # batched('ABCDEFG', 3) --> ABC DEF G\n",
" if n < 1:\n",
" raise ValueError('n must be at least one')\n",
" it = iter(iterable)\n",
" while (batch := tuple(islice(it, n))):\n",
" yield batch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's define a function that encodes a string into tokens and then breaks it up into chunks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def chunked_tokens(text, encoding_name, chunk_length):\n",
" encoding = tiktoken.get_encoding(encoding_name)\n",
" tokens = encoding.encode(text)\n",
" chunks_iterator = batched(tokens, chunk_length)\n",
" yield from chunks_iterator"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The `average` flag can be set to `True` to return the weighted average of the chunk embeddings, or `False` to simply return the unmodified list of chunk embeddings."
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"\n",
"def len_safe_get_embedding(text, model=EMBEDDING_MODEL, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING, average=True):\n",
" chunk_embeddings = []\n",
" for chunk in chunked_tokens(text, encoding_name=encoding_name, chunk_length=max_tokens):\n",
" chunk_embeddings.append(get_embedding(chunk, model=model))\n",
"\n",
" if average:\n",
" chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=[len(c) for c in chunk_embeddings]).tolist()\n",
" return chunk_embeddings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once again, we can verify that we can now handle long input texts."
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Setting reduce=None gives us 2 embedding vectors.\n",
"Setting reduce='average' gives us 1 embedding vector.\n"
]
}
],
"source": [
"average_embedding_vector = len_safe_get_embedding(long_text, average=True)\n",
"chunks_embedding_vectors = len_safe_get_embedding(long_text, average=False)\n",
"\n",
"print(f\"Setting average=True gives us a single {len(average_embedding_vector)}-dimensional embedding vector for our long text.\")\n",
"print(f\"Setting average=False gives us {len(chunks_embedding_vectors)} embedding vectors, one for each of the chunks.\")\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
},
"vscode": {
"interpreter": {
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}