@ -446,11 +446,11 @@ Embeddings can be used for search either by themselves or as a feature in a larg
The simplest way to use embeddings for search is as follows:
The simplest way to use embeddings for search is as follows:
* Before the search (precompute):
* Before the search (precompute):
* Split your text corpus into chunks smaller than the token limit (e.g., ~2,000 tokens)
* Split your text corpus into chunks smaller than the token limit (e.g., <8,000tokens)
* Embed each chunk using a 'doc' model (e.g., `text-search-curie-doc-001`)
* Embed each chunk
* Store those embeddings in your own database or in a vector search provider like [Pinecone](https://www.pinecone.io) or [Weaviate](https://weaviate.io)
* Store those embeddings in your own database or in a vector search provider like [Pinecone](https://www.pinecone.io) or [Weaviate](https://weaviate.io)
* At the time of the search (live compute):
* At the time of the search (live compute):
* Embed the search query using the corresponding 'query' model (e.g. `text-search-curie-query-001`)
* Embed the search query
* Find the closest embeddings in your database
* Find the closest embeddings in your database
* Return the top results, ranked by cosine similarity
* Return the top results, ranked by cosine similarity
@ -460,7 +460,7 @@ In more advanced search systems, the the cosine similarity of embeddings can be
#### Recommendations
#### Recommendations
Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set. And instead of using pairs of doc-query models, you can use a single symmetric similarity model (e.g., `text-similarity-curie-001`).
Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set.
An example of how to use embeddings for recommendations is shown in [Recommendation_using_embeddings.ipynb](examples/Recommendation_using_embeddings.ipynb).
An example of how to use embeddings for recommendations is shown in [Recommendation_using_embeddings.ipynb](examples/Recommendation_using_embeddings.ipynb).
"We index our own openai-python code repository, and show how it can be searched. We implement a simple version of file parsing and extracting of functions from python files."
"We index our own [openai-python code repository](https://github.com/openai/openai-python), and show how it can be searched. We implement a simple version of file parsing and extracting of functions from python files."
]
]
},
},
{
{
@ -18,8 +19,8 @@
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"Total number of py files: 40\n",
"Total number of py files: 51\n",
"Total number of functions extracted: 64\n"
"Total number of functions extracted: 97\n"
]
]
}
}
],
],
@ -63,18 +64,24 @@
"\n",
"\n",
"# get user root directory\n",
"# get user root directory\n",
"root_dir = os.path.expanduser(\"~\")\n",
"root_dir = os.path.expanduser(\"~\")\n",
"# note: for this code to work, the openai-python repo must be downloaded and placed in your root directory\n",
"\n",
"\n",
"# path to code repository directory\n",
"# path to code repository directory\n",
"code_root = root_dir + \"/openai-python\"\n",
"code_root = root_dir + \"/openai-python\"\n",
"\n",
"code_files = [y for x in os.walk(code_root) for y in glob(os.path.join(x[0], '*.py'))]\n",
"code_files = [y for x in os.walk(code_root) for y in glob(os.path.join(x[0], '*.py'))]\n",
"print(\"Total number of py files:\", len(code_files))\n",
"print(\"Total number of py files:\", len(code_files))\n",
"\n",
"if len(code_files) == 0:\n",
" print(\"Double check that you have downloaded the openai-python repo and set the code_root variable correctly.\")\n",
"\n",
"all_funcs = []\n",
"all_funcs = []\n",
"for code_file in code_files:\n",
"for code_file in code_files:\n",
" funcs = list(get_functions(code_file))\n",
" funcs = list(get_functions(code_file))\n",
" for func in funcs:\n",
" for func in funcs:\n",
" all_funcs.append(func)\n",
" all_funcs.append(func)\n",
"\n",
"\n",
"print(\"Total number of functions extracted:\", len(all_funcs))\n"
"print(\"Total number of functions extracted:\", len(all_funcs))"
" This validator will suggest to add a common suffix to the completion if one doesn't already exist in case of classification or conditional generation.\n",
" This validator will suggest to add a common suffix to the completion if one doesn't already exist in case of classification or conditional generation.\n",
"We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
"We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
]
]
},
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy."
"datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\" # for your convenience, we precomputed the embeddings\n",
"# If you have not run the \"Obtain_dataset.ipynb\" notebook, you can download the datafile from here: https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\n",
"print(f\"Babbage similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")\n"
"print(f\"Ada similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")\n"
]
]
},
},
{
{
@ -57,7 +59,7 @@
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"Dummy mean prediction performance on Amazon reviews: mse=1.81, mae=1.08\n"
"Dummy mean prediction performance on Amazon reviews: mse=1.73, mae=1.03\n"
]
]
}
}
],
],
@ -70,10 +72,11 @@
]
]
},
},
{
{
"attachments": {},
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {},
"source": [
"source": [
"We can see that the embeddings are able to predict the scores with an average error of 0.39 per score prediction. This is roughly equivalent to predicting 2 out of 3 reviews perfectly, and 1 out of three reviews by a one star error."
"We can see that the embeddings are able to predict the scores with an average error of 0.60 per score prediction. This is roughly equivalent to predicting 1 out of 3 reviews perfectly, and 1 out of two reviews by a one star error."
"datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\" # for your convenience, we precomputed the embeddings\n",
"# If you have not run the \"Obtain_dataset.ipynb\" notebook, you can download the datafile from here: https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\n",
"Fantastic Instant Refried beans: Fantastic Instant Refried Beans have been a staple for my family now for nearly 20 years. All 7 of us love it and my grown kids are passing on the tradition.\n",
"Good Buy: I liked the beans. They were vacuum sealed, plump and moist. Would recommend them for any use. I personally split and stuck them in some vodka to make vanilla extract. Yum!\n",
"\n",
"\n",
"Jamaican Blue beans: Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and befor\n",
"Jamaican Blue beans: Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and befor\n",
"sooo good: tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.\n",
"\n",
"Tasty and Quick Pasta: Barilla Whole Grain Fusilli with Vegetable Marinara is tasty and has an excellent chunky vegetable marinara. I just wish there was more of it. If you aren't starving or on a \n",
"Tasty and Quick Pasta: Barilla Whole Grain Fusilli with Vegetable Marinara is tasty and has an excellent chunky vegetable marinara. I just wish there was more of it. If you aren't starving or on a \n",
"\n",
"\n",
"Rustichella ROCKS!: Anything this company makes is worthwhile eating! My favorite is their Trenne.<br />Their whole wheat pasta is the best I have ever had.\n",
"sooo good: tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.\n",
"\n",
"Handy: Love the idea of ready in a minute pasta and for that alone this product gets praise. The pasta is whole grain so that's a big plus and it actually comes out al dente. The vegetable marinara\n",
"Good food: The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.\n",
"Good food: The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.\n",
"\n",
"\n",
"Good product: I like that this is a better product for my pets but really for the price of it I couldn't afford to buy this all the time. My cat isn't very picky usually and she ate this, we usually \n",
"The cats like it: My 7 cats like this food but it is a little yucky for the human. Pieces of mackerel swimming in a dark broth. It is billed as a \"complete\" food and contains carrots, peas and pasta.\n",