updates embedding examples with new embedding model

pull/42/head
Logan Kilpatrick 1 year ago committed by Ted Sanders
parent 7de3d50816
commit fd181ec78f

@ -446,11 +446,11 @@ Embeddings can be used for search either by themselves or as a feature in a larg
The simplest way to use embeddings for search is as follows:
* Before the search (precompute):
* Split your text corpus into chunks smaller than the token limit (e.g., ~2,000 tokens)
* Embed each chunk using a 'doc' model (e.g., `text-search-curie-doc-001`)
* Split your text corpus into chunks smaller than the token limit (e.g., <8,000 tokens)
* Embed each chunk
* Store those embeddings in your own database or in a vector search provider like [Pinecone](https://www.pinecone.io) or [Weaviate](https://weaviate.io)
* At the time of the search (live compute):
* Embed the search query using the corresponding 'query' model (e.g. `text-search-curie-query-001`)
* Embed the search query
* Find the closest embeddings in your database
* Return the top results, ranked by cosine similarity
@ -460,7 +460,7 @@ In more advanced search systems, the the cosine similarity of embeddings can be
#### Recommendations
Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set. And instead of using pairs of doc-query models, you can use a single symmetric similarity model (e.g., `text-similarity-curie-001`).
Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set.
An example of how to use embeddings for recommendations is shown in [Recommendation_using_embeddings.ipynb](examples/Recommendation_using_embeddings.ipynb).

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

@ -1,12 +1,13 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Code search\n",
"\n",
"We index our own openai-python code repository, and show how it can be searched. We implement a simple version of file parsing and extracting of functions from python files."
"We index our own [openai-python code repository](https://github.com/openai/openai-python), and show how it can be searched. We implement a simple version of file parsing and extracting of functions from python files."
]
},
{
@ -18,8 +19,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of py files: 40\n",
"Total number of functions extracted: 64\n"
"Total number of py files: 51\n",
"Total number of functions extracted: 97\n"
]
}
],
@ -63,18 +64,24 @@
"\n",
"# get user root directory\n",
"root_dir = os.path.expanduser(\"~\")\n",
"# note: for this code to work, the openai-python repo must be downloaded and placed in your root directory\n",
"\n",
"# path to code repository directory\n",
"code_root = root_dir + \"/openai-python\"\n",
"\n",
"code_files = [y for x in os.walk(code_root) for y in glob(os.path.join(x[0], '*.py'))]\n",
"print(\"Total number of py files:\", len(code_files))\n",
"\n",
"if len(code_files) == 0:\n",
" print(\"Double check that you have downloaded the openai-python repo and set the code_root variable correctly.\")\n",
"\n",
"all_funcs = []\n",
"for code_file in code_files:\n",
" funcs = list(get_functions(code_file))\n",
" for func in funcs:\n",
" all_funcs.append(func)\n",
"\n",
"print(\"Total number of functions extracted:\", len(all_funcs))\n"
"print(\"Total number of functions extracted:\", len(all_funcs))"
]
},
{
@ -119,64 +126,57 @@
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>def semantic_search(engine, query, documents):...</td>\n",
" <td>semantic_search</td>\n",
" <td>/examples/semanticsearch/semanticsearch.py</td>\n",
" <td>[-0.038976121693849564, -0.0031428150832653046...</td>\n",
" <td>def _console_log_level():\\n if openai.log i...</td>\n",
" <td>_console_log_level</td>\n",
" <td>/openai/util.py</td>\n",
" <td>[0.03389773145318031, -0.004390408284962177, 0...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>def main():\\n parser = argparse.ArgumentPar...</td>\n",
" <td>main</td>\n",
" <td>/examples/semanticsearch/semanticsearch.py</td>\n",
" <td>[-0.024289356544613838, -0.017748363316059113,...</td>\n",
" <td>def log_debug(message, **params):\\n msg = l...</td>\n",
" <td>log_debug</td>\n",
" <td>/openai/util.py</td>\n",
" <td>[-0.004034275189042091, 0.004895383026450872, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>def get_candidates(\\n prompt: str,\\n sto...</td>\n",
" <td>get_candidates</td>\n",
" <td>/examples/codex/backtranslation.py</td>\n",
" <td>[-0.04161201789975166, -0.0169310811907053, 0....</td>\n",
" <td>def log_info(message, **params):\\n msg = lo...</td>\n",
" <td>log_info</td>\n",
" <td>/openai/util.py</td>\n",
" <td>[0.004882764536887407, 0.0033515947870910168, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>def rindex(lst: List, value: str) -&gt; int:\\n ...</td>\n",
" <td>rindex</td>\n",
" <td>/examples/codex/backtranslation.py</td>\n",
" <td>[-0.027255680412054062, -0.007931121625006199,...</td>\n",
" <td>def log_warn(message, **params):\\n msg = lo...</td>\n",
" <td>log_warn</td>\n",
" <td>/openai/util.py</td>\n",
" <td>[0.002535992069169879, -0.010829543694853783, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>def eval_candidate(\\n candidate_answer: str...</td>\n",
" <td>eval_candidate</td>\n",
" <td>/examples/codex/backtranslation.py</td>\n",
" <td>[-0.00999179296195507, -0.01640152558684349, 0...</td>\n",
" <td>def logfmt(props):\\n def fmt(key, val):\\n ...</td>\n",
" <td>logfmt</td>\n",
" <td>/openai/util.py</td>\n",
" <td>[0.016732551157474518, 0.017367802560329437, 0...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" code function_name \\\n",
"0 def semantic_search(engine, query, documents):... semantic_search \n",
"1 def main():\\n parser = argparse.ArgumentPar... main \n",
"2 def get_candidates(\\n prompt: str,\\n sto... get_candidates \n",
"3 def rindex(lst: List, value: str) -> int:\\n ... rindex \n",
"4 def eval_candidate(\\n candidate_answer: str... eval_candidate \n",
" code function_name \\\n",
"0 def _console_log_level():\\n if openai.log i... _console_log_level \n",
"1 def log_debug(message, **params):\\n msg = l... log_debug \n",
"2 def log_info(message, **params):\\n msg = lo... log_info \n",
"3 def log_warn(message, **params):\\n msg = lo... log_warn \n",
"4 def logfmt(props):\\n def fmt(key, val):\\n ... logfmt \n",
"\n",
" filepath \\\n",
"0 /examples/semanticsearch/semanticsearch.py \n",
"1 /examples/semanticsearch/semanticsearch.py \n",
"2 /examples/codex/backtranslation.py \n",
"3 /examples/codex/backtranslation.py \n",
"4 /examples/codex/backtranslation.py \n",
"\n",
" code_embedding \n",
"0 [-0.038976121693849564, -0.0031428150832653046... \n",
"1 [-0.024289356544613838, -0.017748363316059113,... \n",
"2 [-0.04161201789975166, -0.0169310811907053, 0.... \n",
"3 [-0.027255680412054062, -0.007931121625006199,... \n",
"4 [-0.00999179296195507, -0.01640152558684349, 0... "
" filepath code_embedding \n",
"0 /openai/util.py [0.03389773145318031, -0.004390408284962177, 0... \n",
"1 /openai/util.py [-0.004034275189042091, 0.004895383026450872, ... \n",
"2 /openai/util.py [0.004882764536887407, 0.0033515947870910168, ... \n",
"3 /openai/util.py [0.002535992069169879, -0.010829543694853783, ... \n",
"4 /openai/util.py [0.016732551157474518, 0.017367802560329437, 0... "
]
},
"execution_count": 2,
@ -188,44 +188,44 @@
"from openai.embeddings_utils import get_embedding\n",
"\n",
"df = pd.DataFrame(all_funcs)\n",
"df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, engine='code-search-babbage-code-001'))\n",
"df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))\n",
"df['filepath'] = df['filepath'].apply(lambda x: x.replace(code_root, \"\"))\n",
"df.to_csv(\"output/code_search_openai-python.csv\", index=False)\n",
"df.to_csv(\"data/code_search_openai-python.csv\", index=False)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/openai/tests/test_endpoints.py:test_completions_multiple_prompts score=0.681\n",
"def test_completions_multiple_prompts():\n",
" result = openai.Completion.create(\n",
" prompt=[\"This was a test\", \"This was another test\"], n=5, engine=\"ada\"\n",
" )\n",
" assert len(result.choices) == 10\n",
"\n",
"----------------------------------------------------------------------\n",
"/openai/tests/test_endpoints.py:test_completions score=0.675\n",
"/openai/tests/test_endpoints.py:test_completions score=0.826\n",
"def test_completions():\n",
" result = openai.Completion.create(prompt=\"This was a test\", n=5, engine=\"ada\")\n",
" assert len(result.choices) == 5\n",
"\n",
"\n",
"----------------------------------------------------------------------\n",
"/openai/tests/test_api_requestor.py:test_requestor_sets_request_id score=0.635\n",
"def test_requestor_sets_request_id(mocker: MockerFixture) -> None:\n",
" # Fake out 'requests' and confirm that the X-Request-Id header is set.\n",
"/openai/tests/test_endpoints.py:test_completions_model score=0.811\n",
"def test_completions_model():\n",
" result = openai.Completion.create(prompt=\"This was a test\", n=5, model=\"ada\")\n",
" assert len(result.choices) == 5\n",
" assert result.model.startswith(\"ada\")\n",
"\n",
"\n",
"----------------------------------------------------------------------\n",
"/openai/tests/test_endpoints.py:test_completions_multiple_prompts score=0.808\n",
"def test_completions_multiple_prompts():\n",
" result = openai.Completion.create(\n",
" prompt=[\"This was a test\", \"This was another test\"], n=5, engine=\"ada\"\n",
" )\n",
" assert len(result.choices) == 10\n",
"\n",
" got_headers = {}\n",
"\n",
" def fake_request(self, *args, **kwargs):\n",
" nonlocal got_headers\n",
"----------------------------------------------------------------------\n"
]
}
@ -234,7 +234,7 @@
"from openai.embeddings_utils import cosine_similarity\n",
"\n",
"def search_functions(df, code_query, n=3, pprint=True, n_lines=7):\n",
" embedding = get_embedding(code_query, engine='code-search-babbage-text-001')\n",
" embedding = get_embedding(code_query, engine='text-embedding-ada-002')\n",
" df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))\n",
"\n",
" res = df.sort_values('similarities', ascending=False).head(n)\n",
@ -244,19 +244,20 @@
" print(\"\\n\".join(r[1].code.split(\"\\n\")[:n_lines]))\n",
" print('-'*70)\n",
" return res\n",
"res = search_functions(df, 'Completions API tests', n=3)\n"
"\n",
"res = search_functions(df, 'Completions API tests', n=3)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/openai/validators.py:format_inferrer_validator score=0.655\n",
"/openai/validators.py:format_inferrer_validator score=0.751\n",
"def format_inferrer_validator(df):\n",
" \"\"\"\n",
" This validator will infer the likely fine-tuning format of the data, and display it to the user if it is classification.\n",
@ -265,23 +266,23 @@
" ft_type = infer_task_type(df)\n",
" immediate_msg = None\n",
"----------------------------------------------------------------------\n",
"/openai/validators.py:long_examples_validator score=0.649\n",
"def long_examples_validator(df):\n",
" \"\"\"\n",
" This validator will suggest to the user to remove examples that are too long.\n",
" \"\"\"\n",
" immediate_msg = None\n",
" optional_msg = None\n",
" optional_fn = None\n",
"/openai/validators.py:get_validators score=0.748\n",
"def get_validators():\n",
" return [\n",
" num_examples_validator,\n",
" lambda x: necessary_column_validator(x, \"prompt\"),\n",
" lambda x: necessary_column_validator(x, \"completion\"),\n",
" additional_column_validator,\n",
" non_empty_field_validator,\n",
"----------------------------------------------------------------------\n",
"/openai/validators.py:non_empty_completion_validator score=0.646\n",
"def non_empty_completion_validator(df):\n",
"/openai/validators.py:infer_task_type score=0.738\n",
"def infer_task_type(df):\n",
" \"\"\"\n",
" This validator will ensure that no completion is empty.\n",
" Infer the likely fine-tuning task type from the data\n",
" \"\"\"\n",
" necessary_msg = None\n",
" necessary_fn = None\n",
" immediate_msg = None\n",
" CLASSIFICATION_THRESHOLD = 3 # min_average instances of each class\n",
" if sum(df.prompt.str.len()) == 0:\n",
" return \"open-ended generation\"\n",
"----------------------------------------------------------------------\n"
]
}
@ -292,14 +293,26 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/openai/validators.py:common_completion_suffix_validator score=0.665\n",
"/openai/validators.py:get_common_xfix score=0.793\n",
"def get_common_xfix(series, xfix=\"suffix\"):\n",
" \"\"\"\n",
" Finds the longest common suffix or prefix of all the values in a series\n",
" \"\"\"\n",
" common_xfix = \"\"\n",
" while True:\n",
" common_xfixes = (\n",
" series.str[-(len(common_xfix) + 1) :]\n",
" if xfix == \"suffix\"\n",
" else series.str[: len(common_xfix) + 1]\n",
"----------------------------------------------------------------------\n",
"/openai/validators.py:common_completion_suffix_validator score=0.778\n",
"def common_completion_suffix_validator(df):\n",
" \"\"\"\n",
" This validator will suggest to add a common suffix to the completion if one doesn't already exist in case of classification or conditional generation.\n",
@ -310,18 +323,6 @@
" optional_fn = None\n",
"\n",
" ft_type = infer_task_type(df)\n",
"----------------------------------------------------------------------\n",
"/openai/validators.py:get_outfnames score=0.66\n",
"def get_outfnames(fname, split):\n",
" suffixes = [\"_train\", \"_valid\"] if split else [\"\"]\n",
" i = 0\n",
" while True:\n",
" index_suffix = f\" ({i})\" if i > 0 else \"\"\n",
" candidate_fnames = [\n",
" fname.split(\".\")[0] + \"_prepared\" + suffix + index_suffix + \".jsonl\"\n",
" for suffix in suffixes\n",
" ]\n",
" if not any(os.path.isfile(f) for f in candidate_fnames):\n",
"----------------------------------------------------------------------\n"
]
}
@ -332,14 +333,14 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/openai/cli.py:tools_register score=0.651\n",
"/openai/cli.py:tools_register score=0.773\n",
"def tools_register(parser):\n",
" subparsers = parser.add_subparsers(\n",
" title=\"Tools\", help=\"Convenience client side tools\"\n",
@ -374,8 +375,9 @@
"hash": "be4b5d5b73a21c599de40d6deb1129796d12dc1cc33a738f7bac13269cfcafe8"
},
"kernelspec": {
"display_name": "Python 3.7.3 64-bit ('base': conda)",
"name": "python3"
"display_name": "openai-cookbook",
"language": "python",
"name": "openai-cookbook"
},
"language_info": {
"codemirror_mode": {
@ -387,7 +389,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.9.6"
},
"orig_nbformat": 4
},

@ -17,7 +17,7 @@
{
"data": {
"text/plain": [
"12288"
"1536"
]
},
"execution_count": 1,
@ -29,8 +29,8 @@
"import openai\n",
"\n",
"embedding = openai.Embedding.create(\n",
" input=\"Sample document text goes here\",\n",
" engine=\"text-similarity-davinci-001\"\n",
" input=\"Your text goes here\",\n",
" engine=\"text-embedding-ada-002\"\n",
")[\"data\"][0][\"embedding\"]\n",
"len(embedding)\n"
]
@ -44,7 +44,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"1024\n"
"1536\n"
]
}
],
@ -54,7 +54,7 @@
"\n",
"\n",
"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
"def get_embedding(text: str, engine=\"text-similarity-davinci-001\") -> list[float]:\n",
"def get_embedding(text: str, engine=\"text-embedding-ada-002\") -> list[float]:\n",
"\n",
" # replace newlines, which can negatively affect performance.\n",
" text = text.replace(\"\\n\", \" \")\n",
@ -62,25 +62,7 @@
" return openai.Embedding.create(input=[text], engine=engine)[\"data\"][0][\"embedding\"]\n",
"\n",
"\n",
"embedding = get_embedding(\"Sample query text goes here\", engine=\"text-search-ada-query-001\")\n",
"print(len(embedding))\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1024\n"
]
}
],
"source": [
"embedding = get_embedding(\"Sample document text goes here\", engine=\"text-search-ada-doc-001\")\n",
"embedding = get_embedding(\"Your text goes here\", engine=\"text-embedding-ada-002\")\n",
"print(len(embedding))\n"
]
}

@ -11,6 +11,14 @@
"We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy."
]
},
{
"cell_type": "code",
"execution_count": 1,
@ -131,7 +139,7 @@
"\n",
"# remove reviews that are too long\n",
"df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))\n",
"df = df[df.n_tokens<2000].tail(1_000)\n",
"df = df[df.n_tokens<8000].tail(1_000)\n",
"len(df)"
]
},
@ -148,20 +156,22 @@
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"from openai.embeddings_utils import get_embedding\n",
"# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage\n",
"\n",
"# This will take just under 10 minutes\n",
"df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-similarity-babbage-001'))\n",
"df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-search-babbage-doc-001'))\n",
"# This will take just between 5 and 10 minutes\n",
"df['ada_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))\n",
"df['ada_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))\n",
"df.to_csv('data/fine_food_reviews_with_embeddings_1k.csv')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.9 ('openai')",
"display_name": "openai-cookbook",
"language": "python",
"name": "python3"
"name": "openai-cookbook"
},
"language_info": {
"codemirror_mode": {
@ -173,12 +183,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
"version": "3.9.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},

File diff suppressed because it is too large Load Diff

@ -20,7 +20,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Babbage similarity embedding performance on 1k Amazon reviews: mse=0.39, mae=0.38\n"
"Ada similarity embedding performance on 1k Amazon reviews: mse=0.60, mae=0.51\n"
]
}
],
@ -32,11 +32,13 @@
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
"\n",
"datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\" # for your convenience, we precomputed the embeddings\n",
"# If you have not run the \"Obtain_dataset.ipynb\" notebook, you can download the datafile from here: https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\n",
"datafile_path = \"./data/fine_food_reviews_with_embeddings_1k.csv\"\n",
"\n",
"df = pd.read_csv(datafile_path)\n",
"df[\"babbage_similarity\"] = df.babbage_similarity.apply(eval).apply(np.array)\n",
"df[\"ada_similarity\"] = df.ada_similarity.apply(eval).apply(np.array)\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(list(df.babbage_similarity.values), df.Score, test_size=0.2, random_state=42)\n",
"X_train, X_test, y_train, y_test = train_test_split(list(df.ada_similarity.values), df.Score, test_size=0.2, random_state=42)\n",
"\n",
"rfr = RandomForestRegressor(n_estimators=100)\n",
"rfr.fit(X_train, y_train)\n",
@ -45,7 +47,7 @@
"mse = mean_squared_error(y_test, preds)\n",
"mae = mean_absolute_error(y_test, preds)\n",
"\n",
"print(f\"Babbage similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")\n"
"print(f\"Ada similarity embedding performance on 1k Amazon reviews: mse={mse:.2f}, mae={mae:.2f}\")\n"
]
},
{
@ -57,7 +59,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Dummy mean prediction performance on Amazon reviews: mse=1.81, mae=1.08\n"
"Dummy mean prediction performance on Amazon reviews: mse=1.73, mae=1.03\n"
]
}
],
@ -70,10 +72,11 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the embeddings are able to predict the scores with an average error of 0.39 per score prediction. This is roughly equivalent to predicting 2 out of 3 reviews perfectly, and 1 out of three reviews by a one star error."
"We can see that the embeddings are able to predict the scores with an average error of 0.60 per score prediction. This is roughly equivalent to predicting 1 out of 3 reviews perfectly, and 1 out of two reviews by a one star error."
]
},
{
@ -86,9 +89,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.9 ('openai')",
"display_name": "openai-cookbook",
"language": "python",
"name": "python3"
"name": "openai-cookbook"
},
"language_info": {
"codemirror_mode": {
@ -100,7 +103,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
"version": "3.9.6"
},
"orig_nbformat": 4,
"vscode": {

@ -18,9 +18,11 @@
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"datafile_path = \"https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\" # for your convenience, we precomputed the embeddings\n",
"# If you have not run the \"Obtain_dataset.ipynb\" notebook, you can download the datafile from here: https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv\n",
"datafile_path = \"./data/fine_food_reviews_with_embeddings_1k.csv\"\n",
"\n",
"df = pd.read_csv(datafile_path)\n",
"df[\"babbage_search\"] = df.babbage_search.apply(eval).apply(np.array)\n"
"df[\"ada_search\"] = df.ada_search.apply(eval).apply(np.array)\n"
]
},
{
@ -39,7 +41,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Fantastic Instant Refried beans: Fantastic Instant Refried Beans have been a staple for my family now for nearly 20 years. All 7 of us love it and my grown kids are passing on the tradition.\n",
"Good Buy: I liked the beans. They were vacuum sealed, plump and moist. Would recommend them for any use. I personally split and stuck them in some vodka to make vanilla extract. Yum!\n",
"\n",
"Jamaican Blue beans: Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and befor\n",
"\n",
@ -55,9 +57,9 @@
"def search_reviews(df, product_description, n=3, pprint=True):\n",
" embedding = get_embedding(\n",
" product_description,\n",
" engine=\"text-search-babbage-query-001\"\n",
" engine=\"text-embedding-ada-002\"\n",
" )\n",
" df[\"similarities\"] = df.babbage_search.apply(lambda x: cosine_similarity(x, embedding))\n",
" df[\"similarities\"] = df.ada_search.apply(lambda x: cosine_similarity(x, embedding))\n",
"\n",
" res = (\n",
" df.sort_values(\"similarities\", ascending=False)\n",
@ -84,17 +86,17 @@
"name": "stdout",
"output_type": "stream",
"text": [
"sooo good: tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.\n",
"\n",
"Tasty and Quick Pasta: Barilla Whole Grain Fusilli with Vegetable Marinara is tasty and has an excellent chunky vegetable marinara. I just wish there was more of it. If you aren't starving or on a \n",
"\n",
"Rustichella ROCKS!: Anything this company makes is worthwhile eating! My favorite is their Trenne.<br />Their whole wheat pasta is the best I have ever had.\n",
"sooo good: tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.\n",
"\n",
"Handy: Love the idea of ready in a minute pasta and for that alone this product gets praise. The pasta is whole grain so that's a big plus and it actually comes out al dente. The vegetable marinara\n",
"\n"
]
}
],
"source": [
"res = search_reviews(df, \"whole wheat pasta\", n=3)\n"
"res = search_reviews(df, \"whole wheat pasta\", n=3)"
]
},
{
@ -119,7 +121,7 @@
}
],
"source": [
"res = search_reviews(df, \"bad delivery\", n=1)\n"
"res = search_reviews(df, \"bad delivery\", n=1)"
]
},
{
@ -144,7 +146,7 @@
}
],
"source": [
"res = search_reviews(df, \"spoilt\", n=1)\n"
"res = search_reviews(df, \"spoilt\", n=1)"
]
},
{
@ -158,21 +160,21 @@
"text": [
"Good food: The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.\n",
"\n",
"Good product: I like that this is a better product for my pets but really for the price of it I couldn't afford to buy this all the time. My cat isn't very picky usually and she ate this, we usually \n",
"The cats like it: My 7 cats like this food but it is a little yucky for the human. Pieces of mackerel swimming in a dark broth. It is billed as a \"complete\" food and contains carrots, peas and pasta.\n",
"\n"
]
}
],
"source": [
"res = search_reviews(df, \"pet food\", n=2)\n"
"res = search_reviews(df, \"pet food\", n=2)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.9 ('openai')",
"display_name": "openai-cookbook",
"language": "python",
"name": "python3"
"name": "openai-cookbook"
},
"language_info": {
"codemirror_mode": {
@ -184,12 +186,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
"version": "3.9.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long
Loading…
Cancel
Save