|
|
@ -0,0 +1,545 @@
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cells": [
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"# Semantic search using OpenSearch and OpenAI\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"This notebook guides you through an example of using [Aiven for OpenSearch](https://go.aiven.io/openai-opensearch-os) as a backend vector database for OpenAI embeddings and how to perform semantic search.\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"## Why using OpenSearch as backend vector database\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"OpenSearch is a widely adopted open source search/analytics engine. It allows to store, query and transform documents in a variety of shapes and provides fast and scalable functionalities to perform both accurate and [fuzzy text search](https://opensearch.org/docs/latest/query-dsl/term/fuzzy/). Using OpenSearch as vector database enables you to mix and match semantic and text search queries on top of a performant and scalable engine.\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"## Prerequisites\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"Before you begin, ensure to follow the prerequisites:\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"1. An [Aiven Account](https://go.aiven.io/openai-opensearch-signup). You can create an account and start a free trial with Aiven by navigating to the [signup page](https://go.aiven.io/openai-opensearch-signup) and creating a user.\n",
|
|
|
|
|
|
|
|
"2. An [Aiven for OpenSearch service](https://go.aiven.io/openai-opensearch-os). You can spin up an Aiven for OpenSearch service in minutes in the [Aiven Console](https://go.aiven.io/openai-opensearch-console) with the following steps \n",
|
|
|
|
|
|
|
|
" * Click on **Create service**\n",
|
|
|
|
|
|
|
|
" * Select **OpenSearch**\n",
|
|
|
|
|
|
|
|
" * Choose the **Cloud Provider and Region**\n",
|
|
|
|
|
|
|
|
" * Select the **Service plan** (the `hobbyist` plan is enough for the notebook)\n",
|
|
|
|
|
|
|
|
" * Provide the **Service name**\n",
|
|
|
|
|
|
|
|
" * Click on **Create service**\n",
|
|
|
|
|
|
|
|
"3. The OpenSearch **Connection String**. The connection string is visible as **Service URI** in the Aiven for OpenSearch service overview page.\n",
|
|
|
|
|
|
|
|
"4. Your [OpenAI API key](https://platform.openai.com/account/api-keys)\n",
|
|
|
|
|
|
|
|
"5. Python and `pip`.\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"## Installing dependencies\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"The notebook requires the following packages:\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"* `openai`\n",
|
|
|
|
|
|
|
|
"* `pandas`\n",
|
|
|
|
|
|
|
|
"* `wget`\n",
|
|
|
|
|
|
|
|
"* `python-dotenv`\n",
|
|
|
|
|
|
|
|
"* `opensearch-py`\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"You can install the above packages with:"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
|
|
|
"metadata": {
|
|
|
|
|
|
|
|
"vscode": {
|
|
|
|
|
|
|
|
"languageId": "shellscript"
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
"outputs": [],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"pip install openai pandas wget python-dotenv opensearch-py"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"## OpenAI key settings\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"We'll use OpenAI to create embeddings starting from a set of documents, therefore an OpenAI API key is needed. You can get one from the [OpenAI API Key page](https://platform.openai.com/account/api-keys) after logging in.\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"To avoid leaking the OpenAI key, you can store it as an environment variable named `OPENAI_API_KEY`. \n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"> For more information on how to perform the same task across other operative systems, refer to [Best Practices for API Key Safety](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). \n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"To store safely the information, create a `.env` file in the same folder where the notebook is located and add the following line, replacing the `<INSERT_YOUR_API_KEY_HERE>` with your OpenAI API Key.\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"```bash\n",
|
|
|
|
|
|
|
|
"OPENAI_API_KEY=<INSERT_YOUR_API_KEY_HERE>\n",
|
|
|
|
|
|
|
|
"```\n"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"## Connect to Aiven for OpenSearch\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"Once the Aiven for OpenSearch service is in `RUNNING` state, we can retrieve the connection string from the Aiven for Opensearch service page, by copying the **Service URI** parameter. We can store it in the same `.env` file created above, after replacing the `https://USER:PASSWORD@HOST:PORT` string with the Service URI.\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"```bash\n",
|
|
|
|
|
|
|
|
"OPENSEARCH_URI=https://USER:PASSWORD@HOST:PORT\n",
|
|
|
|
|
|
|
|
"```"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
" We can now connect to Aiven for OpenSearch with:"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
|
|
|
"metadata": {
|
|
|
|
|
|
|
|
"vscode": {
|
|
|
|
|
|
|
|
"languageId": "shellscript"
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
"outputs": [],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"import os\n",
|
|
|
|
|
|
|
|
"from opensearchpy import OpenSearch\n",
|
|
|
|
|
|
|
|
"from dotenv import load_dotenv\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"# Load environment variables from .env file\n",
|
|
|
|
|
|
|
|
"load_dotenv()\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"connection_string = os.getenv(\"OPENSEARCH_URI\")\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"# Create the client with SSL/TLS enabled, but hostname verification disabled.\n",
|
|
|
|
|
|
|
|
"client = OpenSearch(connection_string, use_ssl=True, timeout=100)\n"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"## Download the dataset\n",
|
|
|
|
|
|
|
|
"To save us from having to recalculate embeddings on a huge dataset, we are using a pre-calculated OpenAI embeddings dataset covering wikipedia articles. We can get the file and unzip it with:"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"import wget\n",
|
|
|
|
|
|
|
|
"import zipfile\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
|
|
|
|
|
|
|
|
"wget.download(embeddings_url)\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\n",
|
|
|
|
|
|
|
|
"\"r\") as zip_ref:\n",
|
|
|
|
|
|
|
|
" zip_ref.extractall(\"data\")"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"Let's load the file in a dataframe and check the content with:"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"import pandas as pd\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"wikipedia_dataframe = pd.read_csv(\"data/vector_database_wikipedia_articles_embedded.csv\")\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"wikipedia_dataframe.head()"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"The file contains:\n",
|
|
|
|
|
|
|
|
"* `id` a unique Wikipedia article identifier\n",
|
|
|
|
|
|
|
|
"* `url` the Wikipedia article URL\n",
|
|
|
|
|
|
|
|
"* `title` the title of the Wikipedia page\n",
|
|
|
|
|
|
|
|
"* `text` the text of the article\n",
|
|
|
|
|
|
|
|
"* `title_vector` and `content_vector` the embedding calculated on the title and content of the wikipedia article respectively\n",
|
|
|
|
|
|
|
|
"* `vector_id` the id of the vector\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"We can create an OpenSearch mapping optimized for the storage of these information with:"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"index_settings ={\n",
|
|
|
|
|
|
|
|
" \"index\": {\n",
|
|
|
|
|
|
|
|
" \"knn\": True,\n",
|
|
|
|
|
|
|
|
" \"knn.algo_param.ef_search\": 100\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"index_mapping= {\n",
|
|
|
|
|
|
|
|
" \"properties\": {\n",
|
|
|
|
|
|
|
|
" \"title_vector\": {\n",
|
|
|
|
|
|
|
|
" \"type\": \"knn_vector\",\n",
|
|
|
|
|
|
|
|
" \"dimension\": 1536,\n",
|
|
|
|
|
|
|
|
" \"method\": {\n",
|
|
|
|
|
|
|
|
" \"name\": \"hnsw\",\n",
|
|
|
|
|
|
|
|
" \"space_type\": \"l2\",\n",
|
|
|
|
|
|
|
|
" \"engine\": \"faiss\"\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
" },\n",
|
|
|
|
|
|
|
|
" \"content_vector\": {\n",
|
|
|
|
|
|
|
|
" \"type\": \"knn_vector\",\n",
|
|
|
|
|
|
|
|
" \"dimension\": 1536,\n",
|
|
|
|
|
|
|
|
" \"method\": {\n",
|
|
|
|
|
|
|
|
" \"name\": \"hnsw\",\n",
|
|
|
|
|
|
|
|
" \"space_type\": \"l2\",\n",
|
|
|
|
|
|
|
|
" \"engine\": \"faiss\"\n",
|
|
|
|
|
|
|
|
" },\n",
|
|
|
|
|
|
|
|
" },\n",
|
|
|
|
|
|
|
|
" \"text\": {\"type\": \"text\"},\n",
|
|
|
|
|
|
|
|
" \"title\": {\"type\": \"text\"},\n",
|
|
|
|
|
|
|
|
" \"url\": { \"type\": \"keyword\"},\n",
|
|
|
|
|
|
|
|
" \"vector_id\": {\"type\": \"long\"}\n",
|
|
|
|
|
|
|
|
" \n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
"}"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"And create an index in Aiven for OpenSearch with:"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"index_name = \"openai_wikipedia_index\"\n",
|
|
|
|
|
|
|
|
"client.indices.create(index=index_name, body={\"settings\": index_settings, \"mappings\":index_mapping})"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"## Index data into OpenSearch\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"Now it's time to parse the the pandas dataframe and index the data into OpenSearch using Bulk APIs. The following function indexes a set of rows in the dataframe:"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"def dataframe_to_bulk_actions(df):\n",
|
|
|
|
|
|
|
|
" for index, row in df.iterrows():\n",
|
|
|
|
|
|
|
|
" yield {\n",
|
|
|
|
|
|
|
|
" \"_index\": index_name,\n",
|
|
|
|
|
|
|
|
" \"_id\": row['id'],\n",
|
|
|
|
|
|
|
|
" \"_source\": {\n",
|
|
|
|
|
|
|
|
" 'url' : row[\"url\"],\n",
|
|
|
|
|
|
|
|
" 'title' : row[\"title\"],\n",
|
|
|
|
|
|
|
|
" 'text' : row[\"text\"],\n",
|
|
|
|
|
|
|
|
" 'title_vector' : json.loads(row[\"title_vector\"]),\n",
|
|
|
|
|
|
|
|
" 'content_vector' : json.loads(row[\"content_vector\"]),\n",
|
|
|
|
|
|
|
|
" 'vector_id' : row[\"vector_id\"]\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
" }"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"We don't want to index all the dataset at once, since it's way too large, so we'll load it in batches of `200` rows."
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"from opensearchpy import helpers\n",
|
|
|
|
|
|
|
|
"import json\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"start = 0\n",
|
|
|
|
|
|
|
|
"end = len(wikipedia_dataframe)\n",
|
|
|
|
|
|
|
|
"batch_size = 200\n",
|
|
|
|
|
|
|
|
"for batch_start in range(start, end, batch_size):\n",
|
|
|
|
|
|
|
|
" batch_end = min(batch_start + batch_size, end)\n",
|
|
|
|
|
|
|
|
" batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end]\n",
|
|
|
|
|
|
|
|
" actions = dataframe_to_bulk_actions(batch_dataframe)\n",
|
|
|
|
|
|
|
|
" helpers.bulk(client, actions)"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"Once all the documents are indexed, let's try a query to retrieve the documents containing `Pizza`:"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"res = client.search(index=index_name, body={\n",
|
|
|
|
|
|
|
|
" \"_source\": {\n",
|
|
|
|
|
|
|
|
" \"excludes\": [\"title_vector\", \"content_vector\"]\n",
|
|
|
|
|
|
|
|
" },\n",
|
|
|
|
|
|
|
|
" \"query\": {\n",
|
|
|
|
|
|
|
|
" \"match\": {\n",
|
|
|
|
|
|
|
|
" \"text\": {\n",
|
|
|
|
|
|
|
|
" \"query\": \"Pizza\"\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
"})\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"print(res[\"hits\"][\"hits\"][0][\"_source\"][\"text\"])"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"# Encode questions with OpenAI\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"To perform a semantic search, we need to calculate questions encodings with the same embedding model used to encode the documents at index time. In this example, we need to use the `text-embedding-3-small` model."
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"from openai import OpenAI\n",
|
|
|
|
|
|
|
|
"import os\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"# Define model\n",
|
|
|
|
|
|
|
|
"EMBEDDING_MODEL = \"text-embedding-ada-002\"\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"# Define the Client\n",
|
|
|
|
|
|
|
|
"openaiclient = OpenAI(\n",
|
|
|
|
|
|
|
|
" # This is the default and can be omitted\n",
|
|
|
|
|
|
|
|
" api_key=os.getenv(\"OPENAI_API_KEY\"),\n",
|
|
|
|
|
|
|
|
")\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"# Define question\n",
|
|
|
|
|
|
|
|
"question = 'is Pineapple a good ingredient for Pizza?'\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"# Create embedding\n",
|
|
|
|
|
|
|
|
"question_embedding = openaiclient.embeddings.create(input=question, model=EMBEDDING_MODEL)"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"# Run semantic search queries with OpenSearch\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"With the above embedding calculated, we can now run semantic searches against the OpenSearch index. We're using `knn` as query type and scan the content of the `content_vector` field"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": 21,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
|
|
|
"text": [
|
|
|
|
|
|
|
|
"Id:66079\n",
|
|
|
|
|
|
|
|
"Score: 0.71338785\n",
|
|
|
|
|
|
|
|
"Title: Pizza Pizza\n",
|
|
|
|
|
|
|
|
"Text: Pizza Pizza Limited (PPL), doing business as Pizza Pizza (), is a franchised Canadian pizza fast foo\n",
|
|
|
|
|
|
|
|
"Id:15719\n",
|
|
|
|
|
|
|
|
"Score: 0.7115042\n",
|
|
|
|
|
|
|
|
"Title: Pineapple\n",
|
|
|
|
|
|
|
|
"Text: The pineapple is a fruit. It is native to South America, Central America and the Caribbean. The word\n",
|
|
|
|
|
|
|
|
"Id:13967\n",
|
|
|
|
|
|
|
|
"Score: 0.7106797\n",
|
|
|
|
|
|
|
|
"Title: Pizza\n",
|
|
|
|
|
|
|
|
"Text: Pizza is an Italian food that was created in Italy (The Naples area). It is made with different topp\n",
|
|
|
|
|
|
|
|
"Id:13968\n",
|
|
|
|
|
|
|
|
"Score: 0.69487476\n",
|
|
|
|
|
|
|
|
"Title: Pepperoni\n",
|
|
|
|
|
|
|
|
"Text: Pepperoni is a meat food that is sometimes sliced thin and put on pizza. It is a kind of salami, whi\n",
|
|
|
|
|
|
|
|
"Id:40989\n",
|
|
|
|
|
|
|
|
"Score: 0.6696015\n",
|
|
|
|
|
|
|
|
"Title: Coprophagia\n",
|
|
|
|
|
|
|
|
"Text: Coprophagia is the eating of faeces. Many animals eat faeces, either their own or that of other anim\n",
|
|
|
|
|
|
|
|
"Id:90918\n",
|
|
|
|
|
|
|
|
"Score: 0.66611433\n",
|
|
|
|
|
|
|
|
"Title: Pizza Hut\n",
|
|
|
|
|
|
|
|
"Text: Pizza Hut is an American pizza restaurant, or pizza parlor. Pizza Hut also serves salads, pastas and\n",
|
|
|
|
|
|
|
|
"Id:433\n",
|
|
|
|
|
|
|
|
"Score: 0.66609937\n",
|
|
|
|
|
|
|
|
"Title: Lanai\n",
|
|
|
|
|
|
|
|
"Text: Lanai (or Lānaʻi) is sixth largest of the Hawaiian Islands, in the United States. It is also known a\n",
|
|
|
|
|
|
|
|
"Id:45877\n",
|
|
|
|
|
|
|
|
"Score: 0.66580874\n",
|
|
|
|
|
|
|
|
"Title: Papaya\n",
|
|
|
|
|
|
|
|
"Text: Papaya is a tall herbaceous plant in the genus Carica; its edible fruit is also called papaya. It is\n",
|
|
|
|
|
|
|
|
"Id:41467\n",
|
|
|
|
|
|
|
|
"Score: 0.6646078\n",
|
|
|
|
|
|
|
|
"Title: Te Puke\n",
|
|
|
|
|
|
|
|
"Text: Te Puke is a small town in the Bay of Plenty in New Zealand. 6670 people live there. It is famous fo\n",
|
|
|
|
|
|
|
|
"Id:31270\n",
|
|
|
|
|
|
|
|
"Score: 0.65891963\n",
|
|
|
|
|
|
|
|
"Title: Afelia\n",
|
|
|
|
|
|
|
|
"Text: Afelia is a Greek food. It is popular in the island nation of Cyprus. Afelia is made from pork, red \n",
|
|
|
|
|
|
|
|
"Id:61037\n",
|
|
|
|
|
|
|
|
"Score: 0.6569093\n",
|
|
|
|
|
|
|
|
"Title: Dough\n",
|
|
|
|
|
|
|
|
"Text: Dough is a thick, malleable and sometimes elastic paste made out of flour by mixing it with a small \n",
|
|
|
|
|
|
|
|
"Id:76670\n",
|
|
|
|
|
|
|
|
"Score: 0.6560743\n",
|
|
|
|
|
|
|
|
"Title: Lycopene\n",
|
|
|
|
|
|
|
|
"Text: Lycopene is the pigment of tomato. Its chemical formula is (6E,8E,10E,12E,14E,16E,18E,20E,22E,24E,26\n",
|
|
|
|
|
|
|
|
"Id:32248\n",
|
|
|
|
|
|
|
|
"Score: 0.653606\n",
|
|
|
|
|
|
|
|
"Title: Pie\n",
|
|
|
|
|
|
|
|
"Text: A pie is a baked food that is made from pastry crust with or without a pastry top. The common filli\n",
|
|
|
|
|
|
|
|
"Id:79026\n",
|
|
|
|
|
|
|
|
"Score: 0.65358526\n",
|
|
|
|
|
|
|
|
"Title: Pectin\n",
|
|
|
|
|
|
|
|
"Text: Pectin is a food supplement. It is a source of dietary fiber. It is used to make jellies and jams. U\n",
|
|
|
|
|
|
|
|
"Id:63962\n",
|
|
|
|
|
|
|
|
"Score: 0.6528203\n",
|
|
|
|
|
|
|
|
"Title: Sprite\n",
|
|
|
|
|
|
|
|
"Text: Sprite is a lemon-lime soda, similar to 7 UP and Sierra Mist. It is made by the Coca-Cola Company. I\n"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"response = client.search(\n",
|
|
|
|
|
|
|
|
" index = index_name,\n",
|
|
|
|
|
|
|
|
" body = {\n",
|
|
|
|
|
|
|
|
" \"size\": 15,\n",
|
|
|
|
|
|
|
|
" \"query\" : {\n",
|
|
|
|
|
|
|
|
" \"knn\" : {\n",
|
|
|
|
|
|
|
|
" \"content_vector\":{\n",
|
|
|
|
|
|
|
|
" \"vector\": question_embedding.data[0].embedding,\n",
|
|
|
|
|
|
|
|
" \"k\": 3\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
")\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"for result in response[\"hits\"][\"hits\"]:\n",
|
|
|
|
|
|
|
|
" print(\"Id:\" + str(result['_id']))\n",
|
|
|
|
|
|
|
|
" print(\"Score: \" + str(result[\"_score\"]))\n",
|
|
|
|
|
|
|
|
" print(\"Title: \" + str(result[\"_source\"][\"title\"]))\n",
|
|
|
|
|
|
|
|
" print(\"Text: \" + result[\"_source\"][\"text\"][0:100])\n"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"## Use OpenAI Chat Completions API to generate a reply\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"The step above retrieves the content semantically similar to the question, now let's use OpenAI chat `completions` to generate a reply based on the information retrieved."
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
|
|
|
"execution_count": 22,
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"outputs": [
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
|
|
|
"text": [
|
|
|
|
|
|
|
|
"------------------------------------------------------------\n",
|
|
|
|
|
|
|
|
"Pineapple is a contentious ingredient for pizza, and it is commonly used at Pizza Pizza Limited (PPL), a Canadian fast-food restaurant with locations throughout Ontario. The menu includes a variety of toppings such as pepperoni, pineapples, mushrooms, and other non-exotic produce. Pizza Pizza has been a staple in the area for over 30 years, with over 500 locations in Ontario and expanding across the nation, including recent openings in Montreal and British Columbia.\n",
|
|
|
|
|
|
|
|
"------------------------------------------------------------\n"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
],
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"# Retrieve the text of the first result in the above dataset\n",
|
|
|
|
|
|
|
|
"top_hit_summary = response['hits']['hits'][0]['_source']['text']\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"# Craft a reply\n",
|
|
|
|
|
|
|
|
"response = openaiclient.chat.completions.create(\n",
|
|
|
|
|
|
|
|
" model=\"gpt-3.5-turbo\",\n",
|
|
|
|
|
|
|
|
" messages=[\n",
|
|
|
|
|
|
|
|
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
|
|
|
|
|
|
|
|
" {\"role\": \"user\", \"content\": \"Answer the following question:\" \n",
|
|
|
|
|
|
|
|
" + question \n",
|
|
|
|
|
|
|
|
" + \"by using the following text:\" \n",
|
|
|
|
|
|
|
|
" + top_hit_summary\n",
|
|
|
|
|
|
|
|
" }\n",
|
|
|
|
|
|
|
|
" ]\n",
|
|
|
|
|
|
|
|
" )\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"choices = response.choices\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"for choice in choices:\n",
|
|
|
|
|
|
|
|
" print(\"------------------------------------------------------------\")\n",
|
|
|
|
|
|
|
|
" print(choice.message.content)\n",
|
|
|
|
|
|
|
|
" print(\"------------------------------------------------------------\")\n"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"## Conclusion\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"OpenSearch is a powerful tool providing both text and vector search capabilities. Used alongside OpenAI APIs allows you to craft personalized AI applications able to augment the context based on semantic search.\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"You can try Aiven for OpenSearch, or any of the other Open Source tools, in the Aiven platform free trial by [signing up](https://go.aiven.io/openai-opensearch-signup)."
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
],
|
|
|
|
|
|
|
|
"metadata": {
|
|
|
|
|
|
|
|
"kernelspec": {
|
|
|
|
|
|
|
|
"display_name": "Python 3",
|
|
|
|
|
|
|
|
"language": "python",
|
|
|
|
|
|
|
|
"name": "python3"
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
"language_info": {
|
|
|
|
|
|
|
|
"codemirror_mode": {
|
|
|
|
|
|
|
|
"name": "ipython",
|
|
|
|
|
|
|
|
"version": 3
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
"file_extension": ".py",
|
|
|
|
|
|
|
|
"mimetype": "text/x-python",
|
|
|
|
|
|
|
|
"name": "python",
|
|
|
|
|
|
|
|
"nbconvert_exporter": "python",
|
|
|
|
|
|
|
|
"pygments_lexer": "ipython3",
|
|
|
|
|
|
|
|
"version": "3.12.2"
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
"nbformat": 4,
|
|
|
|
|
|
|
|
"nbformat_minor": 2
|
|
|
|
|
|
|
|
}
|