Added OpenSearch

Added OpenSearch to vector databases example
1 month ago · c8a0c91f21
parent dc0e64aedf
commit c8a0c91f21
4 changed files with 579 additions and 1 deletions
--- a/authors.yaml
+++ b/authors.yaml
@ -93,3 +93,8 @@ royziv11:
  website: "https://www.linkedin.com/in/roy-ziv-a46001149/"
  avatar: "https://media.licdn.com/dms/image/D5603AQHkaEOOGZWtbA/profile-displayphoto-shrink_400_400/0/1699500606122?e=1716422400&v=beta&t=wKEIx-vTEqm9wnqoC7-xr1WqJjghvcjjlMt034hXY_4"

+
+ftisiot:
+  name: "Francesco Tisiot"
+  website: "https://ftisiot.net"
+  avatar: "https://ftisiot.net/images/ftisiot.png"
--- a/examples/vector_databases/opensearch/README.md
+++ b/examples/vector_databases/opensearch/README.md
@ -0,0 +1,20 @@
+# OpenSearch
+
+OpenSearch is a popular open-source search/analytics engine and [vector database](https://opensearch.org/platform/search/vector-database.html).
+With OpenSearch you can efficiently store and query any kind of data including your vector embeddings at scale. 
+
+[Aiven](https://go.aiven.io/openai-opensearch-aiven) provides a way to experience the best of open source data technologies, including OpenSearch, in a secure, well integrated, scalable and trustable data platform. [Aiven for OpenSearch](https://aiven.io/opensearch) allows you to experience OpenSearch in minutes, on all the major cloud vendor and regions, supported by a self-healing platform with a 99.99% SLA.
+
+For technical details, refer to the [OpenSearch documentation](https://opensearch.org/docs/latest/search-plugins/knn/index/).
+
+## OpenAI cookbook notebooks 📒
+
+Check out our notebooks in this repo for working with OpenAI, using OpenSearch as your vector database.
+
+### [Semantic search](aiven-opensearch-semantic-search.ipynb)
+
+In this notebook you'll learn how to:
+
+ - Index the OpenAI Wikipedia embeddings dataset into OpenSearch
+ - Encode a question with the `text-embedding-3-small` model
+ - Perform a semantic search
--- a/examples/vector_databases/opensearch/aiven-opensearch-vector-search.ipynb
+++ b/examples/vector_databases/opensearch/aiven-opensearch-vector-search.ipynb
@ -0,0 +1,545 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Semantic search using OpenSearch and OpenAI\n",
+    "\n",
+    "This notebook guides you through an example of using [Aiven for OpenSearch](https://go.aiven.io/openai-opensearch-os) as a backend vector database for OpenAI embeddings and how to perform semantic search.\n",
+    "\n",
+    "\n",
+    "## Why using OpenSearch as backend vector database\n",
+    "\n",
+    "OpenSearch is a widely adopted open source search/analytics engine. It allows to store, query and transform documents in a variety of shapes and provides fast and scalable functionalities to perform both accurate and [fuzzy text search](https://opensearch.org/docs/latest/query-dsl/term/fuzzy/). Using OpenSearch as vector database enables you to mix and match semantic and text search queries on top of a performant and scalable engine.\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "Before you begin, ensure to follow the prerequisites:\n",
+    "\n",
+    "1. An [Aiven Account](https://go.aiven.io/openai-opensearch-signup). You can create an account and start a free trial with Aiven by navigating to the [signup page](https://go.aiven.io/openai-opensearch-signup) and creating a user.\n",
+    "2. An [Aiven for OpenSearch service](https://go.aiven.io/openai-opensearch-os). You can spin up an Aiven for OpenSearch service in minutes in the [Aiven Console](https://go.aiven.io/openai-opensearch-console) with the following steps \n",
+    "    * Click on **Create service**\n",
+    "    * Select **OpenSearch**\n",
+    "    * Choose the **Cloud Provider and Region**\n",
+    "    * Select the **Service plan** (the `hobbyist` plan is enough for the notebook)\n",
+    "    * Provide the **Service name**\n",
+    "    * Click on **Create service**\n",
+    "3. The OpenSearch **Connection String**. The connection string is visible as **Service URI** in the Aiven for OpenSearch service overview page.\n",
+    "4. Your [OpenAI API key](https://platform.openai.com/account/api-keys)\n",
+    "5. Python and `pip`.\n",
+    "\n",
+    "## Installing dependencies\n",
+    "\n",
+    "The notebook requires the following packages:\n",
+    "\n",
+    "* `openai`\n",
+    "* `pandas`\n",
+    "* `wget`\n",
+    "* `python-dotenv`\n",
+    "* `opensearch-py`\n",
+    "\n",
+    "You can install the above packages with:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pip install openai pandas wget python-dotenv opensearch-py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## OpenAI key settings\n",
+    "\n",
+    "We'll use OpenAI to create embeddings starting from a set of documents, therefore an OpenAI API key is needed. You can get one from the [OpenAI API Key page](https://platform.openai.com/account/api-keys) after logging in.\n",
+    "\n",
+    "To avoid leaking the OpenAI key, you can store it as an environment variable named `OPENAI_API_KEY`. \n",
+    "\n",
+    "> For more information on how to perform the same task across other operative systems, refer to [Best Practices for API Key Safety](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). \n",
+    "\n",
+    "To store safely the information, create a `.env` file in the same folder where the notebook is located and add the following line, replacing the `<INSERT_YOUR_API_KEY_HERE>` with your OpenAI API Key.\n",
+    "\n",
+    "```bash\n",
+    "OPENAI_API_KEY=<INSERT_YOUR_API_KEY_HERE>\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Connect to Aiven for OpenSearch\n",
+    "\n",
+    "Once the Aiven for OpenSearch service is in `RUNNING` state, we can retrieve the connection string from the Aiven for Opensearch service page, by copying the **Service URI** parameter. We can store it in the same `.env` file created above, after replacing the `https://USER:PASSWORD@HOST:PORT` string with the Service URI.\n",
+    "\n",
+    "```bash\n",
+    "OPENSEARCH_URI=https://USER:PASSWORD@HOST:PORT\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    " We can now connect to Aiven for OpenSearch with:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from opensearchpy import OpenSearch\n",
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "# Load environment variables from .env file\n",
+    "load_dotenv()\n",
+    "\n",
+    "connection_string = os.getenv(\"OPENSEARCH_URI\")\n",
+    "\n",
+    "# Create the client with SSL/TLS enabled, but hostname verification disabled.\n",
+    "client = OpenSearch(connection_string, use_ssl=True, timeout=100)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Download the dataset\n",
+    "To save us from having to recalculate embeddings on a huge dataset, we are using a pre-calculated OpenAI embeddings dataset covering wikipedia articles. We can get the file and unzip it with:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import wget\n",
+    "import zipfile\n",
+    "\n",
+    "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
+    "wget.download(embeddings_url)\n",
+    "\n",
+    "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\n",
+    "\"r\") as zip_ref:\n",
+    "    zip_ref.extractall(\"data\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's load the file in a dataframe and check the content with:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "wikipedia_dataframe = pd.read_csv(\"data/vector_database_wikipedia_articles_embedded.csv\")\n",
+    "\n",
+    "wikipedia_dataframe.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The file contains:\n",
+    "* `id` a unique Wikipedia article identifier\n",
+    "* `url` the Wikipedia article URL\n",
+    "* `title` the title of the Wikipedia page\n",
+    "* `text` the text of the article\n",
+    "* `title_vector` and `content_vector` the embedding calculated on the title and content of the wikipedia article respectively\n",
+    "* `vector_id` the id of the vector\n",
+    "\n",
+    "We can create an OpenSearch mapping optimized for the storage of these information with:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "index_settings ={\n",
+    "    \"index\": {\n",
+    "      \"knn\": True,\n",
+    "      \"knn.algo_param.ef_search\": 100\n",
+    "    }\n",
+    "  }\n",
+    "\n",
+    "index_mapping= {\n",
+    "    \"properties\": {\n",
+    "      \"title_vector\": {\n",
+    "          \"type\": \"knn_vector\",\n",
+    "          \"dimension\": 1536,\n",
+    "          \"method\": {\n",
+    "            \"name\": \"hnsw\",\n",
+    "            \"space_type\": \"l2\",\n",
+    "            \"engine\": \"faiss\"\n",
+    "        }\n",
+    "      },\n",
+    "      \"content_vector\": {\n",
+    "          \"type\": \"knn_vector\",\n",
+    "          \"dimension\": 1536,\n",
+    "          \"method\": {\n",
+    "            \"name\": \"hnsw\",\n",
+    "            \"space_type\": \"l2\",\n",
+    "            \"engine\": \"faiss\"\n",
+    "        },\n",
+    "      },\n",
+    "      \"text\": {\"type\": \"text\"},\n",
+    "      \"title\": {\"type\": \"text\"},\n",
+    "      \"url\": { \"type\": \"keyword\"},\n",
+    "      \"vector_id\": {\"type\": \"long\"}\n",
+    "      \n",
+    "    }\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And create an index in Aiven for OpenSearch with:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "index_name = \"openai_wikipedia_index\"\n",
+    "client.indices.create(index=index_name, body={\"settings\": index_settings, \"mappings\":index_mapping})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Index data into OpenSearch\n",
+    "\n",
+    "Now it's time to parse the the pandas dataframe and index the data into OpenSearch using Bulk APIs. The following function indexes a set of rows in the dataframe:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def dataframe_to_bulk_actions(df):\n",
+    "    for index, row in df.iterrows():\n",
+    "        yield {\n",
+    "            \"_index\": index_name,\n",
+    "            \"_id\": row['id'],\n",
+    "            \"_source\": {\n",
+    "                'url' : row[\"url\"],\n",
+    "                'title' : row[\"title\"],\n",
+    "                'text' : row[\"text\"],\n",
+    "                'title_vector' : json.loads(row[\"title_vector\"]),\n",
+    "                'content_vector' : json.loads(row[\"content_vector\"]),\n",
+    "                'vector_id' : row[\"vector_id\"]\n",
+    "            }\n",
+    "        }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We don't want to index all the dataset at once, since it's way too large, so we'll load it in batches of `200` rows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from opensearchpy import helpers\n",
+    "import json\n",
+    "\n",
+    "start = 0\n",
+    "end = len(wikipedia_dataframe)\n",
+    "batch_size = 200\n",
+    "for batch_start in range(start, end, batch_size):\n",
+    "    batch_end = min(batch_start + batch_size, end)\n",
+    "    batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end]\n",
+    "    actions = dataframe_to_bulk_actions(batch_dataframe)\n",
+    "    helpers.bulk(client, actions)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once all the documents are indexed, let's try a query to retrieve the documents containing `Pizza`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "res = client.search(index=index_name, body={\n",
+    "    \"_source\": {\n",
+    "        \"excludes\": [\"title_vector\", \"content_vector\"]\n",
+    "    },\n",
+    "    \"query\": {\n",
+    "        \"match\": {\n",
+    "            \"text\": {\n",
+    "                \"query\": \"Pizza\"\n",
+    "            }\n",
+    "        }\n",
+    "    }\n",
+    "})\n",
+    "\n",
+    "print(res[\"hits\"][\"hits\"][0][\"_source\"][\"text\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Encode questions with OpenAI\n",
+    "\n",
+    "To perform a semantic search, we need to calculate questions encodings with the same embedding model used to encode the documents at index time. In this example, we need to use the `text-embedding-3-small` model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from openai import OpenAI\n",
+    "import os\n",
+    "\n",
+    "# Define model\n",
+    "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n",
+    "\n",
+    "# Define the Client\n",
+    "openaiclient = OpenAI(\n",
+    "    # This is the default and can be omitted\n",
+    "    api_key=os.getenv(\"OPENAI_API_KEY\"),\n",
+    ")\n",
+    "\n",
+    "# Define question\n",
+    "question = 'is Pineapple a good ingredient for Pizza?'\n",
+    "\n",
+    "# Create embedding\n",
+    "question_embedding = openaiclient.embeddings.create(input=question, model=EMBEDDING_MODEL)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Run semantic search queries with OpenSearch\n",
+    "\n",
+    "With the above embedding calculated, we can now run semantic searches against the OpenSearch index. We're using `knn` as query type and scan the content of the `content_vector` field"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Id:66079\n",
+      "Score: 0.71338785\n",
+      "Title: Pizza Pizza\n",
+      "Text: Pizza Pizza Limited (PPL), doing business as Pizza Pizza (), is a franchised Canadian pizza fast foo\n",
+      "Id:15719\n",
+      "Score: 0.7115042\n",
+      "Title: Pineapple\n",
+      "Text: The pineapple is a fruit. It is native to South America, Central America and the Caribbean. The word\n",
+      "Id:13967\n",
+      "Score: 0.7106797\n",
+      "Title: Pizza\n",
+      "Text: Pizza is an Italian food that was created in Italy (The Naples area). It is made with different topp\n",
+      "Id:13968\n",
+      "Score: 0.69487476\n",
+      "Title: Pepperoni\n",
+      "Text: Pepperoni is a meat food that is sometimes sliced thin and put on pizza. It is a kind of salami, whi\n",
+      "Id:40989\n",
+      "Score: 0.6696015\n",
+      "Title: Coprophagia\n",
+      "Text: Coprophagia is the eating of faeces. Many animals eat faeces, either their own or that of other anim\n",
+      "Id:90918\n",
+      "Score: 0.66611433\n",
+      "Title: Pizza Hut\n",
+      "Text: Pizza Hut is an American pizza restaurant, or pizza parlor. Pizza Hut also serves salads, pastas and\n",
+      "Id:433\n",
+      "Score: 0.66609937\n",
+      "Title: Lanai\n",
+      "Text: Lanai (or Lānaʻi) is sixth largest of the Hawaiian Islands, in the United States. It is also known a\n",
+      "Id:45877\n",
+      "Score: 0.66580874\n",
+      "Title: Papaya\n",
+      "Text: Papaya is a tall herbaceous plant in the genus Carica; its edible fruit is also called papaya. It is\n",
+      "Id:41467\n",
+      "Score: 0.6646078\n",
+      "Title: Te Puke\n",
+      "Text: Te Puke is a small town in the Bay of Plenty in New Zealand. 6670 people live there. It is famous fo\n",
+      "Id:31270\n",
+      "Score: 0.65891963\n",
+      "Title: Afelia\n",
+      "Text: Afelia is a Greek food. It is popular in the island nation of Cyprus. Afelia is made from pork, red \n",
+      "Id:61037\n",
+      "Score: 0.6569093\n",
+      "Title: Dough\n",
+      "Text: Dough is a thick, malleable and sometimes elastic paste made out of flour by mixing it with a small \n",
+      "Id:76670\n",
+      "Score: 0.6560743\n",
+      "Title: Lycopene\n",
+      "Text: Lycopene is the pigment of tomato. Its chemical formula is (6E,8E,10E,12E,14E,16E,18E,20E,22E,24E,26\n",
+      "Id:32248\n",
+      "Score: 0.653606\n",
+      "Title: Pie\n",
+      "Text: A pie is a baked  food that is made from pastry crust with or without a pastry top. The common filli\n",
+      "Id:79026\n",
+      "Score: 0.65358526\n",
+      "Title: Pectin\n",
+      "Text: Pectin is a food supplement. It is a source of dietary fiber. It is used to make jellies and jams. U\n",
+      "Id:63962\n",
+      "Score: 0.6528203\n",
+      "Title: Sprite\n",
+      "Text: Sprite is a lemon-lime soda, similar to 7 UP and Sierra Mist. It is made by the Coca-Cola Company. I\n"
+     ]
+    }
+   ],
+   "source": [
+    "response = client.search(\n",
+    "  index = index_name,\n",
+    "  body = {\n",
+    "      \"size\": 15,\n",
+    "      \"query\" : {\n",
+    "        \"knn\" : {\n",
+    "          \"content_vector\":{\n",
+    "          \"vector\":  question_embedding.data[0].embedding,\n",
+    "          \"k\": 3\n",
+    "        }\n",
+    "      }\n",
+    "    }\n",
+    "  }\n",
+    ")\n",
+    "\n",
+    "for result in response[\"hits\"][\"hits\"]:\n",
+    "  print(\"Id:\" + str(result['_id']))\n",
+    "  print(\"Score: \" + str(result[\"_score\"]))\n",
+    "  print(\"Title: \" + str(result[\"_source\"][\"title\"]))\n",
+    "  print(\"Text: \" + result[\"_source\"][\"text\"][0:100])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Use OpenAI Chat Completions API to generate a reply\n",
+    "\n",
+    "The step above retrieves the content semantically similar to the question, now let's use OpenAI chat `completions` to generate a reply based on the information retrieved."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "------------------------------------------------------------\n",
+      "Pineapple is a contentious ingredient for pizza, and it is commonly used at Pizza Pizza Limited (PPL), a Canadian fast-food restaurant with locations throughout Ontario. The menu includes a variety of toppings such as pepperoni, pineapples, mushrooms, and other non-exotic produce. Pizza Pizza has been a staple in the area for over 30 years, with over 500 locations in Ontario and expanding across the nation, including recent openings in Montreal and British Columbia.\n",
+      "------------------------------------------------------------\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Retrieve the text of the first result in the above dataset\n",
+    "top_hit_summary = response['hits']['hits'][0]['_source']['text']\n",
+    "\n",
+    "# Craft a reply\n",
+    "response = openaiclient.chat.completions.create(\n",
+    "    model=\"gpt-3.5-turbo\",\n",
+    "    messages=[\n",
+    "        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
+    "        {\"role\": \"user\", \"content\": \"Answer the following question:\" \n",
+    "            + question \n",
+    "            + \"by using the following text:\" \n",
+    "            + top_hit_summary\n",
+    "        }\n",
+    "        ]\n",
+    "    )\n",
+    "\n",
+    "choices = response.choices\n",
+    "\n",
+    "for choice in choices:\n",
+    "    print(\"------------------------------------------------------------\")\n",
+    "    print(choice.message.content)\n",
+    "    print(\"------------------------------------------------------------\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "OpenSearch is a powerful tool providing both text and vector search capabilities. Used alongside OpenAI APIs allows you to craft personalized AI applications able to augment the context based on semantic search.\n",
+    "\n",
+    "You can try Aiven for OpenSearch, or any of the other Open Source tools, in the Aiven platform free trial by [signing up](https://go.aiven.io/openai-opensearch-signup)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/registry.yaml
+++ b/registry.yaml
@ -1294,4 +1294,12 @@
    - colin-openai
  tags:
    - completions
-    - functions
+    - functions
+
+- title: OpenSearch as a vector database
+  path: examples/vector_databases/opensearch/README.md
+  date: 2024-05-10
+  authors:
+    - ftisiot
+  tags:
+    - embeddings