Added OpenSearch

Added OpenSearch to vector databases example
5 months ago · c8a0c91f21
parent dc0e64aedf
commit c8a0c91f21
4 changed files with 579 additions and 1 deletions
--- a/authors.yaml
+++ b/authors.yaml
@ -93,3 +93,8 @@ royziv11:
  website: "https://www.linkedin.com/in/roy-ziv-a46001149/"
  avatar: "https://media.licdn.com/dms/image/D5603AQHkaEOOGZWtbA/profile-displayphoto-shrink_400_400/0/1699500606122?e=1716422400&v=beta&t=wKEIx-vTEqm9wnqoC7-xr1WqJjghvcjjlMt034hXY_4"
 ftisiot:
  name: "Francesco Tisiot"
  website: "https://ftisiot.net"
  avatar: "https://ftisiot.net/images/ftisiot.png"
--- a/examples/vector_databases/opensearch/README.md
+++ b/examples/vector_databases/opensearch/README.md
@ -0,0 +1,20 @@
 # OpenSearch
 OpenSearch is a popular open-source search/analytics engine and [vector database](https://opensearch.org/platform/search/vector-database.html).
 With OpenSearch you can efficiently store and query any kind of data including your vector embeddings at scale. 
 [Aiven](https://go.aiven.io/openai-opensearch-aiven) provides a way to experience the best of open source data technologies, including OpenSearch, in a secure, well integrated, scalable and trustable data platform. [Aiven for OpenSearch](https://aiven.io/opensearch) allows you to experience OpenSearch in minutes, on all the major cloud vendor and regions, supported by a self-healing platform with a 99.99% SLA.
 For technical details, refer to the [OpenSearch documentation](https://opensearch.org/docs/latest/search-plugins/knn/index/).
 ## OpenAI cookbook notebooks 📒
 Check out our notebooks in this repo for working with OpenAI, using OpenSearch as your vector database.
 ### [Semantic search](aiven-opensearch-semantic-search.ipynb)
 In this notebook you'll learn how to:
 - Index the OpenAI Wikipedia embeddings dataset into OpenSearch
 - Encode a question with the `text-embedding-3-small` model
 - Perform a semantic search
--- a/examples/vector_databases/opensearch/aiven-opensearch-vector-search.ipynb
+++ b/examples/vector_databases/opensearch/aiven-opensearch-vector-search.ipynb
@ -0,0 +1,545 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Semantic search using OpenSearch and OpenAI\n",
    "\n",
    "This notebook guides you through an example of using [Aiven for OpenSearch](https://go.aiven.io/openai-opensearch-os) as a backend vector database for OpenAI embeddings and how to perform semantic search.\n",
    "\n",
    "\n",
    "## Why using OpenSearch as backend vector database\n",
    "\n",
    "OpenSearch is a widely adopted open source search/analytics engine. It allows to store, query and transform documents in a variety of shapes and provides fast and scalable functionalities to perform both accurate and [fuzzy text search](https://opensearch.org/docs/latest/query-dsl/term/fuzzy/). Using OpenSearch as vector database enables you to mix and match semantic and text search queries on top of a performant and scalable engine.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "Before you begin, ensure to follow the prerequisites:\n",
    "\n",
    "1. An [Aiven Account](https://go.aiven.io/openai-opensearch-signup). You can create an account and start a free trial with Aiven by navigating to the [signup page](https://go.aiven.io/openai-opensearch-signup) and creating a user.\n",
    "2. An [Aiven for OpenSearch service](https://go.aiven.io/openai-opensearch-os). You can spin up an Aiven for OpenSearch service in minutes in the [Aiven Console](https://go.aiven.io/openai-opensearch-console) with the following steps \n",
    "    * Click on **Create service**\n",
    "    * Select **OpenSearch**\n",
    "    * Choose the **Cloud Provider and Region**\n",
    "    * Select the **Service plan** (the `hobbyist` plan is enough for the notebook)\n",
    "    * Provide the **Service name**\n",
    "    * Click on **Create service**\n",
    "3. The OpenSearch **Connection String**. The connection string is visible as **Service URI** in the Aiven for OpenSearch service overview page.\n",
    "4. Your [OpenAI API key](https://platform.openai.com/account/api-keys)\n",
    "5. Python and `pip`.\n",
    "\n",
    "## Installing dependencies\n",
    "\n",
    "The notebook requires the following packages:\n",
    "\n",
    "* `openai`\n",
    "* `pandas`\n",
    "* `wget`\n",
    "* `python-dotenv`\n",
    "* `opensearch-py`\n",
    "\n",
    "You can install the above packages with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "pip install openai pandas wget python-dotenv opensearch-py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## OpenAI key settings\n",
    "\n",
    "We'll use OpenAI to create embeddings starting from a set of documents, therefore an OpenAI API key is needed. You can get one from the [OpenAI API Key page](https://platform.openai.com/account/api-keys) after logging in.\n",
    "\n",
    "To avoid leaking the OpenAI key, you can store it as an environment variable named `OPENAI_API_KEY`. \n",
    "\n",
    "> For more information on how to perform the same task across other operative systems, refer to [Best Practices for API Key Safety](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). \n",
    "\n",
    "To store safely the information, create a `.env` file in the same folder where the notebook is located and add the following line, replacing the `<INSERT_YOUR_API_KEY_HERE>` with your OpenAI API Key.\n",
    "\n",
    "```bash\n",
    "OPENAI_API_KEY=<INSERT_YOUR_API_KEY_HERE>\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connect to Aiven for OpenSearch\n",
    "\n",
    "Once the Aiven for OpenSearch service is in `RUNNING` state, we can retrieve the connection string from the Aiven for Opensearch service page, by copying the **Service URI** parameter. We can store it in the same `.env` file created above, after replacing the `https://USER:PASSWORD@HOST:PORT` string with the Service URI.\n",
    "\n",
    "```bash\n",
    "OPENSEARCH_URI=https://USER:PASSWORD@HOST:PORT\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " We can now connect to Aiven for OpenSearch with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from opensearchpy import OpenSearch\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "# Load environment variables from .env file\n",
    "load_dotenv()\n",
    "\n",
    "connection_string = os.getenv(\"OPENSEARCH_URI\")\n",
    "\n",
    "# Create the client with SSL/TLS enabled, but hostname verification disabled.\n",
    "client = OpenSearch(connection_string, use_ssl=True, timeout=100)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the dataset\n",
    "To save us from having to recalculate embeddings on a huge dataset, we are using a pre-calculated OpenAI embeddings dataset covering wikipedia articles. We can get the file and unzip it with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import wget\n",
    "import zipfile\n",
    "\n",
    "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
    "wget.download(embeddings_url)\n",
    "\n",
    "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\n",
    "\"r\") as zip_ref:\n",
    "    zip_ref.extractall(\"data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load the file in a dataframe and check the content with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "wikipedia_dataframe = pd.read_csv(\"data/vector_database_wikipedia_articles_embedded.csv\")\n",
    "\n",
    "wikipedia_dataframe.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The file contains:\n",
    "* `id` a unique Wikipedia article identifier\n",
    "* `url` the Wikipedia article URL\n",
    "* `title` the title of the Wikipedia page\n",
    "* `text` the text of the article\n",
    "* `title_vector` and `content_vector` the embedding calculated on the title and content of the wikipedia article respectively\n",
    "* `vector_id` the id of the vector\n",
    "\n",
    "We can create an OpenSearch mapping optimized for the storage of these information with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index_settings ={\n",
    "    \"index\": {\n",
    "      \"knn\": True,\n",
    "      \"knn.algo_param.ef_search\": 100\n",
    "    }\n",
    "  }\n",
    "\n",
    "index_mapping= {\n",
    "    \"properties\": {\n",
    "      \"title_vector\": {\n",
    "          \"type\": \"knn_vector\",\n",
    "          \"dimension\": 1536,\n",
    "          \"method\": {\n",
    "            \"name\": \"hnsw\",\n",
    "            \"space_type\": \"l2\",\n",
    "            \"engine\": \"faiss\"\n",
    "        }\n",
    "      },\n",
    "      \"content_vector\": {\n",
    "          \"type\": \"knn_vector\",\n",
    "          \"dimension\": 1536,\n",
    "          \"method\": {\n",
    "            \"name\": \"hnsw\",\n",
    "            \"space_type\": \"l2\",\n",
    "            \"engine\": \"faiss\"\n",
    "        },\n",
    "      },\n",
    "      \"text\": {\"type\": \"text\"},\n",
    "      \"title\": {\"type\": \"text\"},\n",
    "      \"url\": { \"type\": \"keyword\"},\n",
    "      \"vector_id\": {\"type\": \"long\"}\n",
    "      \n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And create an index in Aiven for OpenSearch with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index_name = \"openai_wikipedia_index\"\n",
    "client.indices.create(index=index_name, body={\"settings\": index_settings, \"mappings\":index_mapping})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Index data into OpenSearch\n",
    "\n",
    "Now it's time to parse the the pandas dataframe and index the data into OpenSearch using Bulk APIs. The following function indexes a set of rows in the dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def dataframe_to_bulk_actions(df):\n",
    "    for index, row in df.iterrows():\n",
    "        yield {\n",
    "            \"_index\": index_name,\n",
    "            \"_id\": row['id'],\n",
    "            \"_source\": {\n",
    "                'url' : row[\"url\"],\n",
    "                'title' : row[\"title\"],\n",
    "                'text' : row[\"text\"],\n",
    "                'title_vector' : json.loads(row[\"title_vector\"]),\n",
    "                'content_vector' : json.loads(row[\"content_vector\"]),\n",
    "                'vector_id' : row[\"vector_id\"]\n",
    "            }\n",
    "        }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We don't want to index all the dataset at once, since it's way too large, so we'll load it in batches of `200` rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from opensearchpy import helpers\n",
    "import json\n",
    "\n",
    "start = 0\n",
    "end = len(wikipedia_dataframe)\n",
    "batch_size = 200\n",
    "for batch_start in range(start, end, batch_size):\n",
    "    batch_end = min(batch_start + batch_size, end)\n",
    "    batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end]\n",
    "    actions = dataframe_to_bulk_actions(batch_dataframe)\n",
    "    helpers.bulk(client, actions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once all the documents are indexed, let's try a query to retrieve the documents containing `Pizza`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "res = client.search(index=index_name, body={\n",
    "    \"_source\": {\n",
    "        \"excludes\": [\"title_vector\", \"content_vector\"]\n",
    "    },\n",
    "    \"query\": {\n",
    "        \"match\": {\n",
    "            \"text\": {\n",
    "                \"query\": \"Pizza\"\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "})\n",
    "\n",
    "print(res[\"hits\"][\"hits\"][0][\"_source\"][\"text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Encode questions with OpenAI\n",
    "\n",
    "To perform a semantic search, we need to calculate questions encodings with the same embedding model used to encode the documents at index time. In this example, we need to use the `text-embedding-3-small` model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "import os\n",
    "\n",
    "# Define model\n",
    "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n",
    "\n",
    "# Define the Client\n",
    "openaiclient = OpenAI(\n",
    "    # This is the default and can be omitted\n",
    "    api_key=os.getenv(\"OPENAI_API_KEY\"),\n",
    ")\n",
    "\n",
    "# Define question\n",
    "question = 'is Pineapple a good ingredient for Pizza?'\n",
    "\n",
    "# Create embedding\n",
    "question_embedding = openaiclient.embeddings.create(input=question, model=EMBEDDING_MODEL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run semantic search queries with OpenSearch\n",
    "\n",
    "With the above embedding calculated, we can now run semantic searches against the OpenSearch index. We're using `knn` as query type and scan the content of the `content_vector` field"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Id:66079\n",
      "Score: 0.71338785\n",
      "Title: Pizza Pizza\n",
      "Text: Pizza Pizza Limited (PPL), doing business as Pizza Pizza (), is a franchised Canadian pizza fast foo\n",
      "Id:15719\n",
      "Score: 0.7115042\n",
      "Title: Pineapple\n",
      "Text: The pineapple is a fruit. It is native to South America, Central America and the Caribbean. The word\n",
      "Id:13967\n",
      "Score: 0.7106797\n",
      "Title: Pizza\n",
      "Text: Pizza is an Italian food that was created in Italy (The Naples area). It is made with different topp\n",
      "Id:13968\n",
      "Score: 0.69487476\n",
      "Title: Pepperoni\n",
      "Text: Pepperoni is a meat food that is sometimes sliced thin and put on pizza. It is a kind of salami, whi\n",
      "Id:40989\n",
      "Score: 0.6696015\n",
      "Title: Coprophagia\n",
      "Text: Coprophagia is the eating of faeces. Many animals eat faeces, either their own or that of other anim\n",
      "Id:90918\n",
      "Score: 0.66611433\n",
      "Title: Pizza Hut\n",
      "Text: Pizza Hut is an American pizza restaurant, or pizza parlor. Pizza Hut also serves salads, pastas and\n",
      "Id:433\n",
      "Score: 0.66609937\n",
      "Title: Lanai\n",
      "Text: Lanai (or Lānaʻi) is sixth largest of the Hawaiian Islands, in the United States. It is also known a\n",
      "Id:45877\n",
      "Score: 0.66580874\n",
      "Title: Papaya\n",
      "Text: Papaya is a tall herbaceous plant in the genus Carica; its edible fruit is also called papaya. It is\n",
      "Id:41467\n",
      "Score: 0.6646078\n",
      "Title: Te Puke\n",
      "Text: Te Puke is a small town in the Bay of Plenty in New Zealand. 6670 people live there. It is famous fo\n",
      "Id:31270\n",
      "Score: 0.65891963\n",
      "Title: Afelia\n",
      "Text: Afelia is a Greek food. It is popular in the island nation of Cyprus. Afelia is made from pork, red \n",
      "Id:61037\n",
      "Score: 0.6569093\n",
      "Title: Dough\n",
      "Text: Dough is a thick, malleable and sometimes elastic paste made out of flour by mixing it with a small \n",
      "Id:76670\n",
      "Score: 0.6560743\n",
      "Title: Lycopene\n",
      "Text: Lycopene is the pigment of tomato. Its chemical formula is (6E,8E,10E,12E,14E,16E,18E,20E,22E,24E,26\n",
      "Id:32248\n",
      "Score: 0.653606\n",
      "Title: Pie\n",
      "Text: A pie is a baked  food that is made from pastry crust with or without a pastry top. The common filli\n",
      "Id:79026\n",
      "Score: 0.65358526\n",
      "Title: Pectin\n",
      "Text: Pectin is a food supplement. It is a source of dietary fiber. It is used to make jellies and jams. U\n",
      "Id:63962\n",
      "Score: 0.6528203\n",
      "Title: Sprite\n",
      "Text: Sprite is a lemon-lime soda, similar to 7 UP and Sierra Mist. It is made by the Coca-Cola Company. I\n"
     ]
    }
   ],
   "source": [
    "response = client.search(\n",
    "  index = index_name,\n",
    "  body = {\n",
    "      \"size\": 15,\n",
    "      \"query\" : {\n",
    "        \"knn\" : {\n",
    "          \"content_vector\":{\n",
    "          \"vector\":  question_embedding.data[0].embedding,\n",
    "          \"k\": 3\n",
    "        }\n",
    "      }\n",
    "    }\n",
    "  }\n",
    ")\n",
    "\n",
    "for result in response[\"hits\"][\"hits\"]:\n",
    "  print(\"Id:\" + str(result['_id']))\n",
    "  print(\"Score: \" + str(result[\"_score\"]))\n",
    "  print(\"Title: \" + str(result[\"_source\"][\"title\"]))\n",
    "  print(\"Text: \" + result[\"_source\"][\"text\"][0:100])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use OpenAI Chat Completions API to generate a reply\n",
    "\n",
    "The step above retrieves the content semantically similar to the question, now let's use OpenAI chat `completions` to generate a reply based on the information retrieved."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------------------------------------------------------\n",
      "Pineapple is a contentious ingredient for pizza, and it is commonly used at Pizza Pizza Limited (PPL), a Canadian fast-food restaurant with locations throughout Ontario. The menu includes a variety of toppings such as pepperoni, pineapples, mushrooms, and other non-exotic produce. Pizza Pizza has been a staple in the area for over 30 years, with over 500 locations in Ontario and expanding across the nation, including recent openings in Montreal and British Columbia.\n",
      "------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "# Retrieve the text of the first result in the above dataset\n",
    "top_hit_summary = response['hits']['hits'][0]['_source']['text']\n",
    "\n",
    "# Craft a reply\n",
    "response = openaiclient.chat.completions.create(\n",
    "    model=\"gpt-3.5-turbo\",\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
    "        {\"role\": \"user\", \"content\": \"Answer the following question:\" \n",
    "            + question \n",
    "            + \"by using the following text:\" \n",
    "            + top_hit_summary\n",
    "        }\n",
    "        ]\n",
    "    )\n",
    "\n",
    "choices = response.choices\n",
    "\n",
    "for choice in choices:\n",
    "    print(\"------------------------------------------------------------\")\n",
    "    print(choice.message.content)\n",
    "    print(\"------------------------------------------------------------\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "OpenSearch is a powerful tool providing both text and vector search capabilities. Used alongside OpenAI APIs allows you to craft personalized AI applications able to augment the context based on semantic search.\n",
    "\n",
    "You can try Aiven for OpenSearch, or any of the other Open Source tools, in the Aiven platform free trial by [signing up](https://go.aiven.io/openai-opensearch-signup)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/registry.yaml
+++ b/registry.yaml
@ -1295,3 +1295,11 @@
  tags:
    - completions
    - functions
 - title: OpenSearch as a vector database
  path: examples/vector_databases/opensearch/README.md
  date: 2024-05-10
  authors:
    - ftisiot
  tags:
    - embeddings