From c8a0c91f217ae2bbd30321a2969619e92144c582 Mon Sep 17 00:00:00 2001 From: Francesco Date: Fri, 10 May 2024 10:22:32 +0200 Subject: [PATCH] Added OpenSearch Added OpenSearch to vector databases example --- authors.yaml | 5 + .../vector_databases/opensearch/README.md | 20 + .../aiven-opensearch-vector-search.ipynb | 545 ++++++++++++++++++ registry.yaml | 10 +- 4 files changed, 579 insertions(+), 1 deletion(-) create mode 100644 examples/vector_databases/opensearch/README.md create mode 100644 examples/vector_databases/opensearch/aiven-opensearch-vector-search.ipynb diff --git a/authors.yaml b/authors.yaml index 54c948aa..a989bc16 100644 --- a/authors.yaml +++ b/authors.yaml @@ -93,3 +93,8 @@ royziv11: website: "https://www.linkedin.com/in/roy-ziv-a46001149/" avatar: "https://media.licdn.com/dms/image/D5603AQHkaEOOGZWtbA/profile-displayphoto-shrink_400_400/0/1699500606122?e=1716422400&v=beta&t=wKEIx-vTEqm9wnqoC7-xr1WqJjghvcjjlMt034hXY_4" + +ftisiot: + name: "Francesco Tisiot" + website: "https://ftisiot.net" + avatar: "https://ftisiot.net/images/ftisiot.png" diff --git a/examples/vector_databases/opensearch/README.md b/examples/vector_databases/opensearch/README.md new file mode 100644 index 00000000..3e1d1ccc --- /dev/null +++ b/examples/vector_databases/opensearch/README.md @@ -0,0 +1,20 @@ +# OpenSearch + +OpenSearch is a popular open-source search/analytics engine and [vector database](https://opensearch.org/platform/search/vector-database.html). +With OpenSearch you can efficiently store and query any kind of data including your vector embeddings at scale. + +[Aiven](https://go.aiven.io/openai-opensearch-aiven) provides a way to experience the best of open source data technologies, including OpenSearch, in a secure, well integrated, scalable and trustable data platform. [Aiven for OpenSearch](https://aiven.io/opensearch) allows you to experience OpenSearch in minutes, on all the major cloud vendor and regions, supported by a self-healing platform with a 99.99% SLA. + +For technical details, refer to the [OpenSearch documentation](https://opensearch.org/docs/latest/search-plugins/knn/index/). + +## OpenAI cookbook notebooks 📒 + +Check out our notebooks in this repo for working with OpenAI, using OpenSearch as your vector database. + +### [Semantic search](aiven-opensearch-semantic-search.ipynb) + +In this notebook you'll learn how to: + + - Index the OpenAI Wikipedia embeddings dataset into OpenSearch + - Encode a question with the `text-embedding-3-small` model + - Perform a semantic search \ No newline at end of file diff --git a/examples/vector_databases/opensearch/aiven-opensearch-vector-search.ipynb b/examples/vector_databases/opensearch/aiven-opensearch-vector-search.ipynb new file mode 100644 index 00000000..d2f0c1e3 --- /dev/null +++ b/examples/vector_databases/opensearch/aiven-opensearch-vector-search.ipynb @@ -0,0 +1,545 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Semantic search using OpenSearch and OpenAI\n", + "\n", + "This notebook guides you through an example of using [Aiven for OpenSearch](https://go.aiven.io/openai-opensearch-os) as a backend vector database for OpenAI embeddings and how to perform semantic search.\n", + "\n", + "\n", + "## Why using OpenSearch as backend vector database\n", + "\n", + "OpenSearch is a widely adopted open source search/analytics engine. It allows to store, query and transform documents in a variety of shapes and provides fast and scalable functionalities to perform both accurate and [fuzzy text search](https://opensearch.org/docs/latest/query-dsl/term/fuzzy/). Using OpenSearch as vector database enables you to mix and match semantic and text search queries on top of a performant and scalable engine.\n", + "\n", + "## Prerequisites\n", + "\n", + "Before you begin, ensure to follow the prerequisites:\n", + "\n", + "1. An [Aiven Account](https://go.aiven.io/openai-opensearch-signup). You can create an account and start a free trial with Aiven by navigating to the [signup page](https://go.aiven.io/openai-opensearch-signup) and creating a user.\n", + "2. An [Aiven for OpenSearch service](https://go.aiven.io/openai-opensearch-os). You can spin up an Aiven for OpenSearch service in minutes in the [Aiven Console](https://go.aiven.io/openai-opensearch-console) with the following steps \n", + " * Click on **Create service**\n", + " * Select **OpenSearch**\n", + " * Choose the **Cloud Provider and Region**\n", + " * Select the **Service plan** (the `hobbyist` plan is enough for the notebook)\n", + " * Provide the **Service name**\n", + " * Click on **Create service**\n", + "3. The OpenSearch **Connection String**. The connection string is visible as **Service URI** in the Aiven for OpenSearch service overview page.\n", + "4. Your [OpenAI API key](https://platform.openai.com/account/api-keys)\n", + "5. Python and `pip`.\n", + "\n", + "## Installing dependencies\n", + "\n", + "The notebook requires the following packages:\n", + "\n", + "* `openai`\n", + "* `pandas`\n", + "* `wget`\n", + "* `python-dotenv`\n", + "* `opensearch-py`\n", + "\n", + "You can install the above packages with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "pip install openai pandas wget python-dotenv opensearch-py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## OpenAI key settings\n", + "\n", + "We'll use OpenAI to create embeddings starting from a set of documents, therefore an OpenAI API key is needed. You can get one from the [OpenAI API Key page](https://platform.openai.com/account/api-keys) after logging in.\n", + "\n", + "To avoid leaking the OpenAI key, you can store it as an environment variable named `OPENAI_API_KEY`. \n", + "\n", + "> For more information on how to perform the same task across other operative systems, refer to [Best Practices for API Key Safety](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). \n", + "\n", + "To store safely the information, create a `.env` file in the same folder where the notebook is located and add the following line, replacing the `` with your OpenAI API Key.\n", + "\n", + "```bash\n", + "OPENAI_API_KEY=\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Connect to Aiven for OpenSearch\n", + "\n", + "Once the Aiven for OpenSearch service is in `RUNNING` state, we can retrieve the connection string from the Aiven for Opensearch service page, by copying the **Service URI** parameter. We can store it in the same `.env` file created above, after replacing the `https://USER:PASSWORD@HOST:PORT` string with the Service URI.\n", + "\n", + "```bash\n", + "OPENSEARCH_URI=https://USER:PASSWORD@HOST:PORT\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " We can now connect to Aiven for OpenSearch with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "import os\n", + "from opensearchpy import OpenSearch\n", + "from dotenv import load_dotenv\n", + "\n", + "# Load environment variables from .env file\n", + "load_dotenv()\n", + "\n", + "connection_string = os.getenv(\"OPENSEARCH_URI\")\n", + "\n", + "# Create the client with SSL/TLS enabled, but hostname verification disabled.\n", + "client = OpenSearch(connection_string, use_ssl=True, timeout=100)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Download the dataset\n", + "To save us from having to recalculate embeddings on a huge dataset, we are using a pre-calculated OpenAI embeddings dataset covering wikipedia articles. We can get the file and unzip it with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import wget\n", + "import zipfile\n", + "\n", + "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n", + "wget.download(embeddings_url)\n", + "\n", + "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\n", + "\"r\") as zip_ref:\n", + " zip_ref.extractall(\"data\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's load the file in a dataframe and check the content with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "wikipedia_dataframe = pd.read_csv(\"data/vector_database_wikipedia_articles_embedded.csv\")\n", + "\n", + "wikipedia_dataframe.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The file contains:\n", + "* `id` a unique Wikipedia article identifier\n", + "* `url` the Wikipedia article URL\n", + "* `title` the title of the Wikipedia page\n", + "* `text` the text of the article\n", + "* `title_vector` and `content_vector` the embedding calculated on the title and content of the wikipedia article respectively\n", + "* `vector_id` the id of the vector\n", + "\n", + "We can create an OpenSearch mapping optimized for the storage of these information with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "index_settings ={\n", + " \"index\": {\n", + " \"knn\": True,\n", + " \"knn.algo_param.ef_search\": 100\n", + " }\n", + " }\n", + "\n", + "index_mapping= {\n", + " \"properties\": {\n", + " \"title_vector\": {\n", + " \"type\": \"knn_vector\",\n", + " \"dimension\": 1536,\n", + " \"method\": {\n", + " \"name\": \"hnsw\",\n", + " \"space_type\": \"l2\",\n", + " \"engine\": \"faiss\"\n", + " }\n", + " },\n", + " \"content_vector\": {\n", + " \"type\": \"knn_vector\",\n", + " \"dimension\": 1536,\n", + " \"method\": {\n", + " \"name\": \"hnsw\",\n", + " \"space_type\": \"l2\",\n", + " \"engine\": \"faiss\"\n", + " },\n", + " },\n", + " \"text\": {\"type\": \"text\"},\n", + " \"title\": {\"type\": \"text\"},\n", + " \"url\": { \"type\": \"keyword\"},\n", + " \"vector_id\": {\"type\": \"long\"}\n", + " \n", + " }\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And create an index in Aiven for OpenSearch with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "index_name = \"openai_wikipedia_index\"\n", + "client.indices.create(index=index_name, body={\"settings\": index_settings, \"mappings\":index_mapping})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Index data into OpenSearch\n", + "\n", + "Now it's time to parse the the pandas dataframe and index the data into OpenSearch using Bulk APIs. The following function indexes a set of rows in the dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def dataframe_to_bulk_actions(df):\n", + " for index, row in df.iterrows():\n", + " yield {\n", + " \"_index\": index_name,\n", + " \"_id\": row['id'],\n", + " \"_source\": {\n", + " 'url' : row[\"url\"],\n", + " 'title' : row[\"title\"],\n", + " 'text' : row[\"text\"],\n", + " 'title_vector' : json.loads(row[\"title_vector\"]),\n", + " 'content_vector' : json.loads(row[\"content_vector\"]),\n", + " 'vector_id' : row[\"vector_id\"]\n", + " }\n", + " }" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We don't want to index all the dataset at once, since it's way too large, so we'll load it in batches of `200` rows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from opensearchpy import helpers\n", + "import json\n", + "\n", + "start = 0\n", + "end = len(wikipedia_dataframe)\n", + "batch_size = 200\n", + "for batch_start in range(start, end, batch_size):\n", + " batch_end = min(batch_start + batch_size, end)\n", + " batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end]\n", + " actions = dataframe_to_bulk_actions(batch_dataframe)\n", + " helpers.bulk(client, actions)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once all the documents are indexed, let's try a query to retrieve the documents containing `Pizza`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "res = client.search(index=index_name, body={\n", + " \"_source\": {\n", + " \"excludes\": [\"title_vector\", \"content_vector\"]\n", + " },\n", + " \"query\": {\n", + " \"match\": {\n", + " \"text\": {\n", + " \"query\": \"Pizza\"\n", + " }\n", + " }\n", + " }\n", + "})\n", + "\n", + "print(res[\"hits\"][\"hits\"][0][\"_source\"][\"text\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Encode questions with OpenAI\n", + "\n", + "To perform a semantic search, we need to calculate questions encodings with the same embedding model used to encode the documents at index time. In this example, we need to use the `text-embedding-3-small` model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "import os\n", + "\n", + "# Define model\n", + "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", + "\n", + "# Define the Client\n", + "openaiclient = OpenAI(\n", + " # This is the default and can be omitted\n", + " api_key=os.getenv(\"OPENAI_API_KEY\"),\n", + ")\n", + "\n", + "# Define question\n", + "question = 'is Pineapple a good ingredient for Pizza?'\n", + "\n", + "# Create embedding\n", + "question_embedding = openaiclient.embeddings.create(input=question, model=EMBEDDING_MODEL)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Run semantic search queries with OpenSearch\n", + "\n", + "With the above embedding calculated, we can now run semantic searches against the OpenSearch index. We're using `knn` as query type and scan the content of the `content_vector` field" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Id:66079\n", + "Score: 0.71338785\n", + "Title: Pizza Pizza\n", + "Text: Pizza Pizza Limited (PPL), doing business as Pizza Pizza (), is a franchised Canadian pizza fast foo\n", + "Id:15719\n", + "Score: 0.7115042\n", + "Title: Pineapple\n", + "Text: The pineapple is a fruit. It is native to South America, Central America and the Caribbean. The word\n", + "Id:13967\n", + "Score: 0.7106797\n", + "Title: Pizza\n", + "Text: Pizza is an Italian food that was created in Italy (The Naples area). It is made with different topp\n", + "Id:13968\n", + "Score: 0.69487476\n", + "Title: Pepperoni\n", + "Text: Pepperoni is a meat food that is sometimes sliced thin and put on pizza. It is a kind of salami, whi\n", + "Id:40989\n", + "Score: 0.6696015\n", + "Title: Coprophagia\n", + "Text: Coprophagia is the eating of faeces. Many animals eat faeces, either their own or that of other anim\n", + "Id:90918\n", + "Score: 0.66611433\n", + "Title: Pizza Hut\n", + "Text: Pizza Hut is an American pizza restaurant, or pizza parlor. Pizza Hut also serves salads, pastas and\n", + "Id:433\n", + "Score: 0.66609937\n", + "Title: Lanai\n", + "Text: Lanai (or Lānaʻi) is sixth largest of the Hawaiian Islands, in the United States. It is also known a\n", + "Id:45877\n", + "Score: 0.66580874\n", + "Title: Papaya\n", + "Text: Papaya is a tall herbaceous plant in the genus Carica; its edible fruit is also called papaya. It is\n", + "Id:41467\n", + "Score: 0.6646078\n", + "Title: Te Puke\n", + "Text: Te Puke is a small town in the Bay of Plenty in New Zealand. 6670 people live there. It is famous fo\n", + "Id:31270\n", + "Score: 0.65891963\n", + "Title: Afelia\n", + "Text: Afelia is a Greek food. It is popular in the island nation of Cyprus. Afelia is made from pork, red \n", + "Id:61037\n", + "Score: 0.6569093\n", + "Title: Dough\n", + "Text: Dough is a thick, malleable and sometimes elastic paste made out of flour by mixing it with a small \n", + "Id:76670\n", + "Score: 0.6560743\n", + "Title: Lycopene\n", + "Text: Lycopene is the pigment of tomato. Its chemical formula is (6E,8E,10E,12E,14E,16E,18E,20E,22E,24E,26\n", + "Id:32248\n", + "Score: 0.653606\n", + "Title: Pie\n", + "Text: A pie is a baked food that is made from pastry crust with or without a pastry top. The common filli\n", + "Id:79026\n", + "Score: 0.65358526\n", + "Title: Pectin\n", + "Text: Pectin is a food supplement. It is a source of dietary fiber. It is used to make jellies and jams. U\n", + "Id:63962\n", + "Score: 0.6528203\n", + "Title: Sprite\n", + "Text: Sprite is a lemon-lime soda, similar to 7 UP and Sierra Mist. It is made by the Coca-Cola Company. I\n" + ] + } + ], + "source": [ + "response = client.search(\n", + " index = index_name,\n", + " body = {\n", + " \"size\": 15,\n", + " \"query\" : {\n", + " \"knn\" : {\n", + " \"content_vector\":{\n", + " \"vector\": question_embedding.data[0].embedding,\n", + " \"k\": 3\n", + " }\n", + " }\n", + " }\n", + " }\n", + ")\n", + "\n", + "for result in response[\"hits\"][\"hits\"]:\n", + " print(\"Id:\" + str(result['_id']))\n", + " print(\"Score: \" + str(result[\"_score\"]))\n", + " print(\"Title: \" + str(result[\"_source\"][\"title\"]))\n", + " print(\"Text: \" + result[\"_source\"][\"text\"][0:100])\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use OpenAI Chat Completions API to generate a reply\n", + "\n", + "The step above retrieves the content semantically similar to the question, now let's use OpenAI chat `completions` to generate a reply based on the information retrieved." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "------------------------------------------------------------\n", + "Pineapple is a contentious ingredient for pizza, and it is commonly used at Pizza Pizza Limited (PPL), a Canadian fast-food restaurant with locations throughout Ontario. The menu includes a variety of toppings such as pepperoni, pineapples, mushrooms, and other non-exotic produce. Pizza Pizza has been a staple in the area for over 30 years, with over 500 locations in Ontario and expanding across the nation, including recent openings in Montreal and British Columbia.\n", + "------------------------------------------------------------\n" + ] + } + ], + "source": [ + "# Retrieve the text of the first result in the above dataset\n", + "top_hit_summary = response['hits']['hits'][0]['_source']['text']\n", + "\n", + "# Craft a reply\n", + "response = openaiclient.chat.completions.create(\n", + " model=\"gpt-3.5-turbo\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n", + " {\"role\": \"user\", \"content\": \"Answer the following question:\" \n", + " + question \n", + " + \"by using the following text:\" \n", + " + top_hit_summary\n", + " }\n", + " ]\n", + " )\n", + "\n", + "choices = response.choices\n", + "\n", + "for choice in choices:\n", + " print(\"------------------------------------------------------------\")\n", + " print(choice.message.content)\n", + " print(\"------------------------------------------------------------\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "OpenSearch is a powerful tool providing both text and vector search capabilities. Used alongside OpenAI APIs allows you to craft personalized AI applications able to augment the context based on semantic search.\n", + "\n", + "You can try Aiven for OpenSearch, or any of the other Open Source tools, in the Aiven platform free trial by [signing up](https://go.aiven.io/openai-opensearch-signup)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/registry.yaml b/registry.yaml index 017b3979..b93f685b 100644 --- a/registry.yaml +++ b/registry.yaml @@ -1294,4 +1294,12 @@ - colin-openai tags: - completions - - functions \ No newline at end of file + - functions + +- title: OpenSearch as a vector database + path: examples/vector_databases/opensearch/README.md + date: 2024-05-10 + authors: + - ftisiot + tags: + - embeddings \ No newline at end of file