From 476dd97517c2aefb0a9c07f6d7c5a38af7145cf2 Mon Sep 17 00:00:00 2001
From: "xuqi.wxq" <xuqi.wxq@alibaba-inc.com>
Date: Fri, 7 Apr 2023 12:49:39 +0800
Subject: [PATCH] Add getting started with AnalyticDB distributed vector
 database and OpenAI example.

---
 ...g_started_with_AnalyticDB_and_OpenAI.ipynb | 589 ++++++++++++++++++
 1 file changed, 589 insertions(+)
 create mode 100644 examples/vector_databases/analyticdb/Getting_started_with_AnalyticDB_and_OpenAI.ipynb

diff --git a/examples/vector_databases/analyticdb/Getting_started_with_AnalyticDB_and_OpenAI.ipynb b/examples/vector_databases/analyticdb/Getting_started_with_AnalyticDB_and_OpenAI.ipynb
new file mode 100644
index 00000000..e6081efb
--- /dev/null
+++ b/examples/vector_databases/analyticdb/Getting_started_with_AnalyticDB_and_OpenAI.ipynb
@@ -0,0 +1,589 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Using AnalyticDB as a vector database for OpenAI embeddings\n",
+    "\n",
+    "This notebook guides you step by step on using AnalyticDB as a vector database for OpenAI embeddings.\n",
+    "\n",
+    "This notebook presents an end-to-end process of:\n",
+    "1. Using precomputed embeddings created by OpenAI API.\n",
+    "2. Storing the embeddings in a cloud instance of AnalyticDB.\n",
+    "3. Converting raw text query to an embedding with OpenAI API.\n",
+    "4. Using AnalyticDB to perform the nearest neighbour search in the created collection.\n",
+    "\n",
+    "### What is AnalyticDB\n",
+    "\n",
+    "[AnalyticDB](https://www.alibabacloud.com/help/en/analyticdb-for-postgresql/latest/product-introduction-overview) is a high-performance distributed vector database. Fully compatible with PostgresSQL syntax, you can effortlessly utilize it. AnalyticDB is Alibaba Cloud managed cloud-native database with strong-performed vector compute engine. Absolute out-of-box experience allow to scale into billions of data vectors processing with rich features including indexing algorithms, structured & non-structured data features, realtime update, distance metrics, scalar filtering, time travel searches etc. Also equipped with full OLAP database functionality and SLA commitment for production usage promise;\n",
+    "\n",
+    "### Deployment options\n",
+    "\n",
+    "- Using [AnalyticDB Cloud Vector Database](https://www.alibabacloud.com/help/zh/analyticdb-for-postgresql/latest/overview-2). [Click here](https://www.alibabacloud.com/product/hybriddb-postgresql) to fast deploy it.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "For the purposes of this exercise we need to prepare a couple of things:\n",
+    "\n",
+    "1. AnalyticDB cloud server instance.\n",
+    "2. The 'psycopg2' library to interact with the vector database. Any other postgresql client library is ok.\n",
+    "3. An [OpenAI API key](https://beta.openai.com/account/api-keys).\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We might validate if the server was launched successfully by running a simple curl command:\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Install requirements\n",
+    "\n",
+    "This notebook obviously requires the `openai` and `psycopg2` packages, but there are also some other additional libraries we will use. The following command installs them all:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:05:05.718972Z",
+     "start_time": "2023-02-16T12:04:30.434820Z"
+    },
+    "pycharm": {
+     "is_executing": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "! pip install openai psycopg2 pandas wget"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Prepare your OpenAI API key\n",
+    "\n",
+    "The OpenAI API key is used for vectorization of the documents and queries.\n",
+    "\n",
+    "If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n",
+    "\n",
+    "Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:05:05.730338Z",
+     "start_time": "2023-02-16T12:05:05.723351Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "OPENAI_API_KEY is ready\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Test that your OpenAI API key is correctly set as an environment variable\n",
+    "# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n",
+    "import os\n",
+    "\n",
+    "# Note. alternatively you can set a temporary env variable like this:\n",
+    "# os.environ[\"OPENAI_API_KEY\"] = \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\"\n",
+    "\n",
+    "if os.getenv(\"OPENAI_API_KEY\") is not None:\n",
+    "    print(\"OPENAI_API_KEY is ready\")\n",
+    "else:\n",
+    "    print(\"OPENAI_API_KEY environment variable not found\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Connect to AnalyticDB\n",
+    "First add it to your environment variables. or you can just change the \"psycopg2.connect\" parameters below\n",
+    "\n",
+    "Connecting to a running instance of AnalyticDB server is easy with the official Python library:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:05:06.827143Z",
+     "start_time": "2023-02-16T12:05:05.733771Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import psycopg2\n",
+    "\n",
+    "# Note. alternatively you can set a temporary env variable like this:\n",
+    "# os.environ[\"PGHOST\"] = \"your_host\"\n",
+    "# os.environ[\"PGPORT\"] \"5432\"),\n",
+    "# os.environ[\"PGDATABASE\"] \"postgres\"),\n",
+    "# os.environ[\"PGUSER\"] \"user\"),\n",
+    "# os.environ[\"PGPASSWORD\"] \"password\"),\n",
+    "\n",
+    "connection = psycopg2.connect(\n",
+    "    host=os.environ.get(\"PGHOST\", \"localhost\"),\n",
+    "    port=os.environ.get(\"PGPORT\", \"5432\"),\n",
+    "    database=os.environ.get(\"PGDATABASE\", \"postgres\"),\n",
+    "    user=os.environ.get(\"PGUSER\", \"user\"),\n",
+    "    password=os.environ.get(\"PGPASSWORD\", \"password\")\n",
+    ")\n",
+    "\n",
+    "# Create a new cursor object\n",
+    "cursor = connection.cursor()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can test the connection by running any available method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:05:06.848488Z",
+     "start_time": "2023-02-16T12:05:06.832612Z"
+    },
+    "pycharm": {
+     "is_executing": true
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Connection successful!\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "# Execute a simple query to test the connection\n",
+    "cursor.execute(\"SELECT 1;\")\n",
+    "result = cursor.fetchone()\n",
+    "\n",
+    "# Check the query result\n",
+    "if result == (1,):\n",
+    "    print(\"Connection successful!\")\n",
+    "else:\n",
+    "    print(\"Connection failed.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:05:37.371951Z",
+     "start_time": "2023-02-16T12:05:06.851634Z"
+    },
+    "pycharm": {
+     "is_executing": true
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "100% [......................................................................] 698933052 / 698933052"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'vector_database_wikipedia_articles_embedded.zip'"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import wget\n",
+    "\n",
+    "embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n",
+    "\n",
+    "# The file is ~700 MB so this will take some time\n",
+    "wget.download(embeddings_url)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The downloaded file has to be then extracted:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:06:01.538851Z",
+     "start_time": "2023-02-16T12:05:37.376042Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import zipfile\n",
+    "import os\n",
+    "import re\n",
+    "import tempfile\n",
+    "\n",
+    "current_directory = os.getcwd()\n",
+    "zip_file_path = os.path.join(current_directory, \"vector_database_wikipedia_articles_embedded.zip\")\n",
+    "output_directory = os.path.join(current_directory, \"../../data\")\n",
+    "\n",
+    "with zipfile.ZipFile(zip_file_path, \"r\") as zip_ref:\n",
+    "    zip_ref.extractall(output_directory)\n",
+    "\n",
+    "\n",
+    "# check the csv file exist\n",
+    "file_name = \"vector_database_wikipedia_articles_embedded.csv\"\n",
+    "data_directory = os.path.join(current_directory, \"../../data\")\n",
+    "file_path = os.path.join(data_directory, file_name)\n",
+    "\n",
+    "\n",
+    "if os.path.exists(file_path):\n",
+    "    print(f\"The file {file_name} exists in the data directory.\")\n",
+    "else:\n",
+    "    print(f\"The file {file_name} does not exist in the data directory.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Index data\n",
+    "\n",
+    "AnalyticDB stores data in __relation__ where each object is described by at least one vector. Our relation will be called **articles** and each object will be described by both **title** and **content** vectors. \\\n",
+    "\n",
+    "We will start with creating a relation and create a vector index on both **title** and **content**, and then we will fill it with our precomputed embeddings."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:17:36.366066Z",
+     "start_time": "2023-02-16T12:17:35.486872Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "create_table_sql = '''\n",
+    "CREATE TABLE IF NOT EXISTS public.articles (\n",
+    "    id INTEGER NOT NULL,\n",
+    "    url TEXT,\n",
+    "    title TEXT,\n",
+    "    content TEXT,\n",
+    "    title_vector REAL[],\n",
+    "    content_vector REAL[],\n",
+    "    vector_id INTEGER\n",
+    ");\n",
+    "\n",
+    "ALTER TABLE public.articles ADD PRIMARY KEY (id);\n",
+    "'''\n",
+    "\n",
+    "# SQL statement for creating indexes\n",
+    "create_indexes_sql = '''\n",
+    "CREATE INDEX ON public.articles USING ann (content_vector) WITH (distancemeasure = l2, dim = '1536', pq_segments = '64', hnsw_m = '100', pq_centers = '2048');\n",
+    "\n",
+    "CREATE INDEX ON public.articles USING ann (title_vector) WITH (distancemeasure = l2, dim = '1536', pq_segments = '64', hnsw_m = '100', pq_centers = '2048');\n",
+    "'''\n",
+    "\n",
+    "# Execute the SQL statements\n",
+    "cursor.execute(create_table_sql)\n",
+    "cursor.execute(create_indexes_sql)\n",
+    "\n",
+    "# Commit the changes\n",
+    "connection.commit()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load data\n",
+    "\n",
+    "In this section we are going to load the data prepared previous to this session, so you don't have to recompute the embeddings of Wikipedia articles with your own credits."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:30:37.518210Z",
+     "start_time": "2023-02-16T12:17:36.368564Z"
+    },
+    "pycharm": {
+     "is_executing": true
+    },
+    "scrolled": false
+   },
+   "outputs": [],
+   "source": [
+    "import io\n",
+    "\n",
+    "# Path to your local CSV file\n",
+    "csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'\n",
+    "\n",
+    "# Define a generator function to process the file line by line\n",
+    "def process_file(file_path):\n",
+    "    with open(file_path, 'r') as file:\n",
+    "        for line in file:\n",
+    "            # Replace '[' with '{' and ']' with '}'\n",
+    "            modified_line = line.replace('[', '{').replace(']', '}')\n",
+    "            yield modified_line\n",
+    "\n",
+    "# Create a StringIO object to store the modified lines\n",
+    "modified_lines = io.StringIO(''.join(list(process_file(csv_file_path))))\n",
+    "\n",
+    "# Create the COPY command for the copy_expert method\n",
+    "copy_command = '''\n",
+    "COPY public.articles (id, url, title, content, title_vector, content_vector, vector_id)\n",
+    "FROM STDIN WITH (FORMAT CSV, HEADER true, DELIMITER ',');\n",
+    "'''\n",
+    "\n",
+    "# Execute the COPY command using the copy_expert method\n",
+    "cursor.copy_expert(copy_command, modified_lines)\n",
+    "\n",
+    "# Commit the changes\n",
+    "connection.commit()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:30:40.675202Z",
+     "start_time": "2023-02-16T12:30:40.655654Z"
+    },
+    "pycharm": {
+     "is_executing": true
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Count:25000\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Check the collection size to make sure all the points have been stored\n",
+    "count_sql = \"\"\"select count(*) from public.articles;\"\"\"\n",
+    "cursor.execute(count_sql)\n",
+    "result = cursor.fetchone()\n",
+    "print(f\"Count:{result[0]}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Search data\n",
+    "\n",
+    "Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search. Since the precomputed embeddings were created with `text-embedding-ada-002` OpenAI model we also have to use it during search.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:30:38.024370Z",
+     "start_time": "2023-02-16T12:30:37.712816Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def query_analyticdb(query, collection_name, vector_name=\"title_vector\", top_k=20):\n",
+    "\n",
+    "    # Creates embedding vector from user query\n",
+    "    embedded_query = openai.Embedding.create(\n",
+    "        input=query,\n",
+    "        model=\"text-embedding-ada-002\",\n",
+    "    )[\"data\"][0][\"embedding\"]\n",
+    "\n",
+    "    # Convert the embedded_query to PostgreSQL compatible format\n",
+    "    embedded_query_pg = \"{\" + \",\".join(map(str, embedded_query)) + \"}\"\n",
+    "\n",
+    "    # Create SQL query\n",
+    "    query_sql = f\"\"\"\n",
+    "    SELECT id, url, title, l2_distance({vector_name},'{embedded_query_pg}'::real[]) AS similarity\n",
+    "    FROM {collection_name}\n",
+    "    ORDER BY {vector_name} <-> '{embedded_query_pg}'::real[]\n",
+    "    LIMIT {top_k};\n",
+    "    \"\"\"\n",
+    "    # Execute the query\n",
+    "    cursor.execute(query_sql)\n",
+    "    results = cursor.fetchall()\n",
+    "\n",
+    "    return results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:30:39.379566Z",
+     "start_time": "2023-02-16T12:30:38.031041Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1. Museum of Modern Art (Score: 0.75)\n",
+      "2. Western Europe (Score: 0.735)\n",
+      "3. Renaissance art (Score: 0.728)\n",
+      "4. Pop art (Score: 0.721)\n",
+      "5. Northern Europe (Score: 0.71)\n",
+      "6. Hellenistic art (Score: 0.706)\n",
+      "7. Modernist literature (Score: 0.694)\n",
+      "8. Art film (Score: 0.687)\n",
+      "9. Central Europe (Score: 0.685)\n",
+      "10. European (Score: 0.683)\n",
+      "11. Art (Score: 0.683)\n",
+      "12. Byzantine art (Score: 0.682)\n",
+      "13. Postmodernism (Score: 0.68)\n",
+      "14. Eastern Europe (Score: 0.679)\n",
+      "15. Europe (Score: 0.678)\n",
+      "16. Cubism (Score: 0.678)\n",
+      "17. Impressionism (Score: 0.677)\n",
+      "18. Bauhaus (Score: 0.676)\n",
+      "19. Surrealism (Score: 0.674)\n",
+      "20. Expressionism (Score: 0.674)\n"
+     ]
+    }
+   ],
+   "source": [
+    "import openai\n",
+    "\n",
+    "query_results = query_analyticdb(\"modern art in Europe\", \"Articles\")\n",
+    "for i, result in enumerate(query_results):\n",
+    "    print(f\"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-02-16T12:30:40.652676Z",
+     "start_time": "2023-02-16T12:30:39.382555Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1. Battle of Bannockburn (Score: 0.739)\n",
+      "2. Wars of Scottish Independence (Score: 0.723)\n",
+      "3. 1651 (Score: 0.705)\n",
+      "4. First War of Scottish Independence (Score: 0.699)\n",
+      "5. Robert I of Scotland (Score: 0.692)\n",
+      "6. 841 (Score: 0.688)\n",
+      "7. 1716 (Score: 0.688)\n",
+      "8. 1314 (Score: 0.674)\n",
+      "9. 1263 (Score: 0.673)\n",
+      "10. William Wallace (Score: 0.671)\n",
+      "11. Stirling (Score: 0.663)\n",
+      "12. 1306 (Score: 0.662)\n",
+      "13. 1746 (Score: 0.661)\n",
+      "14. 1040s (Score: 0.656)\n",
+      "15. 1106 (Score: 0.654)\n",
+      "16. 1304 (Score: 0.653)\n",
+      "17. David II of Scotland (Score: 0.65)\n",
+      "18. Braveheart (Score: 0.649)\n",
+      "19. 1124 (Score: 0.648)\n",
+      "20. July 27 (Score: 0.646)\n"
+     ]
+    }
+   ],
+   "source": [
+    "# This time we'll query using content vector\n",
+    "query_results = query_analyticdb(\"Famous battles in Scottish history\", \"Articles\", \"content_vector\")\n",
+    "for i, result in enumerate(query_results):\n",
+    "    print(f\"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}