From 476dd97517c2aefb0a9c07f6d7c5a38af7145cf2 Mon Sep 17 00:00:00 2001 From: "xuqi.wxq" Date: Fri, 7 Apr 2023 12:49:39 +0800 Subject: [PATCH] Add getting started with AnalyticDB distributed vector database and OpenAI example. --- ...g_started_with_AnalyticDB_and_OpenAI.ipynb | 589 ++++++++++++++++++ 1 file changed, 589 insertions(+) create mode 100644 examples/vector_databases/analyticdb/Getting_started_with_AnalyticDB_and_OpenAI.ipynb diff --git a/examples/vector_databases/analyticdb/Getting_started_with_AnalyticDB_and_OpenAI.ipynb b/examples/vector_databases/analyticdb/Getting_started_with_AnalyticDB_and_OpenAI.ipynb new file mode 100644 index 00000000..e6081efb --- /dev/null +++ b/examples/vector_databases/analyticdb/Getting_started_with_AnalyticDB_and_OpenAI.ipynb @@ -0,0 +1,589 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using AnalyticDB as a vector database for OpenAI embeddings\n", + "\n", + "This notebook guides you step by step on using AnalyticDB as a vector database for OpenAI embeddings.\n", + "\n", + "This notebook presents an end-to-end process of:\n", + "1. Using precomputed embeddings created by OpenAI API.\n", + "2. Storing the embeddings in a cloud instance of AnalyticDB.\n", + "3. Converting raw text query to an embedding with OpenAI API.\n", + "4. Using AnalyticDB to perform the nearest neighbour search in the created collection.\n", + "\n", + "### What is AnalyticDB\n", + "\n", + "[AnalyticDB](https://www.alibabacloud.com/help/en/analyticdb-for-postgresql/latest/product-introduction-overview) is a high-performance distributed vector database. Fully compatible with PostgresSQL syntax, you can effortlessly utilize it. AnalyticDB is Alibaba Cloud managed cloud-native database with strong-performed vector compute engine. Absolute out-of-box experience allow to scale into billions of data vectors processing with rich features including indexing algorithms, structured & non-structured data features, realtime update, distance metrics, scalar filtering, time travel searches etc. Also equipped with full OLAP database functionality and SLA commitment for production usage promise;\n", + "\n", + "### Deployment options\n", + "\n", + "- Using [AnalyticDB Cloud Vector Database](https://www.alibabacloud.com/help/zh/analyticdb-for-postgresql/latest/overview-2). [Click here](https://www.alibabacloud.com/product/hybriddb-postgresql) to fast deploy it.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "For the purposes of this exercise we need to prepare a couple of things:\n", + "\n", + "1. AnalyticDB cloud server instance.\n", + "2. The 'psycopg2' library to interact with the vector database. Any other postgresql client library is ok.\n", + "3. An [OpenAI API key](https://beta.openai.com/account/api-keys).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We might validate if the server was launched successfully by running a simple curl command:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Install requirements\n", + "\n", + "This notebook obviously requires the `openai` and `psycopg2` packages, but there are also some other additional libraries we will use. The following command installs them all:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:05:05.718972Z", + "start_time": "2023-02-16T12:04:30.434820Z" + }, + "pycharm": { + "is_executing": true + } + }, + "outputs": [], + "source": [ + "! pip install openai psycopg2 pandas wget" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare your OpenAI API key\n", + "\n", + "The OpenAI API key is used for vectorization of the documents and queries.\n", + "\n", + "If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n", + "\n", + "Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:05:05.730338Z", + "start_time": "2023-02-16T12:05:05.723351Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OPENAI_API_KEY is ready\n" + ] + } + ], + "source": [ + "# Test that your OpenAI API key is correctly set as an environment variable\n", + "# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n", + "import os\n", + "\n", + "# Note. alternatively you can set a temporary env variable like this:\n", + "# os.environ[\"OPENAI_API_KEY\"] = \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\"\n", + "\n", + "if os.getenv(\"OPENAI_API_KEY\") is not None:\n", + " print(\"OPENAI_API_KEY is ready\")\n", + "else:\n", + " print(\"OPENAI_API_KEY environment variable not found\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Connect to AnalyticDB\n", + "First add it to your environment variables. or you can just change the \"psycopg2.connect\" parameters below\n", + "\n", + "Connecting to a running instance of AnalyticDB server is easy with the official Python library:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:05:06.827143Z", + "start_time": "2023-02-16T12:05:05.733771Z" + } + }, + "outputs": [], + "source": [ + "import os\n", + "import psycopg2\n", + "\n", + "# Note. alternatively you can set a temporary env variable like this:\n", + "# os.environ[\"PGHOST\"] = \"your_host\"\n", + "# os.environ[\"PGPORT\"] \"5432\"),\n", + "# os.environ[\"PGDATABASE\"] \"postgres\"),\n", + "# os.environ[\"PGUSER\"] \"user\"),\n", + "# os.environ[\"PGPASSWORD\"] \"password\"),\n", + "\n", + "connection = psycopg2.connect(\n", + " host=os.environ.get(\"PGHOST\", \"localhost\"),\n", + " port=os.environ.get(\"PGPORT\", \"5432\"),\n", + " database=os.environ.get(\"PGDATABASE\", \"postgres\"),\n", + " user=os.environ.get(\"PGUSER\", \"user\"),\n", + " password=os.environ.get(\"PGPASSWORD\", \"password\")\n", + ")\n", + "\n", + "# Create a new cursor object\n", + "cursor = connection.cursor()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can test the connection by running any available method:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:05:06.848488Z", + "start_time": "2023-02-16T12:05:06.832612Z" + }, + "pycharm": { + "is_executing": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Connection successful!\n" + ] + } + ], + "source": [ + "\n", + "# Execute a simple query to test the connection\n", + "cursor.execute(\"SELECT 1;\")\n", + "result = cursor.fetchone()\n", + "\n", + "# Check the query result\n", + "if result == (1,):\n", + " print(\"Connection successful!\")\n", + "else:\n", + " print(\"Connection failed.\")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:05:37.371951Z", + "start_time": "2023-02-16T12:05:06.851634Z" + }, + "pycharm": { + "is_executing": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100% [......................................................................] 698933052 / 698933052" + ] + }, + { + "data": { + "text/plain": [ + "'vector_database_wikipedia_articles_embedded.zip'" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import wget\n", + "\n", + "embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n", + "\n", + "# The file is ~700 MB so this will take some time\n", + "wget.download(embeddings_url)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The downloaded file has to be then extracted:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:06:01.538851Z", + "start_time": "2023-02-16T12:05:37.376042Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.\n" + ] + } + ], + "source": [ + "import zipfile\n", + "import os\n", + "import re\n", + "import tempfile\n", + "\n", + "current_directory = os.getcwd()\n", + "zip_file_path = os.path.join(current_directory, \"vector_database_wikipedia_articles_embedded.zip\")\n", + "output_directory = os.path.join(current_directory, \"../../data\")\n", + "\n", + "with zipfile.ZipFile(zip_file_path, \"r\") as zip_ref:\n", + " zip_ref.extractall(output_directory)\n", + "\n", + "\n", + "# check the csv file exist\n", + "file_name = \"vector_database_wikipedia_articles_embedded.csv\"\n", + "data_directory = os.path.join(current_directory, \"../../data\")\n", + "file_path = os.path.join(data_directory, file_name)\n", + "\n", + "\n", + "if os.path.exists(file_path):\n", + " print(f\"The file {file_name} exists in the data directory.\")\n", + "else:\n", + " print(f\"The file {file_name} does not exist in the data directory.\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Index data\n", + "\n", + "AnalyticDB stores data in __relation__ where each object is described by at least one vector. Our relation will be called **articles** and each object will be described by both **title** and **content** vectors. \\\n", + "\n", + "We will start with creating a relation and create a vector index on both **title** and **content**, and then we will fill it with our precomputed embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:17:36.366066Z", + "start_time": "2023-02-16T12:17:35.486872Z" + } + }, + "outputs": [], + "source": [ + "create_table_sql = '''\n", + "CREATE TABLE IF NOT EXISTS public.articles (\n", + " id INTEGER NOT NULL,\n", + " url TEXT,\n", + " title TEXT,\n", + " content TEXT,\n", + " title_vector REAL[],\n", + " content_vector REAL[],\n", + " vector_id INTEGER\n", + ");\n", + "\n", + "ALTER TABLE public.articles ADD PRIMARY KEY (id);\n", + "'''\n", + "\n", + "# SQL statement for creating indexes\n", + "create_indexes_sql = '''\n", + "CREATE INDEX ON public.articles USING ann (content_vector) WITH (distancemeasure = l2, dim = '1536', pq_segments = '64', hnsw_m = '100', pq_centers = '2048');\n", + "\n", + "CREATE INDEX ON public.articles USING ann (title_vector) WITH (distancemeasure = l2, dim = '1536', pq_segments = '64', hnsw_m = '100', pq_centers = '2048');\n", + "'''\n", + "\n", + "# Execute the SQL statements\n", + "cursor.execute(create_table_sql)\n", + "cursor.execute(create_indexes_sql)\n", + "\n", + "# Commit the changes\n", + "connection.commit()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load data\n", + "\n", + "In this section we are going to load the data prepared previous to this session, so you don't have to recompute the embeddings of Wikipedia articles with your own credits." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:30:37.518210Z", + "start_time": "2023-02-16T12:17:36.368564Z" + }, + "pycharm": { + "is_executing": true + }, + "scrolled": false + }, + "outputs": [], + "source": [ + "import io\n", + "\n", + "# Path to your local CSV file\n", + "csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'\n", + "\n", + "# Define a generator function to process the file line by line\n", + "def process_file(file_path):\n", + " with open(file_path, 'r') as file:\n", + " for line in file:\n", + " # Replace '[' with '{' and ']' with '}'\n", + " modified_line = line.replace('[', '{').replace(']', '}')\n", + " yield modified_line\n", + "\n", + "# Create a StringIO object to store the modified lines\n", + "modified_lines = io.StringIO(''.join(list(process_file(csv_file_path))))\n", + "\n", + "# Create the COPY command for the copy_expert method\n", + "copy_command = '''\n", + "COPY public.articles (id, url, title, content, title_vector, content_vector, vector_id)\n", + "FROM STDIN WITH (FORMAT CSV, HEADER true, DELIMITER ',');\n", + "'''\n", + "\n", + "# Execute the COPY command using the copy_expert method\n", + "cursor.copy_expert(copy_command, modified_lines)\n", + "\n", + "# Commit the changes\n", + "connection.commit()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:30:40.675202Z", + "start_time": "2023-02-16T12:30:40.655654Z" + }, + "pycharm": { + "is_executing": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Count:25000\n" + ] + } + ], + "source": [ + "# Check the collection size to make sure all the points have been stored\n", + "count_sql = \"\"\"select count(*) from public.articles;\"\"\"\n", + "cursor.execute(count_sql)\n", + "result = cursor.fetchone()\n", + "print(f\"Count:{result[0]}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Search data\n", + "\n", + "Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search. Since the precomputed embeddings were created with `text-embedding-ada-002` OpenAI model we also have to use it during search.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:30:38.024370Z", + "start_time": "2023-02-16T12:30:37.712816Z" + } + }, + "outputs": [], + "source": [ + "def query_analyticdb(query, collection_name, vector_name=\"title_vector\", top_k=20):\n", + "\n", + " # Creates embedding vector from user query\n", + " embedded_query = openai.Embedding.create(\n", + " input=query,\n", + " model=\"text-embedding-ada-002\",\n", + " )[\"data\"][0][\"embedding\"]\n", + "\n", + " # Convert the embedded_query to PostgreSQL compatible format\n", + " embedded_query_pg = \"{\" + \",\".join(map(str, embedded_query)) + \"}\"\n", + "\n", + " # Create SQL query\n", + " query_sql = f\"\"\"\n", + " SELECT id, url, title, l2_distance({vector_name},'{embedded_query_pg}'::real[]) AS similarity\n", + " FROM {collection_name}\n", + " ORDER BY {vector_name} <-> '{embedded_query_pg}'::real[]\n", + " LIMIT {top_k};\n", + " \"\"\"\n", + " # Execute the query\n", + " cursor.execute(query_sql)\n", + " results = cursor.fetchall()\n", + "\n", + " return results" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:30:39.379566Z", + "start_time": "2023-02-16T12:30:38.031041Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Museum of Modern Art (Score: 0.75)\n", + "2. Western Europe (Score: 0.735)\n", + "3. Renaissance art (Score: 0.728)\n", + "4. Pop art (Score: 0.721)\n", + "5. Northern Europe (Score: 0.71)\n", + "6. Hellenistic art (Score: 0.706)\n", + "7. Modernist literature (Score: 0.694)\n", + "8. Art film (Score: 0.687)\n", + "9. Central Europe (Score: 0.685)\n", + "10. European (Score: 0.683)\n", + "11. Art (Score: 0.683)\n", + "12. Byzantine art (Score: 0.682)\n", + "13. Postmodernism (Score: 0.68)\n", + "14. Eastern Europe (Score: 0.679)\n", + "15. Europe (Score: 0.678)\n", + "16. Cubism (Score: 0.678)\n", + "17. Impressionism (Score: 0.677)\n", + "18. Bauhaus (Score: 0.676)\n", + "19. Surrealism (Score: 0.674)\n", + "20. Expressionism (Score: 0.674)\n" + ] + } + ], + "source": [ + "import openai\n", + "\n", + "query_results = query_analyticdb(\"modern art in Europe\", \"Articles\")\n", + "for i, result in enumerate(query_results):\n", + " print(f\"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})\")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "ExecuteTime": { + "end_time": "2023-02-16T12:30:40.652676Z", + "start_time": "2023-02-16T12:30:39.382555Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Battle of Bannockburn (Score: 0.739)\n", + "2. Wars of Scottish Independence (Score: 0.723)\n", + "3. 1651 (Score: 0.705)\n", + "4. First War of Scottish Independence (Score: 0.699)\n", + "5. Robert I of Scotland (Score: 0.692)\n", + "6. 841 (Score: 0.688)\n", + "7. 1716 (Score: 0.688)\n", + "8. 1314 (Score: 0.674)\n", + "9. 1263 (Score: 0.673)\n", + "10. William Wallace (Score: 0.671)\n", + "11. Stirling (Score: 0.663)\n", + "12. 1306 (Score: 0.662)\n", + "13. 1746 (Score: 0.661)\n", + "14. 1040s (Score: 0.656)\n", + "15. 1106 (Score: 0.654)\n", + "16. 1304 (Score: 0.653)\n", + "17. David II of Scotland (Score: 0.65)\n", + "18. Braveheart (Score: 0.649)\n", + "19. 1124 (Score: 0.648)\n", + "20. July 27 (Score: 0.646)\n" + ] + } + ], + "source": [ + "# This time we'll query using content vector\n", + "query_results = query_analyticdb(\"Famous battles in Scottish history\", \"Articles\", \"content_vector\")\n", + "for i, result in enumerate(query_results):\n", + " print(f\"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.7" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}