You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/vector_databases/PolarDB/Getting_started_with_PolarD...

511 lines
16 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using PolarDB-PG as a vector database for OpenAI embeddings\n",
"\n",
"This notebook guides you step by step on using PolarDB-PG as a vector database for OpenAI embeddings.\n",
"\n",
"This notebook presents an end-to-end process of:\n",
"1. Using precomputed embeddings created by OpenAI API.\n",
"2. Storing the embeddings in a cloud instance of PolarDB-PG.\n",
"3. Converting raw text query to an embedding with OpenAI API.\n",
"4. Using PolarDB-PG to perform the nearest neighbour search in the created collection.\n",
"\n",
"### What is PolarDB-PG\n",
"\n",
"[PolarDB-PG](https://www.alibabacloud.com/help/en/polardb/latest/what-is-polardb-2) is a high-performance vector database that adopts a read-write separation architecture. It is a cloud-native database managed by Alibaba Cloud, 100% compatible with PostgreSQL, and highly compatible with Oracle syntax. It supports processing massive vector data storage and queries, and greatly improves the efficiency of vector calculations through optimization of underlying execution algorithms, providing users with fast, elastic, high-performance, massive storage, and secure and reliable vector database services. Additionally, PolarDB-PG also supports multi-dimensional and multi-modal spatiotemporal information engines and geographic information engines.At the same time, PolarDB-PG is equipped with complete OLAP functionality and service level agreements, which has been recognized and used by many users;\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"### Deployment options\n",
"\n",
"- Using [PolarDB-PG Cloud Vector Database](https://www.alibabacloud.com/product/polardb-for-postgresql). [Click here](https://www.alibabacloud.com/product/polardb-for-postgresql?spm=a3c0i.147400.6791778070.243.9f204881g5cjpP) to fast deploy it."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"For the purposes of this exercise we need to prepare a couple of things:\n",
"\n",
"1. PolarDB-PG cloud server instance.\n",
"2. The 'psycopg2' library to interact with the vector database. Any other postgresql client library is ok.\n",
"3. An [OpenAI API key](https://beta.openai.com/account/api-keys)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We might validate if the server was launched successfully by running a simple curl command:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install requirements\n",
"\n",
"This notebook obviously requires the `openai` and `psycopg2` packages, but there are also some other additional libraries we will use. The following command installs them all:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install openai psycopg2 pandas wget"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Prepare your OpenAI API key\n",
"The OpenAI API key is used for vectorization of the documents and queries.\n",
"\n",
"If you don't have an OpenAI API key, you can get one from https://beta.openai.com/account/api-keys.\n",
"\n",
"Once you get your key, please add it to your environment variables as OPENAI_API_KEY.\n",
"\n",
"If you have any doubts about setting the API key through environment variables, please refer to [Best Practices for API Key Safety](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"OPENAI_API_KEY is ready\n"
]
}
],
"source": [
"# Test that your OpenAI API key is correctly set as an environment variable\n",
"# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n",
"\n",
"if os.getenv(\"OPENAI_API_KEY\") is not None:\n",
" print(\"OPENAI_API_KEY is ready\")\n",
"else:\n",
" print(\"OPENAI_API_KEY environment variable not found\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect to PolarDB\n",
"First add it to your environment variables. or you can just change the \"psycopg2.connect\" parameters below\n",
"\n",
"Connecting to a running instance of PolarDB server is easy with the official Python library:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import psycopg2\n",
"\n",
"# Note. alternatively you can set a temporary env variable like this:\n",
"# os.environ[\"PGHOST\"] = \"your_host\"\n",
"# os.environ[\"PGPORT\"] \"5432\"),\n",
"# os.environ[\"PGDATABASE\"] \"postgres\"),\n",
"# os.environ[\"PGUSER\"] \"user\"),\n",
"# os.environ[\"PGPASSWORD\"] \"password\"),\n",
"\n",
"connection = psycopg2.connect(\n",
" host=os.environ.get(\"PGHOST\", \"localhost\"),\n",
" port=os.environ.get(\"PGPORT\", \"5432\"),\n",
" database=os.environ.get(\"PGDATABASE\", \"postgres\"),\n",
" user=os.environ.get(\"PGUSER\", \"user\"),\n",
" password=os.environ.get(\"PGPASSWORD\", \"password\")\n",
")\n",
"\n",
"# Create a new cursor object\n",
"cursor = connection.cursor()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We can test the connection by running any available method:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Connection successful!\n"
]
}
],
"source": [
"# Execute a simple query to test the connection\n",
"cursor.execute(\"SELECT 1;\")\n",
"result = cursor.fetchone()\n",
"\n",
"# Check the query result\n",
"if result == (1,):\n",
" print(\"Connection successful!\")\n",
"else:\n",
" print(\"Connection failed.\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'vector_database_wikipedia_articles_embedded.zip'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import wget\n",
"\n",
"embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n",
"\n",
"# The file is ~700 MB so this will take some time\n",
"wget.download(embeddings_url)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The downloaded file has to be then extracted:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.\n"
]
}
],
"source": [
"import zipfile\n",
"import os\n",
"import re\n",
"import tempfile\n",
"\n",
"current_directory = os.getcwd()\n",
"zip_file_path = os.path.join(current_directory, \"vector_database_wikipedia_articles_embedded.zip\")\n",
"output_directory = os.path.join(current_directory, \"../../data\")\n",
"\n",
"with zipfile.ZipFile(zip_file_path, \"r\") as zip_ref:\n",
" zip_ref.extractall(output_directory)\n",
"\n",
"\n",
"# check the csv file exist\n",
"file_name = \"vector_database_wikipedia_articles_embedded.csv\"\n",
"data_directory = os.path.join(current_directory, \"../../data\")\n",
"file_path = os.path.join(data_directory, file_name)\n",
"\n",
"\n",
"if os.path.exists(file_path):\n",
" print(f\"The file {file_name} exists in the data directory.\")\n",
"else:\n",
" print(f\"The file {file_name} does not exist in the data directory.\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Index data\n",
"\n",
"PolarDB stores data in __relation__ where each object is described by at least one vector. Our relation will be called **articles** and each object will be described by both **title** and **content** vectors. \n",
"\n",
"We will start with creating a relation and create a vector index on both **title** and **content**, and then we will fill it with our precomputed embeddings."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"create_table_sql = '''\n",
"CREATE TABLE IF NOT EXISTS public.articles (\n",
" id INTEGER NOT NULL,\n",
" url TEXT,\n",
" title TEXT,\n",
" content TEXT,\n",
" title_vector vector(1536),\n",
" content_vector vector(1536),\n",
" vector_id INTEGER\n",
");\n",
"\n",
"ALTER TABLE public.articles ADD PRIMARY KEY (id);\n",
"'''\n",
"\n",
"# SQL statement for creating indexes\n",
"create_indexes_sql = '''\n",
"CREATE INDEX ON public.articles USING ivfflat (content_vector) WITH (lists = 1000);\n",
"\n",
"CREATE INDEX ON public.articles USING ivfflat (title_vector) WITH (lists = 1000);\n",
"'''\n",
"\n",
"# Execute the SQL statements\n",
"cursor.execute(create_table_sql)\n",
"cursor.execute(create_indexes_sql)\n",
"\n",
"# Commit the changes\n",
"connection.commit()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data\n",
"\n",
"In this section we are going to load the data prepared previous to this session, so you don't have to recompute the embeddings of Wikipedia articles with your own credits."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import io\n",
"\n",
"# Path to your local CSV file\n",
"csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'\n",
"\n",
"# Define a generator function to process the file line by line\n",
"def process_file(file_path):\n",
" with open(file_path, 'r') as file:\n",
" for line in file:\n",
" yield line\n",
"\n",
"# Create a StringIO object to store the modified lines\n",
"modified_lines = io.StringIO(''.join(list(process_file(csv_file_path))))\n",
"\n",
"# Create the COPY command for the copy_expert method\n",
"copy_command = '''\n",
"COPY public.articles (id, url, title, content, title_vector, content_vector, vector_id)\n",
"FROM STDIN WITH (FORMAT CSV, HEADER true, DELIMITER ',');\n",
"'''\n",
"\n",
"# Execute the COPY command using the copy_expert method\n",
"cursor.copy_expert(copy_command, modified_lines)\n",
"\n",
"# Commit the changes\n",
"connection.commit()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Count:25000\n"
]
}
],
"source": [
"# Check the collection size to make sure all the points have been stored\n",
"count_sql = \"\"\"select count(*) from public.articles;\"\"\"\n",
"cursor.execute(count_sql)\n",
"result = cursor.fetchone()\n",
"print(f\"Count:{result[0]}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Search data\n",
"\n",
"Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search. Since the precomputed embeddings were created with `text-embedding-3-small` OpenAI model we also have to use it during search."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def query_polardb(query, collection_name, vector_name=\"title_vector\", top_k=20):\n",
"\n",
" # Creates embedding vector from user query\n",
" embedded_query = openai.Embedding.create(\n",
" input=query,\n",
" model=\"text-embedding-3-small\",\n",
" )[\"data\"][0][\"embedding\"]\n",
"\n",
" # Convert the embedded_query to PostgreSQL compatible format\n",
" embedded_query_pg = \"[\" + \",\".join(map(str, embedded_query)) + \"]\"\n",
"\n",
" # Create SQL query\n",
" query_sql = f\"\"\"\n",
" SELECT id, url, title, l2_distance({vector_name},'{embedded_query_pg}'::VECTOR(1536)) AS similarity\n",
" FROM {collection_name}\n",
" ORDER BY {vector_name} <-> '{embedded_query_pg}'::VECTOR(1536)\n",
" LIMIT {top_k};\n",
" \"\"\"\n",
" # Execute the query\n",
" cursor.execute(query_sql)\n",
" results = cursor.fetchall()\n",
"\n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. Museum of Modern Art (Score: 0.5)\n",
"2. Western Europe (Score: 0.485)\n",
"3. Renaissance art (Score: 0.479)\n",
"4. Pop art (Score: 0.472)\n",
"5. Northern Europe (Score: 0.461)\n",
"6. Hellenistic art (Score: 0.457)\n",
"7. Modernist literature (Score: 0.447)\n",
"8. Art film (Score: 0.44)\n",
"9. Central Europe (Score: 0.439)\n",
"10. European (Score: 0.437)\n",
"11. Art (Score: 0.437)\n",
"12. Byzantine art (Score: 0.436)\n",
"13. Postmodernism (Score: 0.434)\n",
"14. Eastern Europe (Score: 0.433)\n",
"15. Europe (Score: 0.433)\n",
"16. Cubism (Score: 0.432)\n",
"17. Impressionism (Score: 0.432)\n",
"18. Bauhaus (Score: 0.431)\n",
"19. Surrealism (Score: 0.429)\n",
"20. Expressionism (Score: 0.429)\n"
]
}
],
"source": [
"import openai\n",
"\n",
"query_results = query_polardb(\"modern art in Europe\", \"Articles\")\n",
"for i, result in enumerate(query_results):\n",
" print(f\"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. Battle of Bannockburn (Score: 0.489)\n",
"2. Wars of Scottish Independence (Score: 0.474)\n",
"3. 1651 (Score: 0.457)\n",
"4. First War of Scottish Independence (Score: 0.452)\n",
"5. Robert I of Scotland (Score: 0.445)\n",
"6. 841 (Score: 0.441)\n",
"7. 1716 (Score: 0.441)\n",
"8. 1314 (Score: 0.429)\n",
"9. 1263 (Score: 0.428)\n",
"10. William Wallace (Score: 0.426)\n",
"11. Stirling (Score: 0.419)\n",
"12. 1306 (Score: 0.419)\n",
"13. 1746 (Score: 0.418)\n",
"14. 1040s (Score: 0.414)\n",
"15. 1106 (Score: 0.412)\n",
"16. 1304 (Score: 0.411)\n",
"17. David II of Scotland (Score: 0.408)\n",
"18. Braveheart (Score: 0.407)\n",
"19. 1124 (Score: 0.406)\n",
"20. July 27 (Score: 0.405)\n"
]
}
],
"source": [
"# This time we'll query using content vector\n",
"query_results = query_polardb(\"Famous battles in Scottish history\", \"Articles\", \"content_vector\")\n",
"for i, result in enumerate(query_results):\n",
" print(f\"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.16"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}