openai-cookbook/examples/vector_databases/hologres/Getting_started_with_Hologr...

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Hologres as a vector database for OpenAI embeddings\n",
    "\n",
    "This notebook guides you step by step on using Hologres as a vector database for OpenAI embeddings.\n",
    "\n",
    "This notebook presents an end-to-end process of:\n",
    "1. Using precomputed embeddings created by OpenAI API.\n",
    "2. Storing the embeddings in a cloud instance of Hologres.\n",
    "3. Converting raw text query to an embedding with OpenAI API.\n",
    "4. Using Hologres to perform the nearest neighbour search in the created collection.\n",
    "5. Provide large language models with the search results as context in prompt engineering\n",
    "\n",
    "### What is Hologres\n",
    "\n",
    "[Hologres](https://www.alibabacloud.com/help/en/hologres/latest/what-is-hologres) is a unified real-time data warehousing service developed by Alibaba Cloud. You can use Hologres to write, update, process, and analyze large amounts of data in real time. Hologres supports standard SQL syntax, is compatible with PostgreSQL, and supports most PostgreSQL functions. Hologres supports online analytical processing (OLAP) and ad hoc analysis for up to petabytes of data, and provides high-concurrency and low-latency online data services. Hologres supports fine-grained isolation of multiple workloads and enterprise-level security capabilities. Hologres is deeply integrated with MaxCompute, Realtime Compute for Apache Flink, and DataWorks, and provides full-stack online and offline data warehousing solutions for enterprises.\n",
    "\n",
    "Hologres provides vector database functionality by adopting [Proxima](https://www.alibabacloud.com/help/en/hologres/latest/vector-processing).\n",
    "\n",
    "Proxima is a high-performance software library developed by Alibaba DAMO Academy. It allows you to search for the nearest neighbors of vectors. Proxima provides higher stability and performance than similar open source software such as Facebook AI Similarity Search (Faiss). Proxima provides basic modules that have leading performance and effects in the industry and allows you to search for similar images, videos, or human faces. Hologres is deeply integrated with Proxima to provide a high-performance vector search service.\n",
    "\n",
    "### Deployment options\n",
    "\n",
    "- [Click here](https://www.alibabacloud.com/product/hologres) to fast deploy [Hologres data warehouse](https://www.alibabacloud.com/help/en/hologres/latest/getting-started).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "For the purposes of this exercise we need to prepare a couple of things:\n",
    "\n",
    "1. Hologres cloud server instance.\n",
    "2. The 'psycopg2-binary' library to interact with the vector database. Any other postgresql client library is ok.\n",
    "3. An [OpenAI API key](https://beta.openai.com/account/api-keys).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We might validate if the server was launched successfully by running a simple curl command:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install requirements\n",
    "\n",
    "This notebook obviously requires the `openai` and `psycopg2-binary` packages, but there are also some other additional libraries we will use. The following command installs them all:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-16T12:05:05.718972Z",
     "start_time": "2023-02-16T12:04:30.434820Z"
    },
    "pycharm": {
     "is_executing": true
    }
   },
   "outputs": [],
   "source": [
    "! pip install openai psycopg2-binary pandas wget"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare your OpenAI API key\n",
    "\n",
    "The OpenAI API key is used for vectorization of the documents and queries.\n",
    "\n",
    "If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n",
    "\n",
    "Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-16T12:05:05.730338Z",
     "start_time": "2023-02-16T12:05:05.723351Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "OPENAI_API_KEY is ready\n"
     ]
    }
   ],
   "source": [
    "# Test that your OpenAI API key is correctly set as an environment variable\n",
    "# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n",
    "import os\n",
    "\n",
    "# Note. alternatively you can set a temporary env variable like this:\n",
    "# os.environ[\"OPENAI_API_KEY\"] = \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\"\n",
    "\n",
    "if os.getenv(\"OPENAI_API_KEY\") is not None:\n",
    "    print(\"OPENAI_API_KEY is ready\")\n",
    "else:\n",
    "    print(\"OPENAI_API_KEY environment variable not found\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connect to Hologres\n",
    "First add it to your environment variables. or you can just change the \"psycopg2.connect\" parameters below\n",
    "\n",
    "Connecting to a running instance of Hologres server is easy with the official Python library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-16T12:05:06.827143Z",
     "start_time": "2023-02-16T12:05:05.733771Z"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import psycopg2\n",
    "\n",
    "# Note. alternatively you can set a temporary env variable like this:\n",
    "# os.environ[\"PGHOST\"] = \"your_host\"\n",
    "# os.environ[\"PGPORT\"] \"5432\"),\n",
    "# os.environ[\"PGDATABASE\"] \"postgres\"),\n",
    "# os.environ[\"PGUSER\"] \"user\"),\n",
    "# os.environ[\"PGPASSWORD\"] \"password\"),\n",
    "\n",
    "connection = psycopg2.connect(\n",
    "    host=os.environ.get(\"PGHOST\", \"localhost\"),\n",
    "    port=os.environ.get(\"PGPORT\", \"5432\"),\n",
    "    database=os.environ.get(\"PGDATABASE\", \"postgres\"),\n",
    "    user=os.environ.get(\"PGUSER\", \"user\"),\n",
    "    password=os.environ.get(\"PGPASSWORD\", \"password\")\n",
    ")\n",
    "connection.set_session(autocommit=True)\n",
    "\n",
    "# Create a new cursor object\n",
    "cursor = connection.cursor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can test the connection by running any available method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-16T12:05:06.848488Z",
     "start_time": "2023-02-16T12:05:06.832612Z"
    },
    "pycharm": {
     "is_executing": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Connection successful!\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# Execute a simple query to test the connection\n",
    "cursor.execute(\"SELECT 1;\")\n",
    "result = cursor.fetchone()\n",
    "\n",
    "# Check the query result\n",
    "if result == (1,):\n",
    "    print(\"Connection successful!\")\n",
    "else:\n",
    "    print(\"Connection failed.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-16T12:05:37.371951Z",
     "start_time": "2023-02-16T12:05:06.851634Z"
    },
    "pycharm": {
     "is_executing": true
    }
   },
   "outputs": [],
   "source": [
    "import wget\n",
    "\n",
    "embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n",
    "\n",
    "# The file is ~700 MB so this will take some time\n",
    "wget.download(embeddings_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The downloaded file has to be then extracted:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-16T12:06:01.538851Z",
     "start_time": "2023-02-16T12:05:37.376042Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.\n"
     ]
    }
   ],
   "source": [
    "import zipfile\n",
    "import os\n",
    "import re\n",
    "import tempfile\n",
    "\n",
    "current_directory = os.getcwd()\n",
    "zip_file_path = os.path.join(current_directory, \"vector_database_wikipedia_articles_embedded.zip\")\n",
    "output_directory = os.path.join(current_directory, \"../../data\")\n",
    "\n",
    "with zipfile.ZipFile(zip_file_path, \"r\") as zip_ref:\n",
    "    zip_ref.extractall(output_directory)\n",
    "\n",
    "\n",
    "# check the csv file exist\n",
    "file_name = \"vector_database_wikipedia_articles_embedded.csv\"\n",
    "data_directory = os.path.join(current_directory, \"../../data\")\n",
    "file_path = os.path.join(data_directory, file_name)\n",
    "\n",
    "\n",
    "if os.path.exists(file_path):\n",
    "    print(f\"The file {file_name} exists in the data directory.\")\n",
    "else:\n",
    "    print(f\"The file {file_name} does not exist in the data directory.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data\n",
    "\n",
    "In this section we are going to load the data prepared previous to this session, so you don't have to recompute the embeddings of Wikipedia articles with your own credits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Archive:  vector_database_wikipedia_articles_embedded.zip\n",
      "-rw-r--r--@ 1 geng  staff   1.7G Jan 31 01:19 vector_database_wikipedia_articles_embedded.csv\n"
     ]
    }
   ],
   "source": [
    "!unzip -n vector_database_wikipedia_articles_embedded.zip\n",
    "!ls -lh vector_database_wikipedia_articles_embedded.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a look at the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>url</th>\n",
       "      <th>title</th>\n",
       "      <th>text</th>\n",
       "      <th>title_vector</th>\n",
       "      <th>content_vector</th>\n",
       "      <th>vector_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/April</td>\n",
       "      <td>April</td>\n",
       "      <td>April is the fourth month of the year in the J...</td>\n",
       "      <td>[0.001009464613161981, -0.020700545981526375, ...</td>\n",
       "      <td>[-0.011253940872848034, -0.013491976074874401,...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/August</td>\n",
       "      <td>August</td>\n",
       "      <td>August (Aug.) is the eighth month of the year ...</td>\n",
       "      <td>[0.0009286514250561595, 0.000820168002974242, ...</td>\n",
       "      <td>[0.0003609954728744924, 0.007262262050062418, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Art</td>\n",
       "      <td>Art</td>\n",
       "      <td>Art is a creative activity that expresses imag...</td>\n",
       "      <td>[0.003393713850528002, 0.0061537534929811954, ...</td>\n",
       "      <td>[-0.004959689453244209, 0.015772193670272827, ...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/A</td>\n",
       "      <td>A</td>\n",
       "      <td>A or a is the first letter of the English alph...</td>\n",
       "      <td>[0.0153952119871974, -0.013759135268628597, 0....</td>\n",
       "      <td>[0.024894846603274345, -0.022186409682035446, ...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>9</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Air</td>\n",
       "      <td>Air</td>\n",
       "      <td>Air refers to the Earth's atmosphere. Air is a...</td>\n",
       "      <td>[0.02224554680287838, -0.02044147066771984, -0...</td>\n",
       "      <td>[0.021524671465158463, 0.018522677943110466, -...</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24995</th>\n",
       "      <td>98295</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Geneva</td>\n",
       "      <td>Geneva</td>\n",
       "      <td>Geneva (,  ,  ,  , ) is the second biggest cit...</td>\n",
       "      <td>[-0.015773078426718712, 0.01737344264984131, 0...</td>\n",
       "      <td>[0.008000412955880165, 0.02008531428873539, 0....</td>\n",
       "      <td>24995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24996</th>\n",
       "      <td>98316</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Concubinage</td>\n",
       "      <td>Concubinage</td>\n",
       "      <td>Concubinage is the state of a woman in a relat...</td>\n",
       "      <td>[-0.00519518880173564, 0.005898841191083193, 0...</td>\n",
       "      <td>[-0.01736736111342907, -0.002740012714639306, ...</td>\n",
       "      <td>24996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24997</th>\n",
       "      <td>98318</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Mistress%20%...</td>\n",
       "      <td>Mistress (lover)</td>\n",
       "      <td>A mistress is a man's long term female sexual ...</td>\n",
       "      <td>[-0.023164259269833565, -0.02052430994808674, ...</td>\n",
       "      <td>[-0.017878392711281776, -0.0004517830966506153...</td>\n",
       "      <td>24997</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24998</th>\n",
       "      <td>98326</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Eastern%20Front</td>\n",
       "      <td>Eastern Front</td>\n",
       "      <td>Eastern Front can be one of the following:\\n\\n...</td>\n",
       "      <td>[-0.00681863259524107, 0.002171179046854377, 8...</td>\n",
       "      <td>[-0.0019235472427681088, -0.004023272544145584...</td>\n",
       "      <td>24998</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24999</th>\n",
       "      <td>98327</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Italian%20Ca...</td>\n",
       "      <td>Italian Campaign</td>\n",
       "      <td>Italian Campaign can mean the following:\\n\\nTh...</td>\n",
       "      <td>[-0.014151256531476974, -0.008553029969334602,...</td>\n",
       "      <td>[-0.011758845299482346, -0.01346028596162796, ...</td>\n",
       "      <td>24999</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>25000 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          id                                                url   \n",
       "0          1            https://simple.wikipedia.org/wiki/April  \\\n",
       "1          2           https://simple.wikipedia.org/wiki/August   \n",
       "2          6              https://simple.wikipedia.org/wiki/Art   \n",
       "3          8                https://simple.wikipedia.org/wiki/A   \n",
       "4          9              https://simple.wikipedia.org/wiki/Air   \n",
       "...      ...                                                ...   \n",
       "24995  98295           https://simple.wikipedia.org/wiki/Geneva   \n",
       "24996  98316      https://simple.wikipedia.org/wiki/Concubinage   \n",
       "24997  98318  https://simple.wikipedia.org/wiki/Mistress%20%...   \n",
       "24998  98326  https://simple.wikipedia.org/wiki/Eastern%20Front   \n",
       "24999  98327  https://simple.wikipedia.org/wiki/Italian%20Ca...   \n",
       "\n",
       "                  title                                               text   \n",
       "0                 April  April is the fourth month of the year in the J...  \\\n",
       "1                August  August (Aug.) is the eighth month of the year ...   \n",
       "2                   Art  Art is a creative activity that expresses imag...   \n",
       "3                     A  A or a is the first letter of the English alph...   \n",
       "4                   Air  Air refers to the Earth's atmosphere. Air is a...   \n",
       "...                 ...                                                ...   \n",
       "24995            Geneva  Geneva (,  ,  ,  , ) is the second biggest cit...   \n",
       "24996       Concubinage  Concubinage is the state of a woman in a relat...   \n",
       "24997  Mistress (lover)  A mistress is a man's long term female sexual ...   \n",
       "24998     Eastern Front  Eastern Front can be one of the following:\\n\\n...   \n",
       "24999  Italian Campaign  Italian Campaign can mean the following:\\n\\nTh...   \n",
       "\n",
       "                                            title_vector   \n",
       "0      [0.001009464613161981, -0.020700545981526375, ...  \\\n",
       "1      [0.0009286514250561595, 0.000820168002974242, ...   \n",
       "2      [0.003393713850528002, 0.0061537534929811954, ...   \n",
       "3      [0.0153952119871974, -0.013759135268628597, 0....   \n",
       "4      [0.02224554680287838, -0.02044147066771984, -0...   \n",
       "...                                                  ...   \n",
       "24995  [-0.015773078426718712, 0.01737344264984131, 0...   \n",
       "24996  [-0.00519518880173564, 0.005898841191083193, 0...   \n",
       "24997  [-0.023164259269833565, -0.02052430994808674, ...   \n",
       "24998  [-0.00681863259524107, 0.002171179046854377, 8...   \n",
       "24999  [-0.014151256531476974, -0.008553029969334602,...   \n",
       "\n",
       "                                          content_vector  vector_id  \n",
       "0      [-0.011253940872848034, -0.013491976074874401,...          0  \n",
       "1      [0.0003609954728744924, 0.007262262050062418, ...          1  \n",
       "2      [-0.004959689453244209, 0.015772193670272827, ...          2  \n",
       "3      [0.024894846603274345, -0.022186409682035446, ...          3  \n",
       "4      [0.021524671465158463, 0.018522677943110466, -...          4  \n",
       "...                                                  ...        ...  \n",
       "24995  [0.008000412955880165, 0.02008531428873539, 0....      24995  \n",
       "24996  [-0.01736736111342907, -0.002740012714639306, ...      24996  \n",
       "24997  [-0.017878392711281776, -0.0004517830966506153...      24997  \n",
       "24998  [-0.0019235472427681088, -0.004023272544145584...      24998  \n",
       "24999  [-0.011758845299482346, -0.01346028596162796, ...      24999  \n",
       "\n",
       "[25000 rows x 7 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas, json\n",
    "data = pandas.read_csv('../../data/vector_database_wikipedia_articles_embedded.csv')\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1536 1536\n"
     ]
    }
   ],
   "source": [
    "title_vector_length = len(json.loads(data['title_vector'].iloc[0]))\n",
    "content_vector_length = len(json.loads(data['content_vector'].iloc[0]))\n",
    "\n",
    "print(title_vector_length, content_vector_length)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create table and proxima vector index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hologres stores data in __tables__ where each object is described by at least one vector. Our table will be called **articles** and each object will be described by both **title** and **content** vectors.\n",
    "\n",
    "We will start with creating a table and create proxima indexes on both **title** and **content**, and then we will fill it with our precomputed embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "cursor.execute('CREATE EXTENSION IF NOT EXISTS proxima;')\n",
    "create_proxima_table_sql = '''\n",
    "BEGIN;\n",
    "DROP TABLE IF EXISTS articles;\n",
    "CREATE TABLE articles (\n",
    "    id INT PRIMARY KEY NOT NULL,\n",
    "    url TEXT,\n",
    "    title TEXT,\n",
    "    content TEXT,\n",
    "    title_vector float4[] check(\n",
    "        array_ndims(title_vector) = 1 and \n",
    "        array_length(title_vector, 1) = 1536\n",
    "    ), -- define the vectors\n",
    "    content_vector float4[] check(\n",
    "        array_ndims(content_vector) = 1 and \n",
    "        array_length(content_vector, 1) = 1536\n",
    "    ),\n",
    "    vector_id INT\n",
    ");\n",
    "\n",
    "-- Create indexes for the vector fields.\n",
    "call set_table_property(\n",
    "    'articles',\n",
    "    'proxima_vectors', \n",
    "    '{\n",
    "        \"title_vector\":{\"algorithm\":\"Graph\",\"distance_method\":\"Euclidean\",\"builder_params\":{\"min_flush_proxima_row_count\" : 10}},\n",
    "        \"content_vector\":{\"algorithm\":\"Graph\",\"distance_method\":\"Euclidean\",\"builder_params\":{\"min_flush_proxima_row_count\" : 10}}\n",
    "    }'\n",
    ");  \n",
    "\n",
    "COMMIT;\n",
    "'''\n",
    "\n",
    "# Execute the SQL statements (will autocommit)\n",
    "cursor.execute(create_proxima_table_sql)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Upload data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's upload the data to the Hologres cloud instance using [COPY statement](https://www.alibabacloud.com/help/en/hologres/latest/use-the-copy-statement-to-import-or-export-data). This might take 5-10 minutes according to the network bandwidth."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "\n",
    "# Path to the unzipped CSV file\n",
    "csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'\n",
    "\n",
    "# In SQL, arrays are surrounded by {}, rather than []\n",
    "def process_file(file_path):\n",
    "    with open(file_path, 'r') as file:\n",
    "        for line in file:\n",
    "            # Replace '[' with '{' and ']' with '}'\n",
    "            modified_line = line.replace('[', '{').replace(']', '}')\n",
    "            yield modified_line\n",
    "\n",
    "# Create a StringIO object to store the modified lines\n",
    "modified_lines = io.StringIO(''.join(list(process_file(csv_file_path))))\n",
    "\n",
    "# Create the COPY command for the copy_expert method\n",
    "copy_command = '''\n",
    "COPY public.articles (id, url, title, content, title_vector, content_vector, vector_id)\n",
    "FROM STDIN WITH (FORMAT CSV, HEADER true, DELIMITER ',');\n",
    "'''\n",
    "\n",
    "# Execute the COPY command using the copy_expert method\n",
    "cursor.copy_expert(copy_command, modified_lines)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The proxima index will be built in the background. We can do searching during this period but the query will be slow without the vector index. Use this command to wait for finish building the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "cursor.execute('vacuum articles;')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-16T12:30:40.675202Z",
     "start_time": "2023-02-16T12:30:40.655654Z"
    },
    "pycharm": {
     "is_executing": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Count:25000\n"
     ]
    }
   ],
   "source": [
    "# Check the collection size to make sure all the points have been stored\n",
    "count_sql = \"select count(*) from articles;\"\n",
    "cursor.execute(count_sql)\n",
    "result = cursor.fetchone()\n",
    "print(f\"Count:{result[0]}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Search data\n",
    "\n",
    "Once the data is uploaded we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search. Since the precomputed embeddings were created with `text-embedding-3-small` OpenAI model we also have to use it during search.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-16T12:30:38.024370Z",
     "start_time": "2023-02-16T12:30:37.712816Z"
    }
   },
   "outputs": [],
   "source": [
    "import openai\n",
    "def query_knn(query, table_name, vector_name=\"title_vector\", top_k=20):\n",
    "\n",
    "    # Creates embedding vector from user query\n",
    "    embedded_query = openai.Embedding.create(\n",
    "        input=query,\n",
    "        model=\"text-embedding-3-small\",\n",
    "    )[\"data\"][0][\"embedding\"]\n",
    "\n",
    "    # Convert the embedded_query to PostgreSQL compatible format\n",
    "    embedded_query_pg = \"{\" + \",\".join(map(str, embedded_query)) + \"}\"\n",
    "\n",
    "    # Create SQL query\n",
    "    query_sql = f\"\"\"\n",
    "    SELECT id, url, title, pm_approx_euclidean_distance({vector_name},'{embedded_query_pg}'::float4[]) AS distance\n",
    "    FROM {table_name}\n",
    "    ORDER BY distance\n",
    "    LIMIT {top_k};\n",
    "    \"\"\"\n",
    "    # Execute the query\n",
    "    cursor.execute(query_sql)\n",
    "    results = cursor.fetchall()\n",
    "\n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-16T12:30:39.379566Z",
     "start_time": "2023-02-16T12:30:38.031041Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Museum of Modern Art (Score: 0.501)\n",
      "2. Western Europe (Score: 0.485)\n",
      "3. Renaissance art (Score: 0.479)\n",
      "4. Pop art (Score: 0.472)\n",
      "5. Northern Europe (Score: 0.461)\n",
      "6. Hellenistic art (Score: 0.458)\n",
      "7. Modernist literature (Score: 0.447)\n",
      "8. Art film (Score: 0.44)\n",
      "9. Central Europe (Score: 0.439)\n",
      "10. Art (Score: 0.437)\n",
      "11. European (Score: 0.437)\n",
      "12. Byzantine art (Score: 0.436)\n",
      "13. Postmodernism (Score: 0.435)\n",
      "14. Eastern Europe (Score: 0.433)\n",
      "15. Cubism (Score: 0.433)\n",
      "16. Europe (Score: 0.432)\n",
      "17. Impressionism (Score: 0.432)\n",
      "18. Bauhaus (Score: 0.431)\n",
      "19. Surrealism (Score: 0.429)\n",
      "20. Expressionism (Score: 0.429)\n"
     ]
    }
   ],
   "source": [
    "query_results = query_knn(\"modern art in Europe\", \"Articles\")\n",
    "for i, result in enumerate(query_results):\n",
    "    print(f\"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-02-16T12:30:40.652676Z",
     "start_time": "2023-02-16T12:30:39.382555Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Battle of Bannockburn (Score: 0.489)\n",
      "2. Wars of Scottish Independence (Score: 0.474)\n",
      "3. 1651 (Score: 0.457)\n",
      "4. First War of Scottish Independence (Score: 0.452)\n",
      "5. Robert I of Scotland (Score: 0.445)\n",
      "6. 841 (Score: 0.441)\n",
      "7. 1716 (Score: 0.441)\n",
      "8. 1314 (Score: 0.429)\n",
      "9. 1263 (Score: 0.428)\n",
      "10. William Wallace (Score: 0.426)\n",
      "11. Stirling (Score: 0.419)\n",
      "12. 1306 (Score: 0.419)\n",
      "13. 1746 (Score: 0.418)\n",
      "14. 1040s (Score: 0.414)\n",
      "15. 1106 (Score: 0.412)\n",
      "16. 1304 (Score: 0.411)\n",
      "17. David II of Scotland (Score: 0.408)\n",
      "18. Braveheart (Score: 0.407)\n",
      "19. 1124 (Score: 0.406)\n",
      "20. July 27 (Score: 0.405)\n"
     ]
    }
   ],
   "source": [
    "# This time we'll query using content vector\n",
    "query_results = query_knn(\"Famous battles in Scottish history\", \"Articles\", \"content_vector\")\n",
    "for i, result in enumerate(query_results):\n",
    "    print(f\"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}