You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/vector_databases/tair/Getting_started_with_Tair_a...

547 lines
20 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using Tair as a vector database for OpenAI embeddings\n",
"\n",
"This notebook guides you step by step on using Tair as a vector database for OpenAI embeddings.\n",
"\n",
"This notebook presents an end-to-end process of:\n",
"1. Using precomputed embeddings created by OpenAI API.\n",
"2. Storing the embeddings in a cloud instance of Tair.\n",
"3. Converting raw text query to an embedding with OpenAI API.\n",
"4. Using Tair to perform the nearest neighbour search in the created collection.\n",
"\n",
"### What is Tair\n",
"\n",
"[Tair](https://www.alibabacloud.com/help/en/tair/latest/what-is-tair) is a cloud native in-memory database service that is developed by Alibaba Cloud. Tair is compatible with open source Redis and provides a variety of data models and enterprise-class capabilities to support your real-time online scenarios. Tair also introduces persistent memory-optimized instances that are based on the new non-volatile memory (NVM) storage medium. These instances can reduce costs by 30%, ensure data persistence, and provide almost the same performance as in-memory databases. Tair has been widely used in areas such as government affairs, finance, manufacturing, healthcare, and pan-Internet to meet their high-speed query and computing requirements.\n",
"\n",
"[Tairvector](https://www.alibabacloud.com/help/en/tair/latest/tairvector) is an in-house data structure that provides high-performance real-time storage and retrieval of vectors. TairVector provides two indexing algorithms: Hierarchical Navigable Small World (HNSW) and Flat Search. Additionally, TairVector supports multiple distance functions, such as Euclidean distance, inner product, and Jaccard distance. Compared with traditional vector retrieval services, TairVector has the following advantages:\n",
"- Stores all data in memory and supports real-time index updates to reduce latency of read and write operations.\n",
"- Uses an optimized data structure in memory to better utilize storage capacity.\n",
"- Functions as an out-of-the-box data structure in a simple and efficient architecture without complex modules or dependencies.\n",
"\n",
"### Deployment options\n",
"\n",
"- Using [Tair Cloud Vector Database](https://www.alibabacloud.com/help/en/tair/latest/getting-started-overview). [Click here](https://www.alibabacloud.com/product/tair) to fast deploy it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"For the purposes of this exercise we need to prepare a couple of things:\n",
"\n",
"1. Tair cloud server instance.\n",
"2. The 'tair' library to interact with the tair database.\n",
"3. An [OpenAI API key](https://beta.openai.com/account/api-keys).\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install requirements\n",
"\n",
"This notebook obviously requires the `openai` and `tair` packages, but there are also some other additional libraries we will use. The following command installs them all:\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:05:05.718972Z",
"start_time": "2023-02-16T12:04:30.434820Z"
},
"pycharm": {
"is_executing": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Looking in indexes: http://sg.mirrors.cloud.aliyuncs.com/pypi/simple/\n",
"Requirement already satisfied: openai in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (0.28.0)\n",
"Requirement already satisfied: redis in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (5.0.0)\n",
"Requirement already satisfied: tair in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (1.3.6)\n",
"Requirement already satisfied: pandas in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (2.1.0)\n",
"Requirement already satisfied: wget in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (3.2)\n",
"Requirement already satisfied: requests>=2.20 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (2.31.0)\n",
"Requirement already satisfied: tqdm in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (4.66.1)\n",
"Requirement already satisfied: aiohttp in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (3.8.5)\n",
"Requirement already satisfied: async-timeout>=4.0.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from redis) (4.0.3)\n",
"Requirement already satisfied: numpy>=1.22.4 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (1.25.2)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2023.3.post1)\n",
"Requirement already satisfied: tzdata>=2022.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2023.3)\n",
"Requirement already satisfied: six>=1.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.2.0)\n",
"Requirement already satisfied: idna<4,>=2.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.4)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2.0.4)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2023.7.22)\n",
"Requirement already satisfied: attrs>=17.3.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (22.1.0)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (6.0.4)\n",
"Requirement already satisfied: yarl<2.0,>=1.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.9.2)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.4.0)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.3.1)\n",
"\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
"\u001b[0m"
]
}
],
"source": [
"! pip install openai redis tair pandas wget"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare your OpenAI API key\n",
"\n",
"The OpenAI API key is used for vectorization of the documents and queries.\n",
"\n",
"If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n",
"\n",
"Once you get your key, please add it by getpass."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:05:05.730338Z",
"start_time": "2023-02-16T12:05:05.723351Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input your OpenAI API key:········\n"
]
}
],
"source": [
"import getpass\n",
"import openai\n",
"\n",
"openai.api_key = getpass.getpass(\"Input your OpenAI API key:\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect to Tair\n",
"First add it to your environment variables.\n",
"\n",
"Connecting to a running instance of Tair server is easy with the official Python library."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input your tair url:········\n"
]
}
],
"source": [
"# The format of url: redis://[[username]:[password]]@localhost:6379/0\n",
"TAIR_URL = getpass.getpass(\"Input your tair url:\")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"from tair import Tair as TairClient\n",
"\n",
"# connect to tair from url and create a client\n",
"\n",
"url = TAIR_URL\n",
"client = TairClient.from_url(url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can test the connection by ping:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:05:06.848488Z",
"start_time": "2023-02-16T12:05:06.832612Z"
},
"pycharm": {
"is_executing": true
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"client.ping()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:05:37.371951Z",
"start_time": "2023-02-16T12:05:06.851634Z"
},
"pycharm": {
"is_executing": true
},
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100% [......................................................................] 698933052 / 698933052"
]
},
{
"data": {
"text/plain": [
"'vector_database_wikipedia_articles_embedded (1).zip'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import wget\n",
"\n",
"embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n",
"\n",
"# The file is ~700 MB so this will take some time\n",
"wget.download(embeddings_url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The downloaded file has to then be extracted:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:06:01.538851Z",
"start_time": "2023-02-16T12:05:37.376042Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.\n"
]
}
],
"source": [
"import zipfile\n",
"import os\n",
"import re\n",
"import tempfile\n",
"\n",
"current_directory = os.getcwd()\n",
"zip_file_path = os.path.join(current_directory, \"vector_database_wikipedia_articles_embedded.zip\")\n",
"output_directory = os.path.join(current_directory, \"../../data\")\n",
"\n",
"with zipfile.ZipFile(zip_file_path, \"r\") as zip_ref:\n",
" zip_ref.extractall(output_directory)\n",
"\n",
"\n",
"# check the csv file exist\n",
"file_name = \"vector_database_wikipedia_articles_embedded.csv\"\n",
"data_directory = os.path.join(current_directory, \"../../data\")\n",
"file_path = os.path.join(data_directory, file_name)\n",
"\n",
"\n",
"if os.path.exists(file_path):\n",
" print(f\"The file {file_name} exists in the data directory.\")\n",
"else:\n",
" print(f\"The file {file_name} does not exist in the data directory.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Index\n",
"\n",
"Tair stores data in indexes where each object is described by one key. Each key contains a vector and multiple attribute_keys.\n",
"\n",
"We will start with creating two indexes, one for **title_vector** and one for **content_vector**, and then we will fill it with our precomputed embeddings."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index already exists\n",
"Index already exists\n"
]
}
],
"source": [
"# set index parameters\n",
"index = \"openai_test\"\n",
"embedding_dim = 1536\n",
"distance_type = \"L2\"\n",
"index_type = \"HNSW\"\n",
"data_type = \"FLOAT32\"\n",
"\n",
"# Create two indexes, one for title_vector and one for content_vector, skip if already exists\n",
"index_names = [index + \"_title_vector\", index+\"_content_vector\"]\n",
"for index_name in index_names:\n",
" index_connection = client.tvs_get_index(index_name)\n",
" if index_connection is not None:\n",
" print(\"Index already exists\")\n",
" else:\n",
" client.tvs_create_index(name=index_name, dim=embedding_dim, distance_type=distance_type,\n",
" index_type=index_type, data_type=data_type)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data\n",
"\n",
"In this section we are going to load the data prepared previous to this session, so you don't have to recompute the embeddings of Wikipedia articles with your own credits."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from ast import literal_eval\n",
"# Path to your local CSV file\n",
"csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'\n",
"article_df = pd.read_csv(csv_file_path)\n",
"\n",
"# Read vectors from strings back into a list\n",
"article_df['title_vector'] = article_df.title_vector.apply(literal_eval).values\n",
"article_df['content_vector'] = article_df.content_vector.apply(literal_eval).values\n",
"\n",
"# add/update data to indexes\n",
"for i in range(len(article_df)):\n",
" # add data to index with title_vector\n",
" client.tvs_hset(index=index_names[0], key=article_df.id[i].item(), vector=article_df.title_vector[i], is_binary=False,\n",
" **{\"url\": article_df.url[i], \"title\": article_df.title[i], \"text\": article_df.text[i]})\n",
" # add data to index with content_vector\n",
" client.tvs_hset(index=index_names[1], key=article_df.id[i].item(), vector=article_df.content_vector[i], is_binary=False,\n",
" **{\"url\": article_df.url[i], \"title\": article_df.title[i], \"text\": article_df.text[i]})"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:30:40.675202Z",
"start_time": "2023-02-16T12:30:40.655654Z"
},
"pycharm": {
"is_executing": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Count in openai_test_title_vector:25000\n",
"Count in openai_test_content_vector:25000\n"
]
}
],
"source": [
"# Check the data count to make sure all the points have been stored\n",
"for index_name in index_names:\n",
" stats = client.tvs_get_index(index_name)\n",
" count = int(stats[\"current_record_count\"]) - int(stats[\"delete_record_count\"])\n",
" print(f\"Count in {index_name}:{count}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Search data\n",
"\n",
"Once the data is put into Tair we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search. Since the precomputed embeddings were created with `text-embedding-3-small` OpenAI model, we also have to use it during search.\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:30:38.024370Z",
"start_time": "2023-02-16T12:30:37.712816Z"
}
},
"outputs": [],
"source": [
"def query_tair(client, query, vector_name=\"title_vector\", top_k=5):\n",
"\n",
" # Creates embedding vector from user query\n",
" embedded_query = openai.Embedding.create(\n",
" input= query,\n",
" model=\"text-embedding-3-small\",\n",
" )[\"data\"][0]['embedding']\n",
" embedded_query = np.array(embedded_query)\n",
"\n",
" # search for the top k approximate nearest neighbors of vector in an index\n",
" query_result = client.tvs_knnsearch(index=index+\"_\"+vector_name, k=top_k, vector=embedded_query)\n",
"\n",
" return query_result"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:30:39.379566Z",
"start_time": "2023-02-16T12:30:38.031041Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. Museum of Modern Art (Distance: 0.125)\n",
"2. Western Europe (Distance: 0.133)\n",
"3. Renaissance art (Distance: 0.136)\n",
"4. Pop art (Distance: 0.14)\n",
"5. Northern Europe (Distance: 0.145)\n"
]
}
],
"source": [
"import openai\n",
"import numpy as np\n",
"\n",
"query_result = query_tair(client=client, query=\"modern art in Europe\", vector_name=\"title_vector\")\n",
"for i in range(len(query_result)):\n",
" title = client.tvs_hmget(index+\"_\"+\"content_vector\", query_result[i][0].decode('utf-8'), \"title\")\n",
" print(f\"{i + 1}. {title[0].decode('utf-8')} (Distance: {round(query_result[i][1],3)})\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:30:40.652676Z",
"start_time": "2023-02-16T12:30:39.382555Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. Battle of Bannockburn (Distance: 0.131)\n",
"2. Wars of Scottish Independence (Distance: 0.139)\n",
"3. 1651 (Distance: 0.147)\n",
"4. First War of Scottish Independence (Distance: 0.15)\n",
"5. Robert I of Scotland (Distance: 0.154)\n"
]
}
],
"source": [
"# This time we'll query using content vector\n",
"query_result = query_tair(client=client, query=\"Famous battles in Scottish history\", vector_name=\"content_vector\")\n",
"for i in range(len(query_result)):\n",
" title = client.tvs_hmget(index+\"_\"+\"content_vector\", query_result[i][0].decode('utf-8'), \"title\")\n",
" print(f\"{i + 1}. {title[0].decode('utf-8')} (Distance: {round(query_result[i][1],3)})\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:notebook] *",
"language": "python",
"name": "conda-env-notebook-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 1
}