You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/vector_databases/myscale/Using_MyScale_for_embedding...

532 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "cb1537e6",
"metadata": {},
"source": [
"# Using MyScale for Embeddings Search\n",
"\n",
"This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
"\n",
"### What is a Vector Database\n",
"\n",
"A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n",
"\n",
"### Why use a Vector Database\n",
"\n",
"Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n",
"\n",
"\n",
"### Demo Flow\n",
"The demo flow is:\n",
"- **Setup**: Import packages and set any required variables\n",
"- **Load data**: Load a dataset and embed it using OpenAI embeddings\n",
"- **MyScale**\n",
" - *Setup*: Set up the MyScale Python client. For more details go [here](https://docs.myscale.com/en/python-client/)\n",
" - *Index Data*: We'll create a table and index it for __content__.\n",
" - *Search Data*: Run a few example queries with various goals in mind.\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
]
},
{
"cell_type": "markdown",
"id": "e2b59250",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Import the required libraries and set the embedding model that we'd like to use."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d8810f9",
"metadata": {},
"outputs": [],
"source": [
"# We'll need to install the MyScale client\n",
"!pip install clickhouse-connect\n",
"\n",
"#Install wget to pull zip file\n",
"!pip install wget"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5be94df6",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"\n",
"from typing import List, Iterator\n",
"import pandas as pd\n",
"import numpy as np\n",
"import os\n",
"import wget\n",
"from ast import literal_eval\n",
"\n",
"# MyScale's client library for Python\n",
"import clickhouse_connect\n",
"\n",
"# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
"EMBEDDING_MODEL = \"text-embedding-3-small\"\n",
"\n",
"# Ignore unclosed SSL socket warnings - optional in case you get these errors\n",
"import warnings\n",
"\n",
"warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n",
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning) "
]
},
{
"cell_type": "markdown",
"id": "e5d9d2e1",
"metadata": {},
"source": [
"## Load data\n",
"\n",
"In this section we'll load embedded data that we've prepared previous to this session."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5dff8b55",
"metadata": {},
"outputs": [],
"source": [
"embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
"\n",
"# The file is ~700 MB so this will take some time\n",
"wget.download(embeddings_url)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21097972",
"metadata": {},
"outputs": [],
"source": [
"import zipfile\n",
"with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n",
" zip_ref.extractall(\"../data\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "70bbd8ba",
"metadata": {},
"outputs": [],
"source": [
"article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1721e45d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>url</th>\n",
" <th>title</th>\n",
" <th>text</th>\n",
" <th>title_vector</th>\n",
" <th>content_vector</th>\n",
" <th>vector_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>https://simple.wikipedia.org/wiki/April</td>\n",
" <td>April</td>\n",
" <td>April is the fourth month of the year in the J...</td>\n",
" <td>[0.001009464613161981, -0.020700545981526375, ...</td>\n",
" <td>[-0.011253940872848034, -0.013491976074874401,...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>https://simple.wikipedia.org/wiki/August</td>\n",
" <td>August</td>\n",
" <td>August (Aug.) is the eighth month of the year ...</td>\n",
" <td>[0.0009286514250561595, 0.000820168002974242, ...</td>\n",
" <td>[0.0003609954728744924, 0.007262262050062418, ...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6</td>\n",
" <td>https://simple.wikipedia.org/wiki/Art</td>\n",
" <td>Art</td>\n",
" <td>Art is a creative activity that expresses imag...</td>\n",
" <td>[0.003393713850528002, 0.0061537534929811954, ...</td>\n",
" <td>[-0.004959689453244209, 0.015772193670272827, ...</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>8</td>\n",
" <td>https://simple.wikipedia.org/wiki/A</td>\n",
" <td>A</td>\n",
" <td>A or a is the first letter of the English alph...</td>\n",
" <td>[0.0153952119871974, -0.013759135268628597, 0....</td>\n",
" <td>[0.024894846603274345, -0.022186409682035446, ...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9</td>\n",
" <td>https://simple.wikipedia.org/wiki/Air</td>\n",
" <td>Air</td>\n",
" <td>Air refers to the Earth's atmosphere. Air is a...</td>\n",
" <td>[0.02224554680287838, -0.02044147066771984, -0...</td>\n",
" <td>[0.021524671465158463, 0.018522677943110466, -...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id url title \\\n",
"0 1 https://simple.wikipedia.org/wiki/April April \n",
"1 2 https://simple.wikipedia.org/wiki/August August \n",
"2 6 https://simple.wikipedia.org/wiki/Art Art \n",
"3 8 https://simple.wikipedia.org/wiki/A A \n",
"4 9 https://simple.wikipedia.org/wiki/Air Air \n",
"\n",
" text \\\n",
"0 April is the fourth month of the year in the J... \n",
"1 August (Aug.) is the eighth month of the year ... \n",
"2 Art is a creative activity that expresses imag... \n",
"3 A or a is the first letter of the English alph... \n",
"4 Air refers to the Earth's atmosphere. Air is a... \n",
"\n",
" title_vector \\\n",
"0 [0.001009464613161981, -0.020700545981526375, ... \n",
"1 [0.0009286514250561595, 0.000820168002974242, ... \n",
"2 [0.003393713850528002, 0.0061537534929811954, ... \n",
"3 [0.0153952119871974, -0.013759135268628597, 0.... \n",
"4 [0.02224554680287838, -0.02044147066771984, -0... \n",
"\n",
" content_vector vector_id \n",
"0 [-0.011253940872848034, -0.013491976074874401,... 0 \n",
"1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n",
"2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n",
"3 [0.024894846603274345, -0.022186409682035446, ... 3 \n",
"4 [0.021524671465158463, 0.018522677943110466, -... 4 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"article_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "960b82af",
"metadata": {},
"outputs": [],
"source": [
"# Read vectors from strings back into a list\n",
"article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n",
"article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n",
"\n",
"# Set vector_id to be a string\n",
"article_df['vector_id'] = article_df['vector_id'].apply(str)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "a334ab8b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 25000 entries, 0 to 24999\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 id 25000 non-null int64 \n",
" 1 url 25000 non-null object\n",
" 2 title 25000 non-null object\n",
" 3 text 25000 non-null object\n",
" 4 title_vector 25000 non-null object\n",
" 5 content_vector 25000 non-null object\n",
" 6 vector_id 25000 non-null object\n",
"dtypes: int64(1), object(6)\n",
"memory usage: 1.3+ MB\n"
]
}
],
"source": [
"article_df.info(show_counts=True)"
]
},
{
"cell_type": "markdown",
"id": "56a02772",
"metadata": {},
"source": [
"## MyScale\n",
"The next vector database we'll consider is [MyScale](https://myscale.com).\n",
"\n",
"[MyScale](https://myscale.com) is a database built on Clickhouse that combines vector search and SQL analytics to offer a high-performance, streamlined, and fully managed experience. It's designed to facilitate joint queries and analyses on both structured and vector data, with comprehensive SQL support for all data processing.\n",
"\n",
"Deploy and execute vector search with SQL on your cluster within two minutes by using [MyScale Console](https://console.myscale.com)."
]
},
{
"cell_type": "markdown",
"id": "d3e1f96b",
"metadata": {},
"source": [
"### Connect to MyScale\n",
"\n",
"Follow the [connections details](https://docs.myscale.com/en/cluster-management/) section to retrieve the cluster host, username, and password information from the MyScale console, and use it to create a connection to your cluster as shown below:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "024243cf",
"metadata": {},
"outputs": [],
"source": [
"# initialize client\n",
"client = clickhouse_connect.get_client(host='YOUR_CLUSTER_HOST', port=8443, username='YOUR_USERNAME', password='YOUR_CLUSTER_PASSWORD')"
]
},
{
"cell_type": "markdown",
"id": "067009db",
"metadata": {},
"source": [
"### Index data\n",
"\n",
"We will create an SQL table called `articles` in MyScale to store the embeddings data. The table will include a vector index with a cosine distance metric and a constraint for the length of the embeddings. Use the following code to create and insert data into the articles table:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "685cba13",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "378809ac23104dc08c06fa3a53f83666",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/250 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# create articles table with vector index\n",
"embedding_len=len(article_df['content_vector'][0]) # 1536\n",
"\n",
"client.command(f\"\"\"\n",
"CREATE TABLE IF NOT EXISTS default.articles\n",
"(\n",
" id UInt64,\n",
" url String,\n",
" title String,\n",
" text String,\n",
" content_vector Array(Float32),\n",
" CONSTRAINT cons_vector_len CHECK length(content_vector) = {embedding_len},\n",
" VECTOR INDEX article_content_index content_vector TYPE HNSWFLAT('metric_type=Cosine')\n",
")\n",
"ENGINE = MergeTree ORDER BY id\n",
"\"\"\")\n",
"\n",
"# insert data into the table in batches\n",
"from tqdm.auto import tqdm\n",
"\n",
"batch_size = 100\n",
"total_records = len(article_df)\n",
"\n",
"# we only need subset of columns\n",
"article_df = article_df[['id', 'url', 'title', 'text', 'content_vector']]\n",
"\n",
"# upload data in batches\n",
"data = article_df.to_records(index=False).tolist()\n",
"column_names = article_df.columns.tolist()\n",
"\n",
"for i in tqdm(range(0, total_records, batch_size)):\n",
" i_end = min(i + batch_size, total_records)\n",
" client.insert(\"default.articles\", data[i:i_end], column_names=column_names)"
]
},
{
"cell_type": "markdown",
"id": "b0f0e591",
"metadata": {},
"source": [
"We need to check the build status of the vector index before proceeding with the search, as it is automatically built in the background."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "9251bdf1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"articles count: 25000\n",
"index build status: InProgress\n"
]
}
],
"source": [
"# check count of inserted data\n",
"print(f\"articles count: {client.command('SELECT count(*) FROM default.articles')}\")\n",
"\n",
"# check the status of the vector index, make sure vector index is ready with 'Built' status\n",
"get_index_status=\"SELECT status FROM system.vector_indices WHERE name='article_content_index'\"\n",
"print(f\"index build status: {client.command(get_index_status)}\")"
]
},
{
"cell_type": "markdown",
"id": "fe55234a",
"metadata": {},
"source": [
"### Search data\n",
"\n",
"Once indexed in MyScale, we can perform vector search to find similar content. First, we will use the OpenAI API to generate embeddings for our query. Then, we will perform the vector search using MyScale."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fd5f03c6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 Battle of Bannockburn\n",
"2 Wars of Scottish Independence\n",
"3 1651\n",
"4 First War of Scottish Independence\n",
"5 Robert I of Scotland\n",
"6 841\n",
"7 1716\n",
"8 1314\n",
"9 1263\n",
"10 William Wallace\n"
]
}
],
"source": [
"query = \"Famous battles in Scottish history\"\n",
"\n",
"# creates embedding vector from user query\n",
"embed = openai.Embedding.create(\n",
" input=query,\n",
" model=\"text-embedding-3-small\",\n",
")[\"data\"][0][\"embedding\"]\n",
"\n",
"# query the database to find the top K similar content to the given query\n",
"top_k = 10\n",
"results = client.query(f\"\"\"\n",
"SELECT id, url, title, distance(content_vector, {embed}) as dist\n",
"FROM default.articles\n",
"ORDER BY dist\n",
"LIMIT {top_k}\n",
"\"\"\")\n",
"\n",
"# display results\n",
"for i, r in enumerate(results.named_results()):\n",
" print(i+1, r['title'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0119d87a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "vector_db_split",
"language": "python",
"name": "vector_db_split"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"vscode": {
"interpreter": {
"hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}