🌟 MyScale VectorStore Support (#377)

* add myscale notebook

* add myscale to vector database notebook
This commit is contained in:
qingdi 2023-05-02 07:46:51 +08:00 committed by GitHub
parent 7ee1c6c0d1
commit 7fcba408f1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 962 additions and 2 deletions

View File

@ -50,7 +50,10 @@
" - *Setup*: Set up the Typesense Python client. For more details go [here](https://typesense.org/docs/0.24.0/api/)\n",
" - *Index Data*: We'll create a collection and index it for both __titles__ and __content__.\n",
" - *Search Data*: Run a few example queries with various goals in mind.\n",
"\n",
"- **MyScale**\n",
" - *Setup*: Set up the MyScale Python client. For more details go [here](https://docs.myscale.com/en/python-client/)\n",
" - *Index Data*: We'll create a table and index it for __content__.\n",
" - *Search Data*: Run a few example queries with various goals in mind.\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
]
@ -80,6 +83,7 @@
"!pip install qdrant-client\n",
"!pip install redis\n",
"!pip install typesense\n",
"!pip install clickhouse-connect\n",
"\n",
"#Install wget to pull zip file\n",
"!pip install wget"
@ -119,6 +123,9 @@
"# Typesense's client library for Python\n",
"import typesense\n",
"\n",
"# MyScale's client library for Python\n",
"import clickhouse-connect\n",
"\n",
"\n",
"# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
"EMBEDDING_MODEL = \"text-embedding-ada-002\"\n",
@ -2249,6 +2256,166 @@
"source": [
"Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
]
},
{
"cell_type": "markdown",
"id": "56a02772",
"metadata": {},
"source": [
"# MyScale\n",
"The next vector database we'll consider is [MyScale](https://myscale.com).\n",
"\n",
"[MyScale](https://myscale.com) is a database built on Clickhouse that combines vector search and SQL analytics to offer a high-performance, streamlined, and fully managed experience. It's designed to facilitate joint queries and analyses on both structured and vector data, with comprehensive SQL support for all data processing.\n",
"\n",
"Deploy and execute vector search with SQL on your cluster within two minutes by using [MyScale Console](https://console.myscale.com)."
]
},
{
"cell_type": "markdown",
"id": "d3e1f96b",
"metadata": {},
"source": [
"## Connect to MyScale\n",
"\n",
"Follow the [connections details](https://docs.myscale.com/en/cluster-management/) section to retrieve the cluster host, username, and password information from the MyScale console, and use it to create a connection to your cluster as shown below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "024243cf",
"metadata": {},
"outputs": [],
"source": [
"import clickhouse_connect\n",
"\n",
"# initialize client\n",
"client = clickhouse_connect.get_client(host='YOUR_CLUSTER_HOST', port=8443, username='YOUR_USERNAME', password='YOUR_CLUSTER_PASSWORD')"
]
},
{
"cell_type": "markdown",
"id": "067009db",
"metadata": {},
"source": [
"## Index data\n",
"\n",
"We will create an SQL table called `articles` in MyScale to store the embeddings data. The table will include a vector index with a cosine distance metric and a constraint for the length of the embeddings. Use the following code to create and insert data into the articles table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "685cba13",
"metadata": {},
"outputs": [],
"source": [
"# create articles table with vector index\n",
"embedding_len=len(article_df['content_vector'][0]) # 1536\n",
"\n",
"client.command(f\"\"\"\n",
"CREATE TABLE IF NOT EXISTS default.articles\n",
"(\n",
" id UInt64,\n",
" url String,\n",
" title String,\n",
" text String,\n",
" content_vector Array(Float32),\n",
" CONSTRAINT cons_vector_len CHECK length(content_vector) = {embedding_len},\n",
" VECTOR INDEX article_content_index content_vector TYPE HNSWFLAT('metric_type=Cosine')\n",
")\n",
"ENGINE = MergeTree ORDER BY id\n",
"\"\"\")\n",
"\n",
"# insert data into the table in batches\n",
"from tqdm.auto import tqdm\n",
"\n",
"batch_size = 100\n",
"total_records = len(article_df)\n",
"\n",
"# we only need subset of columns\n",
"article_df = article_df[['id', 'url', 'title', 'text', 'content_vector']]\n",
"\n",
"# upload data in batches\n",
"data = article_df.to_records(index=False).tolist()\n",
"column_names = article_df.columns.tolist()\n",
"\n",
"for i in tqdm(range(0, total_records, batch_size)):\n",
" i_end = min(i + batch_size, total_records)\n",
" client.insert(\"default.articles\", data[i:i_end], column_names=column_names)"
]
},
{
"cell_type": "markdown",
"id": "b0f0e591",
"metadata": {},
"source": [
"We need to check the build status of the vector index before proceeding with the search, as it is automatically built in the background."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9251bdf1",
"metadata": {},
"outputs": [],
"source": [
"# check count of inserted data\n",
"print(f\"articles count: {client.command('SELECT count(*) FROM default.articles')}\")\n",
"\n",
"# check the status of the vector index, make sure vector index is ready with 'Built' status\n",
"get_index_status=\"SELECT status FROM system.vector_indices WHERE name='article_content_index'\"\n",
"print(f\"index build status: {client.command(get_index_status)}\")"
]
},
{
"cell_type": "markdown",
"id": "fe55234a",
"metadata": {},
"source": [
"## Search data\n",
"\n",
"Once indexed in MyScale, we can perform vector search to find similar content. First, we will use the OpenAI API to generate embeddings for our query. Then, we will perform the vector search using MyScale."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd5f03c6",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"\n",
"query = \"Famous battles in Scottish history\"\n",
"\n",
"# creates embedding vector from user query\n",
"embed = openai.Embedding.create(\n",
" input=query,\n",
" model=\"text-embedding-ada-002\",\n",
")[\"data\"][0][\"embedding\"]\n",
"\n",
"# query the database to find the top K similar content to the given query\n",
"top_k = 10\n",
"results = client.query(f\"\"\"\n",
"SELECT id, url, title, distance(content_vector, {embed}) as dist\n",
"FROM default.articles\n",
"ORDER BY dist\n",
"LIMIT {top_k}\n",
"\"\"\")\n",
"\n",
"# display results\n",
"for i, r in enumerate(results.named_results()):\n",
" print(i+1, r['title'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0119d87a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -2267,7 +2434,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
"version": "3.9.16"
},
"vscode": {
"interpreter": {

View File

@ -0,0 +1,793 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using MyScale as a vector database for OpenAI embeddings\n",
"\n",
"This notebook provides a step-by-step guide on using MyScale as a vector database for OpenAI embeddings. The process includes:\n",
"\n",
"1. Utilizing precomputed embeddings generated by OpenAI API.\n",
"2. Storing these embeddings in a cloud instance of MyScale.\n",
"3. Converting raw text query to an embedding using OpenAI API.\n",
"4. Leveraging MyScale to perform nearest neighbor search within the created collection.\n",
"\n",
"### What is MyScale\n",
"\n",
"[MyScale](https://myscale.com) is a database built on Clickhouse that combines vector search and SQL analytics to offer a high-performance, streamlined, and fully managed experience. It's designed to facilitate joint queries and analyses on both structured and vector data, with comprehensive SQL support for all data processing.\n",
"\n",
"\n",
"### Deployment options\n",
"\n",
"- Deploy and execute vector search with SQL on your cluster within two minutes by using [MyScale Console](https://console.myscale.com).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"To follow this guide, you will need to have the following:\n",
"\n",
"1. A MyScale cluster deployed by following the [quickstart guide](https://docs.myscale.com/en/quickstart/).\n",
"2. The 'clickhouse-connect' library to interact with MyScale.\n",
"3. An [OpenAI API key](https://beta.openai.com/account/api-keys) for vectorization of queries."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install requirements\n",
"\n",
"This notebook requires the `openai`, `clickhouse-connect`, as well as some other dependencies. Use the following command to install them:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:05:05.718972Z",
"start_time": "2023-02-16T12:04:30.434820Z"
},
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"! pip install openai clickhouse-connect wget pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare your OpenAI API key\n",
"\n",
"To use the OpenAI API, you'll need to set up an API key. If you don't have one already, you can obtain it from [OpenAI](https://platform.openai.com/account/api-keys)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:05:05.730338Z",
"start_time": "2023-02-16T12:05:05.723351Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<OpenAIObject list at 0x118768f40> JSON: {\n",
" \"data\": [\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"babbage\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"davinci\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-davinci-edit-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"babbage-code-search-code\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-similarity-babbage-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"code-davinci-edit-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-davinci-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-davinci-003\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-internal\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"ada\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"babbage-code-search-text\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"babbage-similarity\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"code-search-babbage-text-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-curie-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"whisper-1\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-internal\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"code-search-babbage-code-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-ada-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-embedding-ada-002\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-internal\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-similarity-ada-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"curie-instruct-beta\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"ada-code-search-code\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"ada-similarity\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"code-search-ada-text-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-search-ada-query-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"davinci-search-document\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"gpt-3.5-turbo-0301\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"ada-code-search-text\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-search-ada-doc-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"davinci-instruct-beta\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-similarity-curie-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"code-search-ada-code-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"ada-search-query\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-search-davinci-query-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"curie-search-query\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"davinci-search-query\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"babbage-search-document\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"ada-search-document\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-search-curie-query-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-search-babbage-doc-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"curie-search-document\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-search-curie-doc-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"babbage-search-query\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-babbage-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-search-davinci-doc-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"gpt-3.5-turbo\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-search-babbage-query-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"curie-similarity\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"curie\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-similarity-davinci-001\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"text-davinci-002\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" },\n",
" {\n",
" \"created\": null,\n",
" \"id\": \"davinci-similarity\",\n",
" \"object\": \"engine\",\n",
" \"owner\": \"openai-dev\",\n",
" \"permissions\": null,\n",
" \"ready\": true\n",
" }\n",
" ],\n",
" \"object\": \"list\"\n",
"}"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import openai\n",
"\n",
"# get API key from on OpenAI website\n",
"openai.api_key = \"OPENAI_API_KEY\"\n",
"\n",
"# check we have authenticated\n",
"openai.Engine.list()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect to MyScale\n",
"\n",
"Follow the [connections details](https://docs.myscale.com/en/cluster-management/) section to retrieve the cluster host, username, and password information from the MyScale console, and use it to create a connection to your cluster as shown below:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:05:06.827143Z",
"start_time": "2023-02-16T12:05:05.733771Z"
}
},
"outputs": [],
"source": [
"import clickhouse_connect\n",
"\n",
"# initialize client\n",
"client = clickhouse_connect.get_client(host='YOUR_CLUSTER_HOST', port=8443, username='YOUR_USERNAME', password='YOUR_CLUSTER_PASSWORD')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to load the dataset of precomputed vector embeddings for Wikipedia articles provided by OpenAI. Use the `wget` package to download the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:05:37.371951Z",
"start_time": "2023-02-16T12:05:06.851634Z"
},
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"import wget\n",
"\n",
"embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n",
"\n",
"# The file is ~700 MB so this will take some time\n",
"wget.download(embeddings_url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After the download is complete, extract the file using the `zipfile` package:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:06:01.538851Z",
"start_time": "2023-02-16T12:05:37.376042Z"
}
},
"outputs": [],
"source": [
"import zipfile\n",
"\n",
"with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\", \"r\") as zip_ref:\n",
" zip_ref.extractall(\"../data\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can load the data from `vector_database_wikipedia_articles_embedded.csv` into a Pandas DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"from ast import literal_eval\n",
"\n",
"# read data from csv\n",
"article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')\n",
"article_df = article_df[['id', 'url', 'title', 'text', 'content_vector']]\n",
"\n",
"# read vectors from strings back into a list\n",
"article_df[\"content_vector\"] = article_df.content_vector.apply(literal_eval)\n",
"article_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Index data\n",
"\n",
"We will create an SQL table called `articles` in MyScale to store the embeddings data. The table will include a vector index with a cosine distance metric and a constraint for the length of the embeddings. Use the following code to create and insert data into the articles table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:17:36.366066Z",
"start_time": "2023-02-16T12:17:35.486872Z"
}
},
"outputs": [],
"source": [
"# create articles table with vector index\n",
"embedding_len=len(article_df['content_vector'][0]) # 1536\n",
"\n",
"client.command(f\"\"\"\n",
"CREATE TABLE IF NOT EXISTS default.articles\n",
"(\n",
" id UInt64,\n",
" url String,\n",
" title String,\n",
" text String,\n",
" content_vector Array(Float32),\n",
" CONSTRAINT cons_vector_len CHECK length(content_vector) = {embedding_len},\n",
" VECTOR INDEX article_content_index content_vector TYPE HNSWFLAT('metric_type=Cosine')\n",
")\n",
"ENGINE = MergeTree ORDER BY id\n",
"\"\"\")\n",
"\n",
"# insert data into the table in batches\n",
"from tqdm.auto import tqdm\n",
"\n",
"batch_size = 100\n",
"total_records = len(article_df)\n",
"\n",
"# upload data in batches\n",
"data = article_df.to_records(index=False).tolist()\n",
"column_names = article_df.columns.tolist() \n",
"\n",
"for i in tqdm(range(0, total_records, batch_size)):\n",
" i_end = min(i + batch_size, total_records)\n",
" client.insert(\"default.articles\", data[i:i_end], column_names=column_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to check the build status of the vector index before proceeding with the search, as it is automatically built in the background."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"articles count: 25000\n",
"index build status: Built\n"
]
}
],
"source": [
"# check count of inserted data\n",
"print(f\"articles count: {client.command('SELECT count(*) FROM default.articles')}\")\n",
"\n",
"# check the status of the vector index, make sure vector index is ready with 'Built' status\n",
"get_index_status=\"SELECT status FROM system.vector_indices WHERE name='article_content_index'\"\n",
"print(f\"index build status: {client.command(get_index_status)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Search data\n",
"\n",
"Once indexed in MyScale, we can perform vector search to find similar content. First, we will use the OpenAI API to generate embeddings for our query. Then, we will perform the vector search using MyScale."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2023-02-16T12:30:39.379566Z",
"start_time": "2023-02-16T12:30:38.031041Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 Battle of Bannockburn\n",
"2 Wars of Scottish Independence\n",
"3 1651\n",
"4 First War of Scottish Independence\n",
"5 Robert I of Scotland\n",
"6 841\n",
"7 1716\n",
"8 1314\n",
"9 1263\n",
"10 William Wallace\n"
]
}
],
"source": [
"import openai\n",
"\n",
"query = \"Famous battles in Scottish history\"\n",
"\n",
"# creates embedding vector from user query\n",
"embed = openai.Embedding.create(\n",
" input=query,\n",
" model=\"text-embedding-ada-002\",\n",
")[\"data\"][0][\"embedding\"]\n",
"\n",
"# query the database to find the top K similar content to the given query\n",
"top_k = 10\n",
"results = client.query(f\"\"\"\n",
"SELECT id, url, title, distance(content_vector, {embed}) as dist\n",
"FROM default.articles\n",
"ORDER BY dist\n",
"LIMIT {top_k}\n",
"\"\"\")\n",
"\n",
"# display results\n",
"for i, r in enumerate(results.named_results()):\n",
" print(i+1, r['title'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 1
}