You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/vector_databases/typesense/Using_Typesense_for_embeddi...

880 lines
32 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "cb1537e6",
"metadata": {},
"source": [
"# Using Typesense for Embeddings Search\n",
"\n",
"This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
"\n",
"### What is a Vector Database\n",
"\n",
"A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n",
"\n",
"### Why use a Vector Database\n",
"\n",
"Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n",
"\n",
"\n",
"### Demo Flow\n",
"The demo flow is:\n",
"- **Setup**: Import packages and set any required variables\n",
"- **Load data**: Load a dataset and embed it using OpenAI embeddings\n",
"- **Typesense**\n",
" - *Setup*: Set up the Typesense Python client. For more details go [here](https://typesense.org/docs/0.24.0/api/)\n",
" - *Index Data*: We'll create a collection and index it for both __titles__ and __content__.\n",
" - *Search Data*: Run a few example queries with various goals in mind.\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
]
},
{
"cell_type": "markdown",
"id": "e2b59250",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Import the required libraries and set the embedding model that we'd like to use."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d8810f9",
"metadata": {},
"outputs": [],
"source": [
"# We'll need to install the Typesense client\n",
"!pip install typesense\n",
"\n",
"#Install wget to pull zip file\n",
"!pip install wget"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5be94df6",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"\n",
"from typing import List, Iterator\n",
"import pandas as pd\n",
"import numpy as np\n",
"import os\n",
"import wget\n",
"from ast import literal_eval\n",
"\n",
"# Typesense's client library for Python\n",
"import typesense\n",
"\n",
"# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
"EMBEDDING_MODEL = \"text-embedding-3-small\"\n",
"\n",
"# Ignore unclosed SSL socket warnings - optional in case you get these errors\n",
"import warnings\n",
"\n",
"warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n",
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning) "
]
},
{
"cell_type": "markdown",
"id": "e5d9d2e1",
"metadata": {},
"source": [
"## Load data\n",
"\n",
"In this section we'll load embedded data that we've prepared previous to this session."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5dff8b55",
"metadata": {},
"outputs": [],
"source": [
"embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
"\n",
"# The file is ~700 MB so this will take some time\n",
"wget.download(embeddings_url)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21097972",
"metadata": {},
"outputs": [],
"source": [
"import zipfile\n",
"with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n",
" zip_ref.extractall(\"../data\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "70bbd8ba",
"metadata": {},
"outputs": [],
"source": [
"article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1721e45d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>url</th>\n",
" <th>title</th>\n",
" <th>text</th>\n",
" <th>title_vector</th>\n",
" <th>content_vector</th>\n",
" <th>vector_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>https://simple.wikipedia.org/wiki/April</td>\n",
" <td>April</td>\n",
" <td>April is the fourth month of the year in the J...</td>\n",
" <td>[0.001009464613161981, -0.020700545981526375, ...</td>\n",
" <td>[-0.011253940872848034, -0.013491976074874401,...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>https://simple.wikipedia.org/wiki/August</td>\n",
" <td>August</td>\n",
" <td>August (Aug.) is the eighth month of the year ...</td>\n",
" <td>[0.0009286514250561595, 0.000820168002974242, ...</td>\n",
" <td>[0.0003609954728744924, 0.007262262050062418, ...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6</td>\n",
" <td>https://simple.wikipedia.org/wiki/Art</td>\n",
" <td>Art</td>\n",
" <td>Art is a creative activity that expresses imag...</td>\n",
" <td>[0.003393713850528002, 0.0061537534929811954, ...</td>\n",
" <td>[-0.004959689453244209, 0.015772193670272827, ...</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>8</td>\n",
" <td>https://simple.wikipedia.org/wiki/A</td>\n",
" <td>A</td>\n",
" <td>A or a is the first letter of the English alph...</td>\n",
" <td>[0.0153952119871974, -0.013759135268628597, 0....</td>\n",
" <td>[0.024894846603274345, -0.022186409682035446, ...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9</td>\n",
" <td>https://simple.wikipedia.org/wiki/Air</td>\n",
" <td>Air</td>\n",
" <td>Air refers to the Earth's atmosphere. Air is a...</td>\n",
" <td>[0.02224554680287838, -0.02044147066771984, -0...</td>\n",
" <td>[0.021524671465158463, 0.018522677943110466, -...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id url title \\\n",
"0 1 https://simple.wikipedia.org/wiki/April April \n",
"1 2 https://simple.wikipedia.org/wiki/August August \n",
"2 6 https://simple.wikipedia.org/wiki/Art Art \n",
"3 8 https://simple.wikipedia.org/wiki/A A \n",
"4 9 https://simple.wikipedia.org/wiki/Air Air \n",
"\n",
" text \\\n",
"0 April is the fourth month of the year in the J... \n",
"1 August (Aug.) is the eighth month of the year ... \n",
"2 Art is a creative activity that expresses imag... \n",
"3 A or a is the first letter of the English alph... \n",
"4 Air refers to the Earth's atmosphere. Air is a... \n",
"\n",
" title_vector \\\n",
"0 [0.001009464613161981, -0.020700545981526375, ... \n",
"1 [0.0009286514250561595, 0.000820168002974242, ... \n",
"2 [0.003393713850528002, 0.0061537534929811954, ... \n",
"3 [0.0153952119871974, -0.013759135268628597, 0.... \n",
"4 [0.02224554680287838, -0.02044147066771984, -0... \n",
"\n",
" content_vector vector_id \n",
"0 [-0.011253940872848034, -0.013491976074874401,... 0 \n",
"1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n",
"2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n",
"3 [0.024894846603274345, -0.022186409682035446, ... 3 \n",
"4 [0.021524671465158463, 0.018522677943110466, -... 4 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"article_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "960b82af",
"metadata": {},
"outputs": [],
"source": [
"# Read vectors from strings back into a list\n",
"article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n",
"article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n",
"\n",
"# Set vector_id to be a string\n",
"article_df['vector_id'] = article_df['vector_id'].apply(str)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a334ab8b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 25000 entries, 0 to 24999\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 id 25000 non-null int64 \n",
" 1 url 25000 non-null object\n",
" 2 title 25000 non-null object\n",
" 3 text 25000 non-null object\n",
" 4 title_vector 25000 non-null object\n",
" 5 content_vector 25000 non-null object\n",
" 6 vector_id 25000 non-null object\n",
"dtypes: int64(1), object(6)\n",
"memory usage: 1.3+ MB\n"
]
}
],
"source": [
"article_df.info(show_counts=True)"
]
},
{
"cell_type": "markdown",
"id": "bb09e0ec",
"metadata": {},
"source": [
"## Typesense\n",
"\n",
"The next vector store we'll look at is [Typesense](https://typesense.org/), which is an open source, in-memory search engine, that you can either self-host or run on [Typesense Cloud](https://cloud.typesense.org).\n",
"\n",
"Typesense focuses on performance by storing the entire index in RAM (with a backup on disk) and also focuses on providing an out-of-the-box developer experience by simplifying available options and setting good defaults. It also lets you combine attribute-based filtering together with vector queries.\n",
"\n",
"For this example, we will set up a local docker-based Typesense server, index our vectors in Typesense and then do some nearest-neighbor search queries. If you use Typesense Cloud, you can skip the docker setup part and just obtain the hostname and API keys from your cluster dashboard."
]
},
{
"cell_type": "markdown",
"id": "bd629f7d",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"To run Typesense locally, you'll need [Docker](https://www.docker.com/). Following the instructions contained in the Typesense documentation [here](https://typesense.org/docs/guide/install-typesense.html#docker-compose), we created an example docker-compose.yml file in this repo saved at [./typesense/docker-compose.yml](./typesense/docker-compose.yml).\n",
"\n",
"After starting Docker, you can start Typesense locally by navigating to the `examples/vector_databases/typesense/` directory and running `docker-compose up -d`.\n",
"\n",
"The default API key is set to `xyz` in the Docker compose file, and the default Typesense port to `8108`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "2bee46b1",
"metadata": {},
"outputs": [],
"source": [
"import typesense\n",
"\n",
"typesense_client = \\\n",
" typesense.Client({\n",
" \"nodes\": [{\n",
" \"host\": \"localhost\", # For Typesense Cloud use xxx.a1.typesense.net\n",
" \"port\": \"8108\", # For Typesense Cloud use 443\n",
" \"protocol\": \"http\" # For Typesense Cloud use https\n",
" }],\n",
" \"api_key\": \"xyz\",\n",
" \"connection_timeout_seconds\": 60\n",
" })"
]
},
{
"cell_type": "markdown",
"id": "11910afb",
"metadata": {},
"source": [
"### Index data\n",
"\n",
"To index vectors in Typesense, we'll first create a Collection (which is a collection of Documents) and turn on vector indexing for a particular field. You can even store multiple vector fields in a single document."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "dd055c80",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'created_at': 1687165065, 'default_sorting_field': '', 'enable_nested_fields': False, 'fields': [{'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'content_vector', 'num_dim': 1536, 'optional': False, 'sort': False, 'type': 'float[]'}, {'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'title_vector', 'num_dim': 1536, 'optional': False, 'sort': False, 'type': 'float[]'}], 'name': 'wikipedia_articles', 'num_documents': 0, 'symbols_to_index': [], 'token_separators': []}\n",
"Created new collection wikipedia-articles\n"
]
}
],
"source": [
"# Delete existing collections if they already exist\n",
"try:\n",
" typesense_client.collections['wikipedia_articles'].delete()\n",
"except Exception as e:\n",
" pass\n",
"\n",
"# Create a new collection\n",
"\n",
"schema = {\n",
" \"name\": \"wikipedia_articles\",\n",
" \"fields\": [\n",
" {\n",
" \"name\": \"content_vector\",\n",
" \"type\": \"float[]\",\n",
" \"num_dim\": len(article_df['content_vector'][0])\n",
" },\n",
" {\n",
" \"name\": \"title_vector\",\n",
" \"type\": \"float[]\",\n",
" \"num_dim\": len(article_df['title_vector'][0])\n",
" }\n",
" ]\n",
"}\n",
"\n",
"create_response = typesense_client.collections.create(schema)\n",
"print(create_response)\n",
"\n",
"print(\"Created new collection wikipedia-articles\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "94bbbb11",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Indexing vectors in Typesense...\n",
"Processed 100 / 25000 \n",
"Processed 200 / 25000 \n",
"Processed 300 / 25000 \n",
"Processed 400 / 25000 \n",
"Processed 500 / 25000 \n",
"Processed 600 / 25000 \n",
"Processed 700 / 25000 \n",
"Processed 800 / 25000 \n",
"Processed 900 / 25000 \n",
"Processed 1000 / 25000 \n",
"Processed 1100 / 25000 \n",
"Processed 1200 / 25000 \n",
"Processed 1300 / 25000 \n",
"Processed 1400 / 25000 \n",
"Processed 1500 / 25000 \n",
"Processed 1600 / 25000 \n",
"Processed 1700 / 25000 \n",
"Processed 1800 / 25000 \n",
"Processed 1900 / 25000 \n",
"Processed 2000 / 25000 \n",
"Processed 2100 / 25000 \n",
"Processed 2200 / 25000 \n",
"Processed 2300 / 25000 \n",
"Processed 2400 / 25000 \n",
"Processed 2500 / 25000 \n",
"Processed 2600 / 25000 \n",
"Processed 2700 / 25000 \n",
"Processed 2800 / 25000 \n",
"Processed 2900 / 25000 \n",
"Processed 3000 / 25000 \n",
"Processed 3100 / 25000 \n",
"Processed 3200 / 25000 \n",
"Processed 3300 / 25000 \n",
"Processed 3400 / 25000 \n",
"Processed 3500 / 25000 \n",
"Processed 3600 / 25000 \n",
"Processed 3700 / 25000 \n",
"Processed 3800 / 25000 \n",
"Processed 3900 / 25000 \n",
"Processed 4000 / 25000 \n",
"Processed 4100 / 25000 \n",
"Processed 4200 / 25000 \n",
"Processed 4300 / 25000 \n",
"Processed 4400 / 25000 \n",
"Processed 4500 / 25000 \n",
"Processed 4600 / 25000 \n",
"Processed 4700 / 25000 \n",
"Processed 4800 / 25000 \n",
"Processed 4900 / 25000 \n",
"Processed 5000 / 25000 \n",
"Processed 5100 / 25000 \n",
"Processed 5200 / 25000 \n",
"Processed 5300 / 25000 \n",
"Processed 5400 / 25000 \n",
"Processed 5500 / 25000 \n",
"Processed 5600 / 25000 \n",
"Processed 5700 / 25000 \n",
"Processed 5800 / 25000 \n",
"Processed 5900 / 25000 \n",
"Processed 6000 / 25000 \n",
"Processed 6100 / 25000 \n",
"Processed 6200 / 25000 \n",
"Processed 6300 / 25000 \n",
"Processed 6400 / 25000 \n",
"Processed 6500 / 25000 \n",
"Processed 6600 / 25000 \n",
"Processed 6700 / 25000 \n",
"Processed 6800 / 25000 \n",
"Processed 6900 / 25000 \n",
"Processed 7000 / 25000 \n",
"Processed 7100 / 25000 \n",
"Processed 7200 / 25000 \n",
"Processed 7300 / 25000 \n",
"Processed 7400 / 25000 \n",
"Processed 7500 / 25000 \n",
"Processed 7600 / 25000 \n",
"Processed 7700 / 25000 \n",
"Processed 7800 / 25000 \n",
"Processed 7900 / 25000 \n",
"Processed 8000 / 25000 \n",
"Processed 8100 / 25000 \n",
"Processed 8200 / 25000 \n",
"Processed 8300 / 25000 \n",
"Processed 8400 / 25000 \n",
"Processed 8500 / 25000 \n",
"Processed 8600 / 25000 \n",
"Processed 8700 / 25000 \n",
"Processed 8800 / 25000 \n",
"Processed 8900 / 25000 \n",
"Processed 9000 / 25000 \n",
"Processed 9100 / 25000 \n",
"Processed 9200 / 25000 \n",
"Processed 9300 / 25000 \n",
"Processed 9400 / 25000 \n",
"Processed 9500 / 25000 \n",
"Processed 9600 / 25000 \n",
"Processed 9700 / 25000 \n",
"Processed 9800 / 25000 \n",
"Processed 9900 / 25000 \n",
"Processed 10000 / 25000 \n",
"Processed 10100 / 25000 \n",
"Processed 10200 / 25000 \n",
"Processed 10300 / 25000 \n",
"Processed 10400 / 25000 \n",
"Processed 10500 / 25000 \n",
"Processed 10600 / 25000 \n",
"Processed 10700 / 25000 \n",
"Processed 10800 / 25000 \n",
"Processed 10900 / 25000 \n",
"Processed 11000 / 25000 \n",
"Processed 11100 / 25000 \n",
"Processed 11200 / 25000 \n",
"Processed 11300 / 25000 \n",
"Processed 11400 / 25000 \n",
"Processed 11500 / 25000 \n",
"Processed 11600 / 25000 \n",
"Processed 11700 / 25000 \n",
"Processed 11800 / 25000 \n",
"Processed 11900 / 25000 \n",
"Processed 12000 / 25000 \n",
"Processed 12100 / 25000 \n",
"Processed 12200 / 25000 \n",
"Processed 12300 / 25000 \n",
"Processed 12400 / 25000 \n",
"Processed 12500 / 25000 \n",
"Processed 12600 / 25000 \n",
"Processed 12700 / 25000 \n",
"Processed 12800 / 25000 \n",
"Processed 12900 / 25000 \n",
"Processed 13000 / 25000 \n",
"Processed 13100 / 25000 \n",
"Processed 13200 / 25000 \n",
"Processed 13300 / 25000 \n",
"Processed 13400 / 25000 \n",
"Processed 13500 / 25000 \n",
"Processed 13600 / 25000 \n",
"Processed 13700 / 25000 \n",
"Processed 13800 / 25000 \n",
"Processed 13900 / 25000 \n",
"Processed 14000 / 25000 \n",
"Processed 14100 / 25000 \n",
"Processed 14200 / 25000 \n",
"Processed 14300 / 25000 \n",
"Processed 14400 / 25000 \n",
"Processed 14500 / 25000 \n",
"Processed 14600 / 25000 \n",
"Processed 14700 / 25000 \n",
"Processed 14800 / 25000 \n",
"Processed 14900 / 25000 \n",
"Processed 15000 / 25000 \n",
"Processed 15100 / 25000 \n",
"Processed 15200 / 25000 \n",
"Processed 15300 / 25000 \n",
"Processed 15400 / 25000 \n",
"Processed 15500 / 25000 \n",
"Processed 15600 / 25000 \n",
"Processed 15700 / 25000 \n",
"Processed 15800 / 25000 \n",
"Processed 15900 / 25000 \n",
"Processed 16000 / 25000 \n",
"Processed 16100 / 25000 \n",
"Processed 16200 / 25000 \n",
"Processed 16300 / 25000 \n",
"Processed 16400 / 25000 \n",
"Processed 16500 / 25000 \n",
"Processed 16600 / 25000 \n",
"Processed 16700 / 25000 \n",
"Processed 16800 / 25000 \n",
"Processed 16900 / 25000 \n",
"Processed 17000 / 25000 \n",
"Processed 17100 / 25000 \n",
"Processed 17200 / 25000 \n",
"Processed 17300 / 25000 \n",
"Processed 17400 / 25000 \n",
"Processed 17500 / 25000 \n",
"Processed 17600 / 25000 \n",
"Processed 17700 / 25000 \n",
"Processed 17800 / 25000 \n",
"Processed 17900 / 25000 \n",
"Processed 18000 / 25000 \n",
"Processed 18100 / 25000 \n",
"Processed 18200 / 25000 \n",
"Processed 18300 / 25000 \n",
"Processed 18400 / 25000 \n",
"Processed 18500 / 25000 \n",
"Processed 18600 / 25000 \n",
"Processed 18700 / 25000 \n",
"Processed 18800 / 25000 \n",
"Processed 18900 / 25000 \n",
"Processed 19000 / 25000 \n",
"Processed 19100 / 25000 \n",
"Processed 19200 / 25000 \n",
"Processed 19300 / 25000 \n",
"Processed 19400 / 25000 \n",
"Processed 19500 / 25000 \n",
"Processed 19600 / 25000 \n",
"Processed 19700 / 25000 \n",
"Processed 19800 / 25000 \n",
"Processed 19900 / 25000 \n",
"Processed 20000 / 25000 \n",
"Processed 20100 / 25000 \n",
"Processed 20200 / 25000 \n",
"Processed 20300 / 25000 \n",
"Processed 20400 / 25000 \n",
"Processed 20500 / 25000 \n",
"Processed 20600 / 25000 \n",
"Processed 20700 / 25000 \n",
"Processed 20800 / 25000 \n",
"Processed 20900 / 25000 \n",
"Processed 21000 / 25000 \n",
"Processed 21100 / 25000 \n",
"Processed 21200 / 25000 \n",
"Processed 21300 / 25000 \n",
"Processed 21400 / 25000 \n",
"Processed 21500 / 25000 \n",
"Processed 21600 / 25000 \n",
"Processed 21700 / 25000 \n",
"Processed 21800 / 25000 \n",
"Processed 21900 / 25000 \n",
"Processed 22000 / 25000 \n",
"Processed 22100 / 25000 \n",
"Processed 22200 / 25000 \n",
"Processed 22300 / 25000 \n",
"Processed 22400 / 25000 \n",
"Processed 22500 / 25000 \n",
"Processed 22600 / 25000 \n",
"Processed 22700 / 25000 \n",
"Processed 22800 / 25000 \n",
"Processed 22900 / 25000 \n",
"Processed 23000 / 25000 \n",
"Processed 23100 / 25000 \n",
"Processed 23200 / 25000 \n",
"Processed 23300 / 25000 \n",
"Processed 23400 / 25000 \n",
"Processed 23500 / 25000 \n",
"Processed 23600 / 25000 \n",
"Processed 23700 / 25000 \n",
"Processed 23800 / 25000 \n",
"Processed 23900 / 25000 \n",
"Processed 24000 / 25000 \n",
"Processed 24100 / 25000 \n",
"Processed 24200 / 25000 \n",
"Processed 24300 / 25000 \n",
"Processed 24400 / 25000 \n",
"Processed 24500 / 25000 \n",
"Processed 24600 / 25000 \n",
"Processed 24700 / 25000 \n",
"Processed 24800 / 25000 \n",
"Processed 24900 / 25000 \n",
"Processed 25000 / 25000 \n",
"Imported (25000) articles.\n"
]
}
],
"source": [
"# Upsert the vector data into the collection we just created\n",
"#\n",
"# Note: This can take a few minutes, especially if your on an M1 and running docker in an emulated mode\n",
"\n",
"print(\"Indexing vectors in Typesense...\")\n",
"\n",
"document_counter = 0\n",
"documents_batch = []\n",
"\n",
"for k,v in article_df.iterrows():\n",
" # Create a document with the vector data\n",
"\n",
" # Notice how you can add any fields that you haven't added to the schema to the document.\n",
" # These will be stored on disk and returned when the document is a hit.\n",
" # This is useful to store attributes required for display purposes.\n",
"\n",
" document = {\n",
" \"title_vector\": v[\"title_vector\"],\n",
" \"content_vector\": v[\"content_vector\"],\n",
" \"title\": v[\"title\"],\n",
" \"content\": v[\"text\"],\n",
" }\n",
" documents_batch.append(document)\n",
" document_counter = document_counter + 1\n",
"\n",
" # Upsert a batch of 100 documents\n",
" if document_counter % 100 == 0 or document_counter == len(article_df):\n",
" response = typesense_client.collections['wikipedia_articles'].documents.import_(documents_batch)\n",
" # print(response)\n",
"\n",
" documents_batch = []\n",
" print(f\"Processed {document_counter} / {len(article_df)} \")\n",
"\n",
"print(f\"Imported ({len(article_df)}) articles.\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "f774ecb2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collection has 25000 documents\n"
]
}
],
"source": [
"# Check the number of documents imported\n",
"\n",
"collection = typesense_client.collections['wikipedia_articles'].retrieve()\n",
"print(f'Collection has {collection[\"num_documents\"]} documents')"
]
},
{
"cell_type": "markdown",
"id": "fbc6f5c5",
"metadata": {},
"source": [
"### Search Data\n",
"\n",
"Now that we've imported the vectors into Typesense, we can do a nearest neighbor search on the `title_vector` or `content_vector` field."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d9a3f0dc",
"metadata": {},
"outputs": [],
"source": [
"def query_typesense(query, field='title', top_k=20):\n",
"\n",
" # Creates embedding vector from user query\n",
" openai.api_key = os.getenv(\"OPENAI_API_KEY\", \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\")\n",
" embedded_query = openai.Embedding.create(\n",
" input=query,\n",
" model=EMBEDDING_MODEL,\n",
" )['data'][0]['embedding']\n",
"\n",
" typesense_results = typesense_client.multi_search.perform({\n",
" \"searches\": [{\n",
" \"q\": \"*\",\n",
" \"collection\": \"wikipedia_articles\",\n",
" \"vector_query\": f\"{field}_vector:([{','.join(str(v) for v in embedded_query)}], k:{top_k})\"\n",
" }]\n",
" }, {})\n",
"\n",
" return typesense_results"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "24183c36",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. Museum of Modern Art (Distance: 0.12482291460037231)\n",
"2. Western Europe (Distance: 0.13255876302719116)\n",
"3. Renaissance art (Distance: 0.13584274053573608)\n",
"4. Pop art (Distance: 0.1396539807319641)\n",
"5. Northern Europe (Distance: 0.14534103870391846)\n",
"6. Hellenistic art (Distance: 0.1472070813179016)\n",
"7. Modernist literature (Distance: 0.15296930074691772)\n",
"8. Art film (Distance: 0.1567266583442688)\n",
"9. Central Europe (Distance: 0.15741699934005737)\n",
"10. European (Distance: 0.1585891842842102)\n"
]
}
],
"source": [
"query_results = query_typesense('modern art in Europe', 'title')\n",
"\n",
"for i, hit in enumerate(query_results['results'][0]['hits']):\n",
" document = hit[\"document\"]\n",
" vector_distance = hit[\"vector_distance\"]\n",
" print(f'{i + 1}. {document[\"title\"]} (Distance: {vector_distance})')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "a64e3c80",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. Battle of Bannockburn (Distance: 0.1306111216545105)\n",
"2. Wars of Scottish Independence (Distance: 0.1384994387626648)\n",
"3. 1651 (Distance: 0.14744246006011963)\n",
"4. First War of Scottish Independence (Distance: 0.15033596754074097)\n",
"5. Robert I of Scotland (Distance: 0.15376019477844238)\n",
"6. 841 (Distance: 0.15609073638916016)\n",
"7. 1716 (Distance: 0.15615153312683105)\n",
"8. 1314 (Distance: 0.16280347108840942)\n",
"9. 1263 (Distance: 0.16361045837402344)\n",
"10. William Wallace (Distance: 0.16464537382125854)\n"
]
}
],
"source": [
"query_results = query_typesense('Famous battles in Scottish history', 'content')\n",
"\n",
"for i, hit in enumerate(query_results['results'][0]['hits']):\n",
" document = hit[\"document\"]\n",
" vector_distance = hit[\"vector_distance\"]\n",
" print(f'{i + 1}. {document[\"title\"]} (Distance: {vector_distance})')"
]
},
{
"cell_type": "markdown",
"id": "55afccbf",
"metadata": {},
"source": [
"Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0119d87a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "vector_db_split",
"language": "python",
"name": "vector_db_split"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"vscode": {
"interpreter": {
"hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}