You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/vector_databases/qdrant/Using_Qdrant_for_embeddings...

730 lines
23 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "cb1537e6",
"metadata": {},
"source": [
"# Using Qdrant for Embeddings Search\n",
"\n",
"This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
"\n",
"### What is a Vector Database\n",
"\n",
"A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n",
"\n",
"### Why use a Vector Database\n",
"\n",
"Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n",
"\n",
"\n",
"### Demo Flow\n",
"The demo flow is:\n",
"- **Setup**: Import packages and set any required variables\n",
"- **Load data**: Load a dataset and embed it using OpenAI embeddings\n",
"- **Qdrant**\n",
" - *Setup*: Here we'll set up the Python client for Qdrant. For more details go [here](https://github.com/qdrant/qdrant_client)\n",
" - *Index Data*: We'll create a collection with vectors for __titles__ and __content__\n",
" - *Search Data*: We'll run a few searches to confirm it works\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
]
},
{
"cell_type": "markdown",
"id": "e2b59250",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Import the required libraries and set the embedding model that we'd like to use."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8d8810f9",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-29T12:59:21.344233180Z",
"start_time": "2023-06-29T12:59:00.815088712Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting qdrant-client\r\n",
" ...\r\n",
"Successfully installed certifi-2023.5.7 grpcio-1.56.0 grpcio-tools-1.56.0 h11-0.14.0 h2-4.1.0 hpack-4.0.0 httpcore-0.17.2 httpx-0.24.1 hyperframe-6.0.1 numpy-1.25.0 portalocker-2.7.0 protobuf-4.23.3 pydantic-1.10.9 qdrant-client-1.3.1 typing-extensions-4.5.0 urllib3-1.26.16\r\n",
"Collecting wget\r\n",
" Using cached wget-3.2.zip (10 kB)\r\n",
" Preparing metadata (setup.py) ... \u001b[?25ldone\r\n",
"\u001b[?25hBuilding wheels for collected packages: wget\r\n",
" Building wheel for wget (setup.py) ... \u001b[?25ldone\r\n",
"\u001b[?25h Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9657 sha256=eb5f15f12150fc304e7b14973424f696fa8d95225772bc0cbc0b318bf92e04b9\r\n",
" Stored in directory: /home/user/.cache/pip/wheels/04/5f/3e/46cc37c5d698415694d83f607f833f83f0149e49b3af9d0f38\r\n",
"Successfully built wget\r\n",
"Installing collected packages: wget\r\n",
"Successfully installed wget-3.2\r\n"
]
}
],
"source": [
"# We'll need to install Qdrant client\n",
"!pip install qdrant-client\n",
"\n",
"#Install wget to pull zip file\n",
"!pip install wget"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "5be94df6",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-29T13:00:32.715638041Z",
"start_time": "2023-06-29T13:00:31.654032435Z"
}
},
"outputs": [],
"source": [
"import openai\n",
"\n",
"from typing import List, Iterator\n",
"import pandas as pd\n",
"import numpy as np\n",
"import os\n",
"import wget\n",
"from ast import literal_eval\n",
"\n",
"# Qdrant's client library for Python\n",
"import qdrant_client\n",
"\n",
"# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
"EMBEDDING_MODEL = \"text-embedding-3-small\"\n",
"\n",
"# Ignore unclosed SSL socket warnings - optional in case you get these errors\n",
"import warnings\n",
"\n",
"warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n",
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning) "
]
},
{
"cell_type": "markdown",
"id": "e5d9d2e1",
"metadata": {},
"source": [
"## Load data\n",
"\n",
"In this section we'll load embedded data that we've prepared previous to this session."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5dff8b55",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-29T13:02:47.656128622Z",
"start_time": "2023-06-29T13:00:39.079229873Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'vector_database_wikipedia_articles_embedded.zip'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
"\n",
"# The file is ~700 MB so this will take some time\n",
"wget.download(embeddings_url)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "21097972",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-29T13:03:08.268413005Z",
"start_time": "2023-06-29T13:02:47.626254476Z"
}
},
"outputs": [],
"source": [
"import zipfile\n",
"with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n",
" zip_ref.extractall(\"../data\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "70bbd8ba",
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-29T13:03:28.291797292Z",
"start_time": "2023-06-29T13:03:08.269033964Z"
}
},
"outputs": [],
"source": [
"article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1721e45d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>url</th>\n",
" <th>title</th>\n",
" <th>text</th>\n",
" <th>title_vector</th>\n",
" <th>content_vector</th>\n",
" <th>vector_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>https://simple.wikipedia.org/wiki/April</td>\n",
" <td>April</td>\n",
" <td>April is the fourth month of the year in the J...</td>\n",
" <td>[0.001009464613161981, -0.020700545981526375, ...</td>\n",
" <td>[-0.011253940872848034, -0.013491976074874401,...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>https://simple.wikipedia.org/wiki/August</td>\n",
" <td>August</td>\n",
" <td>August (Aug.) is the eighth month of the year ...</td>\n",
" <td>[0.0009286514250561595, 0.000820168002974242, ...</td>\n",
" <td>[0.0003609954728744924, 0.007262262050062418, ...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6</td>\n",
" <td>https://simple.wikipedia.org/wiki/Art</td>\n",
" <td>Art</td>\n",
" <td>Art is a creative activity that expresses imag...</td>\n",
" <td>[0.003393713850528002, 0.0061537534929811954, ...</td>\n",
" <td>[-0.004959689453244209, 0.015772193670272827, ...</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>8</td>\n",
" <td>https://simple.wikipedia.org/wiki/A</td>\n",
" <td>A</td>\n",
" <td>A or a is the first letter of the English alph...</td>\n",
" <td>[0.0153952119871974, -0.013759135268628597, 0....</td>\n",
" <td>[0.024894846603274345, -0.022186409682035446, ...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9</td>\n",
" <td>https://simple.wikipedia.org/wiki/Air</td>\n",
" <td>Air</td>\n",
" <td>Air refers to the Earth's atmosphere. Air is a...</td>\n",
" <td>[0.02224554680287838, -0.02044147066771984, -0...</td>\n",
" <td>[0.021524671465158463, 0.018522677943110466, -...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id url title \\\n",
"0 1 https://simple.wikipedia.org/wiki/April April \n",
"1 2 https://simple.wikipedia.org/wiki/August August \n",
"2 6 https://simple.wikipedia.org/wiki/Art Art \n",
"3 8 https://simple.wikipedia.org/wiki/A A \n",
"4 9 https://simple.wikipedia.org/wiki/Air Air \n",
"\n",
" text \\\n",
"0 April is the fourth month of the year in the J... \n",
"1 August (Aug.) is the eighth month of the year ... \n",
"2 Art is a creative activity that expresses imag... \n",
"3 A or a is the first letter of the English alph... \n",
"4 Air refers to the Earth's atmosphere. Air is a... \n",
"\n",
" title_vector \\\n",
"0 [0.001009464613161981, -0.020700545981526375, ... \n",
"1 [0.0009286514250561595, 0.000820168002974242, ... \n",
"2 [0.003393713850528002, 0.0061537534929811954, ... \n",
"3 [0.0153952119871974, -0.013759135268628597, 0.... \n",
"4 [0.02224554680287838, -0.02044147066771984, -0... \n",
"\n",
" content_vector vector_id \n",
"0 [-0.011253940872848034, -0.013491976074874401,... 0 \n",
"1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n",
"2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n",
"3 [0.024894846603274345, -0.022186409682035446, ... 3 \n",
"4 [0.021524671465158463, 0.018522677943110466, -... 4 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"article_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "960b82af",
"metadata": {},
"outputs": [],
"source": [
"# Read vectors from strings back into a list\n",
"article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n",
"article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n",
"\n",
"# Set vector_id to be a string\n",
"article_df['vector_id'] = article_df['vector_id'].apply(str)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a334ab8b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 25000 entries, 0 to 24999\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 id 25000 non-null int64 \n",
" 1 url 25000 non-null object\n",
" 2 title 25000 non-null object\n",
" 3 text 25000 non-null object\n",
" 4 title_vector 25000 non-null object\n",
" 5 content_vector 25000 non-null object\n",
" 6 vector_id 25000 non-null object\n",
"dtypes: int64(1), object(6)\n",
"memory usage: 1.3+ MB\n"
]
}
],
"source": [
"article_df.info(show_counts=True)"
]
},
{
"cell_type": "markdown",
"id": "9cfaed9d",
"metadata": {},
"source": [
"## Qdrant\n",
"\n",
"**[Qdrant](https://qdrant.tech/)**. is a high-performant vector search database written in Rust. It offers both on-premise and cloud version, but for the purposes of that example we're going to use the local deployment mode.\n",
"\n",
"Setting everything up will require:\n",
"- Spinning up a local instance of Qdrant\n",
"- Configuring the collection and storing the data in it\n",
"- Trying out with some queries"
]
},
{
"cell_type": "markdown",
"id": "38774565",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"For the local deployment, we are going to use Docker, according to the Qdrant documentation: https://qdrant.tech/documentation/quick_start/. Qdrant requires just a single container, but an example of the docker-compose.yaml file is available at `./qdrant/docker-compose.yaml` in this repo.\n",
"\n",
"You can start Qdrant instance locally by navigating to this directory and running `docker-compose up -d `"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "76d697e9",
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:28:38.928205Z",
"start_time": "2023-01-18T09:28:38.913987Z"
}
},
"outputs": [],
"source": [
"qdrant = qdrant_client.QdrantClient(host='localhost', prefer_grpc=True)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1deeb539",
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:29:19.806639Z",
"start_time": "2023-01-18T09:29:19.727897Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"CollectionsResponse(collections=[CollectionDescription(name='Routines')])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qdrant.get_collections()"
]
},
{
"cell_type": "markdown",
"id": "bc006b6f",
"metadata": {},
"source": [
"### Index data\n",
"\n",
"Qdrant stores data in __collections__ where each object is described by at least one vector and may contain an additional metadata called __payload__. Our collection will be called **Articles** and each object will be described by both **title** and **content** vectors.\n",
"\n",
"We'll be using an official [qdrant-client](https://github.com/qdrant/qdrant_client) package that has all the utility methods already built-in."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "1a84ee1d",
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:29:22.530121Z",
"start_time": "2023-01-18T09:29:22.524604Z"
}
},
"outputs": [],
"source": [
"from qdrant_client.http import models as rest"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "00876f92",
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:31:14.413334Z",
"start_time": "2023-01-18T09:31:13.619079Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vector_size = len(article_df['content_vector'][0])\n",
"\n",
"qdrant.recreate_collection(\n",
" collection_name='Articles',\n",
" vectors_config={\n",
" 'title': rest.VectorParams(\n",
" distance=rest.Distance.COSINE,\n",
" size=vector_size,\n",
" ),\n",
" 'content': rest.VectorParams(\n",
" distance=rest.Distance.COSINE,\n",
" size=vector_size,\n",
" ),\n",
" }\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f24e76ab",
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:36:28.597535Z",
"start_time": "2023-01-18T09:36:24.108867Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qdrant.upsert(\n",
" collection_name='Articles',\n",
" points=[\n",
" rest.PointStruct(\n",
" id=k,\n",
" vector={\n",
" 'title': v['title_vector'],\n",
" 'content': v['content_vector'],\n",
" },\n",
" payload=v.to_dict(),\n",
" )\n",
" for k, v in article_df.iterrows()\n",
" ],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "d1188a12",
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:58:13.825886Z",
"start_time": "2023-01-18T09:58:13.816248Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"CountResult(count=25000)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the collection size to make sure all the points have been stored\n",
"qdrant.count(collection_name='Articles')"
]
},
{
"cell_type": "markdown",
"id": "06ed119b",
"metadata": {},
"source": [
"### Search Data\n",
"\n",
"Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "f1bac4ef",
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:50:35.265647Z",
"start_time": "2023-01-18T09:50:35.256065Z"
}
},
"outputs": [],
"source": [
"def query_qdrant(query, collection_name, vector_name='title', top_k=20):\n",
"\n",
" # Creates embedding vector from user query\n",
" embedded_query = openai.Embedding.create(\n",
" input=query,\n",
" model=EMBEDDING_MODEL,\n",
" )['data'][0]['embedding']\n",
" \n",
" query_results = qdrant.search(\n",
" collection_name=collection_name,\n",
" query_vector=(\n",
" vector_name, embedded_query\n",
" ),\n",
" limit=top_k,\n",
" )\n",
" \n",
" return query_results"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "aa92f3d3",
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:50:46.545145Z",
"start_time": "2023-01-18T09:50:35.711020Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. Museum of Modern Art (Score: 0.875)\n",
"2. Western Europe (Score: 0.867)\n",
"3. Renaissance art (Score: 0.864)\n",
"4. Pop art (Score: 0.86)\n",
"5. Northern Europe (Score: 0.855)\n",
"6. Hellenistic art (Score: 0.853)\n",
"7. Modernist literature (Score: 0.847)\n",
"8. Art film (Score: 0.843)\n",
"9. Central Europe (Score: 0.843)\n",
"10. European (Score: 0.841)\n",
"11. Art (Score: 0.841)\n",
"12. Byzantine art (Score: 0.841)\n",
"13. Postmodernism (Score: 0.84)\n",
"14. Eastern Europe (Score: 0.839)\n",
"15. Europe (Score: 0.839)\n",
"16. Cubism (Score: 0.839)\n",
"17. Impressionism (Score: 0.838)\n",
"18. Bauhaus (Score: 0.838)\n",
"19. Expressionism (Score: 0.837)\n",
"20. Surrealism (Score: 0.837)\n"
]
}
],
"source": [
"query_results = query_qdrant('modern art in Europe', 'Articles')\n",
"for i, article in enumerate(query_results):\n",
" print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "7ed116b8",
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:53:11.038910Z",
"start_time": "2023-01-18T09:52:55.248029Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. Battle of Bannockburn (Score: 0.869)\n",
"2. Wars of Scottish Independence (Score: 0.861)\n",
"3. 1651 (Score: 0.853)\n",
"4. First War of Scottish Independence (Score: 0.85)\n",
"5. Robert I of Scotland (Score: 0.846)\n",
"6. 841 (Score: 0.844)\n",
"7. 1716 (Score: 0.844)\n",
"8. 1314 (Score: 0.837)\n",
"9. 1263 (Score: 0.836)\n",
"10. William Wallace (Score: 0.835)\n",
"11. Stirling (Score: 0.831)\n",
"12. 1306 (Score: 0.831)\n",
"13. 1746 (Score: 0.831)\n",
"14. 1040s (Score: 0.828)\n",
"15. 1106 (Score: 0.827)\n",
"16. 1304 (Score: 0.827)\n",
"17. David II of Scotland (Score: 0.825)\n",
"18. Braveheart (Score: 0.824)\n",
"19. 1124 (Score: 0.824)\n",
"20. July 27 (Score: 0.823)\n"
]
}
],
"source": [
"# This time we'll query using content vector\n",
"query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')\n",
"for i, article in enumerate(query_results):\n",
" print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0119d87a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"vscode": {
"interpreter": {
"hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}