Merge pull request #56 from openai/colin

Initial commit of vector database example with new embeddings
2024-11-04 06:00:33 +00:00 · 2023-02-06 09:20:01 -08:00 · 2023-02-06 09:20:01 -08:00 · 5ef1523014
commit 5ef1523014
parent a24f1c8b10 3ad0e718cb
3 changed files with 905 additions and 0 deletions
--- a/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb
+++ b/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb
@ -0,0 +1,877 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "cb1537e6",
+   "metadata": {},
+   "source": [
+    "# Using Vector Databases for Embeddings Search\n",
+    "\n",
+    "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n",
+    "\n",
+    "### What is a Vector Database\n",
+    "\n",
+    "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n",
+    "\n",
+    "### Why use a Vector Database\n",
+    "\n",
+    "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n",
+    "\n",
+    "\n",
+    "### Demo Flow\n",
+    "The demo flow is:\n",
+    "- **Setup**: Import packages and set any required variables\n",
+    "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n",
+    "- **Pinecone**\n",
+    "    - *Setup*: Here we'll set up the Python client for Pinecone. For more details go [here](https://docs.pinecone.io/docs/quickstart)\n",
+    "    - *Index Data*: We'll create an index with namespaces for __titles__ and __content__\n",
+    "    - *Search Data*: We'll test out both namespaces with search queries to confirm it works\n",
+    "- **Weaviate**\n",
+    "    - *Setup*: Here we'll set up the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)\n",
+    "    - *Index Data*: We'll create an index with __title__ search vectors in it\n",
+    "    - *Search Data*: We'll run a few searches to confirm it works\n",
+    "- **Qdrant**\n",
+    "    - *Setup*: Here we'll set up the Python client for Qdrant. For more details go [here](https://github.com/qdrant/qdrant_client)\n",
+    "    - *Index Data*: We'll create a collection with vectors for __titles__ and __content__\n",
+    "    - *Search Data*: We'll run a few searches to confirm it works\n",
+    "\n",
+    "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e2b59250",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "Import the required libraries and set the embedding model that we'd like to use."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d8810f9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# We'll need to install the clients for all vector databases\n",
+    "!pip install pinecone-client\n",
+    "!pip install weaviate-client\n",
+    "!pip install qdrant-client\n",
+    "\n",
+    "#Install wget to pull zip file\n",
+    "!pip install wget"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5be94df6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import openai\n",
+    "\n",
+    "import tiktoken\n",
+    "from typing import List, Iterator\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import os\n",
+    "import wget\n",
+    "from ast import literal_eval\n",
+    "\n",
+    "# Pinecone's client library for Python\n",
+    "import pinecone\n",
+    "\n",
+    "# Weaviate's client library for Python\n",
+    "import weaviate\n",
+    "\n",
+    "# Qdrant's client library for Python\n",
+    "import qdrant_client\n",
+    "\n",
+    "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
+    "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n",
+    "\n",
+    "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n",
+    "import warnings\n",
+    "\n",
+    "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n",
+    "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e5d9d2e1",
+   "metadata": {},
+   "source": [
+    "## Load data\n",
+    "\n",
+    "In this section we'll load embedded data that we've prepared previous to this session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5dff8b55",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
+    "\n",
+    "# Warning, the file is pretty big so this will take some time\n",
+    "wget.download(embeddings_url)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "21097972",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import zipfile\n",
+    "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n",
+    "    zip_ref.extractall(\"../data\")\n",
+    "    \n",
+    "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1721e45d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "article_df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "960b82af",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Read vectors from strings back into a list\n",
+    "#article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n",
+    "article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n",
+    "\n",
+    "# Set vector_id to be a string\n",
+    "article_df['vector_id'] = article_df['vector_id'].apply(str)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a334ab8b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "len(article_df['title_vector'][0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed32fc87",
+   "metadata": {},
+   "source": [
+    "## Pinecone\n",
+    "\n",
+    "We'll index these embedded documents in a vector database and search them. The first option we'll look at is **Pinecone**, a managed vector database which offers a cloud-native option.\n",
+    "\n",
+    "Before you proceed with this step you'll need to navigate to [Pinecone](pinecone.io), sign up and then save your API key as an environment variable titled ```PINECONE_API_KEY```.\n",
+    "\n",
+    "For section we will:\n",
+    "- Create an index with multiple namespaces for article titles and content\n",
+    "- Store our data in the index with separate searchable \"namespaces\" for article **titles** and **content**\n",
+    "- Fire some similarity search queries to verify our setup is working"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "92e6152a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "api_key = os.getenv(\"PINECONE_API_KEY\")\n",
+    "pinecone.init(api_key=api_key)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "63b28543",
+   "metadata": {},
+   "source": [
+    "### Create Index\n",
+    "\n",
+    "First we will need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [Pinecone documentation](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.).\n",
+    "\n",
+    "If you want to batch insert to your index in parallel to increase insertion speed then there is a great guide in the Pinecone documentation on [batch inserts in parallel](https://docs.pinecone.io/docs/insert-data#sending-upserts-in-parallel)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0a71c575",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Models a simple batch generator that make chunks out of an input DataFrame\n",
+    "class BatchGenerator:\n",
+    "    \n",
+    "    \n",
+    "    def __init__(self, batch_size: int = 10) -> None:\n",
+    "        self.batch_size = batch_size\n",
+    "    \n",
+    "    # Makes chunks out of an input DataFrame\n",
+    "    def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:\n",
+    "        splits = self.splits_num(df.shape[0])\n",
+    "        if splits <= 1:\n",
+    "            yield df\n",
+    "        else:\n",
+    "            for chunk in np.array_split(df, splits):\n",
+    "                yield chunk\n",
+    "\n",
+    "    # Determines how many chunks DataFrame contains\n",
+    "    def splits_num(self, elements: int) -> int:\n",
+    "        return round(elements / self.batch_size)\n",
+    "    \n",
+    "    __call__ = to_batches\n",
+    "\n",
+    "df_batcher = BatchGenerator(300)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ea9ad46",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Pick a name for the new index\n",
+    "index_name = 'wikipedia-articles'\n",
+    "\n",
+    "# Check whether the index with the same name already exists - if so, delete it\n",
+    "if index_name in pinecone.list_indexes():\n",
+    "    pinecone.delete_index(index_name)\n",
+    "    \n",
+    "# Creates new index\n",
+    "pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))\n",
+    "index = pinecone.Index(index_name=index_name)\n",
+    "\n",
+    "# Confirm our index was created\n",
+    "pinecone.list_indexes()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5daeba00",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Upsert content vectors in content namespace - this can take a few minutes\n",
+    "print(\"Uploading vectors to content namespace..\")\n",
+    "for batch_df in df_batcher(article_df):\n",
+    "    index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5fc1b083",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Upsert title vectors in title namespace - this can also take a few minutes\n",
+    "print(\"Uploading vectors to title namespace..\")\n",
+    "for batch_df in df_batcher(article_df):\n",
+    "    index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f90c7fba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Check index size for each namespace to confirm all of our docs have loaded\n",
+    "index.describe_index_stats()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2da40a69",
+   "metadata": {},
+   "source": [
+    "### Search data\n",
+    "\n",
+    "Now we'll enter some dummy searches and check we get decent results back"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d701b3c7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# First we'll create dictionaries mapping vector IDs to their outputs so we can retrieve the text for our search results\n",
+    "titles_mapped = dict(zip(article_df.vector_id,article_df.title))\n",
+    "content_mapped = dict(zip(article_df.vector_id,article_df.text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c8c2aa1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def query_article(query, namespace, top_k=5):\n",
+    "    '''Queries an article using its title in the specified\n",
+    "     namespace and prints results.'''\n",
+    "\n",
+    "    # Create vector embeddings based on the title column\n",
+    "    embedded_query = openai.Embedding.create(\n",
+    "                                            input=query,\n",
+    "                                            model=EMBEDDING_MODEL,\n",
+    "                                            )[\"data\"][0]['embedding']\n",
+    "\n",
+    "    # Query namespace passed as parameter using title vector\n",
+    "    query_result = index.query(embedded_query, \n",
+    "                                      namespace=namespace, \n",
+    "                                      top_k=top_k)\n",
+    "\n",
+    "    # Print query results \n",
+    "    print(f'\\nMost similar results to {query} in \"{namespace}\" namespace:\\n')\n",
+    "    if not query_result.matches:\n",
+    "        print('no query result')\n",
+    "    \n",
+    "    matches = query_result.matches\n",
+    "    ids = [res.id for res in matches]\n",
+    "    scores = [res.score for res in matches]\n",
+    "    df = pd.DataFrame({'id':ids, \n",
+    "                       'score':scores,\n",
+    "                       'title': [titles_mapped[_id] for _id in ids],\n",
+    "                       'content': [content_mapped[_id] for _id in ids],\n",
+    "                       })\n",
+    "    \n",
+    "    counter = 0\n",
+    "    for k,v in df.iterrows():\n",
+    "        counter += 1\n",
+    "        print(f'{v.title} (score = {v.score})')\n",
+    "    \n",
+    "    print('\\n')\n",
+    "\n",
+    "    return df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "67b3584d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_output = query_article('modern art in Europe','title')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3e7ac79b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "content_query_output = query_article(\"Famous battles in Scottish history\",'content')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d939342f",
+   "metadata": {},
+   "source": [
+    "## Weaviate\n",
+    "\n",
+    "The other vector database option we'll explore here is **Weaviate**, which offers both a managed, SaaS option like Pinecone, as well as a self-hosted option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.\n",
+    "\n",
+    "For this we will:\n",
+    "- Set up a local deployment of Weaviate\n",
+    "- Create indices in Weaviate\n",
+    "- Store our data there\n",
+    "- Fire some similarity search queries\n",
+    "- Try a real use case"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bfdfe260",
+   "metadata": {},
+   "source": [
+    "### Setup\n",
+    "\n",
+    "To get Weaviate running locally we will use Docker and follow the instructions contained in the Weaviate documentation here: https://weaviate.io/developers/weaviate/current/installation/docker-compose.html\n",
+    "\n",
+    "For an example docker-compose.yaml file please refer to `./weaviate/docker-compose.yaml` in this repo\n",
+    "\n",
+    "You can start Weaviate up locally by navigating to this directory and running `docker-compose up -d `"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b9ea472d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client = weaviate.Client(\"http://localhost:8080/\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "13be220d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.schema.delete_all()\n",
+    "client.schema.get()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "73d33184",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.is_ready()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "03a926b9",
+   "metadata": {},
+   "source": [
+    "### Index data\n",
+    "\n",
+    "In Weaviate you create __schemas__ to capture each of the entities you will be searching. \n",
+    "\n",
+    "In this case we'll create a schema called **Article** with the **title** vector from above included for us to search by.\n",
+    "\n",
+    "The next few steps closely follow the documentation Weaviate provides [here](https://weaviate.io/developers/weaviate/current/tutorials/how-to-use-weaviate-without-modules.htm)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e868d143",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class_obj = {\n",
+    "    \"class\": \"Article\",\n",
+    "    \"vectorizer\": \"none\", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves\n",
+    "    \"properties\": [{\n",
+    "        \"name\": \"title\",\n",
+    "        \"description\": \"Title of the article\",\n",
+    "        \"dataType\": [\"text\"]\n",
+    "    },\n",
+    "        {\n",
+    "        \"name\": \"content\",\n",
+    "        \"description\": \"Contents of the article\",\n",
+    "        \"dataType\": [\"text\"]\n",
+    "    }]\n",
+    "}\n",
+    "\n",
+    "# Create the schema in Weaviate\n",
+    "client.schema.create_class(class_obj)\n",
+    "\n",
+    "# Check that we've created it as intended\n",
+    "client.schema.get()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "786d437f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Convert DF into a list of tuples\n",
+    "data_objects = []\n",
+    "for k,v in article_df.iterrows():\n",
+    "    data_objects.append((v['title'],v['text'],v['title_vector'],v['vector_id']))\n",
+    "\n",
+    "# Upsert into article schema\n",
+    "print(\"Uploading vectors to article schema..\")\n",
+    "\n",
+    "# Store a list of UUIDs in case we want to use to refer back to the initial dataframe\n",
+    "uuids = []\n",
+    "\n",
+    "# Reuse our batcher from the Pinecone ingestion\n",
+    "for batch_df in df_batcher(article_df):\n",
+    "    for k,v in batch_df.iterrows():\n",
+    "        #print(articles)\n",
+    "        uuid = client.data_object.create(\n",
+    "                              {\n",
+    "                                  \"title\": v['title'],\n",
+    "                                  \"content\": v['text']\n",
+    "                              },\n",
+    "                              \"Article\",\n",
+    "                              vector=v['title_vector']\n",
+    "                            )\n",
+    "        uuids.append(uuid)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3658693c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Test our insert has worked by checking one object\n",
+    "print(client.data_object.get()['objects'][0]['properties']['title'])\n",
+    "print(client.data_object.get()['objects'][0]['properties']['content'])\n",
+    "\n",
+    "# Test that all data has loaded\n",
+    "result = client.query.aggregate(\"Article\") \\\n",
+    "    .with_fields('meta { count }') \\\n",
+    "    .do()\n",
+    "result['data']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "46050ca9",
+   "metadata": {},
+   "source": [
+    "### Search Data\n",
+    "\n",
+    "As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5acd5437",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def query_weaviate(query, schema, top_k=20):\n",
+    "\n",
+    "    # Creates embedding vector from user query\n",
+    "    embedded_query = openai.Embedding.create(\n",
+    "                                                input=query,\n",
+    "                                                model=EMBEDDING_MODEL,\n",
+    "                                            )[\"data\"][0]['embedding']\n",
+    "    \n",
+    "    near_vector = {\"vector\": embedded_query}\n",
+    "\n",
+    "    # Queries input schema with vectorised user query\n",
+    "    query_result = client.query.get(schema,[\"title\",\"content\", \"_additional {certainty}\"]) \\\n",
+    "    .with_near_vector(near_vector) \\\n",
+    "    .with_limit(top_k) \\\n",
+    "    .do()\n",
+    "    \n",
+    "    return query_result"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "15def653",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_result = query_weaviate('modern art in Europe','Article')\n",
+    "counter = 0\n",
+    "for article in query_result['data']['Get']['Article']:\n",
+    "    counter += 1\n",
+    "    print(f\"{counter}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "93c4a696",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_result = query_weaviate('Famous battles in Scottish history','Article')\n",
+    "counter = 0\n",
+    "for article in query_result['data']['Get']['Article']:\n",
+    "    counter += 1\n",
+    "    print(f\"{counter}. {article['title']} (Score: {round(article['_additional']['certainty'],3) })\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9cfaed9d",
+   "metadata": {},
+   "source": [
+    "## Qdrant\n",
+    "\n",
+    "The last vector database we'll consider is **[Qdrant](https://qdrant.tech/)**. This is a high-performant vector search database written in Rust. It offers both on-premise and cloud version, but for the purposes of that example we're going to use the local deployment mode.\n",
+    "\n",
+    "Setting everything up will require:\n",
+    "- Spinning up a local instance of Qdrant\n",
+    "- Configuring the collection and storing the data in it\n",
+    "- Trying out with some queries"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "38774565",
+   "metadata": {},
+   "source": [
+    "### Setup\n",
+    "\n",
+    "For the local deployment, we are going to use Docker, according to the Qdrant documentation: https://qdrant.tech/documentation/quick_start/. Qdrant requires just a single container, but an example of the docker-compose.yaml file is available at `./qdrant/docker-compose.yaml` in this repo.\n",
+    "\n",
+    "You can start Qdrant instance locally by navigating to this directory and running `docker-compose up -d `"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "76d697e9",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-18T09:28:38.928205Z",
+     "start_time": "2023-01-18T09:28:38.913987Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "qdrant = qdrant_client.QdrantClient(host='localhost', prefer_grpc=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1deeb539",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-18T09:29:19.806639Z",
+     "start_time": "2023-01-18T09:29:19.727897Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "qdrant.get_collections()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc006b6f",
+   "metadata": {},
+   "source": [
+    "### Index data\n",
+    "\n",
+    "Qdrant stores data in __collections__ where each object is described by at least one vector and may contain an additional metadata called __payload__. Our collection will be called **Articles** and each object will be described by both **title** and **content** vectors.\n",
+    "\n",
+    "We'll be using an official [qdrant-client](https://github.com/qdrant/qdrant_client) package that has all the utility methods already built-in."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1a84ee1d",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-18T09:29:22.530121Z",
+     "start_time": "2023-01-18T09:29:22.524604Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from qdrant_client.http import models as rest"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "00876f92",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-18T09:31:14.413334Z",
+     "start_time": "2023-01-18T09:31:13.619079Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "vector_size = len(article_df['content_vector'][0])\n",
+    "\n",
+    "qdrant.recreate_collection(\n",
+    "    collection_name='Articles',\n",
+    "    vectors_config={\n",
+    "        'title': rest.VectorParams(\n",
+    "            distance=rest.Distance.COSINE,\n",
+    "            size=vector_size,\n",
+    "        ),\n",
+    "        'content': rest.VectorParams(\n",
+    "            distance=rest.Distance.COSINE,\n",
+    "            size=vector_size,\n",
+    "        ),\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f24e76ab",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-18T09:36:28.597535Z",
+     "start_time": "2023-01-18T09:36:24.108867Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "qdrant.upsert(\n",
+    "    collection_name='Articles',\n",
+    "    points=[\n",
+    "        rest.PointStruct(\n",
+    "            id=k,\n",
+    "            vector={\n",
+    "                'title': v['title_vector'],\n",
+    "                'content': v['content_vector'],\n",
+    "            },\n",
+    "            payload=v.to_dict(),\n",
+    "        )\n",
+    "        for k, v in article_df.iterrows()\n",
+    "    ],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d1188a12",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-18T09:58:13.825886Z",
+     "start_time": "2023-01-18T09:58:13.816248Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Check the collection size to make sure all the points have been stored\n",
+    "qdrant.count(collection_name='Articles')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "06ed119b",
+   "metadata": {},
+   "source": [
+    "### Search Data\n",
+    "\n",
+    "Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f1bac4ef",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-18T09:50:35.265647Z",
+     "start_time": "2023-01-18T09:50:35.256065Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def query_qdrant(query, collection_name, vector_name='title', top_k=20):\n",
+    "\n",
+    "    # Creates embedding vector from user query\n",
+    "    embedded_query = openai.Embedding.create(\n",
+    "        input=query,\n",
+    "        model=EMBEDDING_MODEL,\n",
+    "    )['data'][0]['embedding']\n",
+    "    \n",
+    "    query_results = qdrant.search(\n",
+    "        collection_name=collection_name,\n",
+    "        query_vector=(\n",
+    "            vector_name, embedded_query\n",
+    "        ),\n",
+    "        limit=top_k,\n",
+    "    )\n",
+    "    \n",
+    "    return query_results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aa92f3d3",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-18T09:50:46.545145Z",
+     "start_time": "2023-01-18T09:50:35.711020Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "query_results = query_qdrant('modern art in Europe', 'Articles')\n",
+    "for i, article in enumerate(query_results):\n",
+    "    print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ed116b8",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-01-18T09:53:11.038910Z",
+     "start_time": "2023-01-18T09:52:55.248029Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# This time we'll query using content vector\n",
+    "query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')\n",
+    "for i, article in enumerate(query_results):\n",
+    "    print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "55afccbf",
+   "metadata": {},
+   "source": [
+    "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "vectordb",
+   "language": "python",
+   "name": "vectordb"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/examples/vector_databases/qdrant/docker-compose.yaml
+++ b/examples/vector_databases/qdrant/docker-compose.yaml
@ -0,0 +1,8 @@
+version: '3.4'
+services:
+  qdrant:
+    image: qdrant/qdrant:v0.11.7
+    restart: on-failure
+    ports:
+      - "6333:6333"
+      - "6334:6334"
--- a/examples/vector_databases/weaviate/docker-compose.yaml
+++ b/examples/vector_databases/weaviate/docker-compose.yaml
@ -0,0 +1,20 @@
+version: '3.4'
+services:
+  weaviate:
+    image: semitechnologies/weaviate:1.14.0
+    restart: on-failure:0
+    ports:
+     - "8080:8080"
+    environment:
+      QUERY_DEFAULTS_LIMIT: 20
+      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
+      PERSISTENCE_DATA_PATH: "./data"
+      DEFAULT_VECTORIZER_MODULE: text2vec-transformers
+      ENABLE_MODULES: text2vec-transformers
+      TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
+      CLUSTER_HOSTNAME: 'node1'
+  t2v-transformers:
+    image: semitechnologies/transformers-inference:sentence-transformers-msmarco-distilroberta-base-v2
+    environment:
+      ENABLE_CUDA: 0 # set to 1 to enable
+      # NVIDIA_VISIBLE_DEVICES: all # enable if running with CUDA