{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "cb1537e6", "metadata": {}, "source": [ "# Using Vector Databases for Embeddings Search\n", "\n", "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n", "\n", "### What is a Vector Database\n", "\n", "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n", "\n", "### Why use a Vector Database\n", "\n", "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n", "\n", "\n", "### Demo Flow\n", "The demo flow is:\n", "- **Setup**: Import packages and set any required variables\n", "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n", "- **Pinecone**\n", " - *Setup*: Here we'll set up the Python client for Pinecone. For more details go [here](https://docs.pinecone.io/docs/quickstart)\n", " - *Index Data*: We'll create an index with namespaces for __titles__ and __content__\n", " - *Search Data*: We'll test out both namespaces with search queries to confirm it works\n", "- **Weaviate**\n", " - *Setup*: Here we'll set up the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)\n", " - *Index Data*: We'll create an index with __title__ search vectors in it\n", " - *Search Data*: We'll run a few searches to confirm it works\n", "- **Milvus**\n", " - *Setup*: Here we'll set up the Python client for Milvus. For more details go [here](https://milvus.io/docs)\n", " - *Index Data* We'll create a collection and index it for both __titles__ and __content__\n", " - *Search Data*: We'll test out both collections with search queries to confirm it works\n", "- **Qdrant**\n", " - *Setup*: Here we'll set up the Python client for Qdrant. For more details go [here](https://github.com/qdrant/qdrant_client)\n", " - *Index Data*: We'll create a collection with vectors for __titles__ and __content__\n", " - *Search Data*: We'll run a few searches to confirm it works\n", "- **Redis**\n", " - *Setup*: Set up the Redis-Py client. For more details go [here](https://github.com/redis/redis-py)\n", " - *Index Data*: Create the search index for vector search and hybrid search (vector + full-text search) on all available fields.\n", " - *Search Data*: Run a few example queries with various goals in mind.\n", "\n", "\n", "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings." ] }, { "cell_type": "markdown", "id": "e2b59250", "metadata": {}, "source": [ "## Setup\n", "\n", "Import the required libraries and set the embedding model that we'd like to use." ] }, { "cell_type": "code", "execution_count": null, "id": "8d8810f9", "metadata": {}, "outputs": [], "source": [ "# We'll need to install the clients for all vector databases\n", "!pip install pinecone-client\n", "!pip install weaviate-client\n", "!pip install pymilvus\n", "!pip install qdrant-client\n", "!pip install redis\n", "\n", "#Install wget to pull zip file\n", "!pip install wget" ] }, { "cell_type": "code", "execution_count": 2, "id": "5be94df6", "metadata": {}, "outputs": [], "source": [ "import openai\n", "\n", "import tiktoken\n", "from typing import List, Iterator\n", "import pandas as pd\n", "import numpy as np\n", "import os\n", "import wget\n", "from ast import literal_eval\n", "\n", "# Redis client library for Python\n", "import redis\n", "\n", "# Pinecone's client library for Python\n", "import pinecone\n", "\n", "# Weaviate's client library for Python\n", "import weaviate\n", "\n", "# Milvus's client library for Python\n", "import pymilvus\n", "\n", "# Qdrant's client library for Python\n", "import qdrant_client\n", "\n", "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", "\n", "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n", "import warnings\n", "\n", "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n", "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) " ] }, { "cell_type": "markdown", "id": "e5d9d2e1", "metadata": {}, "source": [ "## Load data\n", "\n", "In this section we'll load embedded data that we've prepared previous to this session." ] }, { "cell_type": "code", "execution_count": 3, "id": "5dff8b55", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'vector_database_wikipedia_articles_embedded.zip'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n", "\n", "# The file is ~700 MB so this will take some time\n", "wget.download(embeddings_url)" ] }, { "cell_type": "code", "execution_count": 4, "id": "21097972", "metadata": {}, "outputs": [], "source": [ "import zipfile\n", "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", " zip_ref.extractall(\"../data\")" ] }, { "cell_type": "code", "execution_count": null, "id": "70bbd8ba", "metadata": {}, "outputs": [], "source": [ "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')" ] }, { "cell_type": "code", "execution_count": 6, "id": "1721e45d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "url | \n", "title | \n", "text | \n", "title_vector | \n", "content_vector | \n", "vector_id | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "https://simple.wikipedia.org/wiki/April | \n", "April | \n", "April is the fourth month of the year in the J... | \n", "[0.001009464613161981, -0.020700545981526375, ... | \n", "[-0.011253940872848034, -0.013491976074874401,... | \n", "0 | \n", "
1 | \n", "2 | \n", "https://simple.wikipedia.org/wiki/August | \n", "August | \n", "August (Aug.) is the eighth month of the year ... | \n", "[0.0009286514250561595, 0.000820168002974242, ... | \n", "[0.0003609954728744924, 0.007262262050062418, ... | \n", "1 | \n", "
2 | \n", "6 | \n", "https://simple.wikipedia.org/wiki/Art | \n", "Art | \n", "Art is a creative activity that expresses imag... | \n", "[0.003393713850528002, 0.0061537534929811954, ... | \n", "[-0.004959689453244209, 0.015772193670272827, ... | \n", "2 | \n", "
3 | \n", "8 | \n", "https://simple.wikipedia.org/wiki/A | \n", "A | \n", "A or a is the first letter of the English alph... | \n", "[0.0153952119871974, -0.013759135268628597, 0.... | \n", "[0.024894846603274345, -0.022186409682035446, ... | \n", "3 | \n", "
4 | \n", "9 | \n", "https://simple.wikipedia.org/wiki/Air | \n", "Air | \n", "Air refers to the Earth's atmosphere. Air is a... | \n", "[0.02224554680287838, -0.02044147066771984, -0... | \n", "[0.021524671465158463, 0.018522677943110466, -... | \n", "4 | \n", "