diff --git a/examples/vector_databases/README.md b/examples/vector_databases/README.md new file mode 100644 index 00000000..f9370712 --- /dev/null +++ b/examples/vector_databases/README.md @@ -0,0 +1,21 @@ +# Vector Databases + +This section of the OpenAI Cookbook showcases many of the vector databases available to support your semantic search use cases. + +Vector databases can be a great accompaniment for knowledge retrieval applications, which reduce hallucinations by providing the LLM with the relevant context to answer questions. + +Each provider has their own named directory, with a standard notebook to introduce you to using our API with their product, and any supplementary notebooks they choose to add to showcase their functionality. + +## Guides & deep dives +- [AnalyticDB](https://www.alibabacloud.com/help/en/analyticdb-for-postgresql/latest/get-started-with-analyticdb-for-postgresql) +- [Chroma](https://docs.trychroma.com/getting-started) +- [Hologres](https://www.alibabacloud.com/help/en/hologres/latest/procedure-to-use-hologres) +- [Kusto](https://learn.microsoft.com/en-us/azure/data-explorer/web-query-data) +- [Milvus](https://milvus.io/docs/example_code.md) +- [MyScale](https://docs.myscale.com/en/quickstart/) +- [Pinecone](https://docs.pinecone.io/docs/quickstart) +- [Qdrant](https://qdrant.tech/documentation/quick-start/) +- [Redis](https://github.com/RedisVentures/simple-vecsim-intro) +- [SingleStoreDB](https://www.singlestore.com/blog/how-to-get-started-with-singlestore/) +- [Typesense](https://typesense.org/docs/guide/) +- [Weaviate](https://weaviate.io/developers/weaviate/quickstart) \ No newline at end of file diff --git a/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb b/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb deleted file mode 100644 index 6318e7ac..00000000 --- a/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb +++ /dev/null @@ -1,2488 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "cb1537e6", - "metadata": {}, - "source": [ - "# Using Vector Databases for Embeddings Search\n", - "\n", - "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n", - "\n", - "### What is a Vector Database\n", - "\n", - "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n", - "\n", - "### Why use a Vector Database\n", - "\n", - "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n", - "\n", - "\n", - "### Demo Flow\n", - "The demo flow is:\n", - "- **Setup**: Import packages and set any required variables\n", - "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n", - "- **Chroma**:\n", - " - *Setup*: Here we'll set up the Python client for Chroma. For more details go [here](https://docs.trychroma.com/usage-guide)\n", - " - *Index Data*: We'll create collections with vectors for __titles__ and __content__\n", - " - *Search Data*: We'll run a few searches to confirm it works\n", - "- **Pinecone**\n", - " - *Setup*: Here we'll set up the Python client for Pinecone. For more details go [here](https://docs.pinecone.io/docs/quickstart)\n", - " - *Index Data*: We'll create an index with namespaces for __titles__ and __content__\n", - " - *Search Data*: We'll test out both namespaces with search queries to confirm it works\n", - "- **Weaviate**\n", - " - *Setup*: Here we'll set up the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)\n", - " - *Index Data*: We'll create an index with __title__ search vectors in it\n", - " - *Search Data*: We'll run a few searches to confirm it works\n", - "- **Milvus**\n", - " - *Setup*: Here we'll set up the Python client for Milvus. For more details go [here](https://milvus.io/docs)\n", - " - *Index Data* We'll create a collection and index it for both __titles__ and __content__\n", - " - *Search Data*: We'll test out both collections with search queries to confirm it works\n", - "- **Qdrant**\n", - " - *Setup*: Here we'll set up the Python client for Qdrant. For more details go [here](https://github.com/qdrant/qdrant_client)\n", - " - *Index Data*: We'll create a collection with vectors for __titles__ and __content__\n", - " - *Search Data*: We'll run a few searches to confirm it works\n", - "- **Redis**\n", - " - *Setup*: Set up the Redis-Py client. For more details go [here](https://github.com/redis/redis-py)\n", - " - *Index Data*: Create the search index for vector search and hybrid search (vector + full-text search) on all available fields.\n", - " - *Search Data*: Run a few example queries with various goals in mind.\n", - "- **Typesense**\n", - " - *Setup*: Set up the Typesense Python client. For more details go [here](https://typesense.org/docs/0.24.0/api/)\n", - " - *Index Data*: We'll create a collection and index it for both __titles__ and __content__.\n", - " - *Search Data*: Run a few example queries with various goals in mind.\n", - "- **MyScale**\n", - " - *Setup*: Set up the MyScale Python client. For more details go [here](https://docs.myscale.com/en/python-client/)\n", - " - *Index Data*: We'll create a table and index it for __content__.\n", - " - *Search Data*: Run a few example queries with various goals in mind.\n", - "\n", - "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "e2b59250", - "metadata": {}, - "source": [ - "## Setup\n", - "\n", - "Import the required libraries and set the embedding model that we'd like to use." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8d8810f9", - "metadata": {}, - "outputs": [], - "source": [ - "# We'll need to install the clients for all vector databases\n", - "!pip install chromadb\n", - "!pip install pinecone-client\n", - "!pip install weaviate-client\n", - "!pip install pymilvus\n", - "!pip install qdrant-client\n", - "!pip install redis\n", - "!pip install typesense\n", - "!pip install clickhouse-connect\n", - "\n", - "#Install wget to pull zip file\n", - "!pip install wget" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5be94df6", - "metadata": {}, - "outputs": [], - "source": [ - "import openai\n", - "\n", - "from typing import List, Iterator\n", - "import pandas as pd\n", - "import numpy as np\n", - "import os\n", - "import wget\n", - "from ast import literal_eval\n", - "\n", - "# Redis client library for Python\n", - "import redis\n", - "\n", - "# Chroma's client library for Python\n", - "import chromadb\n", - "\n", - "# Pinecone's client library for Python\n", - "import pinecone\n", - "\n", - "# Weaviate's client library for Python\n", - "import weaviate\n", - "\n", - "# Qdrant's client library for Python\n", - "import qdrant_client\n", - "\n", - "# Typesense's client library for Python\n", - "import typesense\n", - "\n", - "# MyScale's client library for Python\n", - "import clickhouse-connect\n", - "\n", - "\n", - "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", - "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", - "\n", - "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n", - "import warnings\n", - "\n", - "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n", - "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "e5d9d2e1", - "metadata": {}, - "source": [ - "## Load data\n", - "\n", - "In this section we'll load embedded data that we've prepared previous to this session." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5dff8b55", - "metadata": {}, - "outputs": [], - "source": [ - "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n", - "\n", - "# The file is ~700 MB so this will take some time\n", - "wget.download(embeddings_url)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "21097972", - "metadata": {}, - "outputs": [], - "source": [ - "import zipfile\n", - "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", - " zip_ref.extractall(\"../data\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "70bbd8ba", - "metadata": {}, - "outputs": [], - "source": [ - "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1721e45d", - "metadata": {}, - "outputs": [], - "source": [ - "article_df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "960b82af", - "metadata": {}, - "outputs": [], - "source": [ - "# Read vectors from strings back into a list\n", - "article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n", - "article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n", - "\n", - "# Set vector_id to be a string\n", - "article_df['vector_id'] = article_df['vector_id'].apply(str)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a334ab8b", - "metadata": {}, - "outputs": [], - "source": [ - "article_df.info(show_counts=True)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "81bf5349", - "metadata": {}, - "source": [ - "# Chroma\n", - "\n", - "We'll index these embedded documents in a vector database and search them. The first option we'll look at is **Chroma**, an easy to use open-source self-hosted in-memory vector database, designed for working with embeddings together with LLMs. \n", - "\n", - "In this section, we will:\n", - "- Instantiate the Chroma client\n", - "- Create collections for each class of embedding \n", - "- Query each collection " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "37d1f693", - "metadata": {}, - "source": [ - "### Instantiate the Chroma client\n", - "\n", - "Create the Chroma client. By default, Chroma is ephemeral and runs in memory. \n", - "However, you can easily set up a persistent configuraiton which writes to disk." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "159d9646", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "\n", - "chroma_client = chromadb.Client() # Ephemeral. Comment out for the persistent version.\n", - "\n", - "# Uncomment the following for the persistent version. \n", - "# import chromadb.config.Settings\n", - "# persist_directory = 'chroma_persistence' # Directory to store persisted Chroma data. \n", - "# client = chromadb.Client(\n", - "# Settings(\n", - "# persist_directory=persist_directory,\n", - "# chroma_db_impl=\"duckdb+parquet\",\n", - "# )\n", - "# )" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "5cd61943", - "metadata": {}, - "source": [ - "### Create collections\n", - "\n", - "Chroma collections allow you to store and filter with arbitrary metadata, making it easy to query subsets of the embedded data. \n", - "\n", - "Chroma is already integrated with OpenAI's embedding functions. The best way to use them is on construction of a collection, as follows.\n", - "Alternatively, you can 'bring your own embeddings'. More information can be found [here](https://docs.trychroma.com/embeddings)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ad2d1bce", - "metadata": {}, - "outputs": [], - "source": [ - "from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction\n", - "\n", - "# Test that your OpenAI API key is correctly set as an environment variable\n", - "# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n", - "\n", - "# Note. alternatively you can set a temporary env variable like this:\n", - "# os.environ[\"OPENAI_API_KEY\"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'\n", - "\n", - "if os.getenv(\"OPENAI_API_KEY\") is not None:\n", - " openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n", - " print (\"OPENAI_API_KEY is ready\")\n", - "else:\n", - " print (\"OPENAI_API_KEY environment variable not found\")\n", - "\n", - "\n", - "embedding_function = OpenAIEmbeddingFunction(api_key=os.environ.get('OPENAI_API_KEY'), model_name=EMBEDDING_MODEL)\n", - "\n", - "wikipedia_content_collection = chroma_client.create_collection(name='wikipedia_content', embedding_function=embedding_function)\n", - "wikipedia_title_collection = chroma_client.create_collection(name='wikipedia_titles', embedding_function=embedding_function)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "02887b52", - "metadata": {}, - "source": [ - "### Populate the collections\n", - "\n", - "Chroma collections allow you to populate, and filter on, whatever metadata you like. Chroma can also store the text alongside the vectors, and return everything in a single `query` call, when this is more convenient. \n", - "\n", - "For this use-case, we'll just store the embeddings and IDs, and use these to index the original dataframe. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "84885fec", - "metadata": {}, - "outputs": [], - "source": [ - "# Add the content vectors\n", - "wikipedia_content_collection.add(\n", - " ids=article_df.vector_id.tolist(),\n", - " embeddings=article_df.content_vector.tolist(),\n", - ")\n", - "\n", - "# Add the title vectors\n", - "wikipedia_title_collection.add(\n", - " ids=article_df.vector_id.tolist(),\n", - " embeddings=article_df.title_vector.tolist(),\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "79122c6b", - "metadata": {}, - "source": [ - "### Search the collections\n", - "\n", - "Chroma handles embedding queries for you if an embedding function is set, like in this example." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "273b8b4c", - "metadata": {}, - "outputs": [], - "source": [ - "def query_collection(collection, query, max_results, dataframe):\n", - " results = collection.query(query_texts=query, n_results=max_results, include=['distances']) \n", - " df = pd.DataFrame({\n", - " 'id':results['ids'][0], \n", - " 'score':results['distances'][0],\n", - " 'title': dataframe[dataframe.vector_id.isin(results['ids'][0])]['title'],\n", - " 'content': dataframe[dataframe.vector_id.isin(results['ids'][0])]['text'],\n", - " })\n", - " \n", - " return df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e84cf47f", - "metadata": {}, - "outputs": [], - "source": [ - "title_query_result = query_collection(\n", - " collection=wikipedia_title_collection,\n", - " query=\"modern art in Europe\",\n", - " max_results=10,\n", - " dataframe=article_df\n", - ")\n", - "title_query_result.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f4db910a", - "metadata": {}, - "outputs": [], - "source": [ - "content_query_result = query_collection(\n", - " collection=wikipedia_content_collection,\n", - " query=\"Famous battles in Scottish history\",\n", - " max_results=10,\n", - " dataframe=article_df\n", - ")\n", - "content_query_result.head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "a03e7645", - "metadata": {}, - "source": [ - "Now that you've got a basic embeddings search running, you can [hop over to the Chroma docs](https://docs.trychroma.com/usage-guide#using-where-filters) to learn more about how to add filters to your query, update/delete data in your collections, and deploy Chroma." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "ed32fc87", - "metadata": {}, - "source": [ - "## Pinecone\n", - "\n", - "The next option we'll look at is **Pinecone**, a managed vector database which offers a cloud-native option.\n", - "\n", - "Before you proceed with this step you'll need to navigate to [Pinecone](pinecone.io), sign up and then save your API key as an environment variable titled ```PINECONE_API_KEY```.\n", - "\n", - "For section we will:\n", - "- Create an index with multiple namespaces for article titles and content\n", - "- Store our data in the index with separate searchable \"namespaces\" for article **titles** and **content**\n", - "- Fire some similarity search queries to verify our setup is working" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "92e6152a", - "metadata": {}, - "outputs": [], - "source": [ - "api_key = os.getenv(\"PINECONE_API_KEY\")\n", - "pinecone.init(api_key=api_key)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "63b28543", - "metadata": {}, - "source": [ - "### Create Index\n", - "\n", - "First we will need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [Pinecone documentation](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.).\n", - "\n", - "If you want to batch insert to your index in parallel to increase insertion speed then there is a great guide in the Pinecone documentation on [batch inserts in parallel](https://docs.pinecone.io/docs/insert-data#sending-upserts-in-parallel)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0a71c575", - "metadata": {}, - "outputs": [], - "source": [ - "# Models a simple batch generator that make chunks out of an input DataFrame\n", - "class BatchGenerator:\n", - " \n", - " \n", - " def __init__(self, batch_size: int = 10) -> None:\n", - " self.batch_size = batch_size\n", - " \n", - " # Makes chunks out of an input DataFrame\n", - " def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:\n", - " splits = self.splits_num(df.shape[0])\n", - " if splits <= 1:\n", - " yield df\n", - " else:\n", - " for chunk in np.array_split(df, splits):\n", - " yield chunk\n", - "\n", - " # Determines how many chunks DataFrame contains\n", - " def splits_num(self, elements: int) -> int:\n", - " return round(elements / self.batch_size)\n", - " \n", - " __call__ = to_batches\n", - "\n", - "df_batcher = BatchGenerator(300)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7ea9ad46", - "metadata": {}, - "outputs": [], - "source": [ - "# Pick a name for the new index\n", - "index_name = 'wikipedia-articles'\n", - "\n", - "# Check whether the index with the same name already exists - if so, delete it\n", - "if index_name in pinecone.list_indexes():\n", - " pinecone.delete_index(index_name)\n", - " \n", - "# Creates new index\n", - "pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))\n", - "index = pinecone.Index(index_name=index_name)\n", - "\n", - "# Confirm our index was created\n", - "pinecone.list_indexes()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5daeba00", - "metadata": {}, - "outputs": [], - "source": [ - "# Upsert content vectors in content namespace - this can take a few minutes\n", - "print(\"Uploading vectors to content namespace..\")\n", - "for batch_df in df_batcher(article_df):\n", - " index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5fc1b083", - "metadata": {}, - "outputs": [], - "source": [ - "# Upsert title vectors in title namespace - this can also take a few minutes\n", - "print(\"Uploading vectors to title namespace..\")\n", - "for batch_df in df_batcher(article_df):\n", - " index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f90c7fba", - "metadata": {}, - "outputs": [], - "source": [ - "# Check index size for each namespace to confirm all of our docs have loaded\n", - "index.describe_index_stats()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "2da40a69", - "metadata": {}, - "source": [ - "### Search data\n", - "\n", - "Now we'll enter some dummy searches and check we get decent results back" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c8280363", - "metadata": {}, - "outputs": [], - "source": [ - "# First we'll create dictionaries mapping vector IDs to their outputs so we can retrieve the text for our search results\n", - "titles_mapped = dict(zip(article_df.vector_id,article_df.title))\n", - "content_mapped = dict(zip(article_df.vector_id,article_df.text))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3c8c2aa1", - "metadata": {}, - "outputs": [], - "source": [ - "def query_article(query, namespace, top_k=5):\n", - " '''Queries an article using its title in the specified\n", - " namespace and prints results.'''\n", - "\n", - " # Create vector embeddings based on the title column\n", - " embedded_query = openai.Embedding.create(\n", - " input=query,\n", - " model=EMBEDDING_MODEL,\n", - " )[\"data\"][0]['embedding']\n", - "\n", - " # Query namespace passed as parameter using title vector\n", - " query_result = index.query(embedded_query, \n", - " namespace=namespace, \n", - " top_k=top_k)\n", - "\n", - " # Print query results \n", - " print(f'\\nMost similar results to {query} in \"{namespace}\" namespace:\\n')\n", - " if not query_result.matches:\n", - " print('no query result')\n", - " \n", - " matches = query_result.matches\n", - " ids = [res.id for res in matches]\n", - " scores = [res.score for res in matches]\n", - " df = pd.DataFrame({'id':ids, \n", - " 'score':scores,\n", - " 'title': [titles_mapped[_id] for _id in ids],\n", - " 'content': [content_mapped[_id] for _id in ids],\n", - " })\n", - " \n", - " counter = 0\n", - " for k,v in df.iterrows():\n", - " counter += 1\n", - " print(f'{v.title} (score = {v.score})')\n", - " \n", - " print('\\n')\n", - "\n", - " return df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3402b1b1", - "metadata": {}, - "outputs": [], - "source": [ - "query_output = query_article('modern art in Europe','title')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "64a3f90a", - "metadata": {}, - "outputs": [], - "source": [ - "content_query_output = query_article(\"Famous battles in Scottish history\",'content')" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "d939342f", - "metadata": {}, - "source": [ - "## Weaviate\n", - "\n", - "Another vector database option we'll explore is **Weaviate**, which offers both a managed, [SaaS](https://console.weaviate.io/) option, as well as a self-hosted [open source](https://github.com/weaviate/weaviate) option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.\n", - "\n", - "For this we will:\n", - "- Set up a local deployment of Weaviate\n", - "- Create indices in Weaviate\n", - "- Store our data there\n", - "- Fire some similarity search queries\n", - "- Try a real use case\n", - "\n", - "\n", - "### Bring your own vectors approach\n", - "In this cookbook, we provide the data with already generated vectors. This is a good approach for scenarios, where your data is already vectorized.\n", - "\n", - "### Automated vectorization with OpenAI module\n", - "For scenarios, where your data is not vectorized yet, you can delegate the vectorization task with OpenAI to Weaviate.\n", - "Weaviate offers a built-in module [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai), which takes care of the vectorization for you at:\n", - "* import\n", - "* for any CRUD operations\n", - "* for semantic search\n", - "\n", - "Check out the [Getting Started with Weaviate and OpenAI module cookbook](./weaviate/getting-started-with-weaviate-and-openai.ipynb) to learn step by step how to import and vectorize data in one step." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "bfdfe260", - "metadata": {}, - "source": [ - "### Setup\n", - "\n", - "To run Weaviate locally, you'll need [Docker](https://www.docker.com/). Following the instructions contained in the Weaviate documentation [here](https://weaviate.io/developers/weaviate/installation/docker-compose), we created an example docker-compose.yml file in this repo saved at [./weaviate/docker-compose.yml](./weaviate/docker-compose.yml).\n", - "\n", - "After starting Docker, you can start Weaviate locally by navigating to the `examples/vector_databases/weaviate/` directory and running `docker-compose up -d`.\n", - "\n", - "#### SaaS\n", - "Alternatively you can use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n", - "1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n", - "2. create a `Weaviate Cluster` with the following settings:\n", - " * Sandbox: `Sandbox Free`\n", - " * Weaviate Version: Use default (latest)\n", - " * OIDC Authentication: `Disabled`\n", - "3. your instance should be ready in a minute or two\n", - "4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name-suffix.weaviate.network` " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a78f95d1", - "metadata": {}, - "outputs": [], - "source": [ - "# Option #1 - Self-hosted - Weaviate Open Source \n", - "client = weaviate.Client(\n", - " url=\"http://localhost:8080\",\n", - " additional_headers={\n", - " \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e00b7d68", - "metadata": {}, - "outputs": [], - "source": [ - "# Option #2 - SaaS - (Weaviate Cloud Service)\n", - "client = weaviate.Client(\n", - " url=\"https://your-wcs-instance-name.weaviate.network\",\n", - " additional_headers={\n", - " \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1d370afa", - "metadata": {}, - "outputs": [], - "source": [ - "client.is_ready()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "03a926b9", - "metadata": {}, - "source": [ - "### Index data\n", - "\n", - "In Weaviate you create __schemas__ to capture each of the entities you will be searching. \n", - "\n", - "In this case we'll create a schema called **Article** with the **title** vector from above included for us to search by.\n", - "\n", - "The next few steps closely follow the documentation Weaviate provides [here](https://weaviate.io/developers/weaviate/quickstart).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0e6175a1", - "metadata": {}, - "outputs": [], - "source": [ - "# Clear up the schema, so that we can recreate it\n", - "client.schema.delete_all()\n", - "client.schema.get()\n", - "\n", - "# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`\n", - "article_schema = {\n", - " \"class\": \"Article\",\n", - " \"description\": \"A collection of articles\",\n", - " \"vectorizer\": \"text2vec-openai\",\n", - " \"moduleConfig\": {\n", - " \"text2vec-openai\": {\n", - " \"model\": \"ada\",\n", - " \"modelVersion\": \"002\",\n", - " \"type\": \"text\"\n", - " }\n", - " },\n", - " \"properties\": [{\n", - " \"name\": \"title\",\n", - " \"description\": \"Title of the article\",\n", - " \"dataType\": [\"string\"]\n", - " },\n", - " {\n", - " \"name\": \"content\",\n", - " \"description\": \"Contents of the article\",\n", - " \"dataType\": [\"text\"],\n", - " \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n", - " }]\n", - "}\n", - "\n", - "# add the Article schema\n", - "client.schema.create_class(article_schema)\n", - "\n", - "# get the schema to make sure it worked\n", - "client.schema.get()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ea838e7d", - "metadata": {}, - "outputs": [], - "source": [ - "### Step 1 - configure Weaviate Batch, which optimizes CRUD operations in bulk\n", - "# - starting batch size of 100\n", - "# - dynamically increase/decrease based on performance\n", - "# - add timeout retries if something goes wrong\n", - "\n", - "client.batch.configure(\n", - " batch_size=100,\n", - " dynamic=True,\n", - " timeout_retries=3,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b4c967ec", - "metadata": {}, - "outputs": [], - "source": [ - "### Step 2 - import data\n", - "\n", - "print(\"Uploading data with vectors to Article schema..\")\n", - "\n", - "counter=0\n", - "\n", - "with client.batch as batch:\n", - " for k,v in article_df.iterrows():\n", - " \n", - " # print update message every 100 objects \n", - " if (counter %100 == 0):\n", - " print(f\"Import {counter} / {len(article_df)} \")\n", - " \n", - " properties = {\n", - " \"title\": v[\"title\"],\n", - " \"content\": v[\"text\"]\n", - " }\n", - " \n", - " vector = v[\"title_vector\"]\n", - " \n", - " batch.add_data_object(properties, \"Article\", None, vector)\n", - " counter = counter+1\n", - "\n", - "print(f\"Importing ({len(article_df)}) Articles complete\") " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f826e1ad", - "metadata": {}, - "outputs": [], - "source": [ - "# Test that all data has loaded – get object count\n", - "result = (\n", - " client.query.aggregate(\"Article\")\n", - " .with_fields(\"meta { count }\")\n", - " .do()\n", - ")\n", - "print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5c09d483", - "metadata": {}, - "outputs": [], - "source": [ - "# Test one article has worked by checking one object\n", - "test_article = (\n", - " client.query\n", - " .get(\"Article\", [\"title\", \"content\", \"_additional {id}\"])\n", - " .with_limit(1)\n", - " .do()\n", - ")[\"data\"][\"Get\"][\"Article\"][0]\n", - "\n", - "print(test_article[\"_additional\"][\"id\"])\n", - "print(test_article[\"title\"])\n", - "print(test_article[\"content\"])" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "46050ca9", - "metadata": {}, - "source": [ - "### Search data\n", - "\n", - "As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "add222d7", - "metadata": {}, - "outputs": [], - "source": [ - "def query_weaviate(query, collection_name, top_k=20):\n", - "\n", - " # Creates embedding vector from user query\n", - " embedded_query = openai.Embedding.create(\n", - " input=query,\n", - " model=EMBEDDING_MODEL,\n", - " )[\"data\"][0]['embedding']\n", - " \n", - " near_vector = {\"vector\": embedded_query}\n", - "\n", - " # Queries input schema with vectorised user query\n", - " query_result = (\n", - " client.query\n", - " .get(collection_name, [\"title\", \"content\", \"_additional {certainty distance}\"])\n", - " .with_near_vector(near_vector)\n", - " .with_limit(top_k)\n", - " .do()\n", - " )\n", - " \n", - " return query_result" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c888aa4b", - "metadata": {}, - "outputs": [], - "source": [ - "query_result = query_weaviate(\"modern art in Europe\", \"Article\")\n", - "counter = 0\n", - "for article in query_result[\"data\"][\"Get\"][\"Article\"]:\n", - " counter += 1\n", - " print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c54cd8e9", - "metadata": {}, - "outputs": [], - "source": [ - "query_result = query_weaviate(\"Famous battles in Scottish history\", \"Article\")\n", - "counter = 0\n", - "for article in query_result[\"data\"][\"Get\"][\"Article\"]:\n", - " counter += 1\n", - " print(f\"{counter}. {article['title']} (Score: {round(article['_additional']['certainty'],3) })\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "220b3e11", - "metadata": {}, - "source": [ - "### Let Weaviate handle vector embeddings\n", - "\n", - "Weaviate has a [built-in module for OpenAI](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai), which takes care of the steps required to generate a vector embedding for your queries and any CRUD operations.\n", - "\n", - "This allows you to run a vector query with the `with_near_text` filter, which uses your `OPEN_API_KEY`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9425c882", - "metadata": {}, - "outputs": [], - "source": [ - "def near_text_weaviate(query, collection_name):\n", - " \n", - " nearText = {\n", - " \"concepts\": [query],\n", - " \"distance\": 0.7,\n", - " }\n", - "\n", - " properties = [\n", - " \"title\", \"content\",\n", - " \"_additional {certainty distance}\"\n", - " ]\n", - "\n", - " query_result = (\n", - " client.query\n", - " .get(collection_name, properties)\n", - " .with_near_text(nearText)\n", - " .with_limit(20)\n", - " .do()\n", - " )[\"data\"][\"Get\"][collection_name]\n", - " \n", - " print (f\"Objects returned: {len(query_result)}\")\n", - " \n", - " return query_result" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "501a16f7", - "metadata": {}, - "outputs": [], - "source": [ - "query_result = near_text_weaviate(\"modern art in Europe\",\"Article\")\n", - "counter = 0\n", - "for article in query_result:\n", - " counter += 1\n", - " print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "839b26df", - "metadata": {}, - "outputs": [], - "source": [ - "query_result = near_text_weaviate(\"Famous battles in Scottish history\",\"Article\")\n", - "counter = 0\n", - "for article in query_result:\n", - " counter += 1\n", - " print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "4dc3a0c0", - "metadata": {}, - "source": [ - "## Milvus\n", - "\n", - "The next vector database we will take a look at is **Milvus**, which also offers a SaaS option like the previous two, as well as self-hosted options using either helm or docker-compose. Sticking to the idea of open source, we will show our self-hosted example here.\n", - "\n", - "In this example we will:\n", - "- Set up a local docker-compose based deployment\n", - "- Create the title and content collections\n", - "- Store our data\n", - "- Test out our system with real world searches" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "fe4914e9", - "metadata": {}, - "source": [ - "### Setup\n", - "\n", - "There are many ways to run Milvus (take a look [here](https://milvus.io/docs/install_cluster-milvusoperator.md)), but for now we will stick to a simple standalone Milvus instance with docker-compose.\n", - "\n", - "A simple docker-file can be found at `./milvus/docker-compose.yaml` and can be run using `docker-compose up` if within that mentioned directory or using `docker-compose -f path/to/file up`\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8e10f2ed", - "metadata": {}, - "outputs": [], - "source": [ - "from pymilvus import connections\n", - "\n", - "connections.connect(host='localhost', port=19530) # Local instance defaults to port 19530" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "64ffed22", - "metadata": {}, - "source": [ - "### Index data\n", - "\n", - "In Milvus data is stored in the form of collections, with each collection being able to store the vectors and any attributes that come with them.\n", - "\n", - "In this case we'll create a collection called **articles** which contains the url, title, text and the content_embedding.\n", - "\n", - "In addition to this we will also create an index on the content embedding. Milvus allows for the use of many SOTA indexing methods, but in this case, we are going to use HNSW.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bfabc3db", - "metadata": {}, - "outputs": [], - "source": [ - "from pymilvus import utility, Collection, FieldSchema, CollectionSchema, DataType\n", - "\n", - "# Remove the collection if it already exists.\n", - "if utility.has_collection('articles'):\n", - " utility.drop_collection('articles')\n", - "\n", - "fields = [\n", - " FieldSchema(name='id', dtype=DataType.INT64),\n", - " FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=1000), # Strings have to specify a max length [1, 65535]\n", - " FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=1000),\n", - " FieldSchema(name='text', dtype=DataType.VARCHAR, max_length=1000),\n", - " FieldSchema(name='content_vector', dtype=DataType.FLOAT_VECTOR, dim=len(article_df['content_vector'][0])),\n", - " FieldSchema(name='vector_id', dtype=DataType.INT64, is_primary=True, auto_id=False),\n", - "]\n", - "\n", - "col_schema = CollectionSchema(fields)\n", - "\n", - "col = Collection('articles', col_schema)\n", - "\n", - "# Using a basic HNSW index for this example\n", - "index = {\n", - " 'index_type': 'HNSW',\n", - " 'metric_type': 'L2',\n", - " 'params': {\n", - " 'M': 8,\n", - " 'efConstruction': 64\n", - " },\n", - "}\n", - "\n", - "col.create_index('content_vector', index)\n", - "col.load()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "c1ec4140", - "metadata": {}, - "source": [ - "### Insert the Data\n", - "With the collection setup and the index ready, we can begin pumping in our data. For this example we are cutting off our text data at 1000 characters and adding `...`. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f51aaed1", - "metadata": {}, - "outputs": [], - "source": [ - "# Using the above provided batching function from Pinecone\n", - "def to_batches(df: pd.DataFrame, batch_size: int) -> Iterator[pd.DataFrame]:\n", - " splits = df.shape[0] / batch_size\n", - " if splits <= 1:\n", - " yield df\n", - " else:\n", - " for chunk in np.array_split(df, splits):\n", - " yield chunk\n", - "\n", - "# Since we are storing the text within Milvus we need to clip any that are over our set limit.\n", - "# We can also set the limit to be higher, but that slows down the search requests as more info \n", - "# needs to be sent back.\n", - "def shorten_text(text):\n", - " if len(text) >= 996:\n", - " return text[:996] + '...'\n", - " else:\n", - " return text\n", - "\n", - "for batch in to_batches(article_df, 1000):\n", - " batch = batch.drop(columns = ['title_vector'])\n", - " batch['text'] = batch.text.apply(shorten_text)\n", - " # Due to the vector_id being converted to a string for compatiblity for other vector dbs,\n", - " # we want to swap it back to its original form.\n", - " batch['vector_id'] = batch.vector_id.apply(int)\n", - " col.insert(batch) \n", - "\n", - "col.flush() " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "1f68a790", - "metadata": {}, - "source": [ - "# Search\n", - "Once the data is inserted into Milvus we can perform searches. For this example the search function takes one argument, top_k, how many closest matches to return. In this step we are also grabbing the `OPENAI_API_KEY` to use for generating embeddings." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "02f21251", - "metadata": {}, - "outputs": [], - "source": [ - "openai.api_key = os.getenv(\"OPENAI_API_KEY\", \"missing_key\")\n", - "\n", - "def query_article(query, top_k=5):\n", - " # Generate the embedding with openai\n", - " embedded_query = openai.Embedding.create(\n", - " input=query,\n", - " model=EMBEDDING_MODEL,\n", - " )[\"data\"][0]['embedding']\n", - "\n", - " # Using some basic params for HNSW\n", - " search_param = {\n", - " 'metric_type': 'L2',\n", - " 'params': {\n", - " 'ef': max(64, top_k)\n", - " }\n", - " }\n", - "\n", - " # Perform the search.\n", - " res = col.search([embedded_query], 'content_vector', search_param, output_fields = ['title', 'url'], limit = top_k)\n", - "\n", - " ret = []\n", - " for hit in res[0]:\n", - " # Get the id, distance, and title for the results\n", - " ret.append({'vector_id': hit.id, 'distance': hit.score, 'title': hit.entity.get('title'), 'url': hit.entity.get('url')})\n", - " return ret\n", - " \n", - "\n", - "for x in query_article('fastest plane ever made', 3):\n", - " print(x.items())" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "03b34ed2", - "metadata": {}, - "source": [ - "## Zilliz\n", - "\n", - "The next vector database we will take a look at is **Zilliz**, a SaaS vector database offering billion scale searches in the milliseconds. Zilliz allows you to not think about the cluster and setup, and instead jump right into searching and learning from your data. \n", - "\n", - "In this example we will:\n", - "- Create the title and content collections\n", - "- Store our data\n", - "- Test out our system with real world searches" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "eb3086e1", - "metadata": {}, - "source": [ - "### Setup\n", - "\n", - "Zilliz handles the setup of the service, for more information on signing up and getting started, take a look [here](https://zilliz.com/doc/get_started_overview).\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "95202a96", - "metadata": {}, - "outputs": [], - "source": [ - "from pymilvus import connections\n", - "\n", - "uri = os.getenv(\"ZILLIZ_URI\")\n", - "token = os.getenv(\"ZILLIZ_TOKEN\") # TOKEN == user:password or api_key\n", - "if connections.has_connection('default'):\n", - " connections.disconnect('default')\n", - "connections.connect(uri=uri, token=token)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "c9380e0f", - "metadata": {}, - "source": [ - "### Index data\n", - "\n", - "In Zilliz data is stored in the form of collections, with each collection being able to store the vectors and any attributes that come with them.\n", - "\n", - "In this case we'll create a collection called **articles** which contains the url, title, text and the content_embedding.\n", - "\n", - "In addition to this we will also create an index on the content embedding. Zilliz creates the best index for your use casse and automatically optimizes it as the collection grows.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f6e5cd2f", - "metadata": {}, - "outputs": [], - "source": [ - "from pymilvus import utility, Collection, FieldSchema, CollectionSchema, DataType\n", - "\n", - "# Remove the collection if it already exists.\n", - "if utility.has_collection('articles'):\n", - " utility.drop_collection('articles')\n", - "\n", - "fields = [\n", - " FieldSchema(name='id', dtype=DataType.INT64),\n", - " FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=1000), # Strings have to specify a max length [1, 65535]\n", - " FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=1000),\n", - " FieldSchema(name='text', dtype=DataType.VARCHAR, max_length=1000),\n", - " FieldSchema(name='content_vector', dtype=DataType.FLOAT_VECTOR, dim=len(article_df['content_vector'][0])),\n", - " FieldSchema(name='vector_id', dtype=DataType.INT64, is_primary=True, auto_id=False),\n", - "]\n", - "\n", - "col_schema = CollectionSchema(fields)\n", - "\n", - "col = Collection('articles', col_schema)\n", - "\n", - "# Using the AUTOINDEX index for this example\n", - "index = {\n", - " 'index_type': 'AUTOINDEX',\n", - " 'metric_type': 'L2',\n", - " 'params': {},\n", - "}\n", - "\n", - "col.create_index('content_vector', index)\n", - "col.load()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "7215cef6", - "metadata": {}, - "source": [ - "### Insert the Data\n", - "With the collection setup and the index ready, we can begin pumping in our data. For this example we are cutting off our text data at 1000 characters and adding `...`. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2fee028a", - "metadata": {}, - "outputs": [], - "source": [ - "# Using the above provided batching function from Pinecone\n", - "def to_batches(df: pd.DataFrame, batch_size: int) -> Iterator[pd.DataFrame]:\n", - " splits = df.shape[0] / batch_size\n", - " if splits <= 1:\n", - " yield df\n", - " else:\n", - " for chunk in np.array_split(df, splits):\n", - " yield chunk\n", - "\n", - "# Since we are storing the text within Zilliz we need to clip any that are over our set limit.\n", - "# We can also set the limit to be higher, but that slows down the search requests as more info \n", - "# needs to be sent back.\n", - "def shorten_text(text):\n", - " if len(text) >= 996:\n", - " return text[:996] + '...'\n", - " else:\n", - " return text\n", - "\n", - "for batch in to_batches(article_df, 1000):\n", - " batch = batch.drop(columns = ['title_vector'])\n", - " batch['text'] = batch.text.apply(shorten_text)\n", - " # Due to the vector_id being converted to a string for compatiblity for other vector dbs,\n", - " # we want to swap it back to its original form.\n", - " batch['vector_id'] = batch.vector_id.apply(int)\n", - " col.insert(batch) \n", - "\n", - "col.flush() " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "cf8625c6", - "metadata": {}, - "source": [ - "# Search\n", - "Once the data is inserted into Zilliz we can perform searches. For this example the search function takes one argument, top_k, how many closest matches to return. In this step we are also grabbing the `OPENAI_API_KEY` to use for generating embeddings." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "95c82629", - "metadata": {}, - "outputs": [], - "source": [ - "openai.api_key = os.getenv(\"OPENAI_API_KEY\", \"missing_key\")\n", - "\n", - "def query_article(query, top_k=5):\n", - " # Generate the embedding with openai\n", - " embedded_query = openai.Embedding.create(\n", - " input=query,\n", - " model=EMBEDDING_MODEL,\n", - " )[\"data\"][0]['embedding']\n", - "\n", - " # Using simplest param for AUTOINDEX\n", - " search_param = {\n", - " 'metric_type': 'L2',\n", - " 'params': {}\n", - " }\n", - "\n", - " # Perform the search.\n", - " res = col.search([embedded_query], 'content_vector', search_param, output_fields = ['title', 'url'], limit = top_k)\n", - "\n", - " ret = []\n", - " for hit in res[0]:\n", - " # Get the id, distance, and title for the results\n", - " ret.append({'vector_id': hit.id, 'distance': hit.score, 'title': hit.entity.get('title'), 'url': hit.entity.get('url')})\n", - " return ret\n", - " \n", - "\n", - "for x in query_article('fastest plane ever made', 3):\n", - " print(x.items())\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "9cfaed9d", - "metadata": {}, - "source": [ - "## Qdrant\n", - "\n", - "The last vector database we'll consider is **[Qdrant](https://qdrant.tech/)**. This is a high-performant vector search database written in Rust. It offers both on-premise and cloud version, but for the purposes of that example we're going to use the local deployment mode.\n", - "\n", - "Setting everything up will require:\n", - "- Spinning up a local instance of Qdrant\n", - "- Configuring the collection and storing the data in it\n", - "- Trying out with some queries" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "38774565", - "metadata": {}, - "source": [ - "### Setup\n", - "\n", - "For the local deployment, we are going to use Docker, according to the Qdrant documentation: https://qdrant.tech/documentation/quick_start/. Qdrant requires just a single container, but an example of the docker-compose.yaml file is available at `./qdrant/docker-compose.yaml` in this repo.\n", - "\n", - "You can start Qdrant instance locally by navigating to this directory and running `docker-compose up -d `" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "76d697e9", - "metadata": { - "ExecuteTime": { - "end_time": "2023-01-18T09:28:38.928205Z", - "start_time": "2023-01-18T09:28:38.913987Z" - } - }, - "outputs": [], - "source": [ - "qdrant = qdrant_client.QdrantClient(host='localhost', prefer_grpc=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1deeb539", - "metadata": { - "ExecuteTime": { - "end_time": "2023-01-18T09:29:19.806639Z", - "start_time": "2023-01-18T09:29:19.727897Z" - } - }, - "outputs": [], - "source": [ - "qdrant.get_collections()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "bc006b6f", - "metadata": {}, - "source": [ - "### Index data\n", - "\n", - "Qdrant stores data in __collections__ where each object is described by at least one vector and may contain an additional metadata called __payload__. Our collection will be called **Articles** and each object will be described by both **title** and **content** vectors.\n", - "\n", - "We'll be using an official [qdrant-client](https://github.com/qdrant/qdrant_client) package that has all the utility methods already built-in." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1a84ee1d", - "metadata": { - "ExecuteTime": { - "end_time": "2023-01-18T09:29:22.530121Z", - "start_time": "2023-01-18T09:29:22.524604Z" - } - }, - "outputs": [], - "source": [ - "from qdrant_client.http import models as rest" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "00876f92", - "metadata": { - "ExecuteTime": { - "end_time": "2023-01-18T09:31:14.413334Z", - "start_time": "2023-01-18T09:31:13.619079Z" - } - }, - "outputs": [], - "source": [ - "vector_size = len(article_df['content_vector'][0])\n", - "\n", - "qdrant.recreate_collection(\n", - " collection_name='Articles',\n", - " vectors_config={\n", - " 'title': rest.VectorParams(\n", - " distance=rest.Distance.COSINE,\n", - " size=vector_size,\n", - " ),\n", - " 'content': rest.VectorParams(\n", - " distance=rest.Distance.COSINE,\n", - " size=vector_size,\n", - " ),\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f24e76ab", - "metadata": { - "ExecuteTime": { - "end_time": "2023-01-18T09:36:28.597535Z", - "start_time": "2023-01-18T09:36:24.108867Z" - } - }, - "outputs": [], - "source": [ - "qdrant.upsert(\n", - " collection_name='Articles',\n", - " points=[\n", - " rest.PointStruct(\n", - " id=k,\n", - " vector={\n", - " 'title': v['title_vector'],\n", - " 'content': v['content_vector'],\n", - " },\n", - " payload=v.to_dict(),\n", - " )\n", - " for k, v in article_df.iterrows()\n", - " ],\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d1188a12", - "metadata": { - "ExecuteTime": { - "end_time": "2023-01-18T09:58:13.825886Z", - "start_time": "2023-01-18T09:58:13.816248Z" - } - }, - "outputs": [], - "source": [ - "# Check the collection size to make sure all the points have been stored\n", - "qdrant.count(collection_name='Articles')" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "06ed119b", - "metadata": {}, - "source": [ - "### Search Data\n", - "\n", - "Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f1bac4ef", - "metadata": { - "ExecuteTime": { - "end_time": "2023-01-18T09:50:35.265647Z", - "start_time": "2023-01-18T09:50:35.256065Z" - } - }, - "outputs": [], - "source": [ - "def query_qdrant(query, collection_name, vector_name='title', top_k=20):\n", - "\n", - " # Creates embedding vector from user query\n", - " embedded_query = openai.Embedding.create(\n", - " input=query,\n", - " model=EMBEDDING_MODEL,\n", - " )['data'][0]['embedding']\n", - " \n", - " query_results = qdrant.search(\n", - " collection_name=collection_name,\n", - " query_vector=(\n", - " vector_name, embedded_query\n", - " ),\n", - " limit=top_k,\n", - " )\n", - " \n", - " return query_results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "aa92f3d3", - "metadata": { - "ExecuteTime": { - "end_time": "2023-01-18T09:50:46.545145Z", - "start_time": "2023-01-18T09:50:35.711020Z" - } - }, - "outputs": [], - "source": [ - "query_results = query_qdrant('modern art in Europe', 'Articles')\n", - "for i, article in enumerate(query_results):\n", - " print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7ed116b8", - "metadata": { - "ExecuteTime": { - "end_time": "2023-01-18T09:53:11.038910Z", - "start_time": "2023-01-18T09:52:55.248029Z" - } - }, - "outputs": [], - "source": [ - "# This time we'll query using content vector\n", - "query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')\n", - "for i, article in enumerate(query_results):\n", - " print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "43bffd04", - "metadata": {}, - "source": [ - "# Redis\n", - "\n", - "The next vector database covered in this tutorial is **[Redis](https://redis.io)**. You most likely already know Redis. What you might not be aware of is the [RediSearch module](https://github.com/RediSearch/RediSearch). Enterprises have been using Redis with the RediSearch module for years now across all major cloud providers, Redis Cloud, and on premise. Recently, the Redis team added vector storage and search capability to this module in addition to the features RediSearch already had.\n", - "\n", - "Given the large ecosystem around Redis, there are most likely client libraries in the language you need. You can use any standard Redis client library to run RediSearch commands, but it's easiest to use a library that wraps the RediSearch API. Below are a few examples, but you can find more client libraries [here](https://redis.io/resources/clients/).\n", - "\n", - "| Project | Language | License | Author | Stars |\n", - "|----------|---------|--------|---------|-------|\n", - "| [jedis][jedis-url] | Java | MIT | [Redis][redis-url] | ![Stars][jedis-stars] |\n", - "| [redis-py][redis-py-url] | Python | MIT | [Redis][redis-url] | ![Stars][redis-py-stars] |\n", - "| [node-redis][node-redis-url] | Node.js | MIT | [Redis][redis-url] | ![Stars][node-redis-stars] |\n", - "| [nredisstack][nredisstack-url] | .NET | MIT | [Redis][redis-url] | ![Stars][nredisstack-stars] |\n", - "| [redisearch-go][redisearch-go-url] | Go | BSD | [Redis][redisearch-go-author] | [![redisearch-go-stars]][redisearch-go-url] |\n", - "| [redisearch-api-rs][redisearch-api-rs-url] | Rust | BSD | [Redis][redisearch-api-rs-author] | [![redisearch-api-rs-stars]][redisearch-api-rs-url] |\n", - "\n", - "[redis-url]: https://redis.com\n", - "\n", - "[redis-py-url]: https://github.com/redis/redis-py\n", - "[redis-py-stars]: https://img.shields.io/github/stars/redis/redis-py.svg?style=social&label=Star&maxAge=2592000\n", - "[redis-py-package]: https://pypi.python.org/pypi/redis\n", - "\n", - "[jedis-url]: https://github.com/redis/jedis\n", - "[jedis-stars]: https://img.shields.io/github/stars/redis/jedis.svg?style=social&label=Star&maxAge=2592000\n", - "[Jedis-package]: https://search.maven.org/artifact/redis.clients/jedis\n", - "\n", - "[nredisstack-url]: https://github.com/redis/nredisstack\n", - "[nredisstack-stars]: https://img.shields.io/github/stars/redis/nredisstack.svg?style=social&label=Star&maxAge=2592000\n", - "[nredisstack-package]: https://www.nuget.org/packages/nredisstack/\n", - "\n", - "[node-redis-url]: https://github.com/redis/node-redis\n", - "[node-redis-stars]: https://img.shields.io/github/stars/redis/node-redis.svg?style=social&label=Star&maxAge=2592000\n", - "[node-redis-package]: https://www.npmjs.com/package/redis\n", - "\n", - "[redis-om-python-url]: https://github.com/redis/redis-om-python\n", - "[redis-om-python-author]: https://redis.com\n", - "[redis-om-python-stars]: https://img.shields.io/github/stars/redis/redis-om-python.svg?style=social&label=Star&maxAge=2592000\n", - "\n", - "[redisearch-go-url]: https://github.com/RediSearch/redisearch-go\n", - "[redisearch-go-author]: https://redis.com\n", - "[redisearch-go-stars]: https://img.shields.io/github/stars/RediSearch/redisearch-go.svg?style=social&label=Star&maxAge=2592000\n", - "\n", - "[redisearch-api-rs-url]: https://github.com/RediSearch/redisearch-api-rs\n", - "[redisearch-api-rs-author]: https://redis.com\n", - "[redisearch-api-rs-stars]: https://img.shields.io/github/stars/RediSearch/redisearch-api-rs.svg?style=social&label=Star&maxAge=2592000\n", - "\n", - "\n", - "In the below cells, we will walk you through using Redis as a vector database. Since many of you are likely already used to the Redis API, this should be familiar to most." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "698e24f6", - "metadata": {}, - "source": [ - "## Setup\n", - "\n", - "There are many ways to deploy Redis with RediSearch. The easiest way to get started is to use Docker, but there are are many potential options for deployment. For other deployment options, see the [redis directory](./redis) in this repo.\n", - "\n", - "For this tutorial, we will use Redis Stack on Docker.\n", - "\n", - "Start a version of Redis with RediSearch (Redis Stack) by running the following docker command\n", - "\n", - "```bash\n", - "$ cd redis\n", - "$ docker compose up -d\n", - "```\n", - "This also includes the [RedisInsight](https://redis.com/redis-enterprise/redis-insight/) GUI for managing your Redis database which you can view at [http://localhost:8001](http://localhost:8001) once you start the docker container.\n", - "\n", - "You're all set up and ready to go! Next, we import and create our client for communicating with the Redis database we just created." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d2ce669a", - "metadata": {}, - "outputs": [], - "source": [ - "import redis\n", - "from redis.commands.search.indexDefinition import (\n", - " IndexDefinition,\n", - " IndexType\n", - ")\n", - "from redis.commands.search.query import Query\n", - "from redis.commands.search.field import (\n", - " TextField,\n", - " VectorField\n", - ")\n", - "\n", - "REDIS_HOST = \"localhost\"\n", - "REDIS_PORT = 6379\n", - "REDIS_PASSWORD = \"\" # default for passwordless Redis\n", - "\n", - "# Connect to Redis\n", - "redis_client = redis.Redis(\n", - " host=REDIS_HOST,\n", - " port=REDIS_PORT,\n", - " password=REDIS_PASSWORD\n", - ")\n", - "redis_client.ping()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "3f6f0af9", - "metadata": {}, - "source": [ - "## Creating a Search Index\n", - "\n", - "The below cells will show how to specify and create a search index in Redis. We will\n", - "\n", - "1. Set some constants for defining our index like the distance metric and the index name\n", - "2. Define the index schema with RediSearch fields\n", - "3. Create the index\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a7c64cb9", - "metadata": {}, - "outputs": [], - "source": [ - "# Constants\n", - "VECTOR_DIM = len(article_df['title_vector'][0]) # length of the vectors\n", - "VECTOR_NUMBER = len(article_df) # initial number of vectors\n", - "INDEX_NAME = \"embeddings-index\" # name of the search index\n", - "PREFIX = \"doc\" # prefix for the document keys\n", - "DISTANCE_METRIC = \"COSINE\" # distance metric for the vectors (ex. COSINE, IP, L2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d95fcd06", - "metadata": {}, - "outputs": [], - "source": [ - "# Define RediSearch fields for each of the columns in the dataset\n", - "title = TextField(name=\"title\")\n", - "url = TextField(name=\"url\")\n", - "text = TextField(name=\"text\")\n", - "title_embedding = VectorField(\"title_vector\",\n", - " \"FLAT\", {\n", - " \"TYPE\": \"FLOAT32\",\n", - " \"DIM\": VECTOR_DIM,\n", - " \"DISTANCE_METRIC\": DISTANCE_METRIC,\n", - " \"INITIAL_CAP\": VECTOR_NUMBER,\n", - " }\n", - ")\n", - "text_embedding = VectorField(\"content_vector\",\n", - " \"FLAT\", {\n", - " \"TYPE\": \"FLOAT32\",\n", - " \"DIM\": VECTOR_DIM,\n", - " \"DISTANCE_METRIC\": DISTANCE_METRIC,\n", - " \"INITIAL_CAP\": VECTOR_NUMBER,\n", - " }\n", - ")\n", - "fields = [title, url, text, title_embedding, text_embedding]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7418480d", - "metadata": {}, - "outputs": [], - "source": [ - "# Check if index exists\n", - "try:\n", - " redis_client.ft(INDEX_NAME).info()\n", - " print(\"Index already exists\")\n", - "except:\n", - " # Create RediSearch Index\n", - " redis_client.ft(INDEX_NAME).create_index(\n", - " fields = fields,\n", - " definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)\n", - " )" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "f3563eec", - "metadata": {}, - "source": [ - "## Load Documents into the Index\n", - "\n", - "Now that we have a search index, we can load documents into it. We will use the same documents we used in the previous examples. In Redis, either the Hash or JSON (if using RedisJSON in addition to RediSearch) data types can be used to store documents. We will use the HASH data type in this example. The below cells will show how to load documents into the index." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e98d63ad", - "metadata": {}, - "outputs": [], - "source": [ - "def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):\n", - " records = documents.to_dict(\"records\")\n", - " for doc in records:\n", - " key = f\"{prefix}:{str(doc['id'])}\"\n", - "\n", - " # create byte vectors for title and content\n", - " title_embedding = np.array(doc[\"title_vector\"], dtype=np.float32).tobytes()\n", - " content_embedding = np.array(doc[\"content_vector\"], dtype=np.float32).tobytes()\n", - "\n", - " # replace list of floats with byte vectors\n", - " doc[\"title_vector\"] = title_embedding\n", - " doc[\"content_vector\"] = content_embedding\n", - "\n", - " client.hset(key, mapping = doc)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "098d3c5a", - "metadata": {}, - "outputs": [], - "source": [ - "index_documents(redis_client, PREFIX, article_df)\n", - "print(f\"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "f646bff4", - "metadata": {}, - "source": [ - "## Running Search Queries\n", - "\n", - "Now that we have a search index and documents loaded into it, we can run search queries. Below we will provide a function that will run a search query and return the results. Using this function we run a few queries that will show how you can utilize Redis as a vector database. Each example will demonstrate specific features to keep in mind when developing your search application with Redis.\n", - "\n", - "1. **Return Fields**: You can specify which fields you want to return in the search results. This is useful if you only want to return a subset of the fields in your documents and doesn't require a separate call to retrieve documents. In the below example, we will only return the `title` field in the search results.\n", - "2. **Hybrid Search**: You can combine vector search with any of the other RediSearch fields for hybrid search such as full text search, tag, geo, and numeric. In the below example, we will combine vector search with full text search.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "508d1f89", - "metadata": {}, - "outputs": [], - "source": [ - "def search_redis(\n", - " redis_client: redis.Redis,\n", - " user_query: str,\n", - " index_name: str = \"embeddings-index\",\n", - " vector_field: str = \"title_vector\",\n", - " return_fields: list = [\"title\", \"url\", \"text\", \"vector_score\"],\n", - " hybrid_fields = \"*\",\n", - " k: int = 20,\n", - ") -> List[dict]:\n", - "\n", - " # Creates embedding vector from user query\n", - " embedded_query = openai.Embedding.create(input=user_query,\n", - " model=EMBEDDING_MODEL,\n", - " )[\"data\"][0]['embedding']\n", - "\n", - " # Prepare the Query\n", - " base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'\n", - " query = (\n", - " Query(base_query)\n", - " .return_fields(*return_fields)\n", - " .sort_by(\"vector_score\")\n", - " .paging(0, k)\n", - " .dialect(2)\n", - " )\n", - " params_dict = {\"vector\": np.array(embedded_query).astype(dtype=np.float32).tobytes()}\n", - "\n", - " # perform vector search\n", - " results = redis_client.ft(index_name).search(query, params_dict)\n", - " for i, article in enumerate(results.docs):\n", - " score = 1 - float(article.vector_score)\n", - " print(f\"{i}. {article.title} (Score: {round(score ,3) })\")\n", - " return results.docs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1f0eef07", - "metadata": {}, - "outputs": [], - "source": [ - "# For using OpenAI to generate query embedding\n", - "openai.api_key = os.getenv(\"OPENAI_API_KEY\", \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\")\n", - "results = search_redis(redis_client, 'modern art in Europe', k=10)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7b805a81", - "metadata": {}, - "outputs": [], - "source": [ - "results = search_redis(redis_client, 'Famous battles in Scottish history', vector_field='content_vector', k=10)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "0ed0b34e", - "metadata": {}, - "source": [ - "## Hybrid Queries with Redis\n", - "\n", - "The previous examples showed how run vector search queries with RediSearch. In this section, we will show how to combine vector search with other RediSearch fields for hybrid search. In the below example, we will combine vector search with full text search." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c94d5cce", - "metadata": {}, - "outputs": [], - "source": [ - "def create_hybrid_field(field_name: str, value: str) -> str:\n", - " return f'@{field_name}:\"{value}\"'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bfcd31c2", - "metadata": {}, - "outputs": [], - "source": [ - "# search the content vector for articles about famous battles in Scottish history and only include results with Scottish in the title\n", - "results = search_redis(redis_client,\n", - " \"Famous battles in Scottish history\",\n", - " vector_field=\"title_vector\",\n", - " k=5,\n", - " hybrid_fields=create_hybrid_field(\"title\", \"Scottish\")\n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "28ab1e30", - "metadata": {}, - "outputs": [], - "source": [ - "# run a hybrid query for articles about Art in the title vector and only include results with the phrase \"Leonardo da Vinci\" in the text\n", - "results = search_redis(redis_client,\n", - " \"Art\",\n", - " vector_field=\"title_vector\",\n", - " k=5,\n", - " hybrid_fields=create_hybrid_field(\"text\", \"Leonardo da Vinci\")\n", - " )\n", - "\n", - "# find specific mention of Leonardo da Vinci in the text that our full-text-search query returned\n", - "mention = [sentence for sentence in results[0].text.split(\"\\n\") if \"Leonardo da Vinci\" in sentence][0]\n", - "mention" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "f94b5be2", - "metadata": {}, - "source": [ - "For more example with Redis as a vector database, see the README and examples within the ``vector_databases/redis`` directory of this repository" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "collapsed": false - }, - "source": [ - "## Typesense\n", - "\n", - "The next vector store we'll look at is [Typesense](https://typesense.org/), which is an open source, in-memory search engine, that you can either self-host or run on [Typesense Cloud](https://cloud.typesense.org).\n", - "\n", - "Typesense focuses on performance by storing the entire index in RAM (with a backup on disk) and also focuses on providing an out-of-the-box developer experience by simplifying available options and setting good defaults. It also lets you combine attribute-based filtering together with vector queries.\n", - "\n", - "For this example, we will set up a local docker-based Typesense server, index our vectors in Typesense and then do some nearest-neighbor search queries. If you use Typesense Cloud, you can skip the docker setup part and just obtain the hostname and API keys from your cluster dashboard." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "collapsed": false - }, - "source": [ - "### Setup\n", - "\n", - "To run Typesense locally, you'll need [Docker](https://www.docker.com/). Following the instructions contained in the Typesense documentation [here](https://typesense.org/docs/guide/install-typesense.html#docker-compose), we created an example docker-compose.yml file in this repo saved at [./typesense/docker-compose.yml](./typesense/docker-compose.yml).\n", - "\n", - "After starting Docker, you can start Typesense locally by navigating to the `examples/vector_databases/typesense/` directory and running `docker-compose up -d`.\n", - "\n", - "The default API key is set to `xyz` in the Docker compose file, and the default Typesense port to `8108`." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "import typesense\n", - "\n", - "typesense_client = \\\n", - " typesense.Client({\n", - " \"nodes\": [{\n", - " \"host\": \"localhost\", # For Typesense Cloud use xxx.a1.typesense.net\n", - " \"port\": \"8108\", # For Typesense Cloud use 443\n", - " \"protocol\": \"http\" # For Typesense Cloud use https\n", - " }],\n", - " \"api_key\": \"xyz\",\n", - " \"connection_timeout_seconds\": 60\n", - " })" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "collapsed": false - }, - "source": [ - "### Index data\n", - "\n", - "To index vectors in Typesense, we'll first create a Collection (which is a collection of Documents) and turn on vector indexing for a particular field. You can even store multiple vector fields in a single document." - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Delete existing collections if they already exist\n", - "try:\n", - " typesense_client.collections['wikipedia_articles'].delete()\n", - "except Exception as e:\n", - " pass\n", - "\n", - "# Create a new collection\n", - "\n", - "schema = {\n", - " \"name\": \"wikipedia_articles\",\n", - " \"fields\": [\n", - " {\n", - " \"name\": \"content_vector\",\n", - " \"type\": \"float[]\",\n", - " \"num_dim\": len(article_df['content_vector'][0])\n", - " },\n", - " {\n", - " \"name\": \"title_vector\",\n", - " \"type\": \"float[]\",\n", - " \"num_dim\": len(article_df['title_vector'][0])\n", - " }\n", - " ]\n", - "}\n", - "\n", - "create_response = typesense_client.collections.create(schema)\n", - "print(create_response)\n", - "\n", - "print(\"Created new collection wikipedia-articles\")" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Upsert the vector data into the collection we just created\n", - "#\n", - "# Note: This can take a few minutes, especially if your on an M1 and running docker in an emulated mode\n", - "\n", - "print(\"Indexing vectors in Typesense...\")\n", - "\n", - "document_counter = 0\n", - "documents_batch = []\n", - "\n", - "for k,v in article_df.iterrows():\n", - " # Create a document with the vector data\n", - "\n", - " # Notice how you can add any fields that you haven't added to the schema to the document.\n", - " # These will be stored on disk and returned when the document is a hit.\n", - " # This is useful to store attributes required for display purposes.\n", - "\n", - " document = {\n", - " \"title_vector\": v[\"title_vector\"],\n", - " \"content_vector\": v[\"content_vector\"],\n", - " \"title\": v[\"title\"],\n", - " \"content\": v[\"text\"],\n", - " }\n", - " documents_batch.append(document)\n", - " document_counter = document_counter + 1\n", - "\n", - " # Upsert a batch of 100 documents\n", - " if document_counter % 100 == 0 or document_counter == len(article_df):\n", - " response = typesense_client.collections['wikipedia_articles'].documents.import_(documents_batch)\n", - " # print(response)\n", - "\n", - " documents_batch = []\n", - " print(f\"Processed {document_counter} / {len(article_df)} \")\n", - "\n", - "print(f\"Imported ({len(article_df)}) articles.\")" - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Check the number of documents imported\n", - "\n", - "collection = typesense_client.collections['wikipedia_articles'].retrieve()\n", - "print(f'Collection has {collection[\"num_documents\"]} documents')" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "collapsed": false - }, - "source": [ - "### Search Data\n", - "\n", - "Now that we've imported the vectors into Typesense, we can do a nearest neighbor search on the `title_vector` or `content_vector` field." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "def query_typesense(query, field='title', top_k=20):\n", - "\n", - " # Creates embedding vector from user query\n", - " openai.api_key = os.getenv(\"OPENAI_API_KEY\", \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\")\n", - " embedded_query = openai.Embedding.create(\n", - " input=query,\n", - " model=EMBEDDING_MODEL,\n", - " )['data'][0]['embedding']\n", - "\n", - " typesense_results = typesense_client.multi_search.perform({\n", - " \"searches\": [{\n", - " \"q\": \"*\",\n", - " \"collection\": \"wikipedia_articles\",\n", - " \"vector_query\": f\"{field}_vector:([{','.join(str(v) for v in embedded_query)}], k:{top_k})\"\n", - " }]\n", - " }, {})\n", - "\n", - " return typesense_results" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": { - "collapsed": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1. Museum of Modern Art (Score: 0.12471389770507812)\n", - "2. Renaissance art (Score: 0.13575094938278198)\n", - "3. Pop art (Score: 0.13949453830718994)\n", - "4. Hellenistic art (Score: 0.14710968732833862)\n", - "5. Modernist literature (Score: 0.15288257598876953)\n", - "6. Art film (Score: 0.15657293796539307)\n", - "7. Art (Score: 0.15847939252853394)\n", - "8. Byzantine art (Score: 0.1591007113456726)\n", - "9. Postmodernism (Score: 0.15989065170288086)\n", - "10. Cubism (Score: 0.16093528270721436)\n" - ] - } - ], - "source": [ - "query_results = query_typesense('modern art in Europe', 'title')\n", - "\n", - "for i, hit in enumerate(query_results['results'][0]['hits']):\n", - " document = hit[\"document\"]\n", - " vector_distance = hit[\"vector_distance\"]\n", - " print(f'{i + 1}. {document[\"title\"]} (Distance: {vector_distance})')" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": { - "collapsed": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1. Battle of Bannockburn (Distance: 0.1306602954864502)\n", - "2. Wars of Scottish Independence (Distance: 0.13851898908615112)\n", - "3. 1651 (Distance: 0.14746594429016113)\n", - "4. First War of Scottish Independence (Distance: 0.15035754442214966)\n", - "5. Robert I of Scotland (Distance: 0.1538146734237671)\n", - "6. 841 (Distance: 0.15609896183013916)\n", - "7. 1716 (Distance: 0.15618199110031128)\n", - "8. 1314 (Distance: 0.16281157732009888)\n", - "9. William Wallace (Distance: 0.16468697786331177)\n", - "10. Stirling (Distance: 0.16858011484146118)\n" - ] - } - ], - "source": [ - "query_results = query_typesense('Famous battles in Scottish history', 'content')\n", - "\n", - "for i, hit in enumerate(query_results['results'][0]['hits']):\n", - " document = hit[\"document\"]\n", - " vector_distance = hit[\"vector_distance\"]\n", - " print(f'{i + 1}. {document[\"title\"]} (Distance: {vector_distance})')" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "55afccbf", - "metadata": {}, - "source": [ - "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "56a02772", - "metadata": {}, - "source": [ - "# MyScale\n", - "The next vector database we'll consider is [MyScale](https://myscale.com).\n", - "\n", - "[MyScale](https://myscale.com) is a database built on Clickhouse that combines vector search and SQL analytics to offer a high-performance, streamlined, and fully managed experience. It's designed to facilitate joint queries and analyses on both structured and vector data, with comprehensive SQL support for all data processing.\n", - "\n", - "Deploy and execute vector search with SQL on your cluster within two minutes by using [MyScale Console](https://console.myscale.com)." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "d3e1f96b", - "metadata": {}, - "source": [ - "## Connect to MyScale\n", - "\n", - "Follow the [connections details](https://docs.myscale.com/en/cluster-management/) section to retrieve the cluster host, username, and password information from the MyScale console, and use it to create a connection to your cluster as shown below:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "024243cf", - "metadata": {}, - "outputs": [], - "source": [ - "import clickhouse_connect\n", - "\n", - "# initialize client\n", - "client = clickhouse_connect.get_client(host='YOUR_CLUSTER_HOST', port=8443, username='YOUR_USERNAME', password='YOUR_CLUSTER_PASSWORD')" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "067009db", - "metadata": {}, - "source": [ - "## Index data\n", - "\n", - "We will create an SQL table called `articles` in MyScale to store the embeddings data. The table will include a vector index with a cosine distance metric and a constraint for the length of the embeddings. Use the following code to create and insert data into the articles table:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "685cba13", - "metadata": {}, - "outputs": [], - "source": [ - "# create articles table with vector index\n", - "embedding_len=len(article_df['content_vector'][0]) # 1536\n", - "\n", - "client.command(f\"\"\"\n", - "CREATE TABLE IF NOT EXISTS default.articles\n", - "(\n", - " id UInt64,\n", - " url String,\n", - " title String,\n", - " text String,\n", - " content_vector Array(Float32),\n", - " CONSTRAINT cons_vector_len CHECK length(content_vector) = {embedding_len},\n", - " VECTOR INDEX article_content_index content_vector TYPE HNSWFLAT('metric_type=Cosine')\n", - ")\n", - "ENGINE = MergeTree ORDER BY id\n", - "\"\"\")\n", - "\n", - "# insert data into the table in batches\n", - "from tqdm.auto import tqdm\n", - "\n", - "batch_size = 100\n", - "total_records = len(article_df)\n", - "\n", - "# we only need subset of columns\n", - "article_df = article_df[['id', 'url', 'title', 'text', 'content_vector']]\n", - "\n", - "# upload data in batches\n", - "data = article_df.to_records(index=False).tolist()\n", - "column_names = article_df.columns.tolist()\n", - "\n", - "for i in tqdm(range(0, total_records, batch_size)):\n", - " i_end = min(i + batch_size, total_records)\n", - " client.insert(\"default.articles\", data[i:i_end], column_names=column_names)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "b0f0e591", - "metadata": {}, - "source": [ - "We need to check the build status of the vector index before proceeding with the search, as it is automatically built in the background." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9251bdf1", - "metadata": {}, - "outputs": [], - "source": [ - "# check count of inserted data\n", - "print(f\"articles count: {client.command('SELECT count(*) FROM default.articles')}\")\n", - "\n", - "# check the status of the vector index, make sure vector index is ready with 'Built' status\n", - "get_index_status=\"SELECT status FROM system.vector_indices WHERE name='article_content_index'\"\n", - "print(f\"index build status: {client.command(get_index_status)}\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "fe55234a", - "metadata": {}, - "source": [ - "## Search data\n", - "\n", - "Once indexed in MyScale, we can perform vector search to find similar content. First, we will use the OpenAI API to generate embeddings for our query. Then, we will perform the vector search using MyScale." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fd5f03c6", - "metadata": {}, - "outputs": [], - "source": [ - "import openai\n", - "\n", - "query = \"Famous battles in Scottish history\"\n", - "\n", - "# creates embedding vector from user query\n", - "embed = openai.Embedding.create(\n", - " input=query,\n", - " model=\"text-embedding-ada-002\",\n", - ")[\"data\"][0][\"embedding\"]\n", - "\n", - "# query the database to find the top K similar content to the given query\n", - "top_k = 10\n", - "results = client.query(f\"\"\"\n", - "SELECT id, url, title, distance(content_vector, {embed}) as dist\n", - "FROM default.articles\n", - "ORDER BY dist\n", - "LIMIT {top_k}\n", - "\"\"\")\n", - "\n", - "# display results\n", - "for i, r in enumerate(results.named_results()):\n", - " print(i+1, r['title'])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0119d87a", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - }, - "vscode": { - "interpreter": { - "hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207" - } - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/vector_databases/chroma/Using_Chroma_for_embeddings_search.ipynb b/examples/vector_databases/chroma/Using_Chroma_for_embeddings_search.ipynb new file mode 100644 index 00000000..5eb2b424 --- /dev/null +++ b/examples/vector_databases/chroma/Using_Chroma_for_embeddings_search.ipynb @@ -0,0 +1,733 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cb1537e6", + "metadata": {}, + "source": [ + "# Using Chroma for Embeddings Search\n", + "\n", + "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n", + "\n", + "### What is a Vector Database\n", + "\n", + "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n", + "\n", + "### Why use a Vector Database\n", + "\n", + "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n", + "\n", + "\n", + "### Demo Flow\n", + "The demo flow is:\n", + "- **Setup**: Import packages and set any required variables\n", + "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n", + "- **Chroma**:\n", + " - *Setup*: Here we'll set up the Python client for Chroma. For more details go [here](https://docs.trychroma.com/usage-guide)\n", + " - *Index Data*: We'll create collections with vectors for __titles__ and __content__\n", + " - *Search Data*: We'll run a few searches to confirm it works\n", + "\n", + "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings." + ] + }, + { + "cell_type": "markdown", + "id": "e2b59250", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "Import the required libraries and set the embedding model that we'd like to use." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d8810f9", + "metadata": {}, + "outputs": [], + "source": [ + "# We'll need to install the Chroma client\n", + "!pip install chromadb\n", + "\n", + "#Install wget to pull zip file\n", + "!pip install wget" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "5be94df6", + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "from typing import List, Iterator\n", + "import pandas as pd\n", + "import numpy as np\n", + "import os\n", + "import wget\n", + "from ast import literal_eval\n", + "\n", + "# Chroma's client library for Python\n", + "import chromadb\n", + "\n", + "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", + "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", + "\n", + "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n", + "import warnings\n", + "\n", + "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n", + "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) " + ] + }, + { + "cell_type": "markdown", + "id": "e5d9d2e1", + "metadata": {}, + "source": [ + "## Load data\n", + "\n", + "In this section we'll load embedded data that we've prepared previous to this session." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5dff8b55", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100% [......................................................................] 698933052 / 698933052" + ] + }, + { + "data": { + "text/plain": [ + "'vector_database_wikipedia_articles_embedded.zip'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n", + "\n", + "# The file is ~700 MB so this will take some time\n", + "wget.download(embeddings_url)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "21097972", + "metadata": {}, + "outputs": [], + "source": [ + "import zipfile\n", + "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", + " zip_ref.extractall(\"../data\")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "70bbd8ba", + "metadata": {}, + "outputs": [], + "source": [ + "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "1721e45d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idurltitletexttitle_vectorcontent_vectorvector_id
01https://simple.wikipedia.org/wiki/AprilAprilApril is the fourth month of the year in the J...[0.001009464613161981, -0.020700545981526375, ...[-0.011253940872848034, -0.013491976074874401,...0
12https://simple.wikipedia.org/wiki/AugustAugustAugust (Aug.) is the eighth month of the year ...[0.0009286514250561595, 0.000820168002974242, ...[0.0003609954728744924, 0.007262262050062418, ...1
26https://simple.wikipedia.org/wiki/ArtArtArt is a creative activity that expresses imag...[0.003393713850528002, 0.0061537534929811954, ...[-0.004959689453244209, 0.015772193670272827, ...2
38https://simple.wikipedia.org/wiki/AAA or a is the first letter of the English alph...[0.0153952119871974, -0.013759135268628597, 0....[0.024894846603274345, -0.022186409682035446, ...3
49https://simple.wikipedia.org/wiki/AirAirAir refers to the Earth's atmosphere. Air is a...[0.02224554680287838, -0.02044147066771984, -0...[0.021524671465158463, 0.018522677943110466, -...4
\n", + "
" + ], + "text/plain": [ + " id url title \\\n", + "0 1 https://simple.wikipedia.org/wiki/April April \n", + "1 2 https://simple.wikipedia.org/wiki/August August \n", + "2 6 https://simple.wikipedia.org/wiki/Art Art \n", + "3 8 https://simple.wikipedia.org/wiki/A A \n", + "4 9 https://simple.wikipedia.org/wiki/Air Air \n", + "\n", + " text \\\n", + "0 April is the fourth month of the year in the J... \n", + "1 August (Aug.) is the eighth month of the year ... \n", + "2 Art is a creative activity that expresses imag... \n", + "3 A or a is the first letter of the English alph... \n", + "4 Air refers to the Earth's atmosphere. Air is a... \n", + "\n", + " title_vector \\\n", + "0 [0.001009464613161981, -0.020700545981526375, ... \n", + "1 [0.0009286514250561595, 0.000820168002974242, ... \n", + "2 [0.003393713850528002, 0.0061537534929811954, ... \n", + "3 [0.0153952119871974, -0.013759135268628597, 0.... \n", + "4 [0.02224554680287838, -0.02044147066771984, -0... \n", + "\n", + " content_vector vector_id \n", + "0 [-0.011253940872848034, -0.013491976074874401,... 0 \n", + "1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n", + "2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n", + "3 [0.024894846603274345, -0.022186409682035446, ... 3 \n", + "4 [0.021524671465158463, 0.018522677943110466, -... 4 " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "article_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "960b82af", + "metadata": {}, + "outputs": [], + "source": [ + "# Read vectors from strings back into a list\n", + "article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n", + "article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n", + "\n", + "# Set vector_id to be a string\n", + "article_df['vector_id'] = article_df['vector_id'].apply(str)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "a334ab8b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 25000 entries, 0 to 24999\n", + "Data columns (total 7 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 25000 non-null int64 \n", + " 1 url 25000 non-null object\n", + " 2 title 25000 non-null object\n", + " 3 text 25000 non-null object\n", + " 4 title_vector 25000 non-null object\n", + " 5 content_vector 25000 non-null object\n", + " 6 vector_id 25000 non-null object\n", + "dtypes: int64(1), object(6)\n", + "memory usage: 1.3+ MB\n" + ] + } + ], + "source": [ + "article_df.info(show_counts=True)" + ] + }, + { + "cell_type": "markdown", + "id": "81bf5349", + "metadata": {}, + "source": [ + "# Chroma\n", + "\n", + "We'll index these embedded documents in a vector database and search them. The first option we'll look at is **Chroma**, an easy to use open-source self-hosted in-memory vector database, designed for working with embeddings together with LLMs. \n", + "\n", + "In this section, we will:\n", + "- Instantiate the Chroma client\n", + "- Create collections for each class of embedding \n", + "- Query each collection " + ] + }, + { + "cell_type": "markdown", + "id": "37d1f693", + "metadata": {}, + "source": [ + "### Instantiate the Chroma client\n", + "\n", + "Create the Chroma client. By default, Chroma is ephemeral and runs in memory. \n", + "However, you can easily set up a persistent configuration which writes to disk." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "159d9646", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "chroma_client = chromadb.Client() # Ephemeral. Comment out for the persistent version.\n", + "\n", + "# Uncomment the following for the persistent version. \n", + "# import chromadb.config.Settings\n", + "# persist_directory = 'chroma_persistence' # Directory to store persisted Chroma data. \n", + "# client = chromadb.Client(\n", + "# Settings(\n", + "# persist_directory=persist_directory,\n", + "# chroma_db_impl=\"duckdb+parquet\",\n", + "# )\n", + "# )" + ] + }, + { + "cell_type": "markdown", + "id": "5cd61943", + "metadata": {}, + "source": [ + "### Create collections\n", + "\n", + "Chroma collections allow you to store and filter with arbitrary metadata, making it easy to query subsets of the embedded data. \n", + "\n", + "Chroma is already integrated with OpenAI's embedding functions. The best way to use them is on construction of a collection, as follows.\n", + "Alternatively, you can 'bring your own embeddings'. More information can be found [here](https://docs.trychroma.com/embeddings)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "ad2d1bce", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OPENAI_API_KEY is ready\n" + ] + } + ], + "source": [ + "from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction\n", + "\n", + "# Test that your OpenAI API key is correctly set as an environment variable\n", + "# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n", + "\n", + "# Note. alternatively you can set a temporary env variable like this:\n", + "# os.environ[\"OPENAI_API_KEY\"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'\n", + "\n", + "if os.getenv(\"OPENAI_API_KEY\") is not None:\n", + " openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n", + " print (\"OPENAI_API_KEY is ready\")\n", + "else:\n", + " print (\"OPENAI_API_KEY environment variable not found\")\n", + "\n", + "\n", + "embedding_function = OpenAIEmbeddingFunction(api_key=os.environ.get('OPENAI_API_KEY'), model_name=EMBEDDING_MODEL)\n", + "\n", + "wikipedia_content_collection = chroma_client.create_collection(name='wikipedia_content', embedding_function=embedding_function)\n", + "wikipedia_title_collection = chroma_client.create_collection(name='wikipedia_titles', embedding_function=embedding_function)" + ] + }, + { + "cell_type": "markdown", + "id": "02887b52", + "metadata": {}, + "source": [ + "### Populate the collections\n", + "\n", + "Chroma collections allow you to populate, and filter on, whatever metadata you like. Chroma can also store the text alongside the vectors, and return everything in a single `query` call, when this is more convenient. \n", + "\n", + "For this use-case, we'll just store the embeddings and IDs, and use these to index the original dataframe. " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "84885fec", + "metadata": {}, + "outputs": [], + "source": [ + "# Add the content vectors\n", + "wikipedia_content_collection.add(\n", + " ids=article_df.vector_id.tolist(),\n", + " embeddings=article_df.content_vector.tolist(),\n", + ")\n", + "\n", + "# Add the title vectors\n", + "wikipedia_title_collection.add(\n", + " ids=article_df.vector_id.tolist(),\n", + " embeddings=article_df.title_vector.tolist(),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "79122c6b", + "metadata": {}, + "source": [ + "### Search the collections\n", + "\n", + "Chroma handles embedding queries for you if an embedding function is set, like in this example." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "273b8b4c", + "metadata": {}, + "outputs": [], + "source": [ + "def query_collection(collection, query, max_results, dataframe):\n", + " results = collection.query(query_texts=query, n_results=max_results, include=['distances']) \n", + " df = pd.DataFrame({\n", + " 'id':results['ids'][0], \n", + " 'score':results['distances'][0],\n", + " 'title': dataframe[dataframe.vector_id.isin(results['ids'][0])]['title'],\n", + " 'content': dataframe[dataframe.vector_id.isin(results['ids'][0])]['text'],\n", + " })\n", + " \n", + " return df" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "e84cf47f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idscoretitlecontent
116122490.265118EuropeEurope is the western part of the continent of...
1332122480.290684EuropeanEuropean may mean:\\nA person or attribute of t...
2885122250.314833ScandinaviaScandinavia is a group of countries in norther...
1221213320.317179Western civilizationWestern civilization, western culture or the ...
12216122160.321235Eastern EuropeEastern Europe is the eastern region of Europe...
\n", + "
" + ], + "text/plain": [ + " id score title \\\n", + "116 12249 0.265118 Europe \n", + "1332 12248 0.290684 European \n", + "2885 12225 0.314833 Scandinavia \n", + "12212 1332 0.317179 Western civilization \n", + "12216 12216 0.321235 Eastern Europe \n", + "\n", + " content \n", + "116 Europe is the western part of the continent of... \n", + "1332 European may mean:\\nA person or attribute of t... \n", + "2885 Scandinavia is a group of countries in norther... \n", + "12212 Western civilization, western culture or the ... \n", + "12216 Eastern Europe is the eastern region of Europe... " + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "title_query_result = query_collection(\n", + " collection=wikipedia_title_collection,\n", + " query=\"modern art in Europe\",\n", + " max_results=10,\n", + " dataframe=article_df\n", + ")\n", + "title_query_result.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "f4db910a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idscoretitlecontent
2923131350.2613281651\\n\\nEvents \\n January 1 – Charles II crowned K...
3694135710.277058StirlingStirling () is a city in the middle of Scotlan...
624829230.294823841\\n\\nEvents \\n June 25: Battle of Fontenay – Lo...
6297135680.3007561746\\n\\nEvents \\n January 8 – Bonnie Prince Charli...
11702117080.307572William WallaceWilliam Wallace was a Scottish knight who foug...
\n", + "
" + ], + "text/plain": [ + " id score title \\\n", + "2923 13135 0.261328 1651 \n", + "3694 13571 0.277058 Stirling \n", + "6248 2923 0.294823 841 \n", + "6297 13568 0.300756 1746 \n", + "11702 11708 0.307572 William Wallace \n", + "\n", + " content \n", + "2923 \\n\\nEvents \\n January 1 – Charles II crowned K... \n", + "3694 Stirling () is a city in the middle of Scotlan... \n", + "6248 \\n\\nEvents \\n June 25: Battle of Fontenay – Lo... \n", + "6297 \\n\\nEvents \\n January 8 – Bonnie Prince Charli... \n", + "11702 William Wallace was a Scottish knight who foug... " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "content_query_result = query_collection(\n", + " collection=wikipedia_content_collection,\n", + " query=\"Famous battles in Scottish history\",\n", + " max_results=10,\n", + " dataframe=article_df\n", + ")\n", + "content_query_result.head()" + ] + }, + { + "cell_type": "markdown", + "id": "a03e7645", + "metadata": {}, + "source": [ + "Now that you've got a basic embeddings search running, you can [hop over to the Chroma docs](https://docs.trychroma.com/usage-guide#using-where-filters) to learn more about how to add filters to your query, update/delete data in your collections, and deploy Chroma." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0119d87a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "vector_db_split", + "language": "python", + "name": "vector_db_split" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + }, + "vscode": { + "interpreter": { + "hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/vector_databases/myscale/Using_MyScale_for_embeddings_search.ipynb b/examples/vector_databases/myscale/Using_MyScale_for_embeddings_search.ipynb new file mode 100644 index 00000000..24d1fadf --- /dev/null +++ b/examples/vector_databases/myscale/Using_MyScale_for_embeddings_search.ipynb @@ -0,0 +1,531 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cb1537e6", + "metadata": {}, + "source": [ + "# Using MyScale for Embeddings Search\n", + "\n", + "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n", + "\n", + "### What is a Vector Database\n", + "\n", + "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n", + "\n", + "### Why use a Vector Database\n", + "\n", + "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n", + "\n", + "\n", + "### Demo Flow\n", + "The demo flow is:\n", + "- **Setup**: Import packages and set any required variables\n", + "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n", + "- **MyScale**\n", + " - *Setup*: Set up the MyScale Python client. For more details go [here](https://docs.myscale.com/en/python-client/)\n", + " - *Index Data*: We'll create a table and index it for __content__.\n", + " - *Search Data*: Run a few example queries with various goals in mind.\n", + "\n", + "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings." + ] + }, + { + "cell_type": "markdown", + "id": "e2b59250", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "Import the required libraries and set the embedding model that we'd like to use." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d8810f9", + "metadata": {}, + "outputs": [], + "source": [ + "# We'll need to install the MyScale client\n", + "!pip install clickhouse-connect\n", + "\n", + "#Install wget to pull zip file\n", + "!pip install wget" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5be94df6", + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "from typing import List, Iterator\n", + "import pandas as pd\n", + "import numpy as np\n", + "import os\n", + "import wget\n", + "from ast import literal_eval\n", + "\n", + "# MyScale's client library for Python\n", + "import clickhouse_connect\n", + "\n", + "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", + "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", + "\n", + "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n", + "import warnings\n", + "\n", + "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n", + "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) " + ] + }, + { + "cell_type": "markdown", + "id": "e5d9d2e1", + "metadata": {}, + "source": [ + "## Load data\n", + "\n", + "In this section we'll load embedded data that we've prepared previous to this session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5dff8b55", + "metadata": {}, + "outputs": [], + "source": [ + "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n", + "\n", + "# The file is ~700 MB so this will take some time\n", + "wget.download(embeddings_url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21097972", + "metadata": {}, + "outputs": [], + "source": [ + "import zipfile\n", + "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", + " zip_ref.extractall(\"../data\")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "70bbd8ba", + "metadata": {}, + "outputs": [], + "source": [ + "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "1721e45d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idurltitletexttitle_vectorcontent_vectorvector_id
01https://simple.wikipedia.org/wiki/AprilAprilApril is the fourth month of the year in the J...[0.001009464613161981, -0.020700545981526375, ...[-0.011253940872848034, -0.013491976074874401,...0
12https://simple.wikipedia.org/wiki/AugustAugustAugust (Aug.) is the eighth month of the year ...[0.0009286514250561595, 0.000820168002974242, ...[0.0003609954728744924, 0.007262262050062418, ...1
26https://simple.wikipedia.org/wiki/ArtArtArt is a creative activity that expresses imag...[0.003393713850528002, 0.0061537534929811954, ...[-0.004959689453244209, 0.015772193670272827, ...2
38https://simple.wikipedia.org/wiki/AAA or a is the first letter of the English alph...[0.0153952119871974, -0.013759135268628597, 0....[0.024894846603274345, -0.022186409682035446, ...3
49https://simple.wikipedia.org/wiki/AirAirAir refers to the Earth's atmosphere. Air is a...[0.02224554680287838, -0.02044147066771984, -0...[0.021524671465158463, 0.018522677943110466, -...4
\n", + "
" + ], + "text/plain": [ + " id url title \\\n", + "0 1 https://simple.wikipedia.org/wiki/April April \n", + "1 2 https://simple.wikipedia.org/wiki/August August \n", + "2 6 https://simple.wikipedia.org/wiki/Art Art \n", + "3 8 https://simple.wikipedia.org/wiki/A A \n", + "4 9 https://simple.wikipedia.org/wiki/Air Air \n", + "\n", + " text \\\n", + "0 April is the fourth month of the year in the J... \n", + "1 August (Aug.) is the eighth month of the year ... \n", + "2 Art is a creative activity that expresses imag... \n", + "3 A or a is the first letter of the English alph... \n", + "4 Air refers to the Earth's atmosphere. Air is a... \n", + "\n", + " title_vector \\\n", + "0 [0.001009464613161981, -0.020700545981526375, ... \n", + "1 [0.0009286514250561595, 0.000820168002974242, ... \n", + "2 [0.003393713850528002, 0.0061537534929811954, ... \n", + "3 [0.0153952119871974, -0.013759135268628597, 0.... \n", + "4 [0.02224554680287838, -0.02044147066771984, -0... \n", + "\n", + " content_vector vector_id \n", + "0 [-0.011253940872848034, -0.013491976074874401,... 0 \n", + "1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n", + "2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n", + "3 [0.024894846603274345, -0.022186409682035446, ... 3 \n", + "4 [0.021524671465158463, 0.018522677943110466, -... 4 " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "article_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "960b82af", + "metadata": {}, + "outputs": [], + "source": [ + "# Read vectors from strings back into a list\n", + "article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n", + "article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n", + "\n", + "# Set vector_id to be a string\n", + "article_df['vector_id'] = article_df['vector_id'].apply(str)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "a334ab8b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 25000 entries, 0 to 24999\n", + "Data columns (total 7 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 25000 non-null int64 \n", + " 1 url 25000 non-null object\n", + " 2 title 25000 non-null object\n", + " 3 text 25000 non-null object\n", + " 4 title_vector 25000 non-null object\n", + " 5 content_vector 25000 non-null object\n", + " 6 vector_id 25000 non-null object\n", + "dtypes: int64(1), object(6)\n", + "memory usage: 1.3+ MB\n" + ] + } + ], + "source": [ + "article_df.info(show_counts=True)" + ] + }, + { + "cell_type": "markdown", + "id": "56a02772", + "metadata": {}, + "source": [ + "## MyScale\n", + "The next vector database we'll consider is [MyScale](https://myscale.com).\n", + "\n", + "[MyScale](https://myscale.com) is a database built on Clickhouse that combines vector search and SQL analytics to offer a high-performance, streamlined, and fully managed experience. It's designed to facilitate joint queries and analyses on both structured and vector data, with comprehensive SQL support for all data processing.\n", + "\n", + "Deploy and execute vector search with SQL on your cluster within two minutes by using [MyScale Console](https://console.myscale.com)." + ] + }, + { + "cell_type": "markdown", + "id": "d3e1f96b", + "metadata": {}, + "source": [ + "### Connect to MyScale\n", + "\n", + "Follow the [connections details](https://docs.myscale.com/en/cluster-management/) section to retrieve the cluster host, username, and password information from the MyScale console, and use it to create a connection to your cluster as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "024243cf", + "metadata": {}, + "outputs": [], + "source": [ + "# initialize client\n", + "client = clickhouse_connect.get_client(host='YOUR_CLUSTER_HOST', port=8443, username='YOUR_USERNAME', password='YOUR_CLUSTER_PASSWORD')" + ] + }, + { + "cell_type": "markdown", + "id": "067009db", + "metadata": {}, + "source": [ + "### Index data\n", + "\n", + "We will create an SQL table called `articles` in MyScale to store the embeddings data. The table will include a vector index with a cosine distance metric and a constraint for the length of the embeddings. Use the following code to create and insert data into the articles table:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "685cba13", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "378809ac23104dc08c06fa3a53f83666", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/250 [00:00=2.19.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.31.0)\n", + "Requirement already satisfied: pyyaml>=5.4 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (6.0)\n", + "Requirement already satisfied: loguru>=0.5.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (0.7.0)\n", + "Requirement already satisfied: typing-extensions>=3.7.4 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (4.5.0)\n", + "Requirement already satisfied: dnspython>=2.0.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.3.0)\n", + "Requirement already satisfied: python-dateutil>=2.5.3 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.8.2)\n", + "Requirement already satisfied: urllib3>=1.21.1 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (1.26.16)\n", + "Requirement already satisfied: tqdm>=4.64.1 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (4.65.0)\n", + "Requirement already satisfied: numpy>=1.22.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (1.25.0)\n", + "Requirement already satisfied: six>=1.5 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from python-dateutil>=2.5.3->pinecone-client) (1.16.0)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (3.1.0)\n", + "Requirement already satisfied: idna<4,>=2.5 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (3.4)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (2023.5.7)\n", + "Requirement already satisfied: wget in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (3.2)\n" + ] + } + ], + "source": [ + "# We'll need to install the Pinecone client\n", + "!pip install pinecone-client\n", + "\n", + "#Install wget to pull zip file\n", + "!pip install wget" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "5be94df6", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n", + " from tqdm.autonotebook import tqdm\n" + ] + } + ], + "source": [ + "import openai\n", + "\n", + "from typing import List, Iterator\n", + "import pandas as pd\n", + "import numpy as np\n", + "import os\n", + "import wget\n", + "from ast import literal_eval\n", + "\n", + "# Pinecone's client library for Python\n", + "import pinecone\n", + "\n", + "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", + "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", + "\n", + "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n", + "import warnings\n", + "\n", + "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n", + "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) " + ] + }, + { + "cell_type": "markdown", + "id": "e5d9d2e1", + "metadata": {}, + "source": [ + "## Load data\n", + "\n", + "In this section we'll load embedded data that we've prepared previous to this session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5dff8b55", + "metadata": {}, + "outputs": [], + "source": [ + "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n", + "\n", + "# The file is ~700 MB so this will take some time\n", + "wget.download(embeddings_url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21097972", + "metadata": {}, + "outputs": [], + "source": [ + "import zipfile\n", + "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", + " zip_ref.extractall(\"../data\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "70bbd8ba", + "metadata": {}, + "outputs": [], + "source": [ + "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1721e45d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idurltitletexttitle_vectorcontent_vectorvector_id
01https://simple.wikipedia.org/wiki/AprilAprilApril is the fourth month of the year in the J...[0.001009464613161981, -0.020700545981526375, ...[-0.011253940872848034, -0.013491976074874401,...0
12https://simple.wikipedia.org/wiki/AugustAugustAugust (Aug.) is the eighth month of the year ...[0.0009286514250561595, 0.000820168002974242, ...[0.0003609954728744924, 0.007262262050062418, ...1
26https://simple.wikipedia.org/wiki/ArtArtArt is a creative activity that expresses imag...[0.003393713850528002, 0.0061537534929811954, ...[-0.004959689453244209, 0.015772193670272827, ...2
38https://simple.wikipedia.org/wiki/AAA or a is the first letter of the English alph...[0.0153952119871974, -0.013759135268628597, 0....[0.024894846603274345, -0.022186409682035446, ...3
49https://simple.wikipedia.org/wiki/AirAirAir refers to the Earth's atmosphere. Air is a...[0.02224554680287838, -0.02044147066771984, -0...[0.021524671465158463, 0.018522677943110466, -...4
\n", + "
" + ], + "text/plain": [ + " id url title \\\n", + "0 1 https://simple.wikipedia.org/wiki/April April \n", + "1 2 https://simple.wikipedia.org/wiki/August August \n", + "2 6 https://simple.wikipedia.org/wiki/Art Art \n", + "3 8 https://simple.wikipedia.org/wiki/A A \n", + "4 9 https://simple.wikipedia.org/wiki/Air Air \n", + "\n", + " text \\\n", + "0 April is the fourth month of the year in the J... \n", + "1 August (Aug.) is the eighth month of the year ... \n", + "2 Art is a creative activity that expresses imag... \n", + "3 A or a is the first letter of the English alph... \n", + "4 Air refers to the Earth's atmosphere. Air is a... \n", + "\n", + " title_vector \\\n", + "0 [0.001009464613161981, -0.020700545981526375, ... \n", + "1 [0.0009286514250561595, 0.000820168002974242, ... \n", + "2 [0.003393713850528002, 0.0061537534929811954, ... \n", + "3 [0.0153952119871974, -0.013759135268628597, 0.... \n", + "4 [0.02224554680287838, -0.02044147066771984, -0... \n", + "\n", + " content_vector vector_id \n", + "0 [-0.011253940872848034, -0.013491976074874401,... 0 \n", + "1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n", + "2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n", + "3 [0.024894846603274345, -0.022186409682035446, ... 3 \n", + "4 [0.021524671465158463, 0.018522677943110466, -... 4 " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "article_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "960b82af", + "metadata": {}, + "outputs": [], + "source": [ + "# Read vectors from strings back into a list\n", + "article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n", + "article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n", + "\n", + "# Set vector_id to be a string\n", + "article_df['vector_id'] = article_df['vector_id'].apply(str)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "a334ab8b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 25000 entries, 0 to 24999\n", + "Data columns (total 7 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 25000 non-null int64 \n", + " 1 url 25000 non-null object\n", + " 2 title 25000 non-null object\n", + " 3 text 25000 non-null object\n", + " 4 title_vector 25000 non-null object\n", + " 5 content_vector 25000 non-null object\n", + " 6 vector_id 25000 non-null object\n", + "dtypes: int64(1), object(6)\n", + "memory usage: 1.3+ MB\n" + ] + } + ], + "source": [ + "article_df.info(show_counts=True)" + ] + }, + { + "cell_type": "markdown", + "id": "ed32fc87", + "metadata": {}, + "source": [ + "## Pinecone\n", + "\n", + "The next option we'll look at is **Pinecone**, a managed vector database which offers a cloud-native option.\n", + "\n", + "Before you proceed with this step you'll need to navigate to [Pinecone](pinecone.io), sign up and then save your API key as an environment variable titled ```PINECONE_API_KEY```.\n", + "\n", + "For section we will:\n", + "- Create an index with multiple namespaces for article titles and content\n", + "- Store our data in the index with separate searchable \"namespaces\" for article **titles** and **content**\n", + "- Fire some similarity search queries to verify our setup is working" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "92e6152a", + "metadata": {}, + "outputs": [], + "source": [ + "api_key = os.getenv(\"PINECONE_API_KEY\")\n", + "pinecone.init(api_key=api_key)" + ] + }, + { + "cell_type": "markdown", + "id": "63b28543", + "metadata": {}, + "source": [ + "### Create Index\n", + "\n", + "First we will need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [Pinecone documentation](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.).\n", + "\n", + "If you want to batch insert to your index in parallel to increase insertion speed then there is a great guide in the Pinecone documentation on [batch inserts in parallel](https://docs.pinecone.io/docs/insert-data#sending-upserts-in-parallel)." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "0a71c575", + "metadata": {}, + "outputs": [], + "source": [ + "# Models a simple batch generator that make chunks out of an input DataFrame\n", + "class BatchGenerator:\n", + " \n", + " \n", + " def __init__(self, batch_size: int = 10) -> None:\n", + " self.batch_size = batch_size\n", + " \n", + " # Makes chunks out of an input DataFrame\n", + " def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:\n", + " splits = self.splits_num(df.shape[0])\n", + " if splits <= 1:\n", + " yield df\n", + " else:\n", + " for chunk in np.array_split(df, splits):\n", + " yield chunk\n", + "\n", + " # Determines how many chunks DataFrame contains\n", + " def splits_num(self, elements: int) -> int:\n", + " return round(elements / self.batch_size)\n", + " \n", + " __call__ = to_batches\n", + "\n", + "df_batcher = BatchGenerator(300)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "7ea9ad46", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['podcasts', 'wikipedia-articles']" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Pick a name for the new index\n", + "index_name = 'wikipedia-articles'\n", + "\n", + "# Check whether the index with the same name already exists - if so, delete it\n", + "if index_name in pinecone.list_indexes():\n", + " pinecone.delete_index(index_name)\n", + " \n", + "# Creates new index\n", + "pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))\n", + "index = pinecone.Index(index_name=index_name)\n", + "\n", + "# Confirm our index was created\n", + "pinecone.list_indexes()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "5daeba00", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Uploading vectors to content namespace..\n" + ] + } + ], + "source": [ + "# Upsert content vectors in content namespace - this can take a few minutes\n", + "print(\"Uploading vectors to content namespace..\")\n", + "for batch_df in df_batcher(article_df):\n", + " index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "5fc1b083", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Uploading vectors to title namespace..\n" + ] + } + ], + "source": [ + "# Upsert title vectors in title namespace - this can also take a few minutes\n", + "print(\"Uploading vectors to title namespace..\")\n", + "for batch_df in df_batcher(article_df):\n", + " index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "f90c7fba", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'dimension': 1536,\n", + " 'index_fullness': 0.1,\n", + " 'namespaces': {'content': {'vector_count': 25000},\n", + " 'title': {'vector_count': 25000}},\n", + " 'total_vector_count': 50000}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Check index size for each namespace to confirm all of our docs have loaded\n", + "index.describe_index_stats()" + ] + }, + { + "cell_type": "markdown", + "id": "2da40a69", + "metadata": {}, + "source": [ + "### Search data\n", + "\n", + "Now we'll enter some dummy searches and check we get decent results back" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "c8280363", + "metadata": {}, + "outputs": [], + "source": [ + "# First we'll create dictionaries mapping vector IDs to their outputs so we can retrieve the text for our search results\n", + "titles_mapped = dict(zip(article_df.vector_id,article_df.title))\n", + "content_mapped = dict(zip(article_df.vector_id,article_df.text))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "3c8c2aa1", + "metadata": {}, + "outputs": [], + "source": [ + "def query_article(query, namespace, top_k=5):\n", + " '''Queries an article using its title in the specified\n", + " namespace and prints results.'''\n", + "\n", + " # Create vector embeddings based on the title column\n", + " embedded_query = openai.Embedding.create(\n", + " input=query,\n", + " model=EMBEDDING_MODEL,\n", + " )[\"data\"][0]['embedding']\n", + "\n", + " # Query namespace passed as parameter using title vector\n", + " query_result = index.query(embedded_query, \n", + " namespace=namespace, \n", + " top_k=top_k)\n", + "\n", + " # Print query results \n", + " print(f'\\nMost similar results to {query} in \"{namespace}\" namespace:\\n')\n", + " if not query_result.matches:\n", + " print('no query result')\n", + " \n", + " matches = query_result.matches\n", + " ids = [res.id for res in matches]\n", + " scores = [res.score for res in matches]\n", + " df = pd.DataFrame({'id':ids, \n", + " 'score':scores,\n", + " 'title': [titles_mapped[_id] for _id in ids],\n", + " 'content': [content_mapped[_id] for _id in ids],\n", + " })\n", + " \n", + " counter = 0\n", + " for k,v in df.iterrows():\n", + " counter += 1\n", + " print(f'{v.title} (score = {v.score})')\n", + " \n", + " print('\\n')\n", + "\n", + " return df" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "3402b1b1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Most similar results to modern art in Europe in \"title\" namespace:\n", + "\n", + "Museum of Modern Art (score = 0.875177085)\n", + "Western Europe (score = 0.867441177)\n", + "Renaissance art (score = 0.864156306)\n", + "Pop art (score = 0.860346854)\n", + "Northern Europe (score = 0.854658186)\n", + "\n", + "\n" + ] + } + ], + "source": [ + "query_output = query_article('modern art in Europe','title')" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "64a3f90a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Most similar results to Famous battles in Scottish history in \"content\" namespace:\n", + "\n", + "Battle of Bannockburn (score = 0.869336188)\n", + "Wars of Scottish Independence (score = 0.861470938)\n", + "1651 (score = 0.852588475)\n", + "First War of Scottish Independence (score = 0.84962213)\n", + "Robert I of Scotland (score = 0.846214116)\n", + "\n", + "\n" + ] + } + ], + "source": [ + "content_query_output = query_article(\"Famous battles in Scottish history\",'content')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0119d87a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "vector_db_split", + "language": "python", + "name": "vector_db_split" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + }, + "vscode": { + "interpreter": { + "hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/vector_databases/qdrant/Using_Qdrant_for_embeddings_search.ipynb b/examples/vector_databases/qdrant/Using_Qdrant_for_embeddings_search.ipynb new file mode 100644 index 00000000..8193963e --- /dev/null +++ b/examples/vector_databases/qdrant/Using_Qdrant_for_embeddings_search.ipynb @@ -0,0 +1,673 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cb1537e6", + "metadata": {}, + "source": [ + "# Using Qdrant for Embeddings Search\n", + "\n", + "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n", + "\n", + "### What is a Vector Database\n", + "\n", + "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n", + "\n", + "### Why use a Vector Database\n", + "\n", + "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n", + "\n", + "\n", + "### Demo Flow\n", + "The demo flow is:\n", + "- **Setup**: Import packages and set any required variables\n", + "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n", + "- **Qdrant**\n", + " - *Setup*: Here we'll set up the Python client for Qdrant. For more details go [here](https://github.com/qdrant/qdrant_client)\n", + " - *Index Data*: We'll create a collection with vectors for __titles__ and __content__\n", + " - *Search Data*: We'll run a few searches to confirm it works\n", + "\n", + "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings." + ] + }, + { + "cell_type": "markdown", + "id": "e2b59250", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "Import the required libraries and set the embedding model that we'd like to use." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d8810f9", + "metadata": {}, + "outputs": [], + "source": [ + "# We'll need to install Qdrant client\n", + "!pip install qdrant-client\n", + "\n", + "#Install wget to pull zip file\n", + "!pip install wget" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "5be94df6", + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "from typing import List, Iterator\n", + "import pandas as pd\n", + "import numpy as np\n", + "import os\n", + "import wget\n", + "from ast import literal_eval\n", + "\n", + "# Qdrant's client library for Python\n", + "import qdrant_client\n", + "\n", + "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", + "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", + "\n", + "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n", + "import warnings\n", + "\n", + "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n", + "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) " + ] + }, + { + "cell_type": "markdown", + "id": "e5d9d2e1", + "metadata": {}, + "source": [ + "## Load data\n", + "\n", + "In this section we'll load embedded data that we've prepared previous to this session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5dff8b55", + "metadata": {}, + "outputs": [], + "source": [ + "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n", + "\n", + "# The file is ~700 MB so this will take some time\n", + "wget.download(embeddings_url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21097972", + "metadata": {}, + "outputs": [], + "source": [ + "import zipfile\n", + "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", + " zip_ref.extractall(\"../data\")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "70bbd8ba", + "metadata": {}, + "outputs": [], + "source": [ + "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "1721e45d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idurltitletexttitle_vectorcontent_vectorvector_id
01https://simple.wikipedia.org/wiki/AprilAprilApril is the fourth month of the year in the J...[0.001009464613161981, -0.020700545981526375, ...[-0.011253940872848034, -0.013491976074874401,...0
12https://simple.wikipedia.org/wiki/AugustAugustAugust (Aug.) is the eighth month of the year ...[0.0009286514250561595, 0.000820168002974242, ...[0.0003609954728744924, 0.007262262050062418, ...1
26https://simple.wikipedia.org/wiki/ArtArtArt is a creative activity that expresses imag...[0.003393713850528002, 0.0061537534929811954, ...[-0.004959689453244209, 0.015772193670272827, ...2
38https://simple.wikipedia.org/wiki/AAA or a is the first letter of the English alph...[0.0153952119871974, -0.013759135268628597, 0....[0.024894846603274345, -0.022186409682035446, ...3
49https://simple.wikipedia.org/wiki/AirAirAir refers to the Earth's atmosphere. Air is a...[0.02224554680287838, -0.02044147066771984, -0...[0.021524671465158463, 0.018522677943110466, -...4
\n", + "
" + ], + "text/plain": [ + " id url title \\\n", + "0 1 https://simple.wikipedia.org/wiki/April April \n", + "1 2 https://simple.wikipedia.org/wiki/August August \n", + "2 6 https://simple.wikipedia.org/wiki/Art Art \n", + "3 8 https://simple.wikipedia.org/wiki/A A \n", + "4 9 https://simple.wikipedia.org/wiki/Air Air \n", + "\n", + " text \\\n", + "0 April is the fourth month of the year in the J... \n", + "1 August (Aug.) is the eighth month of the year ... \n", + "2 Art is a creative activity that expresses imag... \n", + "3 A or a is the first letter of the English alph... \n", + "4 Air refers to the Earth's atmosphere. Air is a... \n", + "\n", + " title_vector \\\n", + "0 [0.001009464613161981, -0.020700545981526375, ... \n", + "1 [0.0009286514250561595, 0.000820168002974242, ... \n", + "2 [0.003393713850528002, 0.0061537534929811954, ... \n", + "3 [0.0153952119871974, -0.013759135268628597, 0.... \n", + "4 [0.02224554680287838, -0.02044147066771984, -0... \n", + "\n", + " content_vector vector_id \n", + "0 [-0.011253940872848034, -0.013491976074874401,... 0 \n", + "1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n", + "2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n", + "3 [0.024894846603274345, -0.022186409682035446, ... 3 \n", + "4 [0.021524671465158463, 0.018522677943110466, -... 4 " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "article_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "960b82af", + "metadata": {}, + "outputs": [], + "source": [ + "# Read vectors from strings back into a list\n", + "article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n", + "article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n", + "\n", + "# Set vector_id to be a string\n", + "article_df['vector_id'] = article_df['vector_id'].apply(str)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "a334ab8b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 25000 entries, 0 to 24999\n", + "Data columns (total 7 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 25000 non-null int64 \n", + " 1 url 25000 non-null object\n", + " 2 title 25000 non-null object\n", + " 3 text 25000 non-null object\n", + " 4 title_vector 25000 non-null object\n", + " 5 content_vector 25000 non-null object\n", + " 6 vector_id 25000 non-null object\n", + "dtypes: int64(1), object(6)\n", + "memory usage: 1.3+ MB\n" + ] + } + ], + "source": [ + "article_df.info(show_counts=True)" + ] + }, + { + "cell_type": "markdown", + "id": "9cfaed9d", + "metadata": {}, + "source": [ + "## Qdrant\n", + "\n", + "The last vector database we'll consider is **[Qdrant](https://qdrant.tech/)**. This is a high-performant vector search database written in Rust. It offers both on-premise and cloud version, but for the purposes of that example we're going to use the local deployment mode.\n", + "\n", + "Setting everything up will require:\n", + "- Spinning up a local instance of Qdrant\n", + "- Configuring the collection and storing the data in it\n", + "- Trying out with some queries" + ] + }, + { + "cell_type": "markdown", + "id": "38774565", + "metadata": {}, + "source": [ + "### Setup\n", + "\n", + "For the local deployment, we are going to use Docker, according to the Qdrant documentation: https://qdrant.tech/documentation/quick_start/. Qdrant requires just a single container, but an example of the docker-compose.yaml file is available at `./qdrant/docker-compose.yaml` in this repo.\n", + "\n", + "You can start Qdrant instance locally by navigating to this directory and running `docker-compose up -d `" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "76d697e9", + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:28:38.928205Z", + "start_time": "2023-01-18T09:28:38.913987Z" + } + }, + "outputs": [], + "source": [ + "qdrant = qdrant_client.QdrantClient(host='localhost', prefer_grpc=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "1deeb539", + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:29:19.806639Z", + "start_time": "2023-01-18T09:29:19.727897Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "CollectionsResponse(collections=[CollectionDescription(name='Routines')])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "qdrant.get_collections()" + ] + }, + { + "cell_type": "markdown", + "id": "bc006b6f", + "metadata": {}, + "source": [ + "### Index data\n", + "\n", + "Qdrant stores data in __collections__ where each object is described by at least one vector and may contain an additional metadata called __payload__. Our collection will be called **Articles** and each object will be described by both **title** and **content** vectors.\n", + "\n", + "We'll be using an official [qdrant-client](https://github.com/qdrant/qdrant_client) package that has all the utility methods already built-in." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "1a84ee1d", + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:29:22.530121Z", + "start_time": "2023-01-18T09:29:22.524604Z" + } + }, + "outputs": [], + "source": [ + "from qdrant_client.http import models as rest" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "00876f92", + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:31:14.413334Z", + "start_time": "2023-01-18T09:31:13.619079Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vector_size = len(article_df['content_vector'][0])\n", + "\n", + "qdrant.recreate_collection(\n", + " collection_name='Articles',\n", + " vectors_config={\n", + " 'title': rest.VectorParams(\n", + " distance=rest.Distance.COSINE,\n", + " size=vector_size,\n", + " ),\n", + " 'content': rest.VectorParams(\n", + " distance=rest.Distance.COSINE,\n", + " size=vector_size,\n", + " ),\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "f24e76ab", + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:36:28.597535Z", + "start_time": "2023-01-18T09:36:24.108867Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "UpdateResult(operation_id=0, status=)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "qdrant.upsert(\n", + " collection_name='Articles',\n", + " points=[\n", + " rest.PointStruct(\n", + " id=k,\n", + " vector={\n", + " 'title': v['title_vector'],\n", + " 'content': v['content_vector'],\n", + " },\n", + " payload=v.to_dict(),\n", + " )\n", + " for k, v in article_df.iterrows()\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "d1188a12", + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:58:13.825886Z", + "start_time": "2023-01-18T09:58:13.816248Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "CountResult(count=25000)" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Check the collection size to make sure all the points have been stored\n", + "qdrant.count(collection_name='Articles')" + ] + }, + { + "cell_type": "markdown", + "id": "06ed119b", + "metadata": {}, + "source": [ + "### Search Data\n", + "\n", + "Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "f1bac4ef", + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:50:35.265647Z", + "start_time": "2023-01-18T09:50:35.256065Z" + } + }, + "outputs": [], + "source": [ + "def query_qdrant(query, collection_name, vector_name='title', top_k=20):\n", + "\n", + " # Creates embedding vector from user query\n", + " embedded_query = openai.Embedding.create(\n", + " input=query,\n", + " model=EMBEDDING_MODEL,\n", + " )['data'][0]['embedding']\n", + " \n", + " query_results = qdrant.search(\n", + " collection_name=collection_name,\n", + " query_vector=(\n", + " vector_name, embedded_query\n", + " ),\n", + " limit=top_k,\n", + " )\n", + " \n", + " return query_results" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "aa92f3d3", + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:50:46.545145Z", + "start_time": "2023-01-18T09:50:35.711020Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Museum of Modern Art (Score: 0.875)\n", + "2. Western Europe (Score: 0.867)\n", + "3. Renaissance art (Score: 0.864)\n", + "4. Pop art (Score: 0.86)\n", + "5. Northern Europe (Score: 0.855)\n", + "6. Hellenistic art (Score: 0.853)\n", + "7. Modernist literature (Score: 0.847)\n", + "8. Art film (Score: 0.843)\n", + "9. Central Europe (Score: 0.843)\n", + "10. European (Score: 0.841)\n", + "11. Art (Score: 0.841)\n", + "12. Byzantine art (Score: 0.841)\n", + "13. Postmodernism (Score: 0.84)\n", + "14. Eastern Europe (Score: 0.839)\n", + "15. Europe (Score: 0.839)\n", + "16. Cubism (Score: 0.839)\n", + "17. Impressionism (Score: 0.838)\n", + "18. Bauhaus (Score: 0.838)\n", + "19. Expressionism (Score: 0.837)\n", + "20. Surrealism (Score: 0.837)\n" + ] + } + ], + "source": [ + "query_results = query_qdrant('modern art in Europe', 'Articles')\n", + "for i, article in enumerate(query_results):\n", + " print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "7ed116b8", + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:53:11.038910Z", + "start_time": "2023-01-18T09:52:55.248029Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Battle of Bannockburn (Score: 0.869)\n", + "2. Wars of Scottish Independence (Score: 0.861)\n", + "3. 1651 (Score: 0.853)\n", + "4. First War of Scottish Independence (Score: 0.85)\n", + "5. Robert I of Scotland (Score: 0.846)\n", + "6. 841 (Score: 0.844)\n", + "7. 1716 (Score: 0.844)\n", + "8. 1314 (Score: 0.837)\n", + "9. 1263 (Score: 0.836)\n", + "10. William Wallace (Score: 0.835)\n", + "11. Stirling (Score: 0.831)\n", + "12. 1306 (Score: 0.831)\n", + "13. 1746 (Score: 0.831)\n", + "14. 1040s (Score: 0.828)\n", + "15. 1106 (Score: 0.827)\n", + "16. 1304 (Score: 0.827)\n", + "17. David II of Scotland (Score: 0.825)\n", + "18. Braveheart (Score: 0.824)\n", + "19. 1124 (Score: 0.824)\n", + "20. July 27 (Score: 0.823)\n" + ] + } + ], + "source": [ + "# This time we'll query using content vector\n", + "query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')\n", + "for i, article in enumerate(query_results):\n", + " print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0119d87a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "vector_db_split", + "language": "python", + "name": "vector_db_split" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + }, + "vscode": { + "interpreter": { + "hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/vector_databases/redis/Using_Redis_for_embeddings_search.ipynb b/examples/vector_databases/redis/Using_Redis_for_embeddings_search.ipynb new file mode 100644 index 00000000..6b18a75d --- /dev/null +++ b/examples/vector_databases/redis/Using_Redis_for_embeddings_search.ipynb @@ -0,0 +1,799 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cb1537e6", + "metadata": {}, + "source": [ + "# Using Redis for Embeddings Search\n", + "\n", + "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n", + "\n", + "### What is a Vector Database\n", + "\n", + "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n", + "\n", + "### Why use a Vector Database\n", + "\n", + "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n", + "\n", + "\n", + "### Demo Flow\n", + "The demo flow is:\n", + "- **Setup**: Import packages and set any required variables\n", + "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n", + "- **Redis**\n", + " - *Setup*: Set up the Redis-Py client. For more details go [here](https://github.com/redis/redis-py)\n", + " - *Index Data*: Create the search index for vector search and hybrid search (vector + full-text search) on all available fields.\n", + " - *Search Data*: Run a few example queries with various goals in mind.\n", + "\n", + "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings." + ] + }, + { + "cell_type": "markdown", + "id": "e2b59250", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "Import the required libraries and set the embedding model that we'd like to use." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d8810f9", + "metadata": {}, + "outputs": [], + "source": [ + "# We'll need to install the Redis client\n", + "!pip install redis\n", + "\n", + "#Install wget to pull zip file\n", + "!pip install wget" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "5be94df6", + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "from typing import List, Iterator\n", + "import pandas as pd\n", + "import numpy as np\n", + "import os\n", + "import wget\n", + "from ast import literal_eval\n", + "\n", + "# Redis client library for Python\n", + "import redis\n", + "\n", + "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", + "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", + "\n", + "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n", + "import warnings\n", + "\n", + "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n", + "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) " + ] + }, + { + "cell_type": "markdown", + "id": "e5d9d2e1", + "metadata": {}, + "source": [ + "## Load data\n", + "\n", + "In this section we'll load embedded data that we've prepared previous to this session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5dff8b55", + "metadata": {}, + "outputs": [], + "source": [ + "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n", + "\n", + "# The file is ~700 MB so this will take some time\n", + "wget.download(embeddings_url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21097972", + "metadata": {}, + "outputs": [], + "source": [ + "import zipfile\n", + "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", + " zip_ref.extractall(\"../data\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "70bbd8ba", + "metadata": {}, + "outputs": [], + "source": [ + "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1721e45d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idurltitletexttitle_vectorcontent_vectorvector_id
01https://simple.wikipedia.org/wiki/AprilAprilApril is the fourth month of the year in the J...[0.001009464613161981, -0.020700545981526375, ...[-0.011253940872848034, -0.013491976074874401,...0
12https://simple.wikipedia.org/wiki/AugustAugustAugust (Aug.) is the eighth month of the year ...[0.0009286514250561595, 0.000820168002974242, ...[0.0003609954728744924, 0.007262262050062418, ...1
26https://simple.wikipedia.org/wiki/ArtArtArt is a creative activity that expresses imag...[0.003393713850528002, 0.0061537534929811954, ...[-0.004959689453244209, 0.015772193670272827, ...2
38https://simple.wikipedia.org/wiki/AAA or a is the first letter of the English alph...[0.0153952119871974, -0.013759135268628597, 0....[0.024894846603274345, -0.022186409682035446, ...3
49https://simple.wikipedia.org/wiki/AirAirAir refers to the Earth's atmosphere. Air is a...[0.02224554680287838, -0.02044147066771984, -0...[0.021524671465158463, 0.018522677943110466, -...4
\n", + "
" + ], + "text/plain": [ + " id url title \\\n", + "0 1 https://simple.wikipedia.org/wiki/April April \n", + "1 2 https://simple.wikipedia.org/wiki/August August \n", + "2 6 https://simple.wikipedia.org/wiki/Art Art \n", + "3 8 https://simple.wikipedia.org/wiki/A A \n", + "4 9 https://simple.wikipedia.org/wiki/Air Air \n", + "\n", + " text \\\n", + "0 April is the fourth month of the year in the J... \n", + "1 August (Aug.) is the eighth month of the year ... \n", + "2 Art is a creative activity that expresses imag... \n", + "3 A or a is the first letter of the English alph... \n", + "4 Air refers to the Earth's atmosphere. Air is a... \n", + "\n", + " title_vector \\\n", + "0 [0.001009464613161981, -0.020700545981526375, ... \n", + "1 [0.0009286514250561595, 0.000820168002974242, ... \n", + "2 [0.003393713850528002, 0.0061537534929811954, ... \n", + "3 [0.0153952119871974, -0.013759135268628597, 0.... \n", + "4 [0.02224554680287838, -0.02044147066771984, -0... \n", + "\n", + " content_vector vector_id \n", + "0 [-0.011253940872848034, -0.013491976074874401,... 0 \n", + "1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n", + "2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n", + "3 [0.024894846603274345, -0.022186409682035446, ... 3 \n", + "4 [0.021524671465158463, 0.018522677943110466, -... 4 " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "article_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "960b82af", + "metadata": {}, + "outputs": [], + "source": [ + "# Read vectors from strings back into a list\n", + "article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n", + "article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n", + "\n", + "# Set vector_id to be a string\n", + "article_df['vector_id'] = article_df['vector_id'].apply(str)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "a334ab8b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 25000 entries, 0 to 24999\n", + "Data columns (total 7 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 25000 non-null int64 \n", + " 1 url 25000 non-null object\n", + " 2 title 25000 non-null object\n", + " 3 text 25000 non-null object\n", + " 4 title_vector 25000 non-null object\n", + " 5 content_vector 25000 non-null object\n", + " 6 vector_id 25000 non-null object\n", + "dtypes: int64(1), object(6)\n", + "memory usage: 1.3+ MB\n" + ] + } + ], + "source": [ + "article_df.info(show_counts=True)" + ] + }, + { + "cell_type": "markdown", + "id": "43bffd04", + "metadata": {}, + "source": [ + "# Redis\n", + "\n", + "The next vector database covered in this tutorial is **[Redis](https://redis.io)**. You most likely already know Redis. What you might not be aware of is the [RediSearch module](https://github.com/RediSearch/RediSearch). Enterprises have been using Redis with the RediSearch module for years now across all major cloud providers, Redis Cloud, and on premise. Recently, the Redis team added vector storage and search capability to this module in addition to the features RediSearch already had.\n", + "\n", + "Given the large ecosystem around Redis, there are most likely client libraries in the language you need. You can use any standard Redis client library to run RediSearch commands, but it's easiest to use a library that wraps the RediSearch API. Below are a few examples, but you can find more client libraries [here](https://redis.io/resources/clients/).\n", + "\n", + "| Project | Language | License | Author | Stars |\n", + "|----------|---------|--------|---------|-------|\n", + "| [jedis][jedis-url] | Java | MIT | [Redis][redis-url] | ![Stars][jedis-stars] |\n", + "| [redis-py][redis-py-url] | Python | MIT | [Redis][redis-url] | ![Stars][redis-py-stars] |\n", + "| [node-redis][node-redis-url] | Node.js | MIT | [Redis][redis-url] | ![Stars][node-redis-stars] |\n", + "| [nredisstack][nredisstack-url] | .NET | MIT | [Redis][redis-url] | ![Stars][nredisstack-stars] |\n", + "| [redisearch-go][redisearch-go-url] | Go | BSD | [Redis][redisearch-go-author] | [![redisearch-go-stars]][redisearch-go-url] |\n", + "| [redisearch-api-rs][redisearch-api-rs-url] | Rust | BSD | [Redis][redisearch-api-rs-author] | [![redisearch-api-rs-stars]][redisearch-api-rs-url] |\n", + "\n", + "[redis-url]: https://redis.com\n", + "\n", + "[redis-py-url]: https://github.com/redis/redis-py\n", + "[redis-py-stars]: https://img.shields.io/github/stars/redis/redis-py.svg?style=social&label=Star&maxAge=2592000\n", + "[redis-py-package]: https://pypi.python.org/pypi/redis\n", + "\n", + "[jedis-url]: https://github.com/redis/jedis\n", + "[jedis-stars]: https://img.shields.io/github/stars/redis/jedis.svg?style=social&label=Star&maxAge=2592000\n", + "[Jedis-package]: https://search.maven.org/artifact/redis.clients/jedis\n", + "\n", + "[nredisstack-url]: https://github.com/redis/nredisstack\n", + "[nredisstack-stars]: https://img.shields.io/github/stars/redis/nredisstack.svg?style=social&label=Star&maxAge=2592000\n", + "[nredisstack-package]: https://www.nuget.org/packages/nredisstack/\n", + "\n", + "[node-redis-url]: https://github.com/redis/node-redis\n", + "[node-redis-stars]: https://img.shields.io/github/stars/redis/node-redis.svg?style=social&label=Star&maxAge=2592000\n", + "[node-redis-package]: https://www.npmjs.com/package/redis\n", + "\n", + "[redis-om-python-url]: https://github.com/redis/redis-om-python\n", + "[redis-om-python-author]: https://redis.com\n", + "[redis-om-python-stars]: https://img.shields.io/github/stars/redis/redis-om-python.svg?style=social&label=Star&maxAge=2592000\n", + "\n", + "[redisearch-go-url]: https://github.com/RediSearch/redisearch-go\n", + "[redisearch-go-author]: https://redis.com\n", + "[redisearch-go-stars]: https://img.shields.io/github/stars/RediSearch/redisearch-go.svg?style=social&label=Star&maxAge=2592000\n", + "\n", + "[redisearch-api-rs-url]: https://github.com/RediSearch/redisearch-api-rs\n", + "[redisearch-api-rs-author]: https://redis.com\n", + "[redisearch-api-rs-stars]: https://img.shields.io/github/stars/RediSearch/redisearch-api-rs.svg?style=social&label=Star&maxAge=2592000\n", + "\n", + "\n", + "In the below cells, we will walk you through using Redis as a vector database. Since many of you are likely already used to the Redis API, this should be familiar to most." + ] + }, + { + "cell_type": "markdown", + "id": "698e24f6", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "There are many ways to deploy Redis with RediSearch. The easiest way to get started is to use Docker, but there are are many potential options for deployment. For other deployment options, see the [redis directory](./redis) in this repo.\n", + "\n", + "For this tutorial, we will use Redis Stack on Docker.\n", + "\n", + "Start a version of Redis with RediSearch (Redis Stack) by running the following docker command\n", + "\n", + "```bash\n", + "$ cd redis\n", + "$ docker compose up -d\n", + "```\n", + "This also includes the [RedisInsight](https://redis.com/redis-enterprise/redis-insight/) GUI for managing your Redis database which you can view at [http://localhost:8001](http://localhost:8001) once you start the docker container.\n", + "\n", + "You're all set up and ready to go! Next, we import and create our client for communicating with the Redis database we just created." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "d2ce669a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import redis\n", + "from redis.commands.search.indexDefinition import (\n", + " IndexDefinition,\n", + " IndexType\n", + ")\n", + "from redis.commands.search.query import Query\n", + "from redis.commands.search.field import (\n", + " TextField,\n", + " VectorField\n", + ")\n", + "\n", + "REDIS_HOST = \"localhost\"\n", + "REDIS_PORT = 6379\n", + "REDIS_PASSWORD = \"\" # default for passwordless Redis\n", + "\n", + "# Connect to Redis\n", + "redis_client = redis.Redis(\n", + " host=REDIS_HOST,\n", + " port=REDIS_PORT,\n", + " password=REDIS_PASSWORD\n", + ")\n", + "redis_client.ping()" + ] + }, + { + "cell_type": "markdown", + "id": "3f6f0af9", + "metadata": {}, + "source": [ + "## Creating a Search Index\n", + "\n", + "The below cells will show how to specify and create a search index in Redis. We will\n", + "\n", + "1. Set some constants for defining our index like the distance metric and the index name\n", + "2. Define the index schema with RediSearch fields\n", + "3. Create the index\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "a7c64cb9", + "metadata": {}, + "outputs": [], + "source": [ + "# Constants\n", + "VECTOR_DIM = len(article_df['title_vector'][0]) # length of the vectors\n", + "VECTOR_NUMBER = len(article_df) # initial number of vectors\n", + "INDEX_NAME = \"embeddings-index\" # name of the search index\n", + "PREFIX = \"doc\" # prefix for the document keys\n", + "DISTANCE_METRIC = \"COSINE\" # distance metric for the vectors (ex. COSINE, IP, L2)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "d95fcd06", + "metadata": {}, + "outputs": [], + "source": [ + "# Define RediSearch fields for each of the columns in the dataset\n", + "title = TextField(name=\"title\")\n", + "url = TextField(name=\"url\")\n", + "text = TextField(name=\"text\")\n", + "title_embedding = VectorField(\"title_vector\",\n", + " \"FLAT\", {\n", + " \"TYPE\": \"FLOAT32\",\n", + " \"DIM\": VECTOR_DIM,\n", + " \"DISTANCE_METRIC\": DISTANCE_METRIC,\n", + " \"INITIAL_CAP\": VECTOR_NUMBER,\n", + " }\n", + ")\n", + "text_embedding = VectorField(\"content_vector\",\n", + " \"FLAT\", {\n", + " \"TYPE\": \"FLOAT32\",\n", + " \"DIM\": VECTOR_DIM,\n", + " \"DISTANCE_METRIC\": DISTANCE_METRIC,\n", + " \"INITIAL_CAP\": VECTOR_NUMBER,\n", + " }\n", + ")\n", + "fields = [title, url, text, title_embedding, text_embedding]" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "7418480d", + "metadata": {}, + "outputs": [], + "source": [ + "# Check if index exists\n", + "try:\n", + " redis_client.ft(INDEX_NAME).info()\n", + " print(\"Index already exists\")\n", + "except:\n", + " # Create RediSearch Index\n", + " redis_client.ft(INDEX_NAME).create_index(\n", + " fields = fields,\n", + " definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "f3563eec", + "metadata": {}, + "source": [ + "## Load Documents into the Index\n", + "\n", + "Now that we have a search index, we can load documents into it. We will use the same documents we used in the previous examples. In Redis, either the Hash or JSON (if using RedisJSON in addition to RediSearch) data types can be used to store documents. We will use the HASH data type in this example. The below cells will show how to load documents into the index." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "e98d63ad", + "metadata": {}, + "outputs": [], + "source": [ + "def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):\n", + " records = documents.to_dict(\"records\")\n", + " for doc in records:\n", + " key = f\"{prefix}:{str(doc['id'])}\"\n", + "\n", + " # create byte vectors for title and content\n", + " title_embedding = np.array(doc[\"title_vector\"], dtype=np.float32).tobytes()\n", + " content_embedding = np.array(doc[\"content_vector\"], dtype=np.float32).tobytes()\n", + "\n", + " # replace list of floats with byte vectors\n", + " doc[\"title_vector\"] = title_embedding\n", + " doc[\"content_vector\"] = content_embedding\n", + "\n", + " client.hset(key, mapping = doc)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "098d3c5a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded 25000 documents in Redis search index with name: embeddings-index\n" + ] + } + ], + "source": [ + "index_documents(redis_client, PREFIX, article_df)\n", + "print(f\"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}\")" + ] + }, + { + "cell_type": "markdown", + "id": "f646bff4", + "metadata": {}, + "source": [ + "## Running Search Queries\n", + "\n", + "Now that we have a search index and documents loaded into it, we can run search queries. Below we will provide a function that will run a search query and return the results. Using this function we run a few queries that will show how you can utilize Redis as a vector database. Each example will demonstrate specific features to keep in mind when developing your search application with Redis.\n", + "\n", + "1. **Return Fields**: You can specify which fields you want to return in the search results. This is useful if you only want to return a subset of the fields in your documents and doesn't require a separate call to retrieve documents. In the below example, we will only return the `title` field in the search results.\n", + "2. **Hybrid Search**: You can combine vector search with any of the other RediSearch fields for hybrid search such as full text search, tag, geo, and numeric. In the below example, we will combine vector search with full text search.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "508d1f89", + "metadata": {}, + "outputs": [], + "source": [ + "def search_redis(\n", + " redis_client: redis.Redis,\n", + " user_query: str,\n", + " index_name: str = \"embeddings-index\",\n", + " vector_field: str = \"title_vector\",\n", + " return_fields: list = [\"title\", \"url\", \"text\", \"vector_score\"],\n", + " hybrid_fields = \"*\",\n", + " k: int = 20,\n", + ") -> List[dict]:\n", + "\n", + " # Creates embedding vector from user query\n", + " embedded_query = openai.Embedding.create(input=user_query,\n", + " model=EMBEDDING_MODEL,\n", + " )[\"data\"][0]['embedding']\n", + "\n", + " # Prepare the Query\n", + " base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'\n", + " query = (\n", + " Query(base_query)\n", + " .return_fields(*return_fields)\n", + " .sort_by(\"vector_score\")\n", + " .paging(0, k)\n", + " .dialect(2)\n", + " )\n", + " params_dict = {\"vector\": np.array(embedded_query).astype(dtype=np.float32).tobytes()}\n", + "\n", + " # perform vector search\n", + " results = redis_client.ft(index_name).search(query, params_dict)\n", + " for i, article in enumerate(results.docs):\n", + " score = 1 - float(article.vector_score)\n", + " print(f\"{i}. {article.title} (Score: {round(score ,3) })\")\n", + " return results.docs" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "1f0eef07", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Museum of Modern Art (Score: 0.875)\n", + "1. Western Europe (Score: 0.867)\n", + "2. Renaissance art (Score: 0.864)\n", + "3. Pop art (Score: 0.86)\n", + "4. Northern Europe (Score: 0.855)\n", + "5. Hellenistic art (Score: 0.853)\n", + "6. Modernist literature (Score: 0.847)\n", + "7. Art film (Score: 0.843)\n", + "8. Central Europe (Score: 0.843)\n", + "9. European (Score: 0.841)\n" + ] + } + ], + "source": [ + "# For using OpenAI to generate query embedding\n", + "openai.api_key = os.getenv(\"OPENAI_API_KEY\", \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\")\n", + "results = search_redis(redis_client, 'modern art in Europe', k=10)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "7b805a81", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Battle of Bannockburn (Score: 0.869)\n", + "1. Wars of Scottish Independence (Score: 0.861)\n", + "2. 1651 (Score: 0.853)\n", + "3. First War of Scottish Independence (Score: 0.85)\n", + "4. Robert I of Scotland (Score: 0.846)\n", + "5. 841 (Score: 0.844)\n", + "6. 1716 (Score: 0.844)\n", + "7. 1314 (Score: 0.837)\n", + "8. 1263 (Score: 0.836)\n", + "9. William Wallace (Score: 0.835)\n" + ] + } + ], + "source": [ + "results = search_redis(redis_client, 'Famous battles in Scottish history', vector_field='content_vector', k=10)" + ] + }, + { + "cell_type": "markdown", + "id": "0ed0b34e", + "metadata": {}, + "source": [ + "## Hybrid Queries with Redis\n", + "\n", + "The previous examples showed how run vector search queries with RediSearch. In this section, we will show how to combine vector search with other RediSearch fields for hybrid search. In the below example, we will combine vector search with full text search." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "c94d5cce", + "metadata": {}, + "outputs": [], + "source": [ + "def create_hybrid_field(field_name: str, value: str) -> str:\n", + " return f'@{field_name}:\"{value}\"'" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "bfcd31c2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. First War of Scottish Independence (Score: 0.892)\n", + "1. Wars of Scottish Independence (Score: 0.889)\n", + "2. Second War of Scottish Independence (Score: 0.879)\n", + "3. List of Scottish monarchs (Score: 0.873)\n", + "4. Scottish Borders (Score: 0.863)\n" + ] + } + ], + "source": [ + "# search the content vector for articles about famous battles in Scottish history and only include results with Scottish in the title\n", + "results = search_redis(redis_client,\n", + " \"Famous battles in Scottish history\",\n", + " vector_field=\"title_vector\",\n", + " k=5,\n", + " hybrid_fields=create_hybrid_field(\"title\", \"Scottish\")\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "28ab1e30", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Art (Score: 1.0)\n", + "1. Paint (Score: 0.896)\n", + "2. Renaissance art (Score: 0.88)\n", + "3. Painting (Score: 0.874)\n", + "4. Renaissance (Score: 0.846)\n" + ] + }, + { + "data": { + "text/plain": [ + "'In Europe, after the Middle Ages, there was a \"Renaissance\" which means \"rebirth\". People rediscovered science and artists were allowed to paint subjects other than religious subjects. People like Michelangelo and Leonardo da Vinci still painted religious pictures, but they also now could paint mythological pictures too. These artists also invented perspective where things in the distance look smaller in the picture. This was new because in the Middle Ages people would paint all the figures close up and just overlapping each other. These artists used nudity regularly in their art.'" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# run a hybrid query for articles about Art in the title vector and only include results with the phrase \"Leonardo da Vinci\" in the text\n", + "results = search_redis(redis_client,\n", + " \"Art\",\n", + " vector_field=\"title_vector\",\n", + " k=5,\n", + " hybrid_fields=create_hybrid_field(\"text\", \"Leonardo da Vinci\")\n", + " )\n", + "\n", + "# find specific mention of Leonardo da Vinci in the text that our full-text-search query returned\n", + "mention = [sentence for sentence in results[0].text.split(\"\\n\") if \"Leonardo da Vinci\" in sentence][0]\n", + "mention" + ] + }, + { + "cell_type": "markdown", + "id": "f94b5be2", + "metadata": {}, + "source": [ + "For more example with Redis as a vector database, see the README and examples within the ``vector_databases/redis`` directory of this repository" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0119d87a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "vector_db_split", + "language": "python", + "name": "vector_db_split" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + }, + "vscode": { + "interpreter": { + "hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/vector_databases/typesense/Using_Typesense_for_embeddings_search.ipynb b/examples/vector_databases/typesense/Using_Typesense_for_embeddings_search.ipynb new file mode 100644 index 00000000..04b94de6 --- /dev/null +++ b/examples/vector_databases/typesense/Using_Typesense_for_embeddings_search.ipynb @@ -0,0 +1,879 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cb1537e6", + "metadata": {}, + "source": [ + "# Using Typesense for Embeddings Search\n", + "\n", + "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n", + "\n", + "### What is a Vector Database\n", + "\n", + "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n", + "\n", + "### Why use a Vector Database\n", + "\n", + "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n", + "\n", + "\n", + "### Demo Flow\n", + "The demo flow is:\n", + "- **Setup**: Import packages and set any required variables\n", + "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n", + "- **Typesense**\n", + " - *Setup*: Set up the Typesense Python client. For more details go [here](https://typesense.org/docs/0.24.0/api/)\n", + " - *Index Data*: We'll create a collection and index it for both __titles__ and __content__.\n", + " - *Search Data*: Run a few example queries with various goals in mind.\n", + "\n", + "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings." + ] + }, + { + "cell_type": "markdown", + "id": "e2b59250", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "Import the required libraries and set the embedding model that we'd like to use." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d8810f9", + "metadata": {}, + "outputs": [], + "source": [ + "# We'll need to install the Typesense client\n", + "!pip install typesense\n", + "\n", + "#Install wget to pull zip file\n", + "!pip install wget" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "5be94df6", + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "from typing import List, Iterator\n", + "import pandas as pd\n", + "import numpy as np\n", + "import os\n", + "import wget\n", + "from ast import literal_eval\n", + "\n", + "# Typesense's client library for Python\n", + "import typesense\n", + "\n", + "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", + "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", + "\n", + "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n", + "import warnings\n", + "\n", + "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n", + "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) " + ] + }, + { + "cell_type": "markdown", + "id": "e5d9d2e1", + "metadata": {}, + "source": [ + "## Load data\n", + "\n", + "In this section we'll load embedded data that we've prepared previous to this session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5dff8b55", + "metadata": {}, + "outputs": [], + "source": [ + "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n", + "\n", + "# The file is ~700 MB so this will take some time\n", + "wget.download(embeddings_url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21097972", + "metadata": {}, + "outputs": [], + "source": [ + "import zipfile\n", + "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", + " zip_ref.extractall(\"../data\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "70bbd8ba", + "metadata": {}, + "outputs": [], + "source": [ + "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1721e45d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idurltitletexttitle_vectorcontent_vectorvector_id
01https://simple.wikipedia.org/wiki/AprilAprilApril is the fourth month of the year in the J...[0.001009464613161981, -0.020700545981526375, ...[-0.011253940872848034, -0.013491976074874401,...0
12https://simple.wikipedia.org/wiki/AugustAugustAugust (Aug.) is the eighth month of the year ...[0.0009286514250561595, 0.000820168002974242, ...[0.0003609954728744924, 0.007262262050062418, ...1
26https://simple.wikipedia.org/wiki/ArtArtArt is a creative activity that expresses imag...[0.003393713850528002, 0.0061537534929811954, ...[-0.004959689453244209, 0.015772193670272827, ...2
38https://simple.wikipedia.org/wiki/AAA or a is the first letter of the English alph...[0.0153952119871974, -0.013759135268628597, 0....[0.024894846603274345, -0.022186409682035446, ...3
49https://simple.wikipedia.org/wiki/AirAirAir refers to the Earth's atmosphere. Air is a...[0.02224554680287838, -0.02044147066771984, -0...[0.021524671465158463, 0.018522677943110466, -...4
\n", + "
" + ], + "text/plain": [ + " id url title \\\n", + "0 1 https://simple.wikipedia.org/wiki/April April \n", + "1 2 https://simple.wikipedia.org/wiki/August August \n", + "2 6 https://simple.wikipedia.org/wiki/Art Art \n", + "3 8 https://simple.wikipedia.org/wiki/A A \n", + "4 9 https://simple.wikipedia.org/wiki/Air Air \n", + "\n", + " text \\\n", + "0 April is the fourth month of the year in the J... \n", + "1 August (Aug.) is the eighth month of the year ... \n", + "2 Art is a creative activity that expresses imag... \n", + "3 A or a is the first letter of the English alph... \n", + "4 Air refers to the Earth's atmosphere. Air is a... \n", + "\n", + " title_vector \\\n", + "0 [0.001009464613161981, -0.020700545981526375, ... \n", + "1 [0.0009286514250561595, 0.000820168002974242, ... \n", + "2 [0.003393713850528002, 0.0061537534929811954, ... \n", + "3 [0.0153952119871974, -0.013759135268628597, 0.... \n", + "4 [0.02224554680287838, -0.02044147066771984, -0... \n", + "\n", + " content_vector vector_id \n", + "0 [-0.011253940872848034, -0.013491976074874401,... 0 \n", + "1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n", + "2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n", + "3 [0.024894846603274345, -0.022186409682035446, ... 3 \n", + "4 [0.021524671465158463, 0.018522677943110466, -... 4 " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "article_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "960b82af", + "metadata": {}, + "outputs": [], + "source": [ + "# Read vectors from strings back into a list\n", + "article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n", + "article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n", + "\n", + "# Set vector_id to be a string\n", + "article_df['vector_id'] = article_df['vector_id'].apply(str)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "a334ab8b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 25000 entries, 0 to 24999\n", + "Data columns (total 7 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 25000 non-null int64 \n", + " 1 url 25000 non-null object\n", + " 2 title 25000 non-null object\n", + " 3 text 25000 non-null object\n", + " 4 title_vector 25000 non-null object\n", + " 5 content_vector 25000 non-null object\n", + " 6 vector_id 25000 non-null object\n", + "dtypes: int64(1), object(6)\n", + "memory usage: 1.3+ MB\n" + ] + } + ], + "source": [ + "article_df.info(show_counts=True)" + ] + }, + { + "cell_type": "markdown", + "id": "bb09e0ec", + "metadata": {}, + "source": [ + "## Typesense\n", + "\n", + "The next vector store we'll look at is [Typesense](https://typesense.org/), which is an open source, in-memory search engine, that you can either self-host or run on [Typesense Cloud](https://cloud.typesense.org).\n", + "\n", + "Typesense focuses on performance by storing the entire index in RAM (with a backup on disk) and also focuses on providing an out-of-the-box developer experience by simplifying available options and setting good defaults. It also lets you combine attribute-based filtering together with vector queries.\n", + "\n", + "For this example, we will set up a local docker-based Typesense server, index our vectors in Typesense and then do some nearest-neighbor search queries. If you use Typesense Cloud, you can skip the docker setup part and just obtain the hostname and API keys from your cluster dashboard." + ] + }, + { + "cell_type": "markdown", + "id": "bd629f7d", + "metadata": {}, + "source": [ + "### Setup\n", + "\n", + "To run Typesense locally, you'll need [Docker](https://www.docker.com/). Following the instructions contained in the Typesense documentation [here](https://typesense.org/docs/guide/install-typesense.html#docker-compose), we created an example docker-compose.yml file in this repo saved at [./typesense/docker-compose.yml](./typesense/docker-compose.yml).\n", + "\n", + "After starting Docker, you can start Typesense locally by navigating to the `examples/vector_databases/typesense/` directory and running `docker-compose up -d`.\n", + "\n", + "The default API key is set to `xyz` in the Docker compose file, and the default Typesense port to `8108`." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "2bee46b1", + "metadata": {}, + "outputs": [], + "source": [ + "import typesense\n", + "\n", + "typesense_client = \\\n", + " typesense.Client({\n", + " \"nodes\": [{\n", + " \"host\": \"localhost\", # For Typesense Cloud use xxx.a1.typesense.net\n", + " \"port\": \"8108\", # For Typesense Cloud use 443\n", + " \"protocol\": \"http\" # For Typesense Cloud use https\n", + " }],\n", + " \"api_key\": \"xyz\",\n", + " \"connection_timeout_seconds\": 60\n", + " })" + ] + }, + { + "cell_type": "markdown", + "id": "11910afb", + "metadata": {}, + "source": [ + "### Index data\n", + "\n", + "To index vectors in Typesense, we'll first create a Collection (which is a collection of Documents) and turn on vector indexing for a particular field. You can even store multiple vector fields in a single document." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "dd055c80", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': 1687165065, 'default_sorting_field': '', 'enable_nested_fields': False, 'fields': [{'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'content_vector', 'num_dim': 1536, 'optional': False, 'sort': False, 'type': 'float[]'}, {'facet': False, 'index': True, 'infix': False, 'locale': '', 'name': 'title_vector', 'num_dim': 1536, 'optional': False, 'sort': False, 'type': 'float[]'}], 'name': 'wikipedia_articles', 'num_documents': 0, 'symbols_to_index': [], 'token_separators': []}\n", + "Created new collection wikipedia-articles\n" + ] + } + ], + "source": [ + "# Delete existing collections if they already exist\n", + "try:\n", + " typesense_client.collections['wikipedia_articles'].delete()\n", + "except Exception as e:\n", + " pass\n", + "\n", + "# Create a new collection\n", + "\n", + "schema = {\n", + " \"name\": \"wikipedia_articles\",\n", + " \"fields\": [\n", + " {\n", + " \"name\": \"content_vector\",\n", + " \"type\": \"float[]\",\n", + " \"num_dim\": len(article_df['content_vector'][0])\n", + " },\n", + " {\n", + " \"name\": \"title_vector\",\n", + " \"type\": \"float[]\",\n", + " \"num_dim\": len(article_df['title_vector'][0])\n", + " }\n", + " ]\n", + "}\n", + "\n", + "create_response = typesense_client.collections.create(schema)\n", + "print(create_response)\n", + "\n", + "print(\"Created new collection wikipedia-articles\")" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "94bbbb11", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Indexing vectors in Typesense...\n", + "Processed 100 / 25000 \n", + "Processed 200 / 25000 \n", + "Processed 300 / 25000 \n", + "Processed 400 / 25000 \n", + "Processed 500 / 25000 \n", + "Processed 600 / 25000 \n", + "Processed 700 / 25000 \n", + "Processed 800 / 25000 \n", + "Processed 900 / 25000 \n", + "Processed 1000 / 25000 \n", + "Processed 1100 / 25000 \n", + "Processed 1200 / 25000 \n", + "Processed 1300 / 25000 \n", + "Processed 1400 / 25000 \n", + "Processed 1500 / 25000 \n", + "Processed 1600 / 25000 \n", + "Processed 1700 / 25000 \n", + "Processed 1800 / 25000 \n", + "Processed 1900 / 25000 \n", + "Processed 2000 / 25000 \n", + "Processed 2100 / 25000 \n", + "Processed 2200 / 25000 \n", + "Processed 2300 / 25000 \n", + "Processed 2400 / 25000 \n", + "Processed 2500 / 25000 \n", + "Processed 2600 / 25000 \n", + "Processed 2700 / 25000 \n", + "Processed 2800 / 25000 \n", + "Processed 2900 / 25000 \n", + "Processed 3000 / 25000 \n", + "Processed 3100 / 25000 \n", + "Processed 3200 / 25000 \n", + "Processed 3300 / 25000 \n", + "Processed 3400 / 25000 \n", + "Processed 3500 / 25000 \n", + "Processed 3600 / 25000 \n", + "Processed 3700 / 25000 \n", + "Processed 3800 / 25000 \n", + "Processed 3900 / 25000 \n", + "Processed 4000 / 25000 \n", + "Processed 4100 / 25000 \n", + "Processed 4200 / 25000 \n", + "Processed 4300 / 25000 \n", + "Processed 4400 / 25000 \n", + "Processed 4500 / 25000 \n", + "Processed 4600 / 25000 \n", + "Processed 4700 / 25000 \n", + "Processed 4800 / 25000 \n", + "Processed 4900 / 25000 \n", + "Processed 5000 / 25000 \n", + "Processed 5100 / 25000 \n", + "Processed 5200 / 25000 \n", + "Processed 5300 / 25000 \n", + "Processed 5400 / 25000 \n", + "Processed 5500 / 25000 \n", + "Processed 5600 / 25000 \n", + "Processed 5700 / 25000 \n", + "Processed 5800 / 25000 \n", + "Processed 5900 / 25000 \n", + "Processed 6000 / 25000 \n", + "Processed 6100 / 25000 \n", + "Processed 6200 / 25000 \n", + "Processed 6300 / 25000 \n", + "Processed 6400 / 25000 \n", + "Processed 6500 / 25000 \n", + "Processed 6600 / 25000 \n", + "Processed 6700 / 25000 \n", + "Processed 6800 / 25000 \n", + "Processed 6900 / 25000 \n", + "Processed 7000 / 25000 \n", + "Processed 7100 / 25000 \n", + "Processed 7200 / 25000 \n", + "Processed 7300 / 25000 \n", + "Processed 7400 / 25000 \n", + "Processed 7500 / 25000 \n", + "Processed 7600 / 25000 \n", + "Processed 7700 / 25000 \n", + "Processed 7800 / 25000 \n", + "Processed 7900 / 25000 \n", + "Processed 8000 / 25000 \n", + "Processed 8100 / 25000 \n", + "Processed 8200 / 25000 \n", + "Processed 8300 / 25000 \n", + "Processed 8400 / 25000 \n", + "Processed 8500 / 25000 \n", + "Processed 8600 / 25000 \n", + "Processed 8700 / 25000 \n", + "Processed 8800 / 25000 \n", + "Processed 8900 / 25000 \n", + "Processed 9000 / 25000 \n", + "Processed 9100 / 25000 \n", + "Processed 9200 / 25000 \n", + "Processed 9300 / 25000 \n", + "Processed 9400 / 25000 \n", + "Processed 9500 / 25000 \n", + "Processed 9600 / 25000 \n", + "Processed 9700 / 25000 \n", + "Processed 9800 / 25000 \n", + "Processed 9900 / 25000 \n", + "Processed 10000 / 25000 \n", + "Processed 10100 / 25000 \n", + "Processed 10200 / 25000 \n", + "Processed 10300 / 25000 \n", + "Processed 10400 / 25000 \n", + "Processed 10500 / 25000 \n", + "Processed 10600 / 25000 \n", + "Processed 10700 / 25000 \n", + "Processed 10800 / 25000 \n", + "Processed 10900 / 25000 \n", + "Processed 11000 / 25000 \n", + "Processed 11100 / 25000 \n", + "Processed 11200 / 25000 \n", + "Processed 11300 / 25000 \n", + "Processed 11400 / 25000 \n", + "Processed 11500 / 25000 \n", + "Processed 11600 / 25000 \n", + "Processed 11700 / 25000 \n", + "Processed 11800 / 25000 \n", + "Processed 11900 / 25000 \n", + "Processed 12000 / 25000 \n", + "Processed 12100 / 25000 \n", + "Processed 12200 / 25000 \n", + "Processed 12300 / 25000 \n", + "Processed 12400 / 25000 \n", + "Processed 12500 / 25000 \n", + "Processed 12600 / 25000 \n", + "Processed 12700 / 25000 \n", + "Processed 12800 / 25000 \n", + "Processed 12900 / 25000 \n", + "Processed 13000 / 25000 \n", + "Processed 13100 / 25000 \n", + "Processed 13200 / 25000 \n", + "Processed 13300 / 25000 \n", + "Processed 13400 / 25000 \n", + "Processed 13500 / 25000 \n", + "Processed 13600 / 25000 \n", + "Processed 13700 / 25000 \n", + "Processed 13800 / 25000 \n", + "Processed 13900 / 25000 \n", + "Processed 14000 / 25000 \n", + "Processed 14100 / 25000 \n", + "Processed 14200 / 25000 \n", + "Processed 14300 / 25000 \n", + "Processed 14400 / 25000 \n", + "Processed 14500 / 25000 \n", + "Processed 14600 / 25000 \n", + "Processed 14700 / 25000 \n", + "Processed 14800 / 25000 \n", + "Processed 14900 / 25000 \n", + "Processed 15000 / 25000 \n", + "Processed 15100 / 25000 \n", + "Processed 15200 / 25000 \n", + "Processed 15300 / 25000 \n", + "Processed 15400 / 25000 \n", + "Processed 15500 / 25000 \n", + "Processed 15600 / 25000 \n", + "Processed 15700 / 25000 \n", + "Processed 15800 / 25000 \n", + "Processed 15900 / 25000 \n", + "Processed 16000 / 25000 \n", + "Processed 16100 / 25000 \n", + "Processed 16200 / 25000 \n", + "Processed 16300 / 25000 \n", + "Processed 16400 / 25000 \n", + "Processed 16500 / 25000 \n", + "Processed 16600 / 25000 \n", + "Processed 16700 / 25000 \n", + "Processed 16800 / 25000 \n", + "Processed 16900 / 25000 \n", + "Processed 17000 / 25000 \n", + "Processed 17100 / 25000 \n", + "Processed 17200 / 25000 \n", + "Processed 17300 / 25000 \n", + "Processed 17400 / 25000 \n", + "Processed 17500 / 25000 \n", + "Processed 17600 / 25000 \n", + "Processed 17700 / 25000 \n", + "Processed 17800 / 25000 \n", + "Processed 17900 / 25000 \n", + "Processed 18000 / 25000 \n", + "Processed 18100 / 25000 \n", + "Processed 18200 / 25000 \n", + "Processed 18300 / 25000 \n", + "Processed 18400 / 25000 \n", + "Processed 18500 / 25000 \n", + "Processed 18600 / 25000 \n", + "Processed 18700 / 25000 \n", + "Processed 18800 / 25000 \n", + "Processed 18900 / 25000 \n", + "Processed 19000 / 25000 \n", + "Processed 19100 / 25000 \n", + "Processed 19200 / 25000 \n", + "Processed 19300 / 25000 \n", + "Processed 19400 / 25000 \n", + "Processed 19500 / 25000 \n", + "Processed 19600 / 25000 \n", + "Processed 19700 / 25000 \n", + "Processed 19800 / 25000 \n", + "Processed 19900 / 25000 \n", + "Processed 20000 / 25000 \n", + "Processed 20100 / 25000 \n", + "Processed 20200 / 25000 \n", + "Processed 20300 / 25000 \n", + "Processed 20400 / 25000 \n", + "Processed 20500 / 25000 \n", + "Processed 20600 / 25000 \n", + "Processed 20700 / 25000 \n", + "Processed 20800 / 25000 \n", + "Processed 20900 / 25000 \n", + "Processed 21000 / 25000 \n", + "Processed 21100 / 25000 \n", + "Processed 21200 / 25000 \n", + "Processed 21300 / 25000 \n", + "Processed 21400 / 25000 \n", + "Processed 21500 / 25000 \n", + "Processed 21600 / 25000 \n", + "Processed 21700 / 25000 \n", + "Processed 21800 / 25000 \n", + "Processed 21900 / 25000 \n", + "Processed 22000 / 25000 \n", + "Processed 22100 / 25000 \n", + "Processed 22200 / 25000 \n", + "Processed 22300 / 25000 \n", + "Processed 22400 / 25000 \n", + "Processed 22500 / 25000 \n", + "Processed 22600 / 25000 \n", + "Processed 22700 / 25000 \n", + "Processed 22800 / 25000 \n", + "Processed 22900 / 25000 \n", + "Processed 23000 / 25000 \n", + "Processed 23100 / 25000 \n", + "Processed 23200 / 25000 \n", + "Processed 23300 / 25000 \n", + "Processed 23400 / 25000 \n", + "Processed 23500 / 25000 \n", + "Processed 23600 / 25000 \n", + "Processed 23700 / 25000 \n", + "Processed 23800 / 25000 \n", + "Processed 23900 / 25000 \n", + "Processed 24000 / 25000 \n", + "Processed 24100 / 25000 \n", + "Processed 24200 / 25000 \n", + "Processed 24300 / 25000 \n", + "Processed 24400 / 25000 \n", + "Processed 24500 / 25000 \n", + "Processed 24600 / 25000 \n", + "Processed 24700 / 25000 \n", + "Processed 24800 / 25000 \n", + "Processed 24900 / 25000 \n", + "Processed 25000 / 25000 \n", + "Imported (25000) articles.\n" + ] + } + ], + "source": [ + "# Upsert the vector data into the collection we just created\n", + "#\n", + "# Note: This can take a few minutes, especially if your on an M1 and running docker in an emulated mode\n", + "\n", + "print(\"Indexing vectors in Typesense...\")\n", + "\n", + "document_counter = 0\n", + "documents_batch = []\n", + "\n", + "for k,v in article_df.iterrows():\n", + " # Create a document with the vector data\n", + "\n", + " # Notice how you can add any fields that you haven't added to the schema to the document.\n", + " # These will be stored on disk and returned when the document is a hit.\n", + " # This is useful to store attributes required for display purposes.\n", + "\n", + " document = {\n", + " \"title_vector\": v[\"title_vector\"],\n", + " \"content_vector\": v[\"content_vector\"],\n", + " \"title\": v[\"title\"],\n", + " \"content\": v[\"text\"],\n", + " }\n", + " documents_batch.append(document)\n", + " document_counter = document_counter + 1\n", + "\n", + " # Upsert a batch of 100 documents\n", + " if document_counter % 100 == 0 or document_counter == len(article_df):\n", + " response = typesense_client.collections['wikipedia_articles'].documents.import_(documents_batch)\n", + " # print(response)\n", + "\n", + " documents_batch = []\n", + " print(f\"Processed {document_counter} / {len(article_df)} \")\n", + "\n", + "print(f\"Imported ({len(article_df)}) articles.\")" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "f774ecb2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collection has 25000 documents\n" + ] + } + ], + "source": [ + "# Check the number of documents imported\n", + "\n", + "collection = typesense_client.collections['wikipedia_articles'].retrieve()\n", + "print(f'Collection has {collection[\"num_documents\"]} documents')" + ] + }, + { + "cell_type": "markdown", + "id": "fbc6f5c5", + "metadata": {}, + "source": [ + "### Search Data\n", + "\n", + "Now that we've imported the vectors into Typesense, we can do a nearest neighbor search on the `title_vector` or `content_vector` field." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "d9a3f0dc", + "metadata": {}, + "outputs": [], + "source": [ + "def query_typesense(query, field='title', top_k=20):\n", + "\n", + " # Creates embedding vector from user query\n", + " openai.api_key = os.getenv(\"OPENAI_API_KEY\", \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\")\n", + " embedded_query = openai.Embedding.create(\n", + " input=query,\n", + " model=EMBEDDING_MODEL,\n", + " )['data'][0]['embedding']\n", + "\n", + " typesense_results = typesense_client.multi_search.perform({\n", + " \"searches\": [{\n", + " \"q\": \"*\",\n", + " \"collection\": \"wikipedia_articles\",\n", + " \"vector_query\": f\"{field}_vector:([{','.join(str(v) for v in embedded_query)}], k:{top_k})\"\n", + " }]\n", + " }, {})\n", + "\n", + " return typesense_results" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "24183c36", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Museum of Modern Art (Distance: 0.12482291460037231)\n", + "2. Western Europe (Distance: 0.13255876302719116)\n", + "3. Renaissance art (Distance: 0.13584274053573608)\n", + "4. Pop art (Distance: 0.1396539807319641)\n", + "5. Northern Europe (Distance: 0.14534103870391846)\n", + "6. Hellenistic art (Distance: 0.1472070813179016)\n", + "7. Modernist literature (Distance: 0.15296930074691772)\n", + "8. Art film (Distance: 0.1567266583442688)\n", + "9. Central Europe (Distance: 0.15741699934005737)\n", + "10. European (Distance: 0.1585891842842102)\n" + ] + } + ], + "source": [ + "query_results = query_typesense('modern art in Europe', 'title')\n", + "\n", + "for i, hit in enumerate(query_results['results'][0]['hits']):\n", + " document = hit[\"document\"]\n", + " vector_distance = hit[\"vector_distance\"]\n", + " print(f'{i + 1}. {document[\"title\"]} (Distance: {vector_distance})')" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "a64e3c80", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Battle of Bannockburn (Distance: 0.1306111216545105)\n", + "2. Wars of Scottish Independence (Distance: 0.1384994387626648)\n", + "3. 1651 (Distance: 0.14744246006011963)\n", + "4. First War of Scottish Independence (Distance: 0.15033596754074097)\n", + "5. Robert I of Scotland (Distance: 0.15376019477844238)\n", + "6. 841 (Distance: 0.15609073638916016)\n", + "7. 1716 (Distance: 0.15615153312683105)\n", + "8. 1314 (Distance: 0.16280347108840942)\n", + "9. 1263 (Distance: 0.16361045837402344)\n", + "10. William Wallace (Distance: 0.16464537382125854)\n" + ] + } + ], + "source": [ + "query_results = query_typesense('Famous battles in Scottish history', 'content')\n", + "\n", + "for i, hit in enumerate(query_results['results'][0]['hits']):\n", + " document = hit[\"document\"]\n", + " vector_distance = hit[\"vector_distance\"]\n", + " print(f'{i + 1}. {document[\"title\"]} (Distance: {vector_distance})')" + ] + }, + { + "cell_type": "markdown", + "id": "55afccbf", + "metadata": {}, + "source": [ + "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0119d87a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "vector_db_split", + "language": "python", + "name": "vector_db_split" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + }, + "vscode": { + "interpreter": { + "hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/vector_databases/weaviate/Using_Weaviate_for_embeddings_search.ipynb b/examples/vector_databases/weaviate/Using_Weaviate_for_embeddings_search.ipynb new file mode 100644 index 00000000..4332f717 --- /dev/null +++ b/examples/vector_databases/weaviate/Using_Weaviate_for_embeddings_search.ipynb @@ -0,0 +1,1173 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cb1537e6", + "metadata": {}, + "source": [ + "# Using Weaviate for Embeddings Search\n", + "\n", + "This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.\n", + "\n", + "### What is a Vector Database\n", + "\n", + "A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.\n", + "\n", + "### Why use a Vector Database\n", + "\n", + "Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.\n", + "\n", + "\n", + "### Demo Flow\n", + "The demo flow is:\n", + "- **Setup**: Import packages and set any required variables\n", + "- **Load data**: Load a dataset and embed it using OpenAI embeddings\n", + "- **Weaviate**\n", + " - *Setup*: Here we'll set up the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)\n", + " - *Index Data*: We'll create an index with __title__ search vectors in it\n", + " - *Search Data*: We'll run a few searches to confirm it works\n", + "\n", + "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings." + ] + }, + { + "cell_type": "markdown", + "id": "e2b59250", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "Import the required libraries and set the embedding model that we'd like to use." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d8810f9", + "metadata": {}, + "outputs": [], + "source": [ + "# We'll need to install the Weaviate client\n", + "!pip install weaviate-client\n", + "\n", + "#Install wget to pull zip file\n", + "!pip install wget" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "5be94df6", + "metadata": {}, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "from typing import List, Iterator\n", + "import pandas as pd\n", + "import numpy as np\n", + "import os\n", + "import wget\n", + "from ast import literal_eval\n", + "\n", + "# Weaviate's client library for Python\n", + "import weaviate\n", + "\n", + "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", + "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", + "\n", + "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n", + "import warnings\n", + "\n", + "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ResourceWarning)\n", + "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) " + ] + }, + { + "cell_type": "markdown", + "id": "e5d9d2e1", + "metadata": {}, + "source": [ + "## Load data\n", + "\n", + "In this section we'll load embedded data that we've prepared previous to this session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5dff8b55", + "metadata": {}, + "outputs": [], + "source": [ + "embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n", + "\n", + "# The file is ~700 MB so this will take some time\n", + "wget.download(embeddings_url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21097972", + "metadata": {}, + "outputs": [], + "source": [ + "import zipfile\n", + "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", + " zip_ref.extractall(\"../data\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "70bbd8ba", + "metadata": {}, + "outputs": [], + "source": [ + "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1721e45d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idurltitletexttitle_vectorcontent_vectorvector_id
01https://simple.wikipedia.org/wiki/AprilAprilApril is the fourth month of the year in the J...[0.001009464613161981, -0.020700545981526375, ...[-0.011253940872848034, -0.013491976074874401,...0
12https://simple.wikipedia.org/wiki/AugustAugustAugust (Aug.) is the eighth month of the year ...[0.0009286514250561595, 0.000820168002974242, ...[0.0003609954728744924, 0.007262262050062418, ...1
26https://simple.wikipedia.org/wiki/ArtArtArt is a creative activity that expresses imag...[0.003393713850528002, 0.0061537534929811954, ...[-0.004959689453244209, 0.015772193670272827, ...2
38https://simple.wikipedia.org/wiki/AAA or a is the first letter of the English alph...[0.0153952119871974, -0.013759135268628597, 0....[0.024894846603274345, -0.022186409682035446, ...3
49https://simple.wikipedia.org/wiki/AirAirAir refers to the Earth's atmosphere. Air is a...[0.02224554680287838, -0.02044147066771984, -0...[0.021524671465158463, 0.018522677943110466, -...4
\n", + "
" + ], + "text/plain": [ + " id url title \\\n", + "0 1 https://simple.wikipedia.org/wiki/April April \n", + "1 2 https://simple.wikipedia.org/wiki/August August \n", + "2 6 https://simple.wikipedia.org/wiki/Art Art \n", + "3 8 https://simple.wikipedia.org/wiki/A A \n", + "4 9 https://simple.wikipedia.org/wiki/Air Air \n", + "\n", + " text \\\n", + "0 April is the fourth month of the year in the J... \n", + "1 August (Aug.) is the eighth month of the year ... \n", + "2 Art is a creative activity that expresses imag... \n", + "3 A or a is the first letter of the English alph... \n", + "4 Air refers to the Earth's atmosphere. Air is a... \n", + "\n", + " title_vector \\\n", + "0 [0.001009464613161981, -0.020700545981526375, ... \n", + "1 [0.0009286514250561595, 0.000820168002974242, ... \n", + "2 [0.003393713850528002, 0.0061537534929811954, ... \n", + "3 [0.0153952119871974, -0.013759135268628597, 0.... \n", + "4 [0.02224554680287838, -0.02044147066771984, -0... \n", + "\n", + " content_vector vector_id \n", + "0 [-0.011253940872848034, -0.013491976074874401,... 0 \n", + "1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n", + "2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n", + "3 [0.024894846603274345, -0.022186409682035446, ... 3 \n", + "4 [0.021524671465158463, 0.018522677943110466, -... 4 " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "article_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "960b82af", + "metadata": {}, + "outputs": [], + "source": [ + "# Read vectors from strings back into a list\n", + "article_df['title_vector'] = article_df.title_vector.apply(literal_eval)\n", + "article_df['content_vector'] = article_df.content_vector.apply(literal_eval)\n", + "\n", + "# Set vector_id to be a string\n", + "article_df['vector_id'] = article_df['vector_id'].apply(str)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "a334ab8b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 25000 entries, 0 to 24999\n", + "Data columns (total 7 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 25000 non-null int64 \n", + " 1 url 25000 non-null object\n", + " 2 title 25000 non-null object\n", + " 3 text 25000 non-null object\n", + " 4 title_vector 25000 non-null object\n", + " 5 content_vector 25000 non-null object\n", + " 6 vector_id 25000 non-null object\n", + "dtypes: int64(1), object(6)\n", + "memory usage: 1.3+ MB\n" + ] + } + ], + "source": [ + "article_df.info(show_counts=True)" + ] + }, + { + "cell_type": "markdown", + "id": "d939342f", + "metadata": {}, + "source": [ + "## Weaviate\n", + "\n", + "Another vector database option we'll explore is **Weaviate**, which offers both a managed, [SaaS](https://console.weaviate.io/) option, as well as a self-hosted [open source](https://github.com/weaviate/weaviate) option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.\n", + "\n", + "For this we will:\n", + "- Set up a local deployment of Weaviate\n", + "- Create indices in Weaviate\n", + "- Store our data there\n", + "- Fire some similarity search queries\n", + "- Try a real use case\n", + "\n", + "\n", + "### Bring your own vectors approach\n", + "In this cookbook, we provide the data with already generated vectors. This is a good approach for scenarios, where your data is already vectorized.\n", + "\n", + "### Automated vectorization with OpenAI module\n", + "For scenarios, where your data is not vectorized yet, you can delegate the vectorization task with OpenAI to Weaviate.\n", + "Weaviate offers a built-in module [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai), which takes care of the vectorization for you at:\n", + "* import\n", + "* for any CRUD operations\n", + "* for semantic search\n", + "\n", + "Check out the [Getting Started with Weaviate and OpenAI module cookbook](./weaviate/getting-started-with-weaviate-and-openai.ipynb) to learn step by step how to import and vectorize data in one step." + ] + }, + { + "cell_type": "markdown", + "id": "bfdfe260", + "metadata": {}, + "source": [ + "### Setup\n", + "\n", + "To run Weaviate locally, you'll need [Docker](https://www.docker.com/). Following the instructions contained in the Weaviate documentation [here](https://weaviate.io/developers/weaviate/installation/docker-compose), we created an example docker-compose.yml file in this repo saved at [./weaviate/docker-compose.yml](./weaviate/docker-compose.yml).\n", + "\n", + "After starting Docker, you can start Weaviate locally by navigating to the `examples/vector_databases/weaviate/` directory and running `docker-compose up -d`.\n", + "\n", + "#### SaaS\n", + "Alternatively you can use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.\n", + "1. create a free account and/or login to [WCS](https://console.weaviate.io/)\n", + "2. create a `Weaviate Cluster` with the following settings:\n", + " * Sandbox: `Sandbox Free`\n", + " * Weaviate Version: Use default (latest)\n", + " * OIDC Authentication: `Disabled`\n", + "3. your instance should be ready in a minute or two\n", + "4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name-suffix.weaviate.network` " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "a78f95d1", + "metadata": {}, + "outputs": [], + "source": [ + "# Option #1 - Self-hosted - Weaviate Open Source \n", + "client = weaviate.Client(\n", + " url=\"http://localhost:8080\",\n", + " additional_headers={\n", + " \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e00b7d68", + "metadata": {}, + "outputs": [], + "source": [ + "# Option #2 - SaaS - (Weaviate Cloud Service)\n", + "client = weaviate.Client(\n", + " url=\"https://your-wcs-instance-name.weaviate.network\",\n", + " additional_headers={\n", + " \"X-OpenAI-Api-Key\": os.getenv(\"OPENAI_API_KEY\")\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d370afa", + "metadata": {}, + "outputs": [], + "source": [ + "client.is_ready()" + ] + }, + { + "cell_type": "markdown", + "id": "03a926b9", + "metadata": {}, + "source": [ + "### Index data\n", + "\n", + "In Weaviate you create __schemas__ to capture each of the entities you will be searching. \n", + "\n", + "In this case we'll create a schema called **Article** with the **title** vector from above included for us to search by.\n", + "\n", + "The next few steps closely follow the documentation Weaviate provides [here](https://weaviate.io/developers/weaviate/quickstart).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "0e6175a1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'classes': [{'class': 'Article',\n", + " 'description': 'A collection of articles',\n", + " 'invertedIndexConfig': {'bm25': {'b': 0.75, 'k1': 1.2},\n", + " 'cleanupIntervalSeconds': 60,\n", + " 'stopwords': {'additions': None, 'preset': 'en', 'removals': None}},\n", + " 'moduleConfig': {'text2vec-openai': {'model': 'ada',\n", + " 'modelVersion': '002',\n", + " 'type': 'text',\n", + " 'vectorizeClassName': True}},\n", + " 'properties': [{'dataType': ['string'],\n", + " 'description': 'Title of the article',\n", + " 'moduleConfig': {'text2vec-openai': {'skip': False,\n", + " 'vectorizePropertyName': False}},\n", + " 'name': 'title',\n", + " 'tokenization': 'word'},\n", + " {'dataType': ['text'],\n", + " 'description': 'Contents of the article',\n", + " 'moduleConfig': {'text2vec-openai': {'skip': True,\n", + " 'vectorizePropertyName': False}},\n", + " 'name': 'content',\n", + " 'tokenization': 'word'}],\n", + " 'replicationConfig': {'factor': 1},\n", + " 'shardingConfig': {'virtualPerPhysical': 128,\n", + " 'desiredCount': 1,\n", + " 'actualCount': 1,\n", + " 'desiredVirtualCount': 128,\n", + " 'actualVirtualCount': 128,\n", + " 'key': '_id',\n", + " 'strategy': 'hash',\n", + " 'function': 'murmur3'},\n", + " 'vectorIndexConfig': {'skip': False,\n", + " 'cleanupIntervalSeconds': 300,\n", + " 'maxConnections': 64,\n", + " 'efConstruction': 128,\n", + " 'ef': -1,\n", + " 'dynamicEfMin': 100,\n", + " 'dynamicEfMax': 500,\n", + " 'dynamicEfFactor': 8,\n", + " 'vectorCacheMaxObjects': 1000000000000,\n", + " 'flatSearchCutoff': 40000,\n", + " 'distance': 'cosine'},\n", + " 'vectorIndexType': 'hnsw',\n", + " 'vectorizer': 'text2vec-openai'}]}" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Clear up the schema, so that we can recreate it\n", + "client.schema.delete_all()\n", + "client.schema.get()\n", + "\n", + "# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`\n", + "article_schema = {\n", + " \"class\": \"Article\",\n", + " \"description\": \"A collection of articles\",\n", + " \"vectorizer\": \"text2vec-openai\",\n", + " \"moduleConfig\": {\n", + " \"text2vec-openai\": {\n", + " \"model\": \"ada\",\n", + " \"modelVersion\": \"002\",\n", + " \"type\": \"text\"\n", + " }\n", + " },\n", + " \"properties\": [{\n", + " \"name\": \"title\",\n", + " \"description\": \"Title of the article\",\n", + " \"dataType\": [\"string\"]\n", + " },\n", + " {\n", + " \"name\": \"content\",\n", + " \"description\": \"Contents of the article\",\n", + " \"dataType\": [\"text\"],\n", + " \"moduleConfig\": { \"text2vec-openai\": { \"skip\": True } }\n", + " }]\n", + "}\n", + "\n", + "# add the Article schema\n", + "client.schema.create_class(article_schema)\n", + "\n", + "# get the schema to make sure it worked\n", + "client.schema.get()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "ea838e7d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "### Step 1 - configure Weaviate Batch, which optimizes CRUD operations in bulk\n", + "# - starting batch size of 100\n", + "# - dynamically increase/decrease based on performance\n", + "# - add timeout retries if something goes wrong\n", + "\n", + "client.batch.configure(\n", + " batch_size=100,\n", + " dynamic=True,\n", + " timeout_retries=3,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "b4c967ec", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Uploading data with vectors to Article schema..\n", + "Import 0 / 25000 \n", + "Import 100 / 25000 \n", + "Import 200 / 25000 \n", + "Import 300 / 25000 \n", + "Import 400 / 25000 \n", + "Import 500 / 25000 \n", + "Import 600 / 25000 \n", + "Import 700 / 25000 \n", + "Import 800 / 25000 \n", + "Import 900 / 25000 \n", + "Import 1000 / 25000 \n", + "Import 1100 / 25000 \n", + "Import 1200 / 25000 \n", + "Import 1300 / 25000 \n", + "Import 1400 / 25000 \n", + "Import 1500 / 25000 \n", + "Import 1600 / 25000 \n", + "Import 1700 / 25000 \n", + "Import 1800 / 25000 \n", + "Import 1900 / 25000 \n", + "Import 2000 / 25000 \n", + "Import 2100 / 25000 \n", + "Import 2200 / 25000 \n", + "Import 2300 / 25000 \n", + "Import 2400 / 25000 \n", + "Import 2500 / 25000 \n", + "Import 2600 / 25000 \n", + "Import 2700 / 25000 \n", + "Import 2800 / 25000 \n", + "Import 2900 / 25000 \n", + "Import 3000 / 25000 \n", + "Import 3100 / 25000 \n", + "Import 3200 / 25000 \n", + "Import 3300 / 25000 \n", + "Import 3400 / 25000 \n", + "Import 3500 / 25000 \n", + "Import 3600 / 25000 \n", + "Import 3700 / 25000 \n", + "Import 3800 / 25000 \n", + "Import 3900 / 25000 \n", + "Import 4000 / 25000 \n", + "Import 4100 / 25000 \n", + "Import 4200 / 25000 \n", + "Import 4300 / 25000 \n", + "Import 4400 / 25000 \n", + "Import 4500 / 25000 \n", + "Import 4600 / 25000 \n", + "Import 4700 / 25000 \n", + "Import 4800 / 25000 \n", + "Import 4900 / 25000 \n", + "Import 5000 / 25000 \n", + "Import 5100 / 25000 \n", + "Import 5200 / 25000 \n", + "Import 5300 / 25000 \n", + "Import 5400 / 25000 \n", + "Import 5500 / 25000 \n", + "Import 5600 / 25000 \n", + "Import 5700 / 25000 \n", + "Import 5800 / 25000 \n", + "Import 5900 / 25000 \n", + "Import 6000 / 25000 \n", + "Import 6100 / 25000 \n", + "Import 6200 / 25000 \n", + "Import 6300 / 25000 \n", + "Import 6400 / 25000 \n", + "Import 6500 / 25000 \n", + "Import 6600 / 25000 \n", + "Import 6700 / 25000 \n", + "Import 6800 / 25000 \n", + "Import 6900 / 25000 \n", + "Import 7000 / 25000 \n", + "Import 7100 / 25000 \n", + "Import 7200 / 25000 \n", + "Import 7300 / 25000 \n", + "Import 7400 / 25000 \n", + "Import 7500 / 25000 \n", + "Import 7600 / 25000 \n", + "Import 7700 / 25000 \n", + "Import 7800 / 25000 \n", + "Import 7900 / 25000 \n", + "Import 8000 / 25000 \n", + "Import 8100 / 25000 \n", + "Import 8200 / 25000 \n", + "Import 8300 / 25000 \n", + "Import 8400 / 25000 \n", + "Import 8500 / 25000 \n", + "Import 8600 / 25000 \n", + "Import 8700 / 25000 \n", + "Import 8800 / 25000 \n", + "Import 8900 / 25000 \n", + "Import 9000 / 25000 \n", + "Import 9100 / 25000 \n", + "Import 9200 / 25000 \n", + "Import 9300 / 25000 \n", + "Import 9400 / 25000 \n", + "Import 9500 / 25000 \n", + "Import 9600 / 25000 \n", + "Import 9700 / 25000 \n", + "Import 9800 / 25000 \n", + "Import 9900 / 25000 \n", + "Import 10000 / 25000 \n", + "Import 10100 / 25000 \n", + "Import 10200 / 25000 \n", + "Import 10300 / 25000 \n", + "Import 10400 / 25000 \n", + "Import 10500 / 25000 \n", + "Import 10600 / 25000 \n", + "Import 10700 / 25000 \n", + "Import 10800 / 25000 \n", + "Import 10900 / 25000 \n", + "Import 11000 / 25000 \n", + "Import 11100 / 25000 \n", + "Import 11200 / 25000 \n", + "Import 11300 / 25000 \n", + "Import 11400 / 25000 \n", + "Import 11500 / 25000 \n", + "Import 11600 / 25000 \n", + "Import 11700 / 25000 \n", + "Import 11800 / 25000 \n", + "Import 11900 / 25000 \n", + "Import 12000 / 25000 \n", + "Import 12100 / 25000 \n", + "Import 12200 / 25000 \n", + "Import 12300 / 25000 \n", + "Import 12400 / 25000 \n", + "Import 12500 / 25000 \n", + "Import 12600 / 25000 \n", + "Import 12700 / 25000 \n", + "Import 12800 / 25000 \n", + "Import 12900 / 25000 \n", + "Import 13000 / 25000 \n", + "Import 13100 / 25000 \n", + "Import 13200 / 25000 \n", + "Import 13300 / 25000 \n", + "Import 13400 / 25000 \n", + "Import 13500 / 25000 \n", + "Import 13600 / 25000 \n", + "Import 13700 / 25000 \n", + "Import 13800 / 25000 \n", + "Import 13900 / 25000 \n", + "Import 14000 / 25000 \n", + "Import 14100 / 25000 \n", + "Import 14200 / 25000 \n", + "Import 14300 / 25000 \n", + "Import 14400 / 25000 \n", + "Import 14500 / 25000 \n", + "Import 14600 / 25000 \n", + "Import 14700 / 25000 \n", + "Import 14800 / 25000 \n", + "Import 14900 / 25000 \n", + "Import 15000 / 25000 \n", + "Import 15100 / 25000 \n", + "Import 15200 / 25000 \n", + "Import 15300 / 25000 \n", + "Import 15400 / 25000 \n", + "Import 15500 / 25000 \n", + "Import 15600 / 25000 \n", + "Import 15700 / 25000 \n", + "Import 15800 / 25000 \n", + "Import 15900 / 25000 \n", + "Import 16000 / 25000 \n", + "Import 16100 / 25000 \n", + "Import 16200 / 25000 \n", + "Import 16300 / 25000 \n", + "Import 16400 / 25000 \n", + "Import 16500 / 25000 \n", + "Import 16600 / 25000 \n", + "Import 16700 / 25000 \n", + "Import 16800 / 25000 \n", + "Import 16900 / 25000 \n", + "Import 17000 / 25000 \n", + "Import 17100 / 25000 \n", + "Import 17200 / 25000 \n", + "Import 17300 / 25000 \n", + "Import 17400 / 25000 \n", + "Import 17500 / 25000 \n", + "Import 17600 / 25000 \n", + "Import 17700 / 25000 \n", + "Import 17800 / 25000 \n", + "Import 17900 / 25000 \n", + "Import 18000 / 25000 \n", + "Import 18100 / 25000 \n", + "Import 18200 / 25000 \n", + "Import 18300 / 25000 \n", + "Import 18400 / 25000 \n", + "Import 18500 / 25000 \n", + "Import 18600 / 25000 \n", + "Import 18700 / 25000 \n", + "Import 18800 / 25000 \n", + "Import 18900 / 25000 \n", + "Import 19000 / 25000 \n", + "Import 19100 / 25000 \n", + "Import 19200 / 25000 \n", + "Import 19300 / 25000 \n", + "Import 19400 / 25000 \n", + "Import 19500 / 25000 \n", + "Import 19600 / 25000 \n", + "Import 19700 / 25000 \n", + "Import 19800 / 25000 \n", + "Import 19900 / 25000 \n", + "Import 20000 / 25000 \n", + "Import 20100 / 25000 \n", + "Import 20200 / 25000 \n", + "Import 20300 / 25000 \n", + "Import 20400 / 25000 \n", + "Import 20500 / 25000 \n", + "Import 20600 / 25000 \n", + "Import 20700 / 25000 \n", + "Import 20800 / 25000 \n", + "Import 20900 / 25000 \n", + "Import 21000 / 25000 \n", + "Import 21100 / 25000 \n", + "Import 21200 / 25000 \n", + "Import 21300 / 25000 \n", + "Import 21400 / 25000 \n", + "Import 21500 / 25000 \n", + "Import 21600 / 25000 \n", + "Import 21700 / 25000 \n", + "Import 21800 / 25000 \n", + "Import 21900 / 25000 \n", + "Import 22000 / 25000 \n", + "Import 22100 / 25000 \n", + "Import 22200 / 25000 \n", + "Import 22300 / 25000 \n", + "Import 22400 / 25000 \n", + "Import 22500 / 25000 \n", + "Import 22600 / 25000 \n", + "Import 22700 / 25000 \n", + "Import 22800 / 25000 \n", + "Import 22900 / 25000 \n", + "Import 23000 / 25000 \n", + "Import 23100 / 25000 \n", + "Import 23200 / 25000 \n", + "Import 23300 / 25000 \n", + "Import 23400 / 25000 \n", + "Import 23500 / 25000 \n", + "Import 23600 / 25000 \n", + "Import 23700 / 25000 \n", + "Import 23800 / 25000 \n", + "Import 23900 / 25000 \n", + "Import 24000 / 25000 \n", + "Import 24100 / 25000 \n", + "Import 24200 / 25000 \n", + "Import 24300 / 25000 \n", + "Import 24400 / 25000 \n", + "Import 24500 / 25000 \n", + "Import 24600 / 25000 \n", + "Import 24700 / 25000 \n", + "Import 24800 / 25000 \n", + "Import 24900 / 25000 \n", + "Importing (25000) Articles complete\n" + ] + } + ], + "source": [ + "### Step 2 - import data\n", + "\n", + "print(\"Uploading data with vectors to Article schema..\")\n", + "\n", + "counter=0\n", + "\n", + "with client.batch as batch:\n", + " for k,v in article_df.iterrows():\n", + " \n", + " # print update message every 100 objects \n", + " if (counter %100 == 0):\n", + " print(f\"Import {counter} / {len(article_df)} \")\n", + " \n", + " properties = {\n", + " \"title\": v[\"title\"],\n", + " \"content\": v[\"text\"]\n", + " }\n", + " \n", + " vector = v[\"title_vector\"]\n", + " \n", + " batch.add_data_object(properties, \"Article\", None, vector)\n", + " counter = counter+1\n", + "\n", + "print(f\"Importing ({len(article_df)}) Articles complete\") " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "f826e1ad", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Object count: [{'meta': {'count': 25000}}]\n" + ] + } + ], + "source": [ + "# Test that all data has loaded – get object count\n", + "result = (\n", + " client.query.aggregate(\"Article\")\n", + " .with_fields(\"meta { count }\")\n", + " .do()\n", + ")\n", + "print(\"Object count: \", result[\"data\"][\"Aggregate\"][\"Article\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "5c09d483", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "000393f2-1182-4e3d-abcf-4217eda64be0\n", + "Lago d'Origlio\n", + "Lago d'Origlio is a lake in the municipality of Origlio, in Ticino, Switzerland.\n", + "\n", + "Lakes of Ticino\n" + ] + } + ], + "source": [ + "# Test one article has worked by checking one object\n", + "test_article = (\n", + " client.query\n", + " .get(\"Article\", [\"title\", \"content\", \"_additional {id}\"])\n", + " .with_limit(1)\n", + " .do()\n", + ")[\"data\"][\"Get\"][\"Article\"][0]\n", + "\n", + "print(test_article[\"_additional\"][\"id\"])\n", + "print(test_article[\"title\"])\n", + "print(test_article[\"content\"])" + ] + }, + { + "cell_type": "markdown", + "id": "46050ca9", + "metadata": {}, + "source": [ + "### Search data\n", + "\n", + "As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "add222d7", + "metadata": {}, + "outputs": [], + "source": [ + "def query_weaviate(query, collection_name, top_k=20):\n", + "\n", + " # Creates embedding vector from user query\n", + " embedded_query = openai.Embedding.create(\n", + " input=query,\n", + " model=EMBEDDING_MODEL,\n", + " )[\"data\"][0]['embedding']\n", + " \n", + " near_vector = {\"vector\": embedded_query}\n", + "\n", + " # Queries input schema with vectorised user query\n", + " query_result = (\n", + " client.query\n", + " .get(collection_name, [\"title\", \"content\", \"_additional {certainty distance}\"])\n", + " .with_near_vector(near_vector)\n", + " .with_limit(top_k)\n", + " .do()\n", + " )\n", + " \n", + " return query_result" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "c888aa4b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Museum of Modern Art (Certainty: 0.938) (Distance: 0.125)\n", + "2. Western Europe (Certainty: 0.934) (Distance: 0.133)\n", + "3. Renaissance art (Certainty: 0.932) (Distance: 0.136)\n", + "4. Pop art (Certainty: 0.93) (Distance: 0.14)\n", + "5. Northern Europe (Certainty: 0.927) (Distance: 0.145)\n", + "6. Hellenistic art (Certainty: 0.926) (Distance: 0.147)\n", + "7. Modernist literature (Certainty: 0.924) (Distance: 0.153)\n", + "8. Art film (Certainty: 0.922) (Distance: 0.157)\n", + "9. Central Europe (Certainty: 0.921) (Distance: 0.157)\n", + "10. European (Certainty: 0.921) (Distance: 0.159)\n", + "11. Art (Certainty: 0.921) (Distance: 0.159)\n", + "12. Byzantine art (Certainty: 0.92) (Distance: 0.159)\n", + "13. Postmodernism (Certainty: 0.92) (Distance: 0.16)\n", + "14. Eastern Europe (Certainty: 0.92) (Distance: 0.161)\n", + "15. Europe (Certainty: 0.919) (Distance: 0.161)\n", + "16. Cubism (Certainty: 0.919) (Distance: 0.161)\n", + "17. Impressionism (Certainty: 0.919) (Distance: 0.162)\n", + "18. Bauhaus (Certainty: 0.919) (Distance: 0.162)\n", + "19. Expressionism (Certainty: 0.918) (Distance: 0.163)\n", + "20. Surrealism (Certainty: 0.918) (Distance: 0.163)\n" + ] + } + ], + "source": [ + "query_result = query_weaviate(\"modern art in Europe\", \"Article\")\n", + "counter = 0\n", + "for article in query_result[\"data\"][\"Get\"][\"Article\"]:\n", + " counter += 1\n", + " print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "c54cd8e9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Historic Scotland (Score: 0.946)\n", + "2. First War of Scottish Independence (Score: 0.946)\n", + "3. Battle of Bannockburn (Score: 0.946)\n", + "4. Wars of Scottish Independence (Score: 0.944)\n", + "5. Second War of Scottish Independence (Score: 0.94)\n", + "6. List of Scottish monarchs (Score: 0.937)\n", + "7. Scottish Borders (Score: 0.932)\n", + "8. Braveheart (Score: 0.929)\n", + "9. John of Scotland (Score: 0.929)\n", + "10. Guardians of Scotland (Score: 0.926)\n", + "11. Holyrood Abbey (Score: 0.925)\n", + "12. Scottish (Score: 0.925)\n", + "13. Scots (Score: 0.925)\n", + "14. Robert I of Scotland (Score: 0.924)\n", + "15. Scottish people (Score: 0.924)\n", + "16. Edinburgh Castle (Score: 0.924)\n", + "17. Alexander I of Scotland (Score: 0.924)\n", + "18. Robert Burns (Score: 0.924)\n", + "19. Battle of Bosworth Field (Score: 0.922)\n", + "20. David II of Scotland (Score: 0.922)\n" + ] + } + ], + "source": [ + "query_result = query_weaviate(\"Famous battles in Scottish history\", \"Article\")\n", + "counter = 0\n", + "for article in query_result[\"data\"][\"Get\"][\"Article\"]:\n", + " counter += 1\n", + " print(f\"{counter}. {article['title']} (Score: {round(article['_additional']['certainty'],3) })\")" + ] + }, + { + "cell_type": "markdown", + "id": "220b3e11", + "metadata": {}, + "source": [ + "### Let Weaviate handle vector embeddings\n", + "\n", + "Weaviate has a [built-in module for OpenAI](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai), which takes care of the steps required to generate a vector embedding for your queries and any CRUD operations.\n", + "\n", + "This allows you to run a vector query with the `with_near_text` filter, which uses your `OPEN_API_KEY`." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "9425c882", + "metadata": {}, + "outputs": [], + "source": [ + "def near_text_weaviate(query, collection_name):\n", + " \n", + " nearText = {\n", + " \"concepts\": [query],\n", + " \"distance\": 0.7,\n", + " }\n", + "\n", + " properties = [\n", + " \"title\", \"content\",\n", + " \"_additional {certainty distance}\"\n", + " ]\n", + "\n", + " query_result = (\n", + " client.query\n", + " .get(collection_name, properties)\n", + " .with_near_text(nearText)\n", + " .with_limit(20)\n", + " .do()\n", + " )[\"data\"][\"Get\"][collection_name]\n", + " \n", + " print (f\"Objects returned: {len(query_result)}\")\n", + " \n", + " return query_result" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "501a16f7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Objects returned: 20\n", + "1. Museum of Modern Art (Certainty: 0.938) (Distance: 0.125)\n", + "2. Western Europe (Certainty: 0.934) (Distance: 0.133)\n", + "3. Renaissance art (Certainty: 0.932) (Distance: 0.136)\n", + "4. Pop art (Certainty: 0.93) (Distance: 0.14)\n", + "5. Northern Europe (Certainty: 0.927) (Distance: 0.145)\n", + "6. Hellenistic art (Certainty: 0.926) (Distance: 0.147)\n", + "7. Modernist literature (Certainty: 0.923) (Distance: 0.153)\n", + "8. Art film (Certainty: 0.922) (Distance: 0.157)\n", + "9. Central Europe (Certainty: 0.921) (Distance: 0.157)\n", + "10. European (Certainty: 0.921) (Distance: 0.159)\n", + "11. Art (Certainty: 0.921) (Distance: 0.159)\n", + "12. Byzantine art (Certainty: 0.92) (Distance: 0.159)\n", + "13. Postmodernism (Certainty: 0.92) (Distance: 0.16)\n", + "14. Eastern Europe (Certainty: 0.92) (Distance: 0.161)\n", + "15. Europe (Certainty: 0.919) (Distance: 0.161)\n", + "16. Cubism (Certainty: 0.919) (Distance: 0.161)\n", + "17. Impressionism (Certainty: 0.919) (Distance: 0.162)\n", + "18. Bauhaus (Certainty: 0.919) (Distance: 0.162)\n", + "19. Surrealism (Certainty: 0.918) (Distance: 0.163)\n", + "20. Expressionism (Certainty: 0.918) (Distance: 0.163)\n" + ] + } + ], + "source": [ + "query_result = near_text_weaviate(\"modern art in Europe\",\"Article\")\n", + "counter = 0\n", + "for article in query_result:\n", + " counter += 1\n", + " print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "839b26df", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Objects returned: 20\n", + "1. Historic Scotland (Certainty: 0.946) (Distance: 0.107)\n", + "2. First War of Scottish Independence (Certainty: 0.946) (Distance: 0.108)\n", + "3. Battle of Bannockburn (Certainty: 0.946) (Distance: 0.109)\n", + "4. Wars of Scottish Independence (Certainty: 0.944) (Distance: 0.111)\n", + "5. Second War of Scottish Independence (Certainty: 0.94) (Distance: 0.121)\n", + "6. List of Scottish monarchs (Certainty: 0.937) (Distance: 0.127)\n", + "7. Scottish Borders (Certainty: 0.932) (Distance: 0.137)\n", + "8. Braveheart (Certainty: 0.929) (Distance: 0.141)\n", + "9. John of Scotland (Certainty: 0.929) (Distance: 0.142)\n", + "10. Guardians of Scotland (Certainty: 0.926) (Distance: 0.148)\n", + "11. Holyrood Abbey (Certainty: 0.925) (Distance: 0.15)\n", + "12. Scottish (Certainty: 0.925) (Distance: 0.15)\n", + "13. Scots (Certainty: 0.925) (Distance: 0.15)\n", + "14. Robert I of Scotland (Certainty: 0.924) (Distance: 0.151)\n", + "15. Scottish people (Certainty: 0.924) (Distance: 0.152)\n", + "16. Edinburgh Castle (Certainty: 0.924) (Distance: 0.153)\n", + "17. Alexander I of Scotland (Certainty: 0.924) (Distance: 0.153)\n", + "18. Robert Burns (Certainty: 0.924) (Distance: 0.153)\n", + "19. Battle of Bosworth Field (Certainty: 0.922) (Distance: 0.155)\n", + "20. David II of Scotland (Certainty: 0.922) (Distance: 0.157)\n" + ] + } + ], + "source": [ + "query_result = near_text_weaviate(\"Famous battles in Scottish history\",\"Article\")\n", + "counter = 0\n", + "for article in query_result:\n", + " counter += 1\n", + " print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0119d87a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "vector_db_split", + "language": "python", + "name": "vector_db_split" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + }, + "vscode": { + "interpreter": { + "hash": "fd16a328ca3d68029457069b79cb0b38eb39a0f5ccc4fe4473d3047707df8207" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}