From b2d5c21969a90ff13cd818921863e0f6d796aa91 Mon Sep 17 00:00:00 2001 From: Filip Haltmayer Date: Wed, 15 Feb 2023 11:49:27 -0800 Subject: [PATCH] Add Milvus vector db Signed-off-by: Filip Haltmayer --- ...ctor_databases_for_embeddings_search.ipynb | 186 ++++++++++++++++++ .../milvus/docker-compose.yaml | 52 +++++ 2 files changed, 238 insertions(+) create mode 100644 examples/vector_databases/milvus/docker-compose.yaml diff --git a/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb b/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb index 28fc3a37..da201f20 100644 --- a/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb +++ b/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb @@ -31,6 +31,10 @@ " - *Setup*: Here we'll set up the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)\n", " - *Index Data*: We'll create an index with __title__ search vectors in it\n", " - *Search Data*: We'll run a few searches to confirm it works\n", + "- **Milvus**\n", + " - *Setup*: Here we'll set up the Python client for Milvus. For more details go [here](https://milvus.io/docs)\n", + " - *Index Data* We'll create a collection and index it for both __titles__ and __content__\n", + " - *Search Data*: We'll test out both collections with search queries to confirm it works\n", "- **Qdrant**\n", " - *Setup*: Here we'll set up the Python client for Qdrant. For more details go [here](https://github.com/qdrant/qdrant_client)\n", " - *Index Data*: We'll create a collection with vectors for __titles__ and __content__\n", @@ -64,6 +68,7 @@ "# We'll need to install the clients for all vector databases\n", "!pip install pinecone-client\n", "!pip install weaviate-client\n", + "!pip install pymilvus\n", "!pip install qdrant-client\n", "!pip install redis\n", "\n", @@ -97,6 +102,9 @@ "# Weaviate's client library for Python\n", "import weaviate\n", "\n", + "# Milvus's client library for Python\n", + "import pymilvus\n", + "\n", "# Qdrant's client library for Python\n", "import qdrant_client\n", "\n", @@ -942,6 +950,184 @@ " print(f\"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })\")" ] }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "4dc3a0c0", + "metadata": {}, + "source": [ + "## Milvus\n", + "\n", + "The next vector database we will take a look at is **Milvus**, which also offers a SaaS option like the previous two, as well as self-hosted options using either helm or docker-compose. Sticking to the idea of open source, we will show our self-hosted example here.\n", + "\n", + "In this example we will:\n", + "- Set up a local docker-compose based deployment\n", + "- Create the title and content collections\n", + "- Store our data\n", + "- Test out our system with real world searches" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "fe4914e9", + "metadata": {}, + "source": [ + "### Setup\n", + "\n", + "There are many ways to run Milvus (take a look [here](https://milvus.io/docs/install_cluster-milvusoperator.md)), but for now we will stick to a simple standalone Milvus instance with docker-compose.\n", + "\n", + "A simple docker-file can be found at `./milvus/docker-compose.yaml` and can be run using `docker-compose up` if within that mentioned directory or using `docker-compose -f path/to/file up`\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e10f2ed", + "metadata": {}, + "outputs": [], + "source": [ + "from pymilvus import connections\n", + "\n", + "connections.connect(host='localhost', port=19530) # Local instance defaults to port 19530" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "64ffed22", + "metadata": {}, + "source": [ + "### Index data\n", + "\n", + "In Milvus data is stored in the form of collections, with each collection being able to store the vectors and any attributes that come with them.\n", + "\n", + "In this case we'll create a collection called **articles** which contains the url, title, text and the content_embedding.\n", + "\n", + "In addition to this we will also create an index on the content embedding. Milvus allows for the use of many SOTA indexing methods, but in this case, we are going to use HNSW.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bfabc3db", + "metadata": {}, + "outputs": [], + "source": [ + "from pymilvus import utility, Collection, FieldSchema, CollectionSchema, DataType\n", + "\n", + "# Remove the collection if it already exists.\n", + "if utility.has_collection('articles'):\n", + " utility.drop_collection('articles')\n", + "\n", + "fields = [\n", + " FieldSchema(name='id', dtype=DataType.INT64),\n", + " FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=1000), # Strings have to specify a max length [1, 65535]\n", + " FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=1000),\n", + " FieldSchema(name='text', dtype=DataType.VARCHAR, max_length=1000),\n", + " FieldSchema(name='content_vector', dtype=DataType.FLOAT_VECTOR, dim=len(article_df['content_vector'][0])),\n", + " FieldSchema(name='vector_id', dtype=DataType.INT64, is_primary=True, auto_id=False),\n", + "]\n", + "\n", + "col_schema = CollectionSchema(fields)\n", + "\n", + "col = Collection('articles', col_schema)\n", + "\n", + "# Using a basic HNSW index for this example\n", + "index = {\n", + " 'index_type': 'HNSW',\n", + " 'metric_type': 'L2',\n", + " 'params': {\n", + " 'M': 8,\n", + " 'efConstruction': 64\n", + " },\n", + "}\n", + "\n", + "col.create_index('content_vector', index)\n", + "col.load()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f51aaed1", + "metadata": {}, + "outputs": [], + "source": [ + "# Using the above provided batching function from Pinecone\n", + "def to_batches(df: pd.DataFrame, batch_size: int) -> Iterator[pd.DataFrame]:\n", + " splits = df.shape[0] / batch_size\n", + " if splits <= 1:\n", + " yield df\n", + " else:\n", + " for chunk in np.array_split(df, splits):\n", + " yield chunk\n", + "\n", + "# Since we are storing the text within Milvus we need to clip any that are over our set limit.\n", + "# We can also set the limit to be higher, but that slows down the search requests as more info \n", + "# needs to be sent back.\n", + "def shorten_text(text):\n", + " if len(text) >= 996:\n", + " return text[:996] + '...'\n", + " else:\n", + " return text\n", + "\n", + "for batch in to_batches(article_df, 1000):\n", + " batch = batch.drop(columns = ['title_vector'])\n", + " batch['text'] = batch.text.apply(shorten_text)\n", + " # Due to the vector_id being converted to a string for compatiblity for other vector dbs,\n", + " # we want to swap it back to its original form.\n", + " batch['vector_id'] = batch.vector_id.apply(int)\n", + " col.insert(batch) \n", + "\n", + "col.flush() " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1f68a790", + "metadata": {}, + "source": [ + "# Search\n", + "Once the data is inserted into Milvus we can perform searches. For this example the search function takes one argument, top_k, how many closest matches to return. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "02f21251", + "metadata": {}, + "outputs": [], + "source": [ + "def query_article(query, top_k=5):\n", + " # Generate the embedding with openai\n", + " embedded_query = openai.Embedding.create(\n", + " input=query,\n", + " model=EMBEDDING_MODEL,\n", + " )[\"data\"][0]['embedding']\n", + "\n", + " # Using some basic params for HNSW\n", + " search_param = {\n", + " 'ef': max(64, top_k)\n", + " }\n", + "\n", + " # Perform the search.\n", + " res = col.search([embedded_query], 'content_vector', search_param, output_fields = ['title', 'url'], limit = top_k)\n", + "\n", + " ret = []\n", + " for hit in res[0]:\n", + " # Get the id, distance, and title for the results\n", + " ret.append({'vector_id': hit.id, 'distance': hit.score, 'title': hit.entity.get('title'), 'url': hit.entity.get('url')})\n", + " return ret\n", + " \n", + "\n", + "for x in query_article('fastest plane ever made', 3):\n", + " print(x.items())\n" + ] + }, { "cell_type": "markdown", "id": "9cfaed9d", diff --git a/examples/vector_databases/milvus/docker-compose.yaml b/examples/vector_databases/milvus/docker-compose.yaml new file mode 100644 index 00000000..6ffc9dcc --- /dev/null +++ b/examples/vector_databases/milvus/docker-compose.yaml @@ -0,0 +1,52 @@ +version: '3.5' + +services: + etcd: + container_name: milvus-etcd + image: quay.io/coreos/etcd:v3.5.5 + environment: + - ETCD_AUTO_COMPACTION_MODE=revision + - ETCD_AUTO_COMPACTION_RETENTION=1000 + - ETCD_QUOTA_BACKEND_BYTES=4294967296 + - ETCD_SNAPSHOT_COUNT=50000 + volumes: + - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd + command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd + + minio: + container_name: milvus-minio + image: minio/minio:RELEASE.2022-03-17T06-34-49Z + environment: + MINIO_ACCESS_KEY: minioadmin + MINIO_SECRET_KEY: minioadmin + ports: + - "9001:9001" + - "9000:9000" + volumes: + - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data + command: minio server /minio_data --console-address ":9001" + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"] + interval: 30s + + retries: 3 + + standalone: + container_name: milvus-standalone + image: milvusdb/milvus:latest + command: ["milvus", "run", "standalone"] + environment: + ETCD_ENDPOINTS: etcd:2379 + MINIO_ADDRESS: minio:9000 + volumes: + - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus + ports: + - "19530:19530" + - "9091:9091" + depends_on: + - "etcd" + - "minio" + +networks: + default: + name: milvus