Merge pull request #81 from kacperlukawski/qdrant-example

Add Qdrant as another example of vector database
2 years ago · 2fed004763
parent 225b9177c8 5ee5fecb76
commit 2fed004763
2 changed files with 318 additions and 1 deletions
--- a/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb
+++ b/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb
@ -69,6 +69,9 @@
    "# Weaviate's client library for Python\n",
    "import weaviate\n",
    "\n",
    "# Qdrant's client library for Python\n",
    "import qdrant_client\n",
    "\n",
    "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
    "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n",
    "\n",
@ -1048,7 +1051,313 @@
  },
  {
   "cell_type": "markdown",
-   "id": "ad74202e",
+   "metadata": {},
   "source": [
    "## Qdrant\n",
    "\n",
    "The last vector database we'll consider in **[Qdrant](https://qdrant.tech/)**. This is a high-performant vector search database written in Rust. It offers both on-premise and cloud version, but for the purposes of that example we're going to use the local deployment mode.\n",
    "\n",
    "Setting everything up will require:\n",
    "- Spinning up a local instance of Qdrant\n",
    "- Configuring the collection and storing the data in it\n",
    "- Trying out with some queries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup\n",
    "\n",
    "For the local deployment, we are going to use Docker, according to the Qdrant documentation: https://qdrant.tech/documentation/quick_start/. Qdrant requires just a single container, but an example of the docker-compose.yaml file is available at `./qdrant/docker-compose.yaml` in this repo.\n",
    "\n",
    "You can start Qdrant instance locally by navigating to this directory and running `docker-compose up -d `"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-01-18T09:28:38.928205Z",
     "start_time": "2023-01-18T09:28:38.913987Z"
    }
   },
   "outputs": [],
   "source": [
    "qdrant = qdrant_client.QdrantClient(host='localhost', prefer_grpc=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-01-18T09:29:19.806639Z",
     "start_time": "2023-01-18T09:29:19.727897Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CollectionsResponse(collections=[])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "qdrant.get_collections()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Index data\n",
    "\n",
    "Qdrant stores data in __collections__ where each object is described by at least one vector and may contain an additional metadata called __payload__. Our collection will be called **Articles** and each object will be described by both **title** and **content** vectors.\n",
    "\n",
    "We're going to be using an official [qdrant-client](https://github.com/qdrant/qdrant_client) package that has all the utility methods already built-in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-01-18T09:29:22.530121Z",
     "start_time": "2023-01-18T09:29:22.524604Z"
    }
   },
   "outputs": [],
   "source": [
    "from qdrant_client.http import models as rest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-01-18T09:31:14.413334Z",
     "start_time": "2023-01-18T09:31:13.619079Z"
    }
   },
   "outputs": [],
   "source": [
    "vector_size = len(article_df['content_vector'][0])\n",
    "\n",
    "qdrant.recreate_collection(\n",
    "    collection_name='Articles',\n",
    "    vectors_config={\n",
    "        'title': rest.VectorParams(\n",
    "            distance=rest.Distance.COSINE,\n",
    "            size=vector_size,\n",
    "        ),\n",
    "        'content': rest.VectorParams(\n",
    "            distance=rest.Distance.COSINE,\n",
    "            size=vector_size,\n",
    "        ),\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-01-18T09:36:28.597535Z",
     "start_time": "2023-01-18T09:36:24.108867Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "qdrant.upsert(\n",
    "    collection_name='Articles',\n",
    "    points=[\n",
    "        rest.PointStruct(\n",
    "            id=k,\n",
    "            vector={\n",
    "                'title': v['title_vector'],\n",
    "                'content': v['content_vector'],\n",
    "            },\n",
    "            payload=v.to_dict(),\n",
    "        )\n",
    "        for k, v in article_df.iterrows()\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-01-18T09:58:13.825886Z",
     "start_time": "2023-01-18T09:58:13.816248Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountResult(count=250)"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check the collection size to make sure all the points have been stored\n",
    "qdrant.count(collection_name='Articles')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Search Data\n",
    "\n",
    "Once the data is put into Qdrant we can start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-01-18T09:50:35.265647Z",
     "start_time": "2023-01-18T09:50:35.256065Z"
    }
   },
   "outputs": [],
   "source": [
    "def query_qdrant(query, collection_name, vector_name='title', top_k=20):\n",
    "\n",
    "    # Creates embedding vector from user query\n",
    "    embedded_query = openai.Embedding.create(\n",
    "        input=query,\n",
    "        model=EMBEDDING_MODEL,\n",
    "    )['data'][0]['embedding']\n",
    "    \n",
    "    query_results = qdrant.search(\n",
    "        collection_name=collection_name,\n",
    "        query_vector=(\n",
    "            vector_name, embedded_query\n",
    "        ),\n",
    "        limit=top_k,\n",
    "    )\n",
    "    \n",
    "    return query_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-01-18T09:50:46.545145Z",
     "start_time": "2023-01-18T09:50:35.711020Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0. Art (Score: 0.841)\n",
      "1. Europe (Score: 0.839)\n",
      "2. Italy (Score: 0.816)\n",
      "3. Architecture (Score: 0.815)\n",
      "4. Madrid (Score: 0.815)\n",
      "5. France (Score: 0.812)\n",
      "6. Belgium (Score: 0.808)\n",
      "7. Austria (Score: 0.802)\n",
      "8. London (Score: 0.799)\n",
      "9. History (Score: 0.797)\n",
      "10. Creativity (Score: 0.796)\n",
      "11. Archaeology (Score: 0.795)\n",
      "12. Cartography (Score: 0.794)\n",
      "13. Denmark (Score: 0.793)\n",
      "14. Finland (Score: 0.79)\n",
      "15. English (Score: 0.789)\n",
      "16. Catharism (Score: 0.788)\n",
      "17. Dublin (Score: 0.787)\n",
      "18. Ireland (Score: 0.787)\n",
      "19. Japan (Score: 0.787)\n"
     ]
    }
   ],
   "source": [
    "query_results = query_qdrant('modern art in Europe', 'Articles')\n",
    "for i, article in enumerate(query_results):\n",
    "    print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-01-18T09:53:11.038910Z",
     "start_time": "2023-01-18T09:52:55.248029Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. History (Score: 0.797)\n",
      "2. Dublin (Score: 0.787)\n",
      "3. Ireland (Score: 0.786)\n",
      "4. History of Australia (Score: 0.782)\n",
      "5. Historian (Score: 0.778)\n",
      "6. Belgium (Score: 0.776)\n",
      "7. Black pudding (Score: 0.773)\n",
      "8. London (Score: 0.769)\n",
      "9. History of Spain (Score: 0.768)\n",
      "10. Cartography (Score: 0.763)\n",
      "11. March (Score: 0.762)\n",
      "12. France (Score: 0.761)\n",
      "13. Bubonic plague (Score: 0.76)\n",
      "14. Great Lakes (Score: 0.759)\n",
      "15. Inch (Score: 0.758)\n",
      "16. Dissolution of the monasteries (Score: 0.758)\n",
      "17. Austria (Score: 0.757)\n",
      "18. English (Score: 0.757)\n",
      "19. British English (Score: 0.757)\n",
      "20. Armenia (Score: 0.756)\n"
     ]
    }
   ],
   "source": [
    "# This time we're going to query using content vector\n",
    "query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')\n",
    "for i, article in enumerate(query_results):\n",
    "    print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."
--- a/examples/vector_databases/qdrant/docker-compose.yaml
+++ b/examples/vector_databases/qdrant/docker-compose.yaml
@ -0,0 +1,8 @@
 version: '3.4'
 services:
  qdrant:
    image: qdrant/qdrant:v0.11.7
    restart: on-failure
    ports:
      - "6333:6333"
      - "6334:6334"