diff --git a/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb b/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb index f048d978..55b8cebf 100644 --- a/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb +++ b/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb @@ -69,6 +69,9 @@ "# Weaviate's client library for Python\n", "import weaviate\n", "\n", + "# Qdrant's client library for Python\n", + "import qdrant_client\n", + "\n", "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", "\n", @@ -1048,7 +1051,313 @@ }, { "cell_type": "markdown", - "id": "ad74202e", + "metadata": {}, + "source": [ + "## Qdrant\n", + "\n", + "The last vector database we'll consider in **[Qdrant](https://qdrant.tech/)**. This is a high-performant vector search database written in Rust. It offers both on-premise and cloud version, but for the purposes of that example we're going to use the local deployment mode.\n", + "\n", + "Setting everything up will require:\n", + "- Spinning up a local instance of Qdrant\n", + "- Configuring the collection and storing the data in it\n", + "- Trying out with some queries" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup\n", + "\n", + "For the local deployment, we are going to use Docker, according to the Qdrant documentation: https://qdrant.tech/documentation/quick_start/. Qdrant requires just a single container, but an example of the docker-compose.yaml file is available at `./qdrant/docker-compose.yaml` in this repo.\n", + "\n", + "You can start Qdrant instance locally by navigating to this directory and running `docker-compose up -d `" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:28:38.928205Z", + "start_time": "2023-01-18T09:28:38.913987Z" + } + }, + "outputs": [], + "source": [ + "qdrant = qdrant_client.QdrantClient(host='localhost', prefer_grpc=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:29:19.806639Z", + "start_time": "2023-01-18T09:29:19.727897Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "CollectionsResponse(collections=[])" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "qdrant.get_collections()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Index data\n", + "\n", + "Qdrant stores data in __collections__ where each object is described by at least one vector and may contain an additional metadata called __payload__. Our collection will be called **Articles** and each object will be described by both **title** and **content** vectors.\n", + "\n", + "We're going to be using an official [qdrant-client](https://github.com/qdrant/qdrant_client) package that has all the utility methods already built-in." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:29:22.530121Z", + "start_time": "2023-01-18T09:29:22.524604Z" + } + }, + "outputs": [], + "source": [ + "from qdrant_client.http import models as rest" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:31:14.413334Z", + "start_time": "2023-01-18T09:31:13.619079Z" + } + }, + "outputs": [], + "source": [ + "vector_size = len(article_df['content_vector'][0])\n", + "\n", + "qdrant.recreate_collection(\n", + " collection_name='Articles',\n", + " vectors_config={\n", + " 'title': rest.VectorParams(\n", + " distance=rest.Distance.COSINE,\n", + " size=vector_size,\n", + " ),\n", + " 'content': rest.VectorParams(\n", + " distance=rest.Distance.COSINE,\n", + " size=vector_size,\n", + " ),\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:36:28.597535Z", + "start_time": "2023-01-18T09:36:24.108867Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "UpdateResult(operation_id=0, status=)" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "qdrant.upsert(\n", + " collection_name='Articles',\n", + " points=[\n", + " rest.PointStruct(\n", + " id=k,\n", + " vector={\n", + " 'title': v['title_vector'],\n", + " 'content': v['content_vector'],\n", + " },\n", + " payload=v.to_dict(),\n", + " )\n", + " for k, v in article_df.iterrows()\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:58:13.825886Z", + "start_time": "2023-01-18T09:58:13.816248Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "CountResult(count=250)" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Check the collection size to make sure all the points have been stored\n", + "qdrant.count(collection_name='Articles')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Search Data\n", + "\n", + "Once the data is put into Qdrant we can start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:50:35.265647Z", + "start_time": "2023-01-18T09:50:35.256065Z" + } + }, + "outputs": [], + "source": [ + "def query_qdrant(query, collection_name, vector_name='title', top_k=20):\n", + "\n", + " # Creates embedding vector from user query\n", + " embedded_query = openai.Embedding.create(\n", + " input=query,\n", + " model=EMBEDDING_MODEL,\n", + " )['data'][0]['embedding']\n", + " \n", + " query_results = qdrant.search(\n", + " collection_name=collection_name,\n", + " query_vector=(\n", + " vector_name, embedded_query\n", + " ),\n", + " limit=top_k,\n", + " )\n", + " \n", + " return query_results" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:50:46.545145Z", + "start_time": "2023-01-18T09:50:35.711020Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Art (Score: 0.841)\n", + "1. Europe (Score: 0.839)\n", + "2. Italy (Score: 0.816)\n", + "3. Architecture (Score: 0.815)\n", + "4. Madrid (Score: 0.815)\n", + "5. France (Score: 0.812)\n", + "6. Belgium (Score: 0.808)\n", + "7. Austria (Score: 0.802)\n", + "8. London (Score: 0.799)\n", + "9. History (Score: 0.797)\n", + "10. Creativity (Score: 0.796)\n", + "11. Archaeology (Score: 0.795)\n", + "12. Cartography (Score: 0.794)\n", + "13. Denmark (Score: 0.793)\n", + "14. Finland (Score: 0.79)\n", + "15. English (Score: 0.789)\n", + "16. Catharism (Score: 0.788)\n", + "17. Dublin (Score: 0.787)\n", + "18. Ireland (Score: 0.787)\n", + "19. Japan (Score: 0.787)\n" + ] + } + ], + "source": [ + "query_results = query_qdrant('modern art in Europe', 'Articles')\n", + "for i, article in enumerate(query_results):\n", + " print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": { + "ExecuteTime": { + "end_time": "2023-01-18T09:53:11.038910Z", + "start_time": "2023-01-18T09:52:55.248029Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. History (Score: 0.797)\n", + "2. Dublin (Score: 0.787)\n", + "3. Ireland (Score: 0.786)\n", + "4. History of Australia (Score: 0.782)\n", + "5. Historian (Score: 0.778)\n", + "6. Belgium (Score: 0.776)\n", + "7. Black pudding (Score: 0.773)\n", + "8. London (Score: 0.769)\n", + "9. History of Spain (Score: 0.768)\n", + "10. Cartography (Score: 0.763)\n", + "11. March (Score: 0.762)\n", + "12. France (Score: 0.761)\n", + "13. Bubonic plague (Score: 0.76)\n", + "14. Great Lakes (Score: 0.759)\n", + "15. Inch (Score: 0.758)\n", + "16. Dissolution of the monasteries (Score: 0.758)\n", + "17. Austria (Score: 0.757)\n", + "18. English (Score: 0.757)\n", + "19. British English (Score: 0.757)\n", + "20. Armenia (Score: 0.756)\n" + ] + } + ], + "source": [ + "# This time we're going to query using content vector\n", + "query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')\n", + "for i, article in enumerate(query_results):\n", + " print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')" + ] + }, + { + "cell_type": "markdown", "metadata": {}, "source": [ "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo." diff --git a/examples/vector_databases/qdrant/docker-compose.yaml b/examples/vector_databases/qdrant/docker-compose.yaml new file mode 100644 index 00000000..d924affd --- /dev/null +++ b/examples/vector_databases/qdrant/docker-compose.yaml @@ -0,0 +1,8 @@ +version: '3.4' +services: + qdrant: + image: qdrant/qdrant:v0.11.7 + restart: on-failure + ports: + - "6333:6333" + - "6334:6334" \ No newline at end of file