Merge pull request #81 from kacperlukawski/qdrant-example

Add Qdrant as another example of vector database
This commit is contained in:
colin-openai 2023-01-19 09:37:37 -08:00 committed by GitHub
commit 2fed004763
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 318 additions and 1 deletions

View File

@ -69,6 +69,9 @@
"# Weaviate's client library for Python\n", "# Weaviate's client library for Python\n",
"import weaviate\n", "import weaviate\n",
"\n", "\n",
"# Qdrant's client library for Python\n",
"import qdrant_client\n",
"\n",
"# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n", "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
"EMBEDDING_MODEL = \"text-embedding-ada-002\"\n", "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n",
"\n", "\n",
@ -1048,7 +1051,313 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "ad74202e", "metadata": {},
"source": [
"## Qdrant\n",
"\n",
"The last vector database we'll consider in **[Qdrant](https://qdrant.tech/)**. This is a high-performant vector search database written in Rust. It offers both on-premise and cloud version, but for the purposes of that example we're going to use the local deployment mode.\n",
"\n",
"Setting everything up will require:\n",
"- Spinning up a local instance of Qdrant\n",
"- Configuring the collection and storing the data in it\n",
"- Trying out with some queries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"For the local deployment, we are going to use Docker, according to the Qdrant documentation: https://qdrant.tech/documentation/quick_start/. Qdrant requires just a single container, but an example of the docker-compose.yaml file is available at `./qdrant/docker-compose.yaml` in this repo.\n",
"\n",
"You can start Qdrant instance locally by navigating to this directory and running `docker-compose up -d `"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:28:38.928205Z",
"start_time": "2023-01-18T09:28:38.913987Z"
}
},
"outputs": [],
"source": [
"qdrant = qdrant_client.QdrantClient(host='localhost', prefer_grpc=True)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:29:19.806639Z",
"start_time": "2023-01-18T09:29:19.727897Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"CollectionsResponse(collections=[])"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qdrant.get_collections()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Index data\n",
"\n",
"Qdrant stores data in __collections__ where each object is described by at least one vector and may contain an additional metadata called __payload__. Our collection will be called **Articles** and each object will be described by both **title** and **content** vectors.\n",
"\n",
"We're going to be using an official [qdrant-client](https://github.com/qdrant/qdrant_client) package that has all the utility methods already built-in."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:29:22.530121Z",
"start_time": "2023-01-18T09:29:22.524604Z"
}
},
"outputs": [],
"source": [
"from qdrant_client.http import models as rest"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:31:14.413334Z",
"start_time": "2023-01-18T09:31:13.619079Z"
}
},
"outputs": [],
"source": [
"vector_size = len(article_df['content_vector'][0])\n",
"\n",
"qdrant.recreate_collection(\n",
" collection_name='Articles',\n",
" vectors_config={\n",
" 'title': rest.VectorParams(\n",
" distance=rest.Distance.COSINE,\n",
" size=vector_size,\n",
" ),\n",
" 'content': rest.VectorParams(\n",
" distance=rest.Distance.COSINE,\n",
" size=vector_size,\n",
" ),\n",
" }\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:36:28.597535Z",
"start_time": "2023-01-18T09:36:24.108867Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qdrant.upsert(\n",
" collection_name='Articles',\n",
" points=[\n",
" rest.PointStruct(\n",
" id=k,\n",
" vector={\n",
" 'title': v['title_vector'],\n",
" 'content': v['content_vector'],\n",
" },\n",
" payload=v.to_dict(),\n",
" )\n",
" for k, v in article_df.iterrows()\n",
" ],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:58:13.825886Z",
"start_time": "2023-01-18T09:58:13.816248Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"CountResult(count=250)"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the collection size to make sure all the points have been stored\n",
"qdrant.count(collection_name='Articles')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Search Data\n",
"\n",
"Once the data is put into Qdrant we can start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:50:35.265647Z",
"start_time": "2023-01-18T09:50:35.256065Z"
}
},
"outputs": [],
"source": [
"def query_qdrant(query, collection_name, vector_name='title', top_k=20):\n",
"\n",
" # Creates embedding vector from user query\n",
" embedded_query = openai.Embedding.create(\n",
" input=query,\n",
" model=EMBEDDING_MODEL,\n",
" )['data'][0]['embedding']\n",
" \n",
" query_results = qdrant.search(\n",
" collection_name=collection_name,\n",
" query_vector=(\n",
" vector_name, embedded_query\n",
" ),\n",
" limit=top_k,\n",
" )\n",
" \n",
" return query_results"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:50:46.545145Z",
"start_time": "2023-01-18T09:50:35.711020Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0. Art (Score: 0.841)\n",
"1. Europe (Score: 0.839)\n",
"2. Italy (Score: 0.816)\n",
"3. Architecture (Score: 0.815)\n",
"4. Madrid (Score: 0.815)\n",
"5. France (Score: 0.812)\n",
"6. Belgium (Score: 0.808)\n",
"7. Austria (Score: 0.802)\n",
"8. London (Score: 0.799)\n",
"9. History (Score: 0.797)\n",
"10. Creativity (Score: 0.796)\n",
"11. Archaeology (Score: 0.795)\n",
"12. Cartography (Score: 0.794)\n",
"13. Denmark (Score: 0.793)\n",
"14. Finland (Score: 0.79)\n",
"15. English (Score: 0.789)\n",
"16. Catharism (Score: 0.788)\n",
"17. Dublin (Score: 0.787)\n",
"18. Ireland (Score: 0.787)\n",
"19. Japan (Score: 0.787)\n"
]
}
],
"source": [
"query_results = query_qdrant('modern art in Europe', 'Articles')\n",
"for i, article in enumerate(query_results):\n",
" print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"ExecuteTime": {
"end_time": "2023-01-18T09:53:11.038910Z",
"start_time": "2023-01-18T09:52:55.248029Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. History (Score: 0.797)\n",
"2. Dublin (Score: 0.787)\n",
"3. Ireland (Score: 0.786)\n",
"4. History of Australia (Score: 0.782)\n",
"5. Historian (Score: 0.778)\n",
"6. Belgium (Score: 0.776)\n",
"7. Black pudding (Score: 0.773)\n",
"8. London (Score: 0.769)\n",
"9. History of Spain (Score: 0.768)\n",
"10. Cartography (Score: 0.763)\n",
"11. March (Score: 0.762)\n",
"12. France (Score: 0.761)\n",
"13. Bubonic plague (Score: 0.76)\n",
"14. Great Lakes (Score: 0.759)\n",
"15. Inch (Score: 0.758)\n",
"16. Dissolution of the monasteries (Score: 0.758)\n",
"17. Austria (Score: 0.757)\n",
"18. English (Score: 0.757)\n",
"19. British English (Score: 0.757)\n",
"20. Armenia (Score: 0.756)\n"
]
}
],
"source": [
"# This time we're going to query using content vector\n",
"query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')\n",
"for i, article in enumerate(query_results):\n",
" print(f'{i + 1}. {article.payload[\"title\"]} (Score: {round(article.score, 3)})')"
]
},
{
"cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo." "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo."

View File

@ -0,0 +1,8 @@
version: '3.4'
services:
qdrant:
image: qdrant/qdrant:v0.11.7
restart: on-failure
ports:
- "6333:6333"
- "6334:6334"