You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/vector_databases/zilliz/Getting_started_with_Zilliz...

423 lines
29 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting Started with Zilliz and OpenAI\n",
"### Finding your next book\n",
"\n",
"In this notebook we will be going over generating embeddings of book descriptions with OpenAI and using those embeddings within Zilliz to find relevant books. The dataset in this example is sourced from HuggingFace datasets, and contains a little over 1 million title-description pairs.\n",
"\n",
"Lets begin by first downloading the required libraries for this notebook:\n",
"- `openai` is used for communicating with the OpenAI embedding service\n",
"- `pymilvus` is used for communicating with the Zilliz instance\n",
"- `datasets` is used for downloading the dataset\n",
"- `tqdm` is used for the progress bars\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
"Requirement already satisfied: openai in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (0.27.2)\n",
"Requirement already satisfied: pymilvus in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (2.2.2)\n",
"Requirement already satisfied: datasets in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (2.10.1)\n",
"Requirement already satisfied: tqdm in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (4.64.1)\n",
"Requirement already satisfied: requests>=2.20 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from openai) (2.28.2)\n",
"Requirement already satisfied: aiohttp in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from openai) (3.8.4)\n",
"Requirement already satisfied: ujson<=5.4.0,>=2.0.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (5.1.0)\n",
"Requirement already satisfied: grpcio-tools<=1.48.0,>=1.47.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (1.47.2)\n",
"Requirement already satisfied: grpcio<=1.48.0,>=1.47.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (1.47.2)\n",
"Requirement already satisfied: mmh3<=3.0.0,>=2.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (3.0.0)\n",
"Requirement already satisfied: pandas>=1.2.4 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pymilvus) (1.5.3)\n",
"Requirement already satisfied: numpy>=1.17 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (1.23.5)\n",
"Requirement already satisfied: xxhash in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (3.2.0)\n",
"Requirement already satisfied: responses<0.19 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.18.0)\n",
"Requirement already satisfied: dill<0.3.7,>=0.3.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.3.6)\n",
"Requirement already satisfied: huggingface-hub<1.0.0,>=0.2.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.12.1)\n",
"Requirement already satisfied: pyarrow>=6.0.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (10.0.1)\n",
"Requirement already satisfied: multiprocess in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (0.70.14)\n",
"Requirement already satisfied: pyyaml>=5.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (5.4.1)\n",
"Requirement already satisfied: fsspec[http]>=2021.11.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (2023.1.0)\n",
"Requirement already satisfied: packaging in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from datasets) (23.0)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (1.3.3)\n",
"Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (4.0.2)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (1.3.1)\n",
"Requirement already satisfied: attrs>=17.3.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (22.2.0)\n",
"Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (3.0.1)\n",
"Requirement already satisfied: yarl<2.0,>=1.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (1.8.2)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from aiohttp->openai) (6.0.4)\n",
"Requirement already satisfied: six>=1.5.2 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from grpcio<=1.48.0,>=1.47.0->pymilvus) (1.16.0)\n",
"Requirement already satisfied: protobuf<4.0dev,>=3.12.0 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from grpcio-tools<=1.48.0,>=1.47.0->pymilvus) (3.20.1)\n",
"Requirement already satisfied: setuptools in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from grpcio-tools<=1.48.0,>=1.47.0->pymilvus) (65.6.3)\n",
"Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from huggingface-hub<1.0.0,>=0.2.0->datasets) (4.5.0)\n",
"Requirement already satisfied: filelock in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from huggingface-hub<1.0.0,>=0.2.0->datasets) (3.9.0)\n",
"Requirement already satisfied: python-dateutil>=2.8.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pandas>=1.2.4->pymilvus) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from pandas>=1.2.4->pymilvus) (2022.7.1)\n",
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from requests>=2.20->openai) (1.26.14)\n",
"Requirement already satisfied: idna<4,>=2.5 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from requests>=2.20->openai) (3.4)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages (from requests>=2.20->openai) (2022.12.7)\n"
]
}
],
"source": [
"! pip install openai pymilvus datasets tqdm"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To get Zilliz up and running take a look [here](https://zilliz.com/doc/quick_start). With your account and database set up, proceed to set the following values:\n",
"- URI: The URI your database is running on\n",
"- USER: Your database username\n",
"- PASSWORD: Your database password\n",
"- COLLECTION_NAME: What to name the collection within Zilliz\n",
"- DIMENSION: The dimension of the embeddings\n",
"- OPENAI_ENGINE: Which embedding model to use\n",
"- openai.api_key: Your OpenAI account key\n",
"- INDEX_PARAM: The index settings to use for the collection\n",
"- QUERY_PARAM: The search parameters to use\n",
"- BATCH_SIZE: How many texts to embed and insert at once"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"\n",
"URI = 'your_uri'\n",
"TOKEN = 'your_token' # TOKEN == user:password or api_key\n",
"COLLECTION_NAME = 'book_search'\n",
"DIMENSION = 1536\n",
"OPENAI_ENGINE = 'text-embedding-3-small'\n",
"openai.api_key = 'sk-your-key'\n",
"\n",
"INDEX_PARAM = {\n",
" 'metric_type':'L2',\n",
" 'index_type':\"AUTOINDEX\",\n",
" 'params':{}\n",
"}\n",
"\n",
"QUERY_PARAM = {\n",
" \"metric_type\": \"L2\",\n",
" \"params\": {},\n",
"}\n",
"\n",
"BATCH_SIZE = 1000"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zilliz\n",
"This segment deals with Zilliz and setting up the database for this use case. Within Zilliz we need to setup a collection and index it."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from pymilvus import connections, utility, FieldSchema, Collection, CollectionSchema, DataType\n",
"\n",
"# Connect to Zilliz Database\n",
"connections.connect(uri=URI, token=TOKEN)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Remove collection if it already exists\n",
"if utility.has_collection(COLLECTION_NAME):\n",
" utility.drop_collection(COLLECTION_NAME)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Create collection which includes the id, title, and embedding.\n",
"fields = [\n",
" FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),\n",
" FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=64000),\n",
" FieldSchema(name='description', dtype=DataType.VARCHAR, max_length=64000),\n",
" FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)\n",
"]\n",
"schema = CollectionSchema(fields=fields)\n",
"collection = Collection(name=COLLECTION_NAME, schema=schema)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Create the index on the collection and load it.\n",
"collection.create_index(field_name=\"embedding\", index_params=INDEX_PARAM)\n",
"collection.load()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataset\n",
"With Zilliz up and running we can begin grabbing our data. `Hugging Face Datasets` is a hub that holds many different user datasets, and for this example we are using Skelebor's book dataset. This dataset contains title-description pairs for over 1 million books. We are going to embed each description and store it within Zilliz along with its title."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/filiphaltmayer/miniconda3/envs/haystack/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n",
"Found cached dataset parquet (/Users/filiphaltmayer/.cache/huggingface/datasets/Skelebor___parquet/Skelebor--book_titles_and_descriptions_en_clean-3596935b1d8a7747/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n"
]
}
],
"source": [
"import datasets\n",
"\n",
"# Download the dataset and only use the `train` portion (file is around 800Mb)\n",
"dataset = datasets.load_dataset('Skelebor/book_titles_and_descriptions_en_clean', split='train')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Insert the Data\n",
"Now that we have our data on our machine we can begin embedding it and inserting it into Zilliz. The embedding function takes in text and returns the embeddings in a list format."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Simple function that converts the texts to embeddings\n",
"def embed(texts):\n",
" embeddings = openai.Embedding.create(\n",
" input=texts,\n",
" engine=OPENAI_ENGINE\n",
" )\n",
" return [x['embedding'] for x in embeddings['data']]\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This next step does the actual inserting. Due to having so many datapoints, if you want to immediately test it out you can stop the inserting cell block early and move along. Doing this will probably decrease the accuracy of the results due to less datapoints, but it should still be good enough."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0%| | 2999/1032335 [00:19<1:49:30, 156.66it/s]\n"
]
},
{
"ename": "KeyboardInterrupt",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[10], line 14\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mlen\u001b[39m(data[\u001b[39m0\u001b[39m]) \u001b[39m%\u001b[39m BATCH_SIZE \u001b[39m==\u001b[39m \u001b[39m0\u001b[39m:\n\u001b[1;32m 13\u001b[0m data\u001b[39m.\u001b[39mappend(embed(data[\u001b[39m1\u001b[39m]))\n\u001b[0;32m---> 14\u001b[0m collection\u001b[39m.\u001b[39;49minsert(data)\n\u001b[1;32m 15\u001b[0m data \u001b[39m=\u001b[39m [[],[]]\n\u001b[1;32m 17\u001b[0m \u001b[39m# Embed and insert the remainder \u001b[39;00m\n",
"File \u001b[0;32m~/miniconda3/envs/haystack/lib/python3.9/site-packages/pymilvus/orm/collection.py:430\u001b[0m, in \u001b[0;36mCollection.insert\u001b[0;34m(self, data, partition_name, timeout, **kwargs)\u001b[0m\n\u001b[1;32m 427\u001b[0m entities \u001b[39m=\u001b[39m Prepare\u001b[39m.\u001b[39mprepare_insert_data(data, \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_schema)\n\u001b[1;32m 429\u001b[0m conn \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_get_connection()\n\u001b[0;32m--> 430\u001b[0m res \u001b[39m=\u001b[39m conn\u001b[39m.\u001b[39;49mbatch_insert(\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_name, entities, partition_name,\n\u001b[1;32m 431\u001b[0m timeout\u001b[39m=\u001b[39;49mtimeout, schema\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_schema_dict, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m 433\u001b[0m \u001b[39mif\u001b[39;00m kwargs\u001b[39m.\u001b[39mget(\u001b[39m\"\u001b[39m\u001b[39m_async\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mFalse\u001b[39;00m):\n\u001b[1;32m 434\u001b[0m \u001b[39mreturn\u001b[39;00m MutationFuture(res)\n",
"File \u001b[0;32m~/miniconda3/envs/haystack/lib/python3.9/site-packages/pymilvus/decorators.py:105\u001b[0m, in \u001b[0;36merror_handler.<locals>.wrapper.<locals>.handler\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 103\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[1;32m 104\u001b[0m record_dict[\u001b[39m\"\u001b[39m\u001b[39mRPC start\u001b[39m\u001b[39m\"\u001b[39m] \u001b[39m=\u001b[39m \u001b[39mstr\u001b[39m(datetime\u001b[39m.\u001b[39mdatetime\u001b[39m.\u001b[39mnow())\n\u001b[0;32m--> 105\u001b[0m \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m 106\u001b[0m \u001b[39mexcept\u001b[39;00m MilvusException \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m 107\u001b[0m record_dict[\u001b[39m\"\u001b[39m\u001b[39mRPC error\u001b[39m\u001b[39m\"\u001b[39m] \u001b[39m=\u001b[39m \u001b[39mstr\u001b[39m(datetime\u001b[39m.\u001b[39mdatetime\u001b[39m.\u001b[39mnow())\n",
"File \u001b[0;32m~/miniconda3/envs/haystack/lib/python3.9/site-packages/pymilvus/decorators.py:136\u001b[0m, in \u001b[0;36mtracing_request.<locals>.wrapper.<locals>.handler\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 134\u001b[0m \u001b[39mif\u001b[39;00m req_id:\n\u001b[1;32m 135\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mset_onetime_request_id(req_id)\n\u001b[0;32m--> 136\u001b[0m ret \u001b[39m=\u001b[39m func(\u001b[39mself\u001b[39;49m, \u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m 137\u001b[0m \u001b[39mreturn\u001b[39;00m ret\n",
"File \u001b[0;32m~/miniconda3/envs/haystack/lib/python3.9/site-packages/pymilvus/decorators.py:50\u001b[0m, in \u001b[0;36mretry_on_rpc_failure.<locals>.wrapper.<locals>.handler\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 48\u001b[0m \u001b[39mwhile\u001b[39;00m \u001b[39mTrue\u001b[39;00m:\n\u001b[1;32m 49\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m---> 50\u001b[0m \u001b[39mreturn\u001b[39;00m func(\u001b[39mself\u001b[39;49m, \u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m 51\u001b[0m \u001b[39mexcept\u001b[39;00m grpc\u001b[39m.\u001b[39mRpcError \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m 52\u001b[0m \u001b[39m# DEADLINE_EXCEEDED means that the task wat not completed\u001b[39;00m\n\u001b[1;32m 53\u001b[0m \u001b[39m# UNAVAILABLE means that the service is not reachable currently\u001b[39;00m\n\u001b[1;32m 54\u001b[0m \u001b[39m# Reference: https://grpc.github.io/grpc/python/grpc.html#grpc-status-code\u001b[39;00m\n\u001b[1;32m 55\u001b[0m \u001b[39mif\u001b[39;00m e\u001b[39m.\u001b[39mcode() \u001b[39m!=\u001b[39m grpc\u001b[39m.\u001b[39mStatusCode\u001b[39m.\u001b[39mDEADLINE_EXCEEDED \u001b[39mand\u001b[39;00m e\u001b[39m.\u001b[39mcode() \u001b[39m!=\u001b[39m grpc\u001b[39m.\u001b[39mStatusCode\u001b[39m.\u001b[39mUNAVAILABLE:\n",
"File \u001b[0;32m~/miniconda3/envs/haystack/lib/python3.9/site-packages/pymilvus/client/grpc_handler.py:378\u001b[0m, in \u001b[0;36mGrpcHandler.batch_insert\u001b[0;34m(self, collection_name, entities, partition_name, timeout, **kwargs)\u001b[0m\n\u001b[1;32m 375\u001b[0m f\u001b[39m.\u001b[39madd_callback(ts_utils\u001b[39m.\u001b[39mupdate_ts_on_mutation(collection_name))\n\u001b[1;32m 376\u001b[0m \u001b[39mreturn\u001b[39;00m f\n\u001b[0;32m--> 378\u001b[0m response \u001b[39m=\u001b[39m rf\u001b[39m.\u001b[39;49mresult()\n\u001b[1;32m 379\u001b[0m \u001b[39mif\u001b[39;00m response\u001b[39m.\u001b[39mstatus\u001b[39m.\u001b[39merror_code \u001b[39m==\u001b[39m \u001b[39m0\u001b[39m:\n\u001b[1;32m 380\u001b[0m m \u001b[39m=\u001b[39m MutationResult(response)\n",
"File \u001b[0;32m~/miniconda3/envs/haystack/lib/python3.9/site-packages/grpc/_channel.py:733\u001b[0m, in \u001b[0;36m_MultiThreadedRendezvous.result\u001b[0;34m(self, timeout)\u001b[0m\n\u001b[1;32m 728\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"Returns the result of the computation or raises its exception.\u001b[39;00m\n\u001b[1;32m 729\u001b[0m \n\u001b[1;32m 730\u001b[0m \u001b[39mSee grpc.Future.result for the full API contract.\u001b[39;00m\n\u001b[1;32m 731\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[1;32m 732\u001b[0m \u001b[39mwith\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_state\u001b[39m.\u001b[39mcondition:\n\u001b[0;32m--> 733\u001b[0m timed_out \u001b[39m=\u001b[39m _common\u001b[39m.\u001b[39;49mwait(\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_state\u001b[39m.\u001b[39;49mcondition\u001b[39m.\u001b[39;49mwait,\n\u001b[1;32m 734\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_is_complete,\n\u001b[1;32m 735\u001b[0m timeout\u001b[39m=\u001b[39;49mtimeout)\n\u001b[1;32m 736\u001b[0m \u001b[39mif\u001b[39;00m timed_out:\n\u001b[1;32m 737\u001b[0m \u001b[39mraise\u001b[39;00m grpc\u001b[39m.\u001b[39mFutureTimeoutError()\n",
"File \u001b[0;32m~/miniconda3/envs/haystack/lib/python3.9/site-packages/grpc/_common.py:141\u001b[0m, in \u001b[0;36mwait\u001b[0;34m(wait_fn, wait_complete_fn, timeout, spin_cb)\u001b[0m\n\u001b[1;32m 139\u001b[0m \u001b[39mif\u001b[39;00m timeout \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[1;32m 140\u001b[0m \u001b[39mwhile\u001b[39;00m \u001b[39mnot\u001b[39;00m wait_complete_fn():\n\u001b[0;32m--> 141\u001b[0m _wait_once(wait_fn, MAXIMUM_WAIT_TIMEOUT, spin_cb)\n\u001b[1;32m 142\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 143\u001b[0m end \u001b[39m=\u001b[39m time\u001b[39m.\u001b[39mtime() \u001b[39m+\u001b[39m timeout\n",
"File \u001b[0;32m~/miniconda3/envs/haystack/lib/python3.9/site-packages/grpc/_common.py:106\u001b[0m, in \u001b[0;36m_wait_once\u001b[0;34m(wait_fn, timeout, spin_cb)\u001b[0m\n\u001b[1;32m 105\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_wait_once\u001b[39m(wait_fn, timeout, spin_cb):\n\u001b[0;32m--> 106\u001b[0m wait_fn(timeout\u001b[39m=\u001b[39;49mtimeout)\n\u001b[1;32m 107\u001b[0m \u001b[39mif\u001b[39;00m spin_cb \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[1;32m 108\u001b[0m spin_cb()\n",
"File \u001b[0;32m~/miniconda3/envs/haystack/lib/python3.9/threading.py:316\u001b[0m, in \u001b[0;36mCondition.wait\u001b[0;34m(self, timeout)\u001b[0m\n\u001b[1;32m 314\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 315\u001b[0m \u001b[39mif\u001b[39;00m timeout \u001b[39m>\u001b[39m \u001b[39m0\u001b[39m:\n\u001b[0;32m--> 316\u001b[0m gotit \u001b[39m=\u001b[39m waiter\u001b[39m.\u001b[39;49macquire(\u001b[39mTrue\u001b[39;49;00m, timeout)\n\u001b[1;32m 317\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 318\u001b[0m gotit \u001b[39m=\u001b[39m waiter\u001b[39m.\u001b[39macquire(\u001b[39mFalse\u001b[39;00m)\n",
"\u001b[0;31mKeyboardInterrupt\u001b[0m: "
]
}
],
"source": [
"from tqdm import tqdm\n",
"\n",
"data = [\n",
" [], # title\n",
" [], # description\n",
"]\n",
"\n",
"# Embed and insert in batches\n",
"for i in tqdm(range(0, len(dataset))):\n",
" data[0].append(dataset[i]['title'])\n",
" data[1].append(dataset[i]['description'])\n",
" if len(data[0]) % BATCH_SIZE == 0:\n",
" data.append(embed(data[1]))\n",
" collection.insert(data)\n",
" data = [[],[]]\n",
"\n",
"# Embed and insert the remainder \n",
"if len(data[0]) != 0:\n",
" data.append(embed(data[1]))\n",
" collection.insert(data)\n",
" data = [[],[]]\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Query the Database\n",
"With our data safely inserted in Zilliz, we can now perform a query. The query takes in a string or a list of strings and searches them. The results print out your provided description and the results that include the result score, the result title, and the result book description."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import textwrap\n",
"\n",
"def query(queries, top_k = 5):\n",
" if type(queries) != list:\n",
" queries = [queries]\n",
" res = collection.search(embed(queries), anns_field='embedding', param=QUERY_PARAM, limit = top_k, output_fields=['title', 'description'])\n",
" for i, hit in enumerate(res):\n",
" print('Description:', queries[i])\n",
" print('Results:')\n",
" for ii, hits in enumerate(hit):\n",
" print('\\t' + 'Rank:', ii + 1, 'Score:', hits.score, 'Title:', hits.entity.get('title'))\n",
" print(textwrap.fill(hits.entity.get('description'), 88))\n",
" print()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Description: Book about a k-9 from europe\n",
"Results:\n",
"\tRank: 1 Score: 0.3047754764556885 Title: Bark M For Murder\n",
"Who let the dogs out? Evildoers beware! Four of mystery fiction's top storytellers are\n",
"setting the hounds on your trail -- in an incomparable quartet of crime stories with a\n",
"canine edge. Man's (and woman's) best friends take the lead in this phenomenal\n",
"collection of tales tense and surprising, humorous and thrilling: New York\n",
"Timesbestselling author J.A. Jance's spellbinding saga of a scam-busting septuagenarian\n",
"and her two golden retrievers; Anthony Award winner Virginia Lanier's pureblood thriller\n",
"featuring bloodhounds and bloody murder; Chassie West's suspenseful stunner about a\n",
"life-saving German shepherd and a ghastly forgotten crime; rising star Lee Charles\n",
"Kelley's edge-of-your-seat yarn that pits an ex-cop/kennel owner and a yappy toy poodle\n",
"against a craven killer.\n",
"\n",
"\tRank: 2 Score: 0.3283390402793884 Title: Texas K-9 Unit Christmas: Holiday Hero\\Rescuing Christmas\n",
"CHRISTMAS COMES WRAPPED IN DANGER Holiday Hero by Shirlee McCoy Emma Fairchild never\n",
"expected to find trouble in sleepy Sagebrush, Texas. But when she's attacked and left\n",
"for dead in her own diner, her childhood friend turned K-9 cop Lucas Harwood offers a\n",
"chance at justice--and love. Rescuing Christmas by Terri Reed She escaped a kidnapper,\n",
"but now a killer has set his sights on K-9 dog trainer Lily Anderson. When fellow\n",
"officer Jarrod Evans appoints himself her bodyguard, Lily knows more than her life is at\n",
"risk--so is her heart. Texas K-9 Unit: These lawmen solve the toughest cases with the\n",
"help of their brave canine partners\n",
"\n",
"\tRank: 3 Score: 0.33899369835853577 Title: Dogs on Duty: Soldiers' Best Friends on the Battlefield and Beyond\n",
"When the news of the raid on Osama Bin Laden's compound broke, the SEAL team member that\n",
"stole the show was a highly trained canine companion. Throughout history, dogs have been\n",
"key contributors to military units. Dorothy Hinshaw Patent follows man's best friend\n",
"onto the battlefield, showing readers why dogs are uniquely qualified for the job at\n",
"hand, how they are trained, how they contribute to missions, and what happens when they\n",
"retire. With full-color photographs throughout and sidebars featuring heroic canines\n",
"throughout history, Dogs on Duty provides a fascinating look at these exceptional\n",
"soldiers and companions.\n",
"\n",
"\tRank: 4 Score: 0.34207457304000854 Title: Toute Allure: Falling in Love in Rural France\n",
"After saying goodbye to life as a successful fashion editor in London, Karen Wheeler is\n",
"now happy in her small village house in rural France. Her idyll is complete when she\n",
"meets the love of her life - he has shaggy hair, four paws and a wet nose!\n",
"\n",
"\tRank: 5 Score: 0.343595951795578 Title: Otherwise Alone (Evan Arden, #1)\n",
"Librarian's note: This is an alternate cover edition for ASIN: B00AP5NNWC. Lieutenant\n",
"Evan Arden sits in a shack in the middle of nowhere, waiting for orders that will send\n",
"him back home - if he ever gets them. Other than his loyal Great Pyrenees, there's no\n",
"one around to break up the monotony. The tedium is excruciating, but it is suddenly\n",
"interrupted when a young woman stumbles up his path. \"It's only 50-something pages, but\n",
"in that short amount of time, the author's awesome writing packs in a whole lotta\n",
"character detail. And sets the stage for the series, perfectly.\" -Maryse.net, 4.5 Stars\n",
"He has two choices - pick her off from a distance with his trusty sniper-rifle, or dare\n",
"let her approach his cabin and enter his life. Why not? It's been ages, and he is\n",
"otherwise alone...\n",
"\n"
]
}
],
"source": [
"query('Book about a k-9 from europe')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "haystack",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}