Add Typesense example to vector databases (#277)

* Add Typesense example to vector databases * Fix typo * Add to intro section
2 years ago · 637cc3f87e
parent 3905d2fea0
commit 637cc3f87e
3 changed files with 285 additions and 1 deletions
--- a/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb
+++ b/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb
@ -43,6 +43,10 @@
    "    - *Setup*: Set up the Redis-Py client. For more details go [here](https://github.com/redis/redis-py)\n",
    "    - *Index Data*: Create the search index for vector search and hybrid search (vector + full-text search) on all available fields.\n",
    "    - *Search Data*: Run a few example queries with various goals in mind.\n",
+    "- **Typesense**\n",
+    "    - *Setup*: Set up the Typesense Python client. For more details go [here](https://typesense.org/docs/0.24.0/api/)\n",
+    "    - *Index Data*: We'll create a collection and index it for both __titles__ and __content__.\n",
+    "    - *Search Data*: Run a few example queries with various goals in mind.\n",    
    "\n",
    "\n",
    "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings."
@ -71,6 +75,7 @@
    "!pip install pymilvus\n",
    "!pip install qdrant-client\n",
    "!pip install redis\n",
+    "!pip install typesense\n",
    "\n",
    "#Install wget to pull zip file\n",
    "!pip install wget"
@ -108,6 +113,10 @@
    "# Qdrant's client library for Python\n",
    "import qdrant_client\n",
    "\n",
+    "# Typesense's client library for Python\n",
+    "import typesense\n",
+    "\n",
+    "\n",
    "# I've set this to our new embeddings model, this can be changed to the embedding model of your choice\n",
    "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n",
    "\n",
@ -1421,7 +1430,7 @@
   "source": [
    "# Redis\n",
    "\n",
-    "The last vector database covered in this tutorial is **[Redis](https://redis.io)**. You most likely already know Redis. What you might not be aware of is the [RediSearch module](https://github.com/RediSearch/RediSearch). Enterprises have been using Redis with the RediSearch module for years now across all major cloud providers, Redis Cloud, and on premise. Recently, the Redis team added vector storage and search capability to this module in addition to the features RediSearch already had.\n",
+    "The next vector database covered in this tutorial is **[Redis](https://redis.io)**. You most likely already know Redis. What you might not be aware of is the [RediSearch module](https://github.com/RediSearch/RediSearch). Enterprises have been using Redis with the RediSearch module for years now across all major cloud providers, Redis Cloud, and on premise. Recently, the Redis team added vector storage and search capability to this module in addition to the features RediSearch already had.\n",
    "\n",
    "Given the large ecosystem around Redis, there are most likely client libraries in the language you need. You can use any standard Redis client library to run RediSearch commands, but it's easiest to use a library that wraps the RediSearch API. Below are a few examples, but you can find more client libraries [here](https://redis.io/resources/clients/).\n",
    "\n",
@ -1790,6 +1799,270 @@
    "For more example with Redis as a vector database, see the README and examples within the ``vector_databases/redis`` directory of this repository"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Typesense\n",
+    "\n",
+    "The next vector store we'll look at is [Typesense](https://typesense.org/), which is an open source, in-memory search engine, that you can either self-host or run on [Typesense Cloud](https://cloud.typesense.org).\n",
+    "\n",
+    "Typesense focuses on performance by storing the entire index in RAM (with a backup on disk) and also focuses on providing an out-of-the-box developer experience by simplifying available options and setting good defaults. It also lets you combine attribute-based filtering together with vector queries.\n",
+    "\n",
+    "For this example, we will set up a local docker-based Typesense server, index our vectors in Typesense and then do some nearest-neighbor search queries. If you use Typesense Cloud, you can skip the docker setup part and just obtain the hostname and API keys from your cluster dashboard."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Setup\n",
+    "\n",
+    "To run Typesense locally, you'll need [Docker](https://www.docker.com/). Following the instructions contained in the Typesense documentation [here](https://typesense.org/docs/guide/install-typesense.html#docker-compose), we created an example docker-compose.yml file in this repo saved at [./typesense/docker-compose.yml](./typesense/docker-compose.yml).\n",
+    "\n",
+    "After starting Docker, you can start Typesense locally by navigating to the `examples/vector_databases/typesense/` directory and running `docker-compose up -d`.\n",
+    "\n",
+    "The default API key is set to `xyz` in the Docker compose file, and the default Typesense port to `8108`."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "outputs": [],
+   "source": [
+    "import typesense\n",
+    "\n",
+    "typesense_client = \\\n",
+    "    typesense.Client({\n",
+    "        \"nodes\": [{\n",
+    "            \"host\": \"localhost\",  # For Typesense Cloud use xxx.a1.typesense.net\n",
+    "            \"port\": \"8108\",       # For Typesense Cloud use 443\n",
+    "            \"protocol\": \"http\"    # For Typesense Cloud use https\n",
+    "          }],\n",
+    "          \"api_key\": \"xyz\",\n",
+    "          \"connection_timeout_seconds\": 60\n",
+    "        })"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Index data\n",
+    "\n",
+    "To index vectors in Typesense, we'll first create a Collection (which is a collection of Documents) and turn on vector indexing for a particular field. You can even store multiple vector fields in a single document."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "outputs": [],
+   "source": [
+    "# Delete existing collections if they already exist\n",
+    "try:\n",
+    "    typesense_client.collections['wikipedia_articles'].delete()\n",
+    "except Exception as e:\n",
+    "    pass\n",
+    "\n",
+    "# Create a new collection\n",
+    "\n",
+    "schema = {\n",
+    "    \"name\": \"wikipedia_articles\",\n",
+    "    \"fields\": [\n",
+    "        {\n",
+    "            \"name\": \"content_vector\",\n",
+    "            \"type\": \"float[]\",\n",
+    "            \"num_dim\": len(article_df['content_vector'][0])\n",
+    "        },\n",
+    "        {\n",
+    "            \"name\": \"title_vector\",\n",
+    "            \"type\": \"float[]\",\n",
+    "            \"num_dim\": len(article_df['title_vector'][0])\n",
+    "        }\n",
+    "    ]\n",
+    "}\n",
+    "\n",
+    "create_response = typesense_client.collections.create(schema)\n",
+    "print(create_response)\n",
+    "\n",
+    "print(\"Created new collection wikipedia-articles\")"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "outputs": [],
+   "source": [
+    "# Upsert the vector data into the collection we just created\n",
+    "#\n",
+    "# Note: This can take a few minutes, especially if your on an M1 and running docker in an emulated mode\n",
+    "\n",
+    "print(\"Indexing vectors in Typesense...\")\n",
+    "\n",
+    "document_counter = 0\n",
+    "documents_batch = []\n",
+    "\n",
+    "for k,v in article_df.iterrows():\n",
+    "    # Create a document with the vector data\n",
+    "\n",
+    "    # Notice how you can add any fields that you haven't added to the schema to the document.\n",
+    "    # These will be stored on disk and returned when the document is a hit.\n",
+    "    # This is useful to store attributes required for display purposes.\n",
+    "\n",
+    "    document = {\n",
+    "        \"title_vector\": v[\"title_vector\"],\n",
+    "        \"content_vector\": v[\"content_vector\"],\n",
+    "        \"title\": v[\"title\"],\n",
+    "        \"content\": v[\"text\"],\n",
+    "    }\n",
+    "    documents_batch.append(document)\n",
+    "    document_counter = document_counter + 1\n",
+    "\n",
+    "    # Upsert a batch of 100 documents\n",
+    "    if document_counter % 100 == 0 or document_counter == len(article_df):\n",
+    "        response = typesense_client.collections['wikipedia_articles'].documents.import_(documents_batch)\n",
+    "        # print(response)\n",
+    "\n",
+    "        documents_batch = []\n",
+    "        print(f\"Processed {document_counter} / {len(article_df)} \")\n",
+    "\n",
+    "print(f\"Imported ({len(article_df)}) articles.\")"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "outputs": [],
+   "source": [
+    "# Check the number of documents imported\n",
+    "\n",
+    "collection = typesense_client.collections['wikipedia_articles'].retrieve()\n",
+    "print(f'Collection has {collection[\"num_documents\"]} documents')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Search Data\n",
+    "\n",
+    "Now that we've imported the vectors into Typesense, we can do a nearest neighbor search on the `title_vector` or `content_vector` field."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "outputs": [],
+   "source": [
+    "def query_typesense(query, field='title', top_k=20):\n",
+    "\n",
+    "    # Creates embedding vector from user query\n",
+    "    openai.api_key = os.getenv(\"OPENAI_API_KEY\", \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\")\n",
+    "    embedded_query = openai.Embedding.create(\n",
+    "        input=query,\n",
+    "        model=EMBEDDING_MODEL,\n",
+    "    )['data'][0]['embedding']\n",
+    "\n",
+    "    typesense_results = typesense_client.multi_search.perform({\n",
+    "        \"searches\": [{\n",
+    "            \"q\": \"*\",\n",
+    "            \"collection\": \"wikipedia_articles\",\n",
+    "            \"vector_query\": f\"{field}_vector:([{','.join(str(v) for v in embedded_query)}], k:{top_k})\"\n",
+    "        }]\n",
+    "    }, {})\n",
+    "\n",
+    "    return typesense_results"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1. Museum of Modern Art (Score: 0.12471389770507812)\n",
+      "2. Renaissance art (Score: 0.13575094938278198)\n",
+      "3. Pop art (Score: 0.13949453830718994)\n",
+      "4. Hellenistic art (Score: 0.14710968732833862)\n",
+      "5. Modernist literature (Score: 0.15288257598876953)\n",
+      "6. Art film (Score: 0.15657293796539307)\n",
+      "7. Art (Score: 0.15847939252853394)\n",
+      "8. Byzantine art (Score: 0.1591007113456726)\n",
+      "9. Postmodernism (Score: 0.15989065170288086)\n",
+      "10. Cubism (Score: 0.16093528270721436)\n"
+     ]
+    }
+   ],
+   "source": [
+    "query_results = query_typesense('modern art in Europe', 'title')\n",
+    "\n",
+    "for i, hit in enumerate(query_results['results'][0]['hits']):\n",
+    "    document = hit[\"document\"]\n",
+    "    vector_distance = hit[\"vector_distance\"]\n",
+    "    print(f'{i + 1}. {document[\"title\"]} (Distance: {vector_distance})')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1. Battle of Bannockburn (Distance: 0.1306602954864502)\n",
+      "2. Wars of Scottish Independence (Distance: 0.13851898908615112)\n",
+      "3. 1651 (Distance: 0.14746594429016113)\n",
+      "4. First War of Scottish Independence (Distance: 0.15035754442214966)\n",
+      "5. Robert I of Scotland (Distance: 0.1538146734237671)\n",
+      "6. 841 (Distance: 0.15609896183013916)\n",
+      "7. 1716 (Distance: 0.15618199110031128)\n",
+      "8. 1314 (Distance: 0.16281157732009888)\n",
+      "9. William Wallace (Distance: 0.16468697786331177)\n",
+      "10. Stirling (Distance: 0.16858011484146118)\n"
+     ]
+    }
+   ],
+   "source": [
+    "query_results = query_typesense('Famous battles in Scottish history', 'content')\n",
+    "\n",
+    "for i, hit in enumerate(query_results['results'][0]['hits']):\n",
+    "    document = hit[\"document\"]\n",
+    "    vector_distance = hit[\"vector_distance\"]\n",
+    "    print(f'{i + 1}. {document[\"title\"]} (Distance: {vector_distance})')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
  {
   "cell_type": "markdown",
   "id": "55afccbf",
--- a/examples/vector_databases/typesense/.gitignore
+++ b/examples/vector_databases/typesense/.gitignore
@ -0,0 +1 @@
+typesense-data
--- a/examples/vector_databases/typesense/docker-compose.yml
+++ b/examples/vector_databases/typesense/docker-compose.yml
@ -0,0 +1,10 @@
+version: '3.4'
+services:
+  typesense:
+    image: typesense/typesense:0.24.0
+    restart: on-failure
+    ports:
+      - "8108:8108"
+    volumes:
+      - ./typesense-data:/data
+    command: '--data-dir /data --api-key=xyz --enable-cors'