{ "cells": [ { "cell_type": "markdown", "id": "9787b308", "metadata": {}, "source": [ "# Rockset\n", "\n", ">[Rockset](https://rockset.com/) is a real-time search and analytics database built for the cloud. Rockset uses a [Converged Index™](https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/) with an efficient store for vector embeddings to serve low latency, high concurrency search queries at scale. Rockset has full support for metadata filtering and handles real-time ingestion for constantly updating, streaming data.\n", "\n", "This notebook demonstrates how to use `Rockset` as a vector store in LangChain. Before getting started, make sure you have access to a `Rockset` account and an API key available. [Start your free trial today.](https://rockset.com/create/)\n" ] }, { "cell_type": "markdown", "id": "b823d64a", "metadata": {}, "source": [ "## Setting Up Your Environment[](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/rockset#setting-up-environment)\n", "\n", "1. Leverage the `Rockset` console to create a [collection](https://rockset.com/docs/collections/) with the Write API as your source. In this walkthrough, we create a collection named `langchain_demo`. \n", " \n", " Configure the following [ingest transformation](https://rockset.com/docs/ingest-transformation/) to mark your embeddings field and take advantage of performance and storage optimizations:" ] }, { "cell_type": "code", "execution_count": null, "id": "aac58387", "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [], "source": [ "SELECT _input.* EXCEPT(_meta), \n", "VECTOR_ENFORCE(_input.description_embedding, #length_of_vector_embedding, 'float') as description_embedding \n", "FROM _input" ] }, { "cell_type": "markdown", "id": "df380e1c", "metadata": {}, "source": [ "2. After creating your collection, use the console to retrieve an [API key](https://rockset.com/docs/iam/#users-api-keys-and-roles). For the purpose of this notebook, we assume you are using the `Oregon(us-west-2)` region.\n", "\n", "3. Install the [rockset-python-client](https://github.com/rockset/rockset-python-client) to enable LangChain to communicate directly with `Rockset`." ] }, { "cell_type": "code", "execution_count": null, "id": "00d16b83", "metadata": {}, "outputs": [], "source": [ "pip install rockset" ] }, { "cell_type": "markdown", "id": "e79550eb", "metadata": {}, "source": [ "## LangChain Tutorial\n", "\n", "Follow along in your own Python notebook to generate and store vector embeddings in Rockset.\n", "Start using Rockset to search for documents similar to your search queries.\n", "\n", "### 1. Define Key Variables" ] }, { "cell_type": "code", "execution_count": 5, "id": "29505c1e", "metadata": {}, "outputs": [ { "ename": "InitializationException", "evalue": "The rockset client was initialized incorrectly: An api key must be provided as a parameter to the RocksetClient or the Configuration object.", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mInitializationException\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[5], line 6\u001b[0m\n\u001b[1;32m 4\u001b[0m ROCKSET_API_KEY \u001b[39m=\u001b[39m os\u001b[39m.\u001b[39menviron\u001b[39m.\u001b[39mget(\u001b[39m\"\u001b[39m\u001b[39mROCKSET_API_KEY\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39m# Verify ROCKSET_API_KEY environment variable\u001b[39;00m\n\u001b[1;32m 5\u001b[0m ROCKSET_API_SERVER \u001b[39m=\u001b[39m rockset\u001b[39m.\u001b[39mRegions\u001b[39m.\u001b[39musw2a1 \u001b[39m# Verify Rockset region\u001b[39;00m\n\u001b[0;32m----> 6\u001b[0m rockset_client \u001b[39m=\u001b[39m rockset\u001b[39m.\u001b[39;49mRocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)\n\u001b[1;32m 8\u001b[0m COLLECTION_NAME\u001b[39m=\u001b[39m\u001b[39m'\u001b[39m\u001b[39mlangchain_demo\u001b[39m\u001b[39m'\u001b[39m\n\u001b[1;32m 9\u001b[0m TEXT_KEY\u001b[39m=\u001b[39m\u001b[39m'\u001b[39m\u001b[39mdescription\u001b[39m\u001b[39m'\u001b[39m\n", "File \u001b[0;32m~/Library/Python/3.9/lib/python/site-packages/rockset/rockset_client.py:242\u001b[0m, in \u001b[0;36mRocksetClient.__init__\u001b[0;34m(self, host, api_key, max_workers, config)\u001b[0m\n\u001b[1;32m 239\u001b[0m config\u001b[39m.\u001b[39mhost \u001b[39m=\u001b[39m host\n\u001b[1;32m 241\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m config\u001b[39m.\u001b[39mapi_key:\n\u001b[0;32m--> 242\u001b[0m \u001b[39mraise\u001b[39;00m InitializationException(\n\u001b[1;32m 243\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAn api key must be provided as a parameter to the RocksetClient or the Configuration object.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 244\u001b[0m )\n\u001b[1;32m 246\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mapi_client \u001b[39m=\u001b[39m ApiClient(config, max_workers\u001b[39m=\u001b[39mmax_workers)\n\u001b[1;32m 248\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mAliases \u001b[39m=\u001b[39m AliasesApiWrapper(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mapi_client)\n", "\u001b[0;31mInitializationException\u001b[0m: The rockset client was initialized incorrectly: An api key must be provided as a parameter to the RocksetClient or the Configuration object." ] } ], "source": [ "import os\n", "import rockset\n", "\n", "ROCKSET_API_KEY = os.environ.get(\"ROCKSET_API_KEY\") # Verify ROCKSET_API_KEY environment variable\n", "ROCKSET_API_SERVER = rockset.Regions.usw2a1 # Verify Rockset region\n", "rockset_client = rockset.RocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)\n", "\n", "COLLECTION_NAME='langchain_demo'\n", "TEXT_KEY='description'\n", "EMBEDDING_KEY='description_embedding'" ] }, { "cell_type": "markdown", "id": "07625be2", "metadata": {}, "source": [ "### 2. Prepare Documents" ] }, { "cell_type": "code", "execution_count": null, "id": "9740d8c4", "metadata": {}, "outputs": [ { "ename": "", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31mRunning cells with '/opt/local/bin/python3.11' requires the ipykernel package.\n", "\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n", "\u001b[1;31mCommand: '/opt/local/bin/python3.11 -m pip install ipykernel -U --user --force-reinstall'" ] } ], "source": [ "from langchain.embeddings.openai import OpenAIEmbeddings\n", "from langchain.text_splitter import CharacterTextSplitter\n", "from langchain.document_loaders import TextLoader\n", "from langchain.vectorstores import Rockset\n", "\n", "loader = TextLoader('../../../state_of_the_union.txt')\n", "documents = loader.load()\n", "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", "docs = text_splitter.split_documents(documents)" ] }, { "cell_type": "markdown", "id": "a068be18", "metadata": {}, "source": [ "### 3. Insert Documents" ] }, { "cell_type": "code", "execution_count": null, "id": "85b6a6c5", "metadata": {}, "outputs": [ { "ename": "", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31mRunning cells with '/opt/local/bin/python3.11' requires the ipykernel package.\n", "\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n", "\u001b[1;31mCommand: '/opt/local/bin/python3.11 -m pip install ipykernel -U --user --force-reinstall'" ] } ], "source": [ "embeddings = OpenAIEmbeddings() # Verify OPENAI_KEY environment variable\n", "\n", "docsearch = Rockset(\n", " client=rockset_client,\n", " embeddings=embeddings,\n", " collection_name=COLLECTION_NAME,\n", " text_key=TEXT_KEY,\n", " embedding_key=EMBEDDING_KEY,\n", ")\n", "\n", "ids=docsearch.add_texts(\n", " texts=[d.page_content for d in docs],\n", " metadatas=[d.metadata for d in docs],\n", ")" ] }, { "cell_type": "markdown", "id": "56eef48d", "metadata": {}, "source": [ "### 4. Search for Similar Documents" ] }, { "cell_type": "code", "execution_count": 1, "id": "0bbf3df0", "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'docsearch' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[1], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m query \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mWhat did the president say about Ketanji Brown Jackson?\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m output \u001b[39m=\u001b[39m docsearch\u001b[39m.\u001b[39msimilarity_search_with_relevance_scores(query, \u001b[39m4\u001b[39m, Rockset\u001b[39m.\u001b[39mDistanceFunction\u001b[39m.\u001b[39mCOSINE_SIM)\n\u001b[1;32m 4\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m\"\u001b[39m\u001b[39moutput length:\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mlen\u001b[39m(output))\n\u001b[1;32m 5\u001b[0m \u001b[39mfor\u001b[39;00m d, dist \u001b[39min\u001b[39;00m output:\n", "\u001b[0;31mNameError\u001b[0m: name 'docsearch' is not defined" ] } ], "source": [ "query = \"What did the president say about Ketanji Brown Jackson\"\n", "output = docsearch.similarity_search_with_relevance_scores(\n", " query, 4, Rockset.DistanceFunction.COSINE_SIM\n", ")\n", "print(\"output length:\", len(output))\n", "for d, dist in output:\n", " print(dist, d.metadata, d.page_content[:20] + '...')\n", "\n", "##\n", "# output length: 4\n", "# 0.764990692109871 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam...\n", "# 0.7485416901622112 {'source': '../../../state_of_the_union.txt'} And I’m taking robus...\n", "# 0.7468678973398306 {'source': '../../../state_of_the_union.txt'} And so many families...\n", "# 0.7436231261419488 {'source': '../../../state_of_the_union.txt'} Groups of citizens b..." ] }, { "cell_type": "markdown", "id": "7037a22f", "metadata": {}, "source": [ "### 5. Search for Similar Documents with Filtering" ] }, { "cell_type": "code", "execution_count": null, "id": "b64a290f", "metadata": {}, "outputs": [], "source": [ "output = docsearch.similarity_search_with_relevance_scores(\n", " query,\n", " 4,\n", " Rockset.DistanceFunction.COSINE_SIM,\n", " where_str=\"{} NOT LIKE '%citizens%'\".format(TEXT_KEY),\n", ")\n", "print(\"output length:\", len(output))\n", "for d, dist in output:\n", " print(dist, d.metadata, d.page_content[:20] + '...')\n", "\n", "##\n", "# output length: 4\n", "# 0.7651359650263554 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam...\n", "# 0.7486265516824893 {'source': '../../../state_of_the_union.txt'} And I’m taking robus...\n", "# 0.7469625542348115 {'source': '../../../state_of_the_union.txt'} And so many families...\n", "# 0.7344177777547739 {'source': '../../../state_of_the_union.txt'} We see the unity amo..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "13a52b38", "metadata": {}, "source": [ "### 6. [Optional] Delete Inserted Documents\n", "\n", "You must have the unique ID associated with each document to delete them from your collection.\n", "Define IDs when inserting documents with `Rockset.add_texts()`. Rockset will otherwise generate a unique ID for each document. Regardless, `Rockset.add_texts()` returns the IDs of inserted documents.\n", "\n", "To delete these docs, simply use the `Rockset.delete_texts()` function." ] }, { "cell_type": "code", "execution_count": null, "id": "1f755924", "metadata": {}, "outputs": [], "source": [ "docsearch.delete_texts(ids)" ] }, { "cell_type": "markdown", "id": "d468f431", "metadata": {}, "source": [ "## Summary\n", "\n", "In this tutorial, we successfully created a `Rockset` collection, `inserted` documents with OpenAI embeddings, and searched for similar documents with and without metadata filters.\n", "\n", "Keep an eye on https://rockset.com/ for future updates in this space." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 5 }