Minor improvements to rockset vectorstore (#8416)

This PR makes minor improvements to our python notebook, and adds
support for `Rockset` workspaces in our vectorstore client.

@rlancemartin, @eyurtsev

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
pull/8525/head
Anubhav Bindlish 1 year ago committed by GitHub
parent 893f3014af
commit 913a156cff
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -2,131 +2,141 @@
"cells": [
{
"cell_type": "markdown",
"id": "20b588b4",
"id": "9787b308",
"metadata": {},
"source": [
"# Rockset\n",
"\n",
">[Rockset](https://rockset.com/product/) is a real-time analytics database service for serving low latency, high concurrency analytical queries at scale. It builds a Converged Index™ on structured and semi-structured data with an efficient store for vector embeddings. Its support for running SQL on schemaless data makes it a perfect choice for running vector search with metadata filters. \n",
">[Rockset](https://rockset.com/) is a real-time search and analytics database built for the cloud. Rockset uses a [Converged Index™](https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/) with an efficient store for vector embeddings to serve low latency, high concurrency search queries at scale. Rockset has full support for metadata filtering and handles real-time ingestion for constantly updating, streaming data.\n",
"\n",
"This notebook demonstrates how to use `Rockset` as a vectorstore in langchain. To get started, make sure you have a `Rockset` account and an API key available."
"This notebook demonstrates how to use `Rockset` as a vector store in LangChain. Before getting started, make sure you have access to a `Rockset` account and an API key available. [Start your free trial today.](https://rockset.com/create/)\n"
]
},
{
"cell_type": "markdown",
"id": "e290ddc0",
"id": "b823d64a",
"metadata": {},
"source": [
"## Setting up environment\n",
"## Setting Up Your Environment[](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/rockset#setting-up-environment)\n",
"\n",
"1. Make sure you have Rockset account and go to the web console to get the API key. Details can be found on [the website](https://rockset.com/docs/rest-api/). For the purpose of this notebook, we will assume you're using Rockset from `Oregon(us-west-2)`."
]
},
{
"cell_type": "markdown",
"id": "7d77bbbe",
"metadata": {},
"source": [
"2. Now you will need to create a Rockset collection to write to, use the Rockset web console to do this. For the purpose of this exercise, we will create a collection called `langchain_demo`. Since Rockset supports schemaless ingest, you don't need to inform us of the shape of metadata for your texts. However, you do need to decide on two columns upfront:\n",
"- Where to store the text. We will use the column `description` for this.\n",
"- Where to store the vector-embedding for the text. We will use the column `description_embedding` for this.\n",
"\n",
"Also you will need to inform Rockset that `description_embedding` is a vector-embedding, so that we can optimize its format. You can do this using a **Rockset ingest transformation** while creating your collection:"
"1. Leverage the `Rockset` console to create a [collection](https://rockset.com/docs/collections/) with the Write API as your source. In this walkthrough, we create a collection named `langchain_demo`. \n",
" \n",
" Configure the following [ingest transformation](https://rockset.com/docs/ingest-transformation/) to mark your embeddings field and take advantage of performance and storage optimizations:"
]
},
{
"cell_type": "raw",
"id": "3daa76ba",
"metadata": {},
"cell_type": "code",
"execution_count": null,
"id": "aac58387",
"metadata": {
"vscode": {
"languageId": "sql"
}
},
"outputs": [],
"source": [
"SELECT\n",
" _input.* EXCEPT(_meta),\n",
" VECTOR_ENFORCE(_input.description_embedding, #length_of_vector_embedding, 'float') as description_embedding\n",
"FROM\n",
" _input\n",
" \n",
"// We used OpenAI `text-embedding-ada-002` for this examples, where #length_of_vector_embedding = 1536"
"SELECT _input.* EXCEPT(_meta), \n",
"VECTOR_ENFORCE(_input.description_embedding, #length_of_vector_embedding, 'float') as description_embedding \n",
"FROM _input"
]
},
{
"cell_type": "markdown",
"id": "7951c9cd",
"id": "df380e1c",
"metadata": {},
"source": [
"3. Now let's install the [rockset-python-client](https://github.com/rockset/rockset-python-client). This is used by langchain to talk to the Rockset database."
"2. After creating your collection, use the console to retrieve an [API key](https://rockset.com/docs/iam/#users-api-keys-and-roles). For the purpose of this notebook, we assume you are using the `Oregon(us-west-2)` region.\n",
"\n",
"3. Install the [rockset-python-client](https://github.com/rockset/rockset-python-client) to enable LangChain to communicate directly with `Rockset`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2aac7ae6",
"id": "00d16b83",
"metadata": {},
"outputs": [],
"source": [
"!pip install rockset"
]
},
{
"cell_type": "markdown",
"id": "8600900d",
"metadata": {},
"source": [
"This is it! Now you're ready to start writing some python code to store vector embeddings in Rockset, and querying the database to find texts similar to your query! We support 3 distance functions: `COSINE_SIM`, `EUCLIDEAN_DIST` and `DOT_PRODUCT`."
"pip install rockset"
]
},
{
"cell_type": "markdown",
"id": "3bf2f818",
"id": "e79550eb",
"metadata": {},
"source": [
"## Example"
"## LangChain Tutorial\n",
"\n",
"Follow along in your own Python notebook to generate and store vector embeddings in Rockset.\n",
"Start using Rockset to search for documents similar to your search queries.\n",
"\n",
"### 1. Define Key Variables"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a7b39626",
"execution_count": 5,
"id": "29505c1e",
"metadata": {},
"outputs": [],
"outputs": [
{
"ename": "InitializationException",
"evalue": "The rockset client was initialized incorrectly: An api key must be provided as a parameter to the RocksetClient or the Configuration object.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mInitializationException\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[5], line 6\u001b[0m\n\u001b[1;32m 4\u001b[0m ROCKSET_API_KEY \u001b[39m=\u001b[39m os\u001b[39m.\u001b[39menviron\u001b[39m.\u001b[39mget(\u001b[39m\"\u001b[39m\u001b[39mROCKSET_API_KEY\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39m# Verify ROCKSET_API_KEY environment variable\u001b[39;00m\n\u001b[1;32m 5\u001b[0m ROCKSET_API_SERVER \u001b[39m=\u001b[39m rockset\u001b[39m.\u001b[39mRegions\u001b[39m.\u001b[39musw2a1 \u001b[39m# Verify Rockset region\u001b[39;00m\n\u001b[0;32m----> 6\u001b[0m rockset_client \u001b[39m=\u001b[39m rockset\u001b[39m.\u001b[39;49mRocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)\n\u001b[1;32m 8\u001b[0m COLLECTION_NAME\u001b[39m=\u001b[39m\u001b[39m'\u001b[39m\u001b[39mlangchain_demo\u001b[39m\u001b[39m'\u001b[39m\n\u001b[1;32m 9\u001b[0m TEXT_KEY\u001b[39m=\u001b[39m\u001b[39m'\u001b[39m\u001b[39mdescription\u001b[39m\u001b[39m'\u001b[39m\n",
"File \u001b[0;32m~/Library/Python/3.9/lib/python/site-packages/rockset/rockset_client.py:242\u001b[0m, in \u001b[0;36mRocksetClient.__init__\u001b[0;34m(self, host, api_key, max_workers, config)\u001b[0m\n\u001b[1;32m 239\u001b[0m config\u001b[39m.\u001b[39mhost \u001b[39m=\u001b[39m host\n\u001b[1;32m 241\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m config\u001b[39m.\u001b[39mapi_key:\n\u001b[0;32m--> 242\u001b[0m \u001b[39mraise\u001b[39;00m InitializationException(\n\u001b[1;32m 243\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAn api key must be provided as a parameter to the RocksetClient or the Configuration object.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 244\u001b[0m )\n\u001b[1;32m 246\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mapi_client \u001b[39m=\u001b[39m ApiClient(config, max_workers\u001b[39m=\u001b[39mmax_workers)\n\u001b[1;32m 248\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mAliases \u001b[39m=\u001b[39m AliasesApiWrapper(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mapi_client)\n",
"\u001b[0;31mInitializationException\u001b[0m: The rockset client was initialized incorrectly: An api key must be provided as a parameter to the RocksetClient or the Configuration object."
]
}
],
"source": [
"import os\n",
"import rockset\n",
"\n",
"# Make sure env variable ROCKSET_API_KEY is set\n",
"ROCKSET_API_KEY = os.environ.get(\"ROCKSET_API_KEY\")\n",
"ROCKSET_API_SERVER = (\n",
" rockset.Regions.usw2a1\n",
") # Make sure this points to the correct Rockset region\n",
"ROCKSET_API_KEY = os.environ.get(\"ROCKSET_API_KEY\") # Verify ROCKSET_API_KEY environment variable\n",
"ROCKSET_API_SERVER = rockset.Regions.usw2a1 # Verify Rockset region\n",
"rockset_client = rockset.RocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)\n",
"\n",
"COLLECTION_NAME = \"langchain_demo\"\n",
"TEXT_KEY = \"description\"\n",
"EMBEDDING_KEY = \"description_embedding\""
"COLLECTION_NAME='langchain_demo'\n",
"TEXT_KEY='description'\n",
"EMBEDDING_KEY='description_embedding'"
]
},
{
"cell_type": "markdown",
"id": "474636a2",
"id": "07625be2",
"metadata": {},
"source": [
"Now let's use this client to create a Rockset Langchain Vectorstore!\n",
"\n",
"### 1. Inserting texts"
"### 2. Prepare Documents"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d73c5bb",
"id": "9740d8c4",
"metadata": {},
"outputs": [],
"outputs": [
{
"ename": "",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31mRunning cells with '/opt/local/bin/python3.11' requires the ipykernel package.\n",
"\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n",
"\u001b[1;31mCommand: '/opt/local/bin/python3.11 -m pip install ipykernel -U --user --force-reinstall'"
]
}
],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.document_loaders import TextLoader\n",
"from langchain.vectorstores import Rockset\n",
"\n",
"loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
"loader = TextLoader('../../../state_of_the_union.txt')\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)"
@ -134,21 +144,31 @@
},
{
"cell_type": "markdown",
"id": "1404cada",
"id": "a068be18",
"metadata": {},
"source": [
"Now we have the documents we want to insert. Let's create a Rockset vectorstore and insert these docs into the Rockset collection. We will use `OpenAIEmbeddings` to create embeddings for the texts, but you're free to use whatever you want."
"### 3. Insert Documents"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63c98bac",
"id": "85b6a6c5",
"metadata": {},
"outputs": [],
"outputs": [
{
"ename": "",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31mRunning cells with '/opt/local/bin/python3.11' requires the ipykernel package.\n",
"\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n",
"\u001b[1;31mCommand: '/opt/local/bin/python3.11 -m pip install ipykernel -U --user --force-reinstall'"
]
}
],
"source": [
"# Make sure the environment variable OPENAI_API_KEY is set up\n",
"embeddings = OpenAIEmbeddings()\n",
"embeddings = OpenAIEmbeddings() # Verify OPENAI_KEY environment variable\n",
"\n",
"docsearch = Rockset(\n",
" client=rockset_client,\n",
@ -158,30 +178,38 @@
" embedding_key=EMBEDDING_KEY,\n",
")\n",
"\n",
"ids = docsearch.add_texts(\n",
"ids=docsearch.add_texts(\n",
" texts=[d.page_content for d in docs],\n",
" metadatas=[d.metadata for d in docs],\n",
")\n",
"\n",
"## If you go to the Rockset console now, you should be able to see this docs along with the metadata `source`"
")"
]
},
{
"cell_type": "markdown",
"id": "f1290844",
"id": "56eef48d",
"metadata": {},
"source": [
"### 2. Searching similar texts\n",
"\n",
"Now let's try to search Rockset to find strings similar to our query string!"
"### 4. Search for Similar Documents"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "96e73ac1",
"execution_count": 1,
"id": "0bbf3df0",
"metadata": {},
"outputs": [],
"outputs": [
{
"ename": "NameError",
"evalue": "name 'docsearch' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[1], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m query \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mWhat did the president say about Ketanji Brown Jackson?\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m output \u001b[39m=\u001b[39m docsearch\u001b[39m.\u001b[39msimilarity_search_with_relevance_scores(query, \u001b[39m4\u001b[39m, Rockset\u001b[39m.\u001b[39mDistanceFunction\u001b[39m.\u001b[39mCOSINE_SIM)\n\u001b[1;32m 4\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m\"\u001b[39m\u001b[39moutput length:\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mlen\u001b[39m(output))\n\u001b[1;32m 5\u001b[0m \u001b[39mfor\u001b[39;00m d, dist \u001b[39min\u001b[39;00m output:\n",
"\u001b[0;31mNameError\u001b[0m: name 'docsearch' is not defined"
]
}
],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"output = docsearch.similarity_search_with_relevance_scores(\n",
@ -189,7 +217,7 @@
")\n",
"print(\"output length:\", len(output))\n",
"for d, dist in output:\n",
" print(dist, d.metadata, d.page_content[:20] + \"...\")\n",
" print(dist, d.metadata, d.page_content[:20] + '...')\n",
"\n",
"##\n",
"# output length: 4\n",
@ -201,20 +229,16 @@
},
{
"cell_type": "markdown",
"id": "5e15d630",
"id": "7037a22f",
"metadata": {},
"source": [
"You can also use a where filter to prune your search space. You can add filters on text key, or any of the metadata fields. \n",
"\n",
"> **Note**: Since Rockset stores each metadata field as a separate column internally, these filters are much faster than other vector databases which store all metadata as a single JSON.\n",
"\n",
"For eg, to find all texts NOT containing the substring \"and\", you can use the following code:"
"### 5. Search for Similar Documents with Filtering"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1c44d41",
"id": "b64a290f",
"metadata": {},
"outputs": [],
"source": [
@ -226,7 +250,7 @@
")\n",
"print(\"output length:\", len(output))\n",
"for d, dist in output:\n",
" print(dist, d.metadata, d.page_content[:20] + \"...\")\n",
" print(dist, d.metadata, d.page_content[:20] + '...')\n",
"\n",
"##\n",
"# output length: 4\n",
@ -239,12 +263,13 @@
{
"attachments": {},
"cell_type": "markdown",
"id": "0765b822",
"id": "13a52b38",
"metadata": {},
"source": [
"### 3. [Optional] Drop all inserted documents\n",
"### 6. [Optional] Delete Inserted Documents\n",
"\n",
"In order to delete texts from the Rockset collection, you need to know the unique ID associated with each document inside Rockset. These ids can either be supplied directly by the user while inserting the texts (in the `Rockset.add_texts()` function), else Rockset will generate a unique ID or each document. Either way, `Rockset.add_texts()` returns the ids for the inserted documents.\n",
"You must have the unique ID associated with each document to delete them from your collection.\n",
"Define IDs when inserting documents with `Rockset.add_texts()`. Rockset will otherwise generate a unique ID for each document. Regardless, `Rockset.add_texts()` returns the IDs of inserted documents.\n",
"\n",
"To delete these docs, simply use the `Rockset.delete_texts()` function."
]
@ -252,7 +277,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "31738966",
"id": "1f755924",
"metadata": {},
"outputs": [],
"source": [
@ -261,23 +286,15 @@
},
{
"cell_type": "markdown",
"id": "03fa12a9",
"id": "d468f431",
"metadata": {},
"source": [
"## Congratulations!\n",
"## Summary\n",
"\n",
"Voila! In this example you successfuly created a Rockset collection, inserted documents along with their OpenAI vector embeddings, and searched for similar docs both with and without any metadata filters.\n",
"In this tutorial, we successfully created a `Rockset` collection, `inserted` documents with OpenAI embeddings, and searched for similar documents with and without metadata filters.\n",
"\n",
"Keep an eye on https://rockset.com/blog/introducing-vector-search-on-rockset/ for future updates in this space!"
"Keep an eye on https://rockset.com/ for future updates in this space."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2763dddb-e87d-4d3b-b0bf-c246b0573d87",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@ -296,7 +313,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
"version": "3.9.6"
}
},
"nbformat": 4,

@ -23,7 +23,6 @@ class Rockset(VectorStore):
See: https://rockset.com/blog/introducing-vector-search-on-rockset/ for more details
Everything below assumes `commons` Rockset workspace.
TODO: Add support for workspace args.
Example:
.. code-block:: python
@ -50,6 +49,7 @@ class Rockset(VectorStore):
collection_name: str,
text_key: str,
embedding_key: str,
workspace: str = "commons",
):
"""Initialize with Rockset client.
Args:
@ -82,6 +82,7 @@ class Rockset(VectorStore):
self._embeddings = embeddings
self._text_key = text_key
self._embedding_key = embedding_key
self._workspace = workspace
@property
def embeddings(self) -> Embeddings:
@ -303,7 +304,7 @@ class Rockset(VectorStore):
where_str = f"WHERE {where_str}\n" if where_str else ""
return f"""\
SELECT * EXCEPT({self._embedding_key}), {distance_str}
FROM {self._collection_name}
FROM {self._workspace}.{self._collection_name}
{where_str}\
ORDER BY dist {distance_func.order_by()}
LIMIT {str(k)}
@ -311,7 +312,7 @@ LIMIT {str(k)}
def _write_documents_to_rockset(self, batch: List[dict]) -> List[str]:
add_doc_res = self._client.Documents.add_documents(
collection=self._collection_name, data=batch
collection=self._collection_name, data=batch, workspace=self._workspace
)
return [doc_status._id for doc_status in add_doc_res.data]
@ -328,4 +329,5 @@ LIMIT {str(k)}
self._client.Documents.delete_documents(
collection=self._collection_name,
data=[DeleteDocumentsRequestData(id=i) for i in ids],
workspace=self._workspace,
)

@ -34,12 +34,12 @@ def test_sql_query() -> None:
client = rockset.RocksetClient(host, api_key)
col_1 = "Rockset is a real-time analytics database which enables queries on massive, semi-structured data without operational burden. Rockset is serverless and fully managed. It offloads the work of managing configuration, cluster provisioning, denormalization, and shard / index management. Rockset is also SOC 2 Type II compliant and offers encryption at rest and in flight, securing and protecting any sensitive data. Most teams can ingest data into Rockset and start executing queries in less than 15 minutes." # noqa: E501
col_1 = "Rockset is a real-time analytics database"
col_2 = 2
col_3 = "e903e069-b0b5-4b80-95e2-86471b41f55f"
id = 7320132
"""Run a simple SQL query query"""
"""Run a simple SQL query"""
loader = RocksetLoader(
client,
rockset.models.QueryRequestSql(

@ -33,6 +33,7 @@ logger = logging.getLogger(__name__)
#
# See https://rockset.com/blog/introducing-vector-search-on-rockset/ for more details.
workspace = "langchain_tests"
collection_name = "langchain_demo"
text_key = "description"
embedding_key = "description_embedding"
@ -71,10 +72,9 @@ class TestRockset:
"Deleting all existing documents from the Rockset collection %s",
collection_name,
)
query = f"select _id from {workspace}.{collection_name}"
query_response = client.Queries.query(
sql={"query": "select _id from {}".format(collection_name)}
)
query_response = client.Queries.query(sql={"query": query})
ids = [
str(r["_id"])
for r in getattr(
@ -85,12 +85,13 @@ class TestRockset:
client.Documents.delete_documents(
collection=collection_name,
data=[rockset.models.DeleteDocumentsRequestData(id=i) for i in ids],
workspace=workspace,
)
embeddings = ConsistentFakeEmbeddings()
embeddings.embed_documents(fake_texts)
cls.rockset_vectorstore = Rockset(
client, embeddings, collection_name, text_key, embedding_key
client, embeddings, collection_name, text_key, embedding_key, workspace
)
def test_rockset_insert_and_search(self) -> None:
@ -127,9 +128,9 @@ class TestRockset:
)
vector_str = ",".join(map(str, vector))
expected = f"""\
SELECT * EXCEPT(description_embedding), \
COSINE_SIM(description_embedding, [{vector_str}]) as dist
FROM langchain_demo
SELECT * EXCEPT({embedding_key}), \
COSINE_SIM({embedding_key}, [{vector_str}]) as dist
FROM {workspace}.{collection_name}
ORDER BY dist DESC
LIMIT 4
"""
@ -145,9 +146,9 @@ LIMIT 4
)
vector_str = ",".join(map(str, vector))
expected = f"""\
SELECT * EXCEPT(description_embedding), \
COSINE_SIM(description_embedding, [{vector_str}]) as dist
FROM langchain_demo
SELECT * EXCEPT({embedding_key}), \
COSINE_SIM({embedding_key}, [{vector_str}]) as dist
FROM {workspace}.{collection_name}
WHERE age >= 10
ORDER BY dist DESC
LIMIT 4

Loading…
Cancel
Save