langchain/docs/extras/integrations/vectorstores/rockset.ipynb

304 lines
10 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"id": "20b588b4",
"metadata": {},
"source": [
"# Rockset\n",
"\n",
">[Rockset](https://rockset.com/product/) is a real-time analytics database service for serving low latency, high concurrency analytical queries at scale. It builds a Converged Index™ on structured and semi-structured data with an efficient store for vector embeddings. Its support for running SQL on schemaless data makes it a perfect choice for running vector search with metadata filters. \n",
"\n",
"This notebook demonstrates how to use `Rockset` as a vectorstore in langchain. To get started, make sure you have a `Rockset` account and an API key available."
]
},
{
"cell_type": "markdown",
"id": "e290ddc0",
"metadata": {},
"source": [
"## Setting up environment\n",
"\n",
"1. Make sure you have Rockset account and go to the web console to get the API key. Details can be found on [the website](https://rockset.com/docs/rest-api/). For the purpose of this notebook, we will assume you're using Rockset from `Oregon(us-west-2)`."
]
},
{
"cell_type": "markdown",
"id": "7d77bbbe",
"metadata": {},
"source": [
"2. Now you will need to create a Rockset collection to write to, use the Rockset web console to do this. For the purpose of this exercise, we will create a collection called `langchain_demo`. Since Rockset supports schemaless ingest, you don't need to inform us of the shape of metadata for your texts. However, you do need to decide on two columns upfront:\n",
"- Where to store the text. We will use the column `description` for this.\n",
"- Where to store the vector-embedding for the text. We will use the column `description_embedding` for this.\n",
"\n",
"Also you will need to inform Rockset that `description_embedding` is a vector-embedding, so that we can optimize its format. You can do this using a **Rockset ingest transformation** while creating your collection:"
]
},
{
"cell_type": "raw",
"id": "3daa76ba",
"metadata": {},
"source": [
"SELECT\n",
" _input.* EXCEPT(_meta),\n",
" VECTOR_ENFORCE(_input.description_embedding, #length_of_vector_embedding, 'float') as description_embedding\n",
"FROM\n",
" _input\n",
" \n",
"// We used OpenAI `text-embedding-ada-002` for this examples, where #length_of_vector_embedding = 1536"
]
},
{
"cell_type": "markdown",
"id": "7951c9cd",
"metadata": {},
"source": [
"3. Now let's install the [rockset-python-client](https://github.com/rockset/rockset-python-client). This is used by langchain to talk to the Rockset database."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2aac7ae6",
"metadata": {},
"outputs": [],
"source": [
"!pip install rockset"
]
},
{
"cell_type": "markdown",
"id": "8600900d",
"metadata": {},
"source": [
"This is it! Now you're ready to start writing some python code to store vector embeddings in Rockset, and querying the database to find texts similar to your query! We support 3 distance functions: `COSINE_SIM`, `EUCLIDEAN_DIST` and `DOT_PRODUCT`."
]
},
{
"cell_type": "markdown",
"id": "3bf2f818",
"metadata": {},
"source": [
"## Example"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a7b39626",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import rockset\n",
"\n",
"# Make sure env variable ROCKSET_API_KEY is set\n",
"ROCKSET_API_KEY = os.environ.get(\"ROCKSET_API_KEY\")\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 02:05:14 +00:00
"ROCKSET_API_SERVER = (\n",
" rockset.Regions.usw2a1\n",
") # Make sure this points to the correct Rockset region\n",
"rockset_client = rockset.RocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)\n",
"\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 02:05:14 +00:00
"COLLECTION_NAME = \"langchain_demo\"\n",
"TEXT_KEY = \"description\"\n",
"EMBEDDING_KEY = \"description_embedding\""
]
},
{
"cell_type": "markdown",
"id": "474636a2",
"metadata": {},
"source": [
"Now let's use this client to create a Rockset Langchain Vectorstore!\n",
"\n",
"### 1. Inserting texts"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d73c5bb",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.document_loaders import TextLoader\n",
"from langchain.vectorstores.rocksetdb import RocksetDB\n",
"\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 02:05:14 +00:00
"loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)"
]
},
{
"cell_type": "markdown",
"id": "1404cada",
"metadata": {},
"source": [
"Now we have the documents we want to insert. Let's create a Rockset vectorstore and insert these docs into the Rockset collection. We will use `OpenAIEmbeddings` to create embeddings for the texts, but you're free to use whatever you want."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63c98bac",
"metadata": {},
"outputs": [],
"source": [
"# Make sure the environment variable OPENAI_API_KEY is set up\n",
"embeddings = OpenAIEmbeddings()\n",
"\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 02:05:14 +00:00
"docsearch = RocksetDB(\n",
" client=rockset_client,\n",
" embeddings=embeddings,\n",
" collection_name=COLLECTION_NAME,\n",
" text_key=TEXT_KEY,\n",
" embedding_key=EMBEDDING_KEY,\n",
")\n",
"\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 02:05:14 +00:00
"ids = docsearch.add_texts(\n",
" texts=[d.page_content for d in docs],\n",
" metadatas=[d.metadata for d in docs],\n",
")\n",
"\n",
"## If you go to the Rockset console now, you should be able to see this docs along with the metadata `source`"
]
},
{
"cell_type": "markdown",
"id": "f1290844",
"metadata": {},
"source": [
"### 2. Searching similar texts\n",
"\n",
"Now let's try to search Rockset to find strings similar to our query string!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "96e73ac1",
"metadata": {},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 02:05:14 +00:00
"output = docsearch.similarity_search_with_relevance_scores(\n",
" query, 4, RocksetDB.DistanceFunction.COSINE_SIM\n",
")\n",
"print(\"output length:\", len(output))\n",
"for d, dist in output:\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 02:05:14 +00:00
" print(dist, d.metadata, d.page_content[:20] + \"...\")\n",
"\n",
"##\n",
"# output length: 4\n",
"# 0.764990692109871 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam...\n",
"# 0.7485416901622112 {'source': '../../../state_of_the_union.txt'} And Im taking robus...\n",
"# 0.7468678973398306 {'source': '../../../state_of_the_union.txt'} And so many families...\n",
"# 0.7436231261419488 {'source': '../../../state_of_the_union.txt'} Groups of citizens b..."
]
},
{
"cell_type": "markdown",
"id": "5e15d630",
"metadata": {},
"source": [
"You can also use a where filter to prune your search space. You can add filters on text key, or any of the metadata fields. \n",
"\n",
"> **Note**: Since Rockset stores each metadata field as a separate column internally, these filters are much faster than other vector databases which store all metadata as a single JSON.\n",
"\n",
"For eg, to find all texts NOT containing the substring \"and\", you can use the following code:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1c44d41",
"metadata": {},
"outputs": [],
"source": [
"output = docsearch.similarity_search_with_relevance_scores(\n",
" query,\n",
" 4,\n",
" RocksetDB.DistanceFunction.COSINE_SIM,\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 02:05:14 +00:00
" where_str=\"{} NOT LIKE '%citizens%'\".format(TEXT_KEY),\n",
")\n",
"print(\"output length:\", len(output))\n",
"for d, dist in output:\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 02:05:14 +00:00
" print(dist, d.metadata, d.page_content[:20] + \"...\")\n",
"\n",
"##\n",
"# output length: 4\n",
"# 0.7651359650263554 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam...\n",
"# 0.7486265516824893 {'source': '../../../state_of_the_union.txt'} And Im taking robus...\n",
"# 0.7469625542348115 {'source': '../../../state_of_the_union.txt'} And so many families...\n",
"# 0.7344177777547739 {'source': '../../../state_of_the_union.txt'} We see the unity amo..."
]
},
{
"cell_type": "markdown",
"id": "0765b822",
"metadata": {},
"source": [
"### 3. [Optional] Drop all inserted documents\n",
"\n",
"In order to delete texts from the Rockset collection, you need to know the unique ID associated with each document inside Rockset. These ids can either be supplied directly by the user while inserting the texts (in the `RocksetDB.add_texts()` function), else Rockset will generate a unique ID or each document. Either way, `Rockset.add_texts()` returns the ids for the inserted documents.\n",
"\n",
"To delete these docs, simply use the `RocksetDB.delete_texts()` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "31738966",
"metadata": {},
"outputs": [],
"source": [
"docsearch.delete_texts(ids)"
]
},
{
"cell_type": "markdown",
"id": "03fa12a9",
"metadata": {},
"source": [
"## Congratulations!\n",
"\n",
"Voila! In this example you successfuly created a Rockset collection, inserted documents along with their OpenAI vector embeddings, and searched for similar docs both with and without any metadata filters.\n",
"\n",
"Keep an eye on https://rockset.com/blog/introducing-vector-search-on-rockset/ for future updates in this space!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2763dddb-e87d-4d3b-b0bf-c246b0573d87",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}