mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
913a156cff
This PR makes minor improvements to our python notebook, and adds support for `Rockset` workspaces in our vectorstore client. @rlancemartin, @eyurtsev --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
322 lines
13 KiB
Plaintext
322 lines
13 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9787b308",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Rockset\n",
|
||
"\n",
|
||
">[Rockset](https://rockset.com/) is a real-time search and analytics database built for the cloud. Rockset uses a [Converged Index™](https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/) with an efficient store for vector embeddings to serve low latency, high concurrency search queries at scale. Rockset has full support for metadata filtering and handles real-time ingestion for constantly updating, streaming data.\n",
|
||
"\n",
|
||
"This notebook demonstrates how to use `Rockset` as a vector store in LangChain. Before getting started, make sure you have access to a `Rockset` account and an API key available. [Start your free trial today.](https://rockset.com/create/)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b823d64a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Setting Up Your Environment[](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/rockset#setting-up-environment)\n",
|
||
"\n",
|
||
"1. Leverage the `Rockset` console to create a [collection](https://rockset.com/docs/collections/) with the Write API as your source. In this walkthrough, we create a collection named `langchain_demo`. \n",
|
||
" \n",
|
||
" Configure the following [ingest transformation](https://rockset.com/docs/ingest-transformation/) to mark your embeddings field and take advantage of performance and storage optimizations:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "aac58387",
|
||
"metadata": {
|
||
"vscode": {
|
||
"languageId": "sql"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"SELECT _input.* EXCEPT(_meta), \n",
|
||
"VECTOR_ENFORCE(_input.description_embedding, #length_of_vector_embedding, 'float') as description_embedding \n",
|
||
"FROM _input"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "df380e1c",
|
||
"metadata": {},
|
||
"source": [
|
||
"2. After creating your collection, use the console to retrieve an [API key](https://rockset.com/docs/iam/#users-api-keys-and-roles). For the purpose of this notebook, we assume you are using the `Oregon(us-west-2)` region.\n",
|
||
"\n",
|
||
"3. Install the [rockset-python-client](https://github.com/rockset/rockset-python-client) to enable LangChain to communicate directly with `Rockset`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "00d16b83",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"pip install rockset"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e79550eb",
|
||
"metadata": {},
|
||
"source": [
|
||
"## LangChain Tutorial\n",
|
||
"\n",
|
||
"Follow along in your own Python notebook to generate and store vector embeddings in Rockset.\n",
|
||
"Start using Rockset to search for documents similar to your search queries.\n",
|
||
"\n",
|
||
"### 1. Define Key Variables"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "29505c1e",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "InitializationException",
|
||
"evalue": "The rockset client was initialized incorrectly: An api key must be provided as a parameter to the RocksetClient or the Configuration object.",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
||
"\u001b[0;31mInitializationException\u001b[0m Traceback (most recent call last)",
|
||
"Cell \u001b[0;32mIn[5], line 6\u001b[0m\n\u001b[1;32m 4\u001b[0m ROCKSET_API_KEY \u001b[39m=\u001b[39m os\u001b[39m.\u001b[39menviron\u001b[39m.\u001b[39mget(\u001b[39m\"\u001b[39m\u001b[39mROCKSET_API_KEY\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39m# Verify ROCKSET_API_KEY environment variable\u001b[39;00m\n\u001b[1;32m 5\u001b[0m ROCKSET_API_SERVER \u001b[39m=\u001b[39m rockset\u001b[39m.\u001b[39mRegions\u001b[39m.\u001b[39musw2a1 \u001b[39m# Verify Rockset region\u001b[39;00m\n\u001b[0;32m----> 6\u001b[0m rockset_client \u001b[39m=\u001b[39m rockset\u001b[39m.\u001b[39;49mRocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)\n\u001b[1;32m 8\u001b[0m COLLECTION_NAME\u001b[39m=\u001b[39m\u001b[39m'\u001b[39m\u001b[39mlangchain_demo\u001b[39m\u001b[39m'\u001b[39m\n\u001b[1;32m 9\u001b[0m TEXT_KEY\u001b[39m=\u001b[39m\u001b[39m'\u001b[39m\u001b[39mdescription\u001b[39m\u001b[39m'\u001b[39m\n",
|
||
"File \u001b[0;32m~/Library/Python/3.9/lib/python/site-packages/rockset/rockset_client.py:242\u001b[0m, in \u001b[0;36mRocksetClient.__init__\u001b[0;34m(self, host, api_key, max_workers, config)\u001b[0m\n\u001b[1;32m 239\u001b[0m config\u001b[39m.\u001b[39mhost \u001b[39m=\u001b[39m host\n\u001b[1;32m 241\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m config\u001b[39m.\u001b[39mapi_key:\n\u001b[0;32m--> 242\u001b[0m \u001b[39mraise\u001b[39;00m InitializationException(\n\u001b[1;32m 243\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAn api key must be provided as a parameter to the RocksetClient or the Configuration object.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 244\u001b[0m )\n\u001b[1;32m 246\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mapi_client \u001b[39m=\u001b[39m ApiClient(config, max_workers\u001b[39m=\u001b[39mmax_workers)\n\u001b[1;32m 248\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mAliases \u001b[39m=\u001b[39m AliasesApiWrapper(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mapi_client)\n",
|
||
"\u001b[0;31mInitializationException\u001b[0m: The rockset client was initialized incorrectly: An api key must be provided as a parameter to the RocksetClient or the Configuration object."
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import os\n",
|
||
"import rockset\n",
|
||
"\n",
|
||
"ROCKSET_API_KEY = os.environ.get(\"ROCKSET_API_KEY\") # Verify ROCKSET_API_KEY environment variable\n",
|
||
"ROCKSET_API_SERVER = rockset.Regions.usw2a1 # Verify Rockset region\n",
|
||
"rockset_client = rockset.RocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)\n",
|
||
"\n",
|
||
"COLLECTION_NAME='langchain_demo'\n",
|
||
"TEXT_KEY='description'\n",
|
||
"EMBEDDING_KEY='description_embedding'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "07625be2",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2. Prepare Documents"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "9740d8c4",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "",
|
||
"evalue": "",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[1;31mRunning cells with '/opt/local/bin/python3.11' requires the ipykernel package.\n",
|
||
"\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n",
|
||
"\u001b[1;31mCommand: '/opt/local/bin/python3.11 -m pip install ipykernel -U --user --force-reinstall'"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||
"from langchain.document_loaders import TextLoader\n",
|
||
"from langchain.vectorstores import Rockset\n",
|
||
"\n",
|
||
"loader = TextLoader('../../../state_of_the_union.txt')\n",
|
||
"documents = loader.load()\n",
|
||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||
"docs = text_splitter.split_documents(documents)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a068be18",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 3. Insert Documents"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "85b6a6c5",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "",
|
||
"evalue": "",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[1;31mRunning cells with '/opt/local/bin/python3.11' requires the ipykernel package.\n",
|
||
"\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n",
|
||
"\u001b[1;31mCommand: '/opt/local/bin/python3.11 -m pip install ipykernel -U --user --force-reinstall'"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"embeddings = OpenAIEmbeddings() # Verify OPENAI_KEY environment variable\n",
|
||
"\n",
|
||
"docsearch = Rockset(\n",
|
||
" client=rockset_client,\n",
|
||
" embeddings=embeddings,\n",
|
||
" collection_name=COLLECTION_NAME,\n",
|
||
" text_key=TEXT_KEY,\n",
|
||
" embedding_key=EMBEDDING_KEY,\n",
|
||
")\n",
|
||
"\n",
|
||
"ids=docsearch.add_texts(\n",
|
||
" texts=[d.page_content for d in docs],\n",
|
||
" metadatas=[d.metadata for d in docs],\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "56eef48d",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 4. Search for Similar Documents"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "0bbf3df0",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "NameError",
|
||
"evalue": "name 'docsearch' is not defined",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
||
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
||
"Cell \u001b[0;32mIn[1], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m query \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mWhat did the president say about Ketanji Brown Jackson?\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m output \u001b[39m=\u001b[39m docsearch\u001b[39m.\u001b[39msimilarity_search_with_relevance_scores(query, \u001b[39m4\u001b[39m, Rockset\u001b[39m.\u001b[39mDistanceFunction\u001b[39m.\u001b[39mCOSINE_SIM)\n\u001b[1;32m 4\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m\"\u001b[39m\u001b[39moutput length:\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mlen\u001b[39m(output))\n\u001b[1;32m 5\u001b[0m \u001b[39mfor\u001b[39;00m d, dist \u001b[39min\u001b[39;00m output:\n",
|
||
"\u001b[0;31mNameError\u001b[0m: name 'docsearch' is not defined"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||
"output = docsearch.similarity_search_with_relevance_scores(\n",
|
||
" query, 4, Rockset.DistanceFunction.COSINE_SIM\n",
|
||
")\n",
|
||
"print(\"output length:\", len(output))\n",
|
||
"for d, dist in output:\n",
|
||
" print(dist, d.metadata, d.page_content[:20] + '...')\n",
|
||
"\n",
|
||
"##\n",
|
||
"# output length: 4\n",
|
||
"# 0.764990692109871 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam...\n",
|
||
"# 0.7485416901622112 {'source': '../../../state_of_the_union.txt'} And I’m taking robus...\n",
|
||
"# 0.7468678973398306 {'source': '../../../state_of_the_union.txt'} And so many families...\n",
|
||
"# 0.7436231261419488 {'source': '../../../state_of_the_union.txt'} Groups of citizens b..."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7037a22f",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 5. Search for Similar Documents with Filtering"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "b64a290f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"output = docsearch.similarity_search_with_relevance_scores(\n",
|
||
" query,\n",
|
||
" 4,\n",
|
||
" Rockset.DistanceFunction.COSINE_SIM,\n",
|
||
" where_str=\"{} NOT LIKE '%citizens%'\".format(TEXT_KEY),\n",
|
||
")\n",
|
||
"print(\"output length:\", len(output))\n",
|
||
"for d, dist in output:\n",
|
||
" print(dist, d.metadata, d.page_content[:20] + '...')\n",
|
||
"\n",
|
||
"##\n",
|
||
"# output length: 4\n",
|
||
"# 0.7651359650263554 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam...\n",
|
||
"# 0.7486265516824893 {'source': '../../../state_of_the_union.txt'} And I’m taking robus...\n",
|
||
"# 0.7469625542348115 {'source': '../../../state_of_the_union.txt'} And so many families...\n",
|
||
"# 0.7344177777547739 {'source': '../../../state_of_the_union.txt'} We see the unity amo..."
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "13a52b38",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 6. [Optional] Delete Inserted Documents\n",
|
||
"\n",
|
||
"You must have the unique ID associated with each document to delete them from your collection.\n",
|
||
"Define IDs when inserting documents with `Rockset.add_texts()`. Rockset will otherwise generate a unique ID for each document. Regardless, `Rockset.add_texts()` returns the IDs of inserted documents.\n",
|
||
"\n",
|
||
"To delete these docs, simply use the `Rockset.delete_texts()` function."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1f755924",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"docsearch.delete_texts(ids)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d468f431",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Summary\n",
|
||
"\n",
|
||
"In this tutorial, we successfully created a `Rockset` collection, `inserted` documents with OpenAI embeddings, and searched for similar documents with and without metadata filters.\n",
|
||
"\n",
|
||
"Keep an eye on https://rockset.com/ for future updates in this space."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.9.6"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|