mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
5f1aab5487
* remove error output for notebook * add comment about vector length for ingest transformation * change OPENAI_KEY -> OPENAI_API_KEY cc @baskaryan
278 lines
8.8 KiB
Plaintext
278 lines
8.8 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9787b308",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Rockset\n",
|
||
"\n",
|
||
">[Rockset](https://rockset.com/) is a real-time search and analytics database built for the cloud. Rockset uses a [Converged Index™](https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/) with an efficient store for vector embeddings to serve low latency, high concurrency search queries at scale. Rockset has full support for metadata filtering and handles real-time ingestion for constantly updating, streaming data.\n",
|
||
"\n",
|
||
"This notebook demonstrates how to use `Rockset` as a vector store in LangChain. Before getting started, make sure you have access to a `Rockset` account and an API key available. [Start your free trial today.](https://rockset.com/create/)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b823d64a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Setting Up Your Environment[](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/rockset#setting-up-environment)\n",
|
||
"\n",
|
||
"1. Leverage the `Rockset` console to create a [collection](https://rockset.com/docs/collections/) with the Write API as your source. In this walkthrough, we create a collection named `langchain_demo`. \n",
|
||
" \n",
|
||
" Configure the following [ingest transformation](https://rockset.com/docs/ingest-transformation/) to mark your embeddings field and take advantage of performance and storage optimizations:\n",
|
||
"\n",
|
||
"\n",
|
||
" (We used OpenAI `text-embedding-ada-002` for this examples, where #length_of_vector_embedding = 1536)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "aac58387",
|
||
"metadata": {
|
||
"vscode": {
|
||
"languageId": "sql"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"SELECT _input.* EXCEPT(_meta), \n",
|
||
"VECTOR_ENFORCE(_input.description_embedding, #length_of_vector_embedding, 'float') as description_embedding \n",
|
||
"FROM _input"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "df380e1c",
|
||
"metadata": {},
|
||
"source": [
|
||
"2. After creating your collection, use the console to retrieve an [API key](https://rockset.com/docs/iam/#users-api-keys-and-roles). For the purpose of this notebook, we assume you are using the `Oregon(us-west-2)` region.\n",
|
||
"\n",
|
||
"3. Install the [rockset-python-client](https://github.com/rockset/rockset-python-client) to enable LangChain to communicate directly with `Rockset`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "00d16b83",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"pip install rockset"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e79550eb",
|
||
"metadata": {},
|
||
"source": [
|
||
"## LangChain Tutorial\n",
|
||
"\n",
|
||
"Follow along in your own Python notebook to generate and store vector embeddings in Rockset.\n",
|
||
"Start using Rockset to search for documents similar to your search queries.\n",
|
||
"\n",
|
||
"### 1. Define Key Variables"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "29505c1e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"import rockset\n",
|
||
"\n",
|
||
"ROCKSET_API_KEY = os.environ.get(\"ROCKSET_API_KEY\") # Verify ROCKSET_API_KEY environment variable\n",
|
||
"ROCKSET_API_SERVER = rockset.Regions.usw2a1 # Verify Rockset region\n",
|
||
"rockset_client = rockset.RocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)\n",
|
||
"\n",
|
||
"COLLECTION_NAME='langchain_demo'\n",
|
||
"TEXT_KEY='description'\n",
|
||
"EMBEDDING_KEY='description_embedding'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "07625be2",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2. Prepare Documents"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "9740d8c4",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||
"from langchain.document_loaders import TextLoader\n",
|
||
"from langchain.vectorstores import Rockset\n",
|
||
"\n",
|
||
"loader = TextLoader('../../../state_of_the_union.txt')\n",
|
||
"documents = loader.load()\n",
|
||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||
"docs = text_splitter.split_documents(documents)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a068be18",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 3. Insert Documents"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "85b6a6c5",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"embeddings = OpenAIEmbeddings() # Verify OPENAI_API_KEY environment variable\n",
|
||
"\n",
|
||
"docsearch = Rockset(\n",
|
||
" client=rockset_client,\n",
|
||
" embeddings=embeddings,\n",
|
||
" collection_name=COLLECTION_NAME,\n",
|
||
" text_key=TEXT_KEY,\n",
|
||
" embedding_key=EMBEDDING_KEY,\n",
|
||
")\n",
|
||
"\n",
|
||
"ids=docsearch.add_texts(\n",
|
||
" texts=[d.page_content for d in docs],\n",
|
||
" metadatas=[d.metadata for d in docs],\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "56eef48d",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 4. Search for Similar Documents"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "0bbf3df0",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||
"output = docsearch.similarity_search_with_relevance_scores(\n",
|
||
" query, 4, Rockset.DistanceFunction.COSINE_SIM\n",
|
||
")\n",
|
||
"print(\"output length:\", len(output))\n",
|
||
"for d, dist in output:\n",
|
||
" print(dist, d.metadata, d.page_content[:20] + '...')\n",
|
||
"\n",
|
||
"##\n",
|
||
"# output length: 4\n",
|
||
"# 0.764990692109871 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam...\n",
|
||
"# 0.7485416901622112 {'source': '../../../state_of_the_union.txt'} And I’m taking robus...\n",
|
||
"# 0.7468678973398306 {'source': '../../../state_of_the_union.txt'} And so many families...\n",
|
||
"# 0.7436231261419488 {'source': '../../../state_of_the_union.txt'} Groups of citizens b..."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7037a22f",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 5. Search for Similar Documents with Filtering"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "b64a290f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"output = docsearch.similarity_search_with_relevance_scores(\n",
|
||
" query,\n",
|
||
" 4,\n",
|
||
" Rockset.DistanceFunction.COSINE_SIM,\n",
|
||
" where_str=\"{} NOT LIKE '%citizens%'\".format(TEXT_KEY),\n",
|
||
")\n",
|
||
"print(\"output length:\", len(output))\n",
|
||
"for d, dist in output:\n",
|
||
" print(dist, d.metadata, d.page_content[:20] + '...')\n",
|
||
"\n",
|
||
"##\n",
|
||
"# output length: 4\n",
|
||
"# 0.7651359650263554 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam...\n",
|
||
"# 0.7486265516824893 {'source': '../../../state_of_the_union.txt'} And I’m taking robus...\n",
|
||
"# 0.7469625542348115 {'source': '../../../state_of_the_union.txt'} And so many families...\n",
|
||
"# 0.7344177777547739 {'source': '../../../state_of_the_union.txt'} We see the unity amo..."
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "13a52b38",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 6. [Optional] Delete Inserted Documents\n",
|
||
"\n",
|
||
"You must have the unique ID associated with each document to delete them from your collection.\n",
|
||
"Define IDs when inserting documents with `Rockset.add_texts()`. Rockset will otherwise generate a unique ID for each document. Regardless, `Rockset.add_texts()` returns the IDs of inserted documents.\n",
|
||
"\n",
|
||
"To delete these docs, simply use the `Rockset.delete_texts()` function."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "1f755924",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"docsearch.delete_texts(ids)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d468f431",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Summary\n",
|
||
"\n",
|
||
"In this tutorial, we successfully created a `Rockset` collection, `inserted` documents with OpenAI embeddings, and searched for similar documents with and without metadata filters.\n",
|
||
"\n",
|
||
"Keep an eye on https://rockset.com/ for future updates in this space."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.12"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|