Add indexing support (#9614)

This PR introduces a persistence layer to help with indexing workflows
into
vectostores.

The indexing code helps users to:

1. Avoid writing duplicated content into the vectostore
2. Avoid over-writing content if it's unchanged

Importantly, this keeps on working even if the content being written is
derived
via a set of transformations from some source content (e.g., indexing
children
documents that were derived from parent documents by chunking.)

The two main components are:

1. Persistence layer that keeps track of which keys were updated and
when.
Keeping track of the timestamp of updates, allows to clean up old
content
   safely, and with minimal complexity.
2. HashedDocument which is used to hash the contents (including
metadata) of
   the documents. We rely on the hashes for identifying duplicates.


The indexing code works with **ANY** document loader. To add
transformations
to the documents, users for now can add a custom document loader
that composes an existing loader together with document transformers.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
pull/9675/head
Eugene Yurtsev 11 months ago committed by GitHub
parent c215481531
commit b88dfcb42a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,916 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0fe57ac5-31c5-4dbb-b96c-78dead32e1bd",
"metadata": {},
"source": [
"# Indexing\n",
"\n",
"Here, we will look at a basic indexing workflow using the LangChain indexing API. \n",
"\n",
"The indexing API lets you load and keep in sync documents from any source into a vector store. Specifically, it helps:\n",
"\n",
"* Avoid writing duplicated content into the vector store\n",
"* Avoid re-writing unchanged content\n",
"* Avoid re-computing embeddings over unchanged content\n",
"\n",
"All of which should save you time and money, as well as improve your vector search results.\n",
"\n",
"Crucially, the indexing API will work even with documents that have gone through several \n",
"transformation steps (e.g., via text chunking) with respect to the original source documents.\n",
"\n",
"## How it works\n",
"\n",
"LangChain indexing makes use of a record manager (`RecordManager`) that keeps track of document writes into the vector store.\n",
"\n",
"When indexing content, hashes are computed for each document, and the following information is stored in the record manager: \n",
"\n",
"- the document hash (hash of both page content and metadata)\n",
"- write time\n",
"- the source id -- each document should include information in its metadata to allow us to determine the ultimate source of this document\n",
"\n",
"## Deletion modes\n",
"\n",
"When indexing documents into a vector store, it's possible that some existing documents in the vector store should be deleted. In certain situations you may want to remove any existing documents that are derived from the same sources as the new documents being indexed. In others you may want to delete all existing documents wholesale. The indexing API deletion modes let you pick the behavior you want:\n",
"\n",
"| Delete Mode | De-Duplicates Content | Parallelizable | Cleans Up Deleted Source Docs | Cleans Up Mutations of Source Docs and/or Derived Docs | Clean Up Timing |\n",
"|-------------|-----------------------|---------------|----------------------------------|----------------------------------------------------|---------------------|\n",
"| None | ✅ | ✅ | ❌ | ❌ | - |\n",
"| Incremental | ✅ | ✅ | ❌ | ✅ | Continuously |\n",
"| Full | ✅ | ❌ | ✅ | ✅ | At end of indexing |\n",
"\n",
"\n",
"`None` does not do any automatic clean up, allowing the user to manually do clean up of old content. \n",
"\n",
"`incremental` and `full` offer the following automated clean up:\n",
"\n",
"* If the content of source document or derived documents has **changed**, both `incremental` or `full` modes will clean up (delete) previous versions of the content.\n",
"* If the source document has been **deleted** (meaning it is not included in the documents currently being indexed), the `full` delete mode will delete it from the vector store correctly, but the `incremental` mode will not.\n",
"\n",
"When content is mutated (e.g., the source PDF file was revised) there will be a period of time during indexing when both the new and old versions may be returned to the user. This happens after the new content was written, but before the old version was deleted.\n",
"\n",
"* `incremental` indexing minimizes this period of time as it is able to do clean up continuously, as it writes.\n",
"* `full` mode does the clean up after all batches have been written.\n",
"\n",
"## Requirements\n",
"\n",
"1. Do not use with a store that has been pre-populated with content independently of the indexing API, as the record manager will not know that records have been inserted previously.\n",
"2. Only works with LangChain ``VectorStore``'s that support:\n",
" * document addition by id (`add_documents` method with `ids` argument)\n",
" * delete by id (`delete` method with)\n",
" \n",
"## Caution\n",
"\n",
"The record manager relies on a time-based mechanism to determine what content can be cleaned up (when using `full` or `incremental` delete modes).\n",
"\n",
"If two tasks run back to back, and the first task finishes before the the clock time changes, then the second task may not be able to clean up content.\n",
"\n",
"This is unlikely to be an issue in actual settings for the following reasons:\n",
"\n",
"1. The RecordManager uses higher resolutino timestamps.\n",
"2. The data would need to change between the first and the second tasks runs, which becomes unlikely if the time interval between the tasks is small.\n",
"3. Indexing tasks typically take more than a few ms."
]
},
{
"cell_type": "markdown",
"id": "ec2109b4-cbcc-44eb-9dac-3f7345f971dc",
"metadata": {},
"source": [
"## Quickstart"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "15f7263e-c82e-4914-874f-9699ea4de93e",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.indexes import SQLRecordManager, index\n",
"from langchain.schema import Document\n",
"from langchain.vectorstores import ElasticsearchStore"
]
},
{
"cell_type": "markdown",
"id": "f81201ab-d997-433c-9f18-ceea70e61cbd",
"metadata": {},
"source": [
"Initialize a vector store and set up the embeddings"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4ffc9659-91c0-41e0-ae4b-f7ff0d97292d",
"metadata": {},
"outputs": [],
"source": [
"collection_name = \"test_index\"\n",
"\n",
"embedding = OpenAIEmbeddings()\n",
"\n",
"vectorstore = ElasticsearchStore(\n",
" es_url=\"http://localhost:9200\", index_name=\"test_index\", embedding=embedding\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b9b7564f-2334-428b-b513-13045a08b56c",
"metadata": {},
"source": [
"Initialize a record manager with an appropriate namespace.\n",
"\n",
"**Suggestion** Use a namespace that takes into account both the vectostore and the collection name in the vectorstore; e.g., 'redis/my_docs', 'chromadb/my_docs' or 'postgres/my_docs'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "498cc80e-c339-49ee-893b-b18d06346ef8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"namespace = f\"elasticsearch/{collection_name}\"\n",
"record_manager = SQLRecordManager(\n",
" namespace, db_url=\"sqlite:///record_manager_cache.sql\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "835c2c19-68ec-4086-9066-f7ba40877fd5",
"metadata": {},
"source": [
"Create a schema before using the record manager"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a4be2da3-3a5c-468a-a824-560157290f7f",
"metadata": {},
"outputs": [],
"source": [
"record_manager.create_schema()"
]
},
{
"cell_type": "markdown",
"id": "7f07c6bd-6ada-4b17-a8c5-fe5e4a5278fd",
"metadata": {},
"source": [
"Let's index some test documents"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "bbfdf314-14f9-4799-8fb6-d42de4d51287",
"metadata": {},
"outputs": [],
"source": [
"doc1 = Document(page_content=\"kitty\", metadata={\"source\": \"kitty.txt\"})\n",
"doc2 = Document(page_content=\"doggy\", metadata={\"source\": \"doggy.txt\"})"
]
},
{
"cell_type": "markdown",
"id": "c7d572be-a913-4511-ab64-2864a252458a",
"metadata": {},
"source": [
"Indexing into an empty vectorstore"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "67d2a5c8-f2bd-489a-b58e-2c7ba7fefe6f",
"metadata": {},
"outputs": [],
"source": [
"def _clear():\n",
" \"\"\"Hacky helper method to clear content. See the `full` mode section to to understand why it works.\"\"\"\n",
" index([], record_manager, vectorstore, delete_mode=\"full\", source_id_key=\"source\")"
]
},
{
"cell_type": "markdown",
"id": "e5e92e76-f23f-4a61-8a2d-f16baf288700",
"metadata": {},
"source": [
"### ``None`` deletion mode\n",
"\n",
"This mode does not do automatic clean up of old versions of content; however, it still takes care of content de-duplication."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e2288cee-1738-4054-af72-23b5c5be8840",
"metadata": {},
"outputs": [],
"source": [
"_clear()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b253483b-5be0-4151-b732-ca93db4457b1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(\n",
" [doc1, doc1, doc1, doc1, doc1],\n",
" record_manager,\n",
" vectorstore,\n",
" delete_mode=None,\n",
" source_id_key=\"source\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "7abaf351-bf5a-4d9e-95cd-4e3ecbfc1a84",
"metadata": {},
"outputs": [],
"source": [
"_clear()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "55b6873c-5907-4fa6-84ca-df6cdf1810f0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(\n",
" [doc1, doc2], record_manager, vectorstore, delete_mode=None, source_id_key=\"source\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "7be3e55a-5fe9-4f40-beff-577c2aa5e76a",
"metadata": {},
"source": [
"Second time around all content will be skipped"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "59d74ca1-2e3d-4b4c-ad88-a4907aa20081",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(\n",
" [doc1, doc2], record_manager, vectorstore, delete_mode=None, source_id_key=\"source\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "237a809e-575d-4f02-870e-5906a3643f30",
"metadata": {},
"source": [
"### ``\"incremental\"`` deletion mode"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "6bc91073-0ab4-465a-9302-e7f4bbd2285c",
"metadata": {},
"outputs": [],
"source": [
"_clear()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "4a551091-6d46-4cdd-9af9-8672e5866a0a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(\n",
" [doc1, doc2],\n",
" record_manager,\n",
" vectorstore,\n",
" delete_mode=\"incremental\",\n",
" source_id_key=\"source\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "d0604ab8-318c-4706-959b-3907af438630",
"metadata": {},
"source": [
"Indexing again should result in both documents getting **skipped** -- also skipping the embedding operation!"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "81785863-391b-4578-a6f6-63b3e5285488",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(\n",
" [doc1, doc2],\n",
" record_manager,\n",
" vectorstore,\n",
" delete_mode=\"incremental\",\n",
" source_id_key=\"source\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b205c1ba-f069-4a4e-af93-dc98afd5c9e6",
"metadata": {},
"source": [
"If we provide no documents with incremental indexing mode, nothing will change"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "1f73ca85-7478-48ab-976c-17b00beec7bd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(\n",
" [], record_manager, vectorstore, delete_mode=\"incremental\", source_id_key=\"source\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b8c4ac96-8d60-4ade-8a94-e76ccb536442",
"metadata": {},
"source": [
"If we mutate a document, the new version will be written and all old versions sharing the same source will be deleted."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "27d05bcb-d96d-42eb-88a8-54b33d6cfcdc",
"metadata": {},
"outputs": [],
"source": [
"changed_doc_2 = Document(page_content=\"puppy\", metadata={\"source\": \"doggy.txt\"})"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "3809e379-5962-4267-add9-b10f43e24c66",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 1}"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(\n",
" [changed_doc_2],\n",
" record_manager,\n",
" vectorstore,\n",
" delete_mode=\"incremental\",\n",
" source_id_key=\"source\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8bc75b9c-784a-4eb6-b5d6-688e3fbd4658",
"metadata": {},
"source": [
"### ``\"full\"`` deletion mode\n",
"\n",
"In `full` mode the user should pass the `full` universe of content that should be indexed into the indexing function.\n",
"\n",
"Any documents that are not passed into the indexing functino and are present in the vectorstore will be deleted!\n",
"\n",
"This behavior is useful to handle deletions of source documents."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "38a14a3d-11c7-43e2-b7f1-08e487961bb5",
"metadata": {},
"outputs": [],
"source": [
"_clear()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "46b5d7b6-ce91-47d2-a9d0-f390e77d847f",
"metadata": {},
"outputs": [],
"source": [
"all_docs = [doc1, doc2]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "06954765-6155-40a0-b95e-33ef87754c8d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(all_docs, record_manager, vectorstore, delete_mode=\"full\", source_id_key=\"source\")"
]
},
{
"cell_type": "markdown",
"id": "887c45c6-4363-4389-ac56-9cdad682b4c8",
"metadata": {},
"source": [
"Say someone deleted the first doc"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "35270e4e-9b03-4486-95de-e819ca5e469f",
"metadata": {},
"outputs": [],
"source": [
"del all_docs[0]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "7d835a6a-f468-4d79-9a3d-47db187edbb8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='doggy', metadata={'source': 'doggy.txt'})]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_docs"
]
},
{
"cell_type": "markdown",
"id": "d940bcb4-cf6d-4c21-a565-e7f53f6dacf1",
"metadata": {},
"source": [
"Using full mode will clean up the deleted content as well"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "1b660eae-3bed-434d-a6f5-2aec96e5f0d6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 0, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 1}"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(all_docs, record_manager, vectorstore, delete_mode=\"full\", source_id_key=\"source\")"
]
},
{
"cell_type": "markdown",
"id": "1a7ecdc9-df3c-4601-b2f3-50fdffc6e5f9",
"metadata": {},
"source": [
"## Source "
]
},
{
"cell_type": "markdown",
"id": "4002a4ac-02dd-4599-9b23-9b59f54237c8",
"metadata": {},
"source": [
"The metadata attribute contains a filed called `source`. This source should be pointing at the *ultimate* provenance associated with the given document.\n",
"\n",
"For example, if these documents are representing chunks of some parent document, the `source` for both documents should be the same and reference the parent document.\n",
"\n",
"In general, `source` should always be specified. Only use a `None`, if you **never** intend to use `incremental` mode, and for some reason can't specify the `source` field correctly."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "184d3051-7fd1-4db2-a1d5-218ac0e1e641",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "11318248-ad2a-4ef0-bd9b-9d4dab97caba",
"metadata": {},
"outputs": [],
"source": [
"doc1 = Document(\n",
" page_content=\"kitty kitty kitty kitty kitty\", metadata={\"source\": \"kitty.txt\"}\n",
")\n",
"doc2 = Document(page_content=\"doggy doggy the doggy\", metadata={\"source\": \"doggy.txt\"})"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "2cbf0902-d17b-44c9-8983-e8d0e831f909",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='kitty kit', metadata={'source': 'kitty.txt'}),\n",
" Document(page_content='tty kitty ki', metadata={'source': 'kitty.txt'}),\n",
" Document(page_content='tty kitty', metadata={'source': 'kitty.txt'}),\n",
" Document(page_content='doggy doggy', metadata={'source': 'doggy.txt'}),\n",
" Document(page_content='the doggy', metadata={'source': 'doggy.txt'})]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_docs = CharacterTextSplitter(\n",
" separator=\"t\", keep_separator=True, chunk_size=12, chunk_overlap=2\n",
").split_documents([doc1, doc2])\n",
"new_docs"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "0f9d9bc2-ea85-48ab-b4a2-351c8708b1d4",
"metadata": {},
"outputs": [],
"source": [
"_clear()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "58781d81-f273-4aeb-8df6-540236826d00",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 5, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(\n",
" new_docs,\n",
" record_manager,\n",
" vectorstore,\n",
" delete_mode=\"incremental\",\n",
" source_id_key=\"source\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "11b81cb6-5f04-499b-b125-1abb22d353bf",
"metadata": {},
"outputs": [],
"source": [
"changed_doggy_docs = [\n",
" Document(page_content=\"woof woof\", metadata={\"source\": \"doggy.txt\"}),\n",
" Document(page_content=\"woof woof woof\", metadata={\"source\": \"doggy.txt\"}),\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "ab1c0915-3f9e-42ac-bdb5-3017935c6e7f",
"metadata": {},
"source": [
"This should delete the old versions of documents associated with `doggy.txt` source and replace them with the new versions"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "fec71cb5-6757-4b92-a306-62509f6e867d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 2}"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(\n",
" changed_doggy_docs,\n",
" record_manager,\n",
" vectorstore,\n",
" delete_mode=\"incremental\",\n",
" source_id_key=\"source\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "876f5ab6-4b25-423e-8cff-f5a7a014395b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='tty kitty', metadata={'source': 'kitty.txt'}),\n",
" Document(page_content='tty kitty ki', metadata={'source': 'kitty.txt'}),\n",
" Document(page_content='kitty kit', metadata={'source': 'kitty.txt'})]"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorstore.similarity_search(\"dog\", k=30)"
]
},
{
"cell_type": "markdown",
"id": "c0af4d24-d735-4e5d-ad9b-a2e8b281f9f1",
"metadata": {},
"source": [
"## Using with Loaders\n",
"\n",
"Indexing can accept either an iterable of documents or else any loader.\n",
"\n",
"**Attention** The loader **MUST** set source keys correctly."
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "08b68357-27c0-4f07-a51d-61c986aeb359",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders.base import BaseLoader\n",
"\n",
"\n",
"class MyCustomLoader(BaseLoader):\n",
" def lazy_load(self):\n",
" text_splitter = CharacterTextSplitter(\n",
" separator=\"t\", keep_separator=True, chunk_size=12, chunk_overlap=2\n",
" )\n",
" docs = [\n",
" Document(page_content=\"woof woof\", metadata={\"source\": \"doggy.txt\"}),\n",
" Document(page_content=\"woof woof woof\", metadata={\"source\": \"doggy.txt\"}),\n",
" ]\n",
" yield from text_splitter.split_documents(docs)\n",
"\n",
" def load(self):\n",
" return list(self.lazy_load())"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "5dae8e11-c0d6-4fc6-aa0e-68f8d92b5087",
"metadata": {},
"outputs": [],
"source": [
"_clear()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "d8d72f76-6d6e-4a7c-8fea-9bdec05af05b",
"metadata": {},
"outputs": [],
"source": [
"loader = MyCustomLoader()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "945c45cc-5a8d-4bd7-9f36-4ebd4a50e08b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),\n",
" Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'})]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "dcb1ba71-db49-4140-ab4a-c5d64fc2578a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index(loader, record_manager, vectorstore, delete_mode=\"full\", source_id_key=\"source\")"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "441159c1-dd84-48d7-8599-37a65c9fb589",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),\n",
" Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'})]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorstore.similarity_search(\"dog\", k=30)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -1,5 +1,28 @@
"""**Index** utilities."""
"""Code to support various indexing workflows.
Provides code to:
* Create knowledge graphs from data.
* Support indexing workflows from LangChain data loaders to vectorstores.
For indexing workflows, this code is used to avoid writing duplicated content
into the vectostore and to avoid over-writing content if it's unchanged.
Importantly, this keeps on working even if the content being written is derived
via a set of transformations from some source content (e.g., indexing children
documents that were derived from parent documents by chunking.)
"""
from langchain.indexes._api import IndexingResult, index
from langchain.indexes._sql_record_manager import SQLRecordManager
from langchain.indexes.graph import GraphIndexCreator
from langchain.indexes.vectorstore import VectorstoreIndexCreator
__all__ = ["GraphIndexCreator", "VectorstoreIndexCreator"]
__all__ = [
# Keep sorted
"GraphIndexCreator",
"index",
"IndexingResult",
"SQLRecordManager",
"VectorstoreIndexCreator",
]

@ -0,0 +1,346 @@
"""Module contains logic for indexing documents into vector stores."""
from __future__ import annotations
import hashlib
import json
import uuid
from itertools import islice
from typing import (
Any,
Callable,
Dict,
Iterable,
Iterator,
List,
Literal,
Optional,
Sequence,
TypedDict,
TypeVar,
Union,
cast,
)
from langchain.document_loaders.base import BaseLoader
from langchain.indexes.base import NAMESPACE_UUID, RecordManager
from langchain.pydantic_v1 import root_validator
from langchain.schema import Document
from langchain.vectorstores.base import VectorStore
T = TypeVar("T")
def _hash_string_to_uuid(input_string: str) -> uuid.UUID:
"""Hashes a string and returns the corresponding UUID."""
hash_value = hashlib.sha1(input_string.encode("utf-8")).hexdigest()
return uuid.uuid5(NAMESPACE_UUID, hash_value)
def _hash_nested_dict_to_uuid(data: dict) -> uuid.UUID:
"""Hashes a nested dictionary and returns the corresponding UUID."""
serialized_data = json.dumps(data, sort_keys=True)
hash_value = hashlib.sha1(serialized_data.encode("utf-8")).hexdigest()
return uuid.uuid5(NAMESPACE_UUID, hash_value)
class _HashedDocument(Document):
"""A hashed document with a unique ID."""
uid: str
hash_: str
"""The hash of the document including content and metadata."""
content_hash: str
"""The hash of the document content."""
metadata_hash: str
"""The hash of the document metadata."""
@root_validator(pre=True)
def calculate_hashes(cls, values: Dict[str, Any]) -> Dict[str, Any]:
"""Root validator to calculate content and metadata hash."""
content = values.get("page_content", "")
metadata = values.get("metadata", {})
forbidden_keys = ("hash_", "content_hash", "metadata_hash")
for key in forbidden_keys:
if key in metadata:
raise ValueError(
f"Metadata cannot contain key {key} as it "
f"is reserved for internal use."
)
content_hash = str(_hash_string_to_uuid(content))
try:
metadata_hash = str(_hash_nested_dict_to_uuid(metadata))
except Exception as e:
raise ValueError(
f"Failed to hash metadata: {e}. "
f"Please use a dict that can be serialized using json."
)
values["content_hash"] = content_hash
values["metadata_hash"] = metadata_hash
values["hash_"] = str(_hash_string_to_uuid(content_hash + metadata_hash))
_uid = values.get("uid", None)
if _uid is None:
values["uid"] = values["hash_"]
return values
def to_document(self) -> Document:
"""Return a Document object."""
return Document(
page_content=self.page_content,
metadata=self.metadata,
)
@classmethod
def from_document(
cls, document: Document, *, uid: Optional[str] = None
) -> _HashedDocument:
"""Create a HashedDocument from a Document."""
return cls(
uid=uid,
page_content=document.page_content,
metadata=document.metadata,
)
def _batch(size: int, iterable: Iterable[T]) -> Iterator[List[T]]:
"""Utility batching function."""
it = iter(iterable)
while True:
chunk = list(islice(it, size))
if not chunk:
return
yield chunk
def _get_source_id_assigner(
source_id_key: Union[str, Callable[[Document], str], None],
) -> Callable[[Document], Union[str, None]]:
"""Get the source id from the document."""
if source_id_key is None:
return lambda doc: None
elif isinstance(source_id_key, str):
return lambda doc: doc.metadata[source_id_key]
elif callable(source_id_key):
return source_id_key
else:
raise ValueError(
f"source_id_key should be either None, a string or a callable. "
f"Got {source_id_key} of type {type(source_id_key)}."
)
def _deduplicate_in_order(
hashed_documents: Iterable[_HashedDocument],
) -> Iterator[_HashedDocument]:
"""Deduplicate a list of hashed documents while preserving order."""
seen = set()
for hashed_doc in hashed_documents:
if hashed_doc.hash_ not in seen:
seen.add(hashed_doc.hash_)
yield hashed_doc
# PUBLIC API
class IndexingResult(TypedDict):
"""Return a detailed a breakdown of the result of the indexing operation."""
num_added: int
"""Number of added documents."""
num_updated: int
"""Number of updated documents because they were not up to date."""
num_deleted: int
"""Number of deleted documents."""
num_skipped: int
"""Number of skipped documents because they were already up to date."""
def index(
docs_source: Union[BaseLoader, Iterable[Document]],
record_manager: RecordManager,
vector_store: VectorStore,
*,
batch_size: int = 100,
delete_mode: Literal["incremental", "full", None] = None,
source_id_key: Union[str, Callable[[Document], str], None] = None,
) -> IndexingResult:
"""Index data from the loader into the vector store.
Indexing functionality uses a manager to keep track of which documents
are in the vector store.
This allows us to keep track of which documents were updated, and which
documents were deleted, which documents should be skipped.
For the time being, documents are indexed using their hashes, and users
are not able to specify the uid of the document.
IMPORTANT:
if auto_cleanup is set to True, the loader should be returning
the entire dataset, and not just a subset of the dataset.
Otherwise, the auto_cleanup will remove documents that it is not
supposed to.
Args:
docs_source: Data loader or iterable of documents to index.
record_manager: Timestamped set to keep track of which documents were
updated.
vector_store: Vector store to index the documents into.
batch_size: Batch size to use when indexing.
delete_mode: How to handle clean up of documents.
- Incremental: Cleans up all documents that haven't been updated AND
that are associated with source ids that were seen
during indexing.
Clean up is done continuously during indexing helping
to minimize the probability of users seeing duplicated
content.
- Full: Delete all documents that haven to been returned by the loader.
Clean up runs after all documents have been indexed.
This means that users may see duplicated content during indexing.
- None: Do not delete any documents.
source_id_key: Optional key that helps identify the original source
of the document.
Returns:
Indexing result which contains information about how many documents
were added, updated, deleted, or skipped.
"""
if delete_mode not in {"incremental", "full", None}:
raise ValueError(
f"delete_mode should be one of 'incremental', 'full' or None. "
f"Got {delete_mode}."
)
if delete_mode == "incremental" and source_id_key is None:
raise ValueError("Source id key is required when delete mode is incremental.")
# Check that the Vectorstore has required methods implemented
methods = ["delete", "add_documents"]
for method in methods:
if not hasattr(vector_store, method):
raise ValueError(
f"Vectorstore {vector_store} does not have required method {method}"
)
if type(vector_store).delete == VectorStore.delete:
# Checking if the vectorstore has overridden the default delete method
# implementation which just raises a NotImplementedError
raise ValueError("Vectorstore has not implemented the delete method")
if isinstance(docs_source, BaseLoader):
try:
doc_iterator = docs_source.lazy_load()
except NotImplementedError:
doc_iterator = iter(docs_source.load())
else:
doc_iterator = iter(docs_source)
source_id_assigner = _get_source_id_assigner(source_id_key)
# Mark when the update started.
index_start_dt = record_manager.get_time()
num_added = 0
num_skipped = 0
num_updated = 0
num_deleted = 0
for doc_batch in _batch(batch_size, doc_iterator):
hashed_docs = list(
_deduplicate_in_order(
[_HashedDocument.from_document(doc) for doc in doc_batch]
)
)
source_ids: Sequence[Optional[str]] = [
source_id_assigner(doc) for doc in hashed_docs
]
if delete_mode == "incremental":
# If the delete mode is incremental, source ids are required.
for source_id, hashed_doc in zip(source_ids, hashed_docs):
if source_id is None:
raise ValueError(
"Source ids are required when delete mode is incremental. "
f"Document that starts with "
f"content: {hashed_doc.page_content[:100]} was not assigned "
f"as source id."
)
# source ids cannot be None after for loop above.
source_ids = cast(Sequence[str], source_ids) # type: ignore[assignment]
exists_batch = record_manager.exists([doc.uid for doc in hashed_docs])
# Filter out documents that already exist in the record store.
uids = []
docs_to_index = []
for doc, hashed_doc, doc_exists in zip(doc_batch, hashed_docs, exists_batch):
if doc_exists:
# Must be updated to refresh timestamp.
record_manager.update([hashed_doc.uid], time_at_least=index_start_dt)
num_skipped += 1
continue
uids.append(hashed_doc.uid)
docs_to_index.append(doc)
# Be pessimistic and assume that all vector store write will fail.
# First write to vector store
if docs_to_index:
vector_store.add_documents(docs_to_index, ids=uids)
num_added += len(docs_to_index)
# And only then update the record store.
# Update ALL records, even if they already exist since we want to refresh
# their timestamp.
record_manager.update(
[doc.uid for doc in hashed_docs],
group_ids=source_ids,
time_at_least=index_start_dt,
)
# If source IDs are provided, we can do the deletion incrementally!
if delete_mode == "incremental":
# Get the uids of the documents that were not returned by the loader.
# mypy isn't good enough to determine that source ids cannot be None
# here due to a check that's happening above, so we check again.
for source_id in source_ids:
if source_id is None:
raise AssertionError("Source ids cannot be None here.")
_source_ids = cast(Sequence[str], source_ids)
uids_to_delete = record_manager.list_keys(
group_ids=_source_ids, before=index_start_dt
)
if uids_to_delete:
# Then delete from vector store.
vector_store.delete(uids_to_delete)
# First delete from record store.
record_manager.delete_keys(uids_to_delete)
num_deleted += len(uids_to_delete)
if delete_mode == "full":
uids_to_delete = record_manager.list_keys(before=index_start_dt)
if uids_to_delete:
# Then delete from vector store.
vector_store.delete(uids_to_delete)
# First delete from record store.
record_manager.delete_keys(uids_to_delete)
num_deleted = len(uids_to_delete)
return {
"num_added": num_added,
"num_updated": num_updated,
"num_skipped": num_skipped,
"num_deleted": num_deleted,
}

@ -0,0 +1,265 @@
"""Implementation of a record management layer in SQLAlchemy.
The management layer uses SQLAlchemy to track upserted records.
Currently, this layer only works with SQLite; hopwever, should be adaptable
to other SQL implementations with minimal effort.
Currently, includes an implementation that uses SQLAlchemy which should
allow it to work with a variety of SQL as a backend.
* Each key is associated with an updated_at field.
* This filed is updated whenever the key is updated.
* Keys can be listed based on the updated at field.
* Keys can be deleted.
"""
import contextlib
import uuid
from typing import Any, Dict, Generator, List, Optional, Sequence
from sqlalchemy import (
Column,
Engine,
Float,
Index,
String,
UniqueConstraint,
and_,
create_engine,
text,
)
from sqlalchemy.dialects.sqlite import insert
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import Session, sessionmaker
from langchain.indexes.base import RecordManager
Base = declarative_base()
class UpsertionRecord(Base): # type: ignore[valid-type,misc]
"""Table used to keep track of when a key was last updated."""
# ATTENTION:
# Prior to modifying this table, please determine whether
# we should create migrations for this table to make sure
# users do not experience data loss.
__tablename__ = "upsertion_record"
uuid = Column(
String,
index=True,
default=lambda: str(uuid.uuid4()),
primary_key=True,
nullable=False,
)
key = Column(String, index=True)
# Using a non-normalized representation to handle `namespace` attribute.
# If the need arises, this attribute can be pulled into a separate Collection
# table at some time later.
namespace = Column(String, index=True, nullable=False)
group_id = Column(String, index=True, nullable=True)
# The timestamp associated with the last record upsertion.
updated_at = Column(Float, index=True)
__table_args__ = (
UniqueConstraint("key", "namespace", name="uix_key_namespace"),
Index("ix_key_namespace", "key", "namespace"),
)
class SQLRecordManager(RecordManager):
"""A SQL Alchemy based implementation of the record manager."""
def __init__(
self,
namespace: str,
*,
engine: Optional[Engine] = None,
db_url: Optional[str] = None,
engine_kwargs: Optional[Dict[str, Any]] = None,
) -> None:
"""Initialize the SQLRecordManager.
This class serves as a manager persistence layer that uses an SQL
backend to track upserted records. You should specify either a db_url
to create an engine or provide an existing engine.
Args:
namespace: The namespace associated with this record manager.
engine: An already existing SQL Alchemy engine.
Default is None.
db_url: A database connection string used to create
an SQL Alchemy engine. Default is None.
engine_kwargs: Additional keyword arguments
to be passed when creating the engine. Default is an empty dictionary.
Raises:
ValueError: If both db_url and engine are provided or neither.
AssertionError: If something unexpected happens during engine configuration.
"""
super().__init__(namespace=namespace)
if db_url is None and engine is None:
raise ValueError("Must specify either db_url or engine")
if db_url is not None and engine is not None:
raise ValueError("Must specify either db_url or engine, not both")
if db_url:
_kwargs = engine_kwargs or {}
_engine = create_engine(db_url, **_kwargs)
elif engine:
_engine = engine
else:
raise AssertionError("Something went wrong with configuration of engine.")
self.engine = _engine
self.session_factory = sessionmaker(bind=self.engine)
def create_schema(self) -> None:
"""Create the database schema."""
Base.metadata.create_all(self.engine)
@contextlib.contextmanager
def _make_session(self) -> Generator[Session, None, None]:
"""Create a session and close it after use."""
session = self.session_factory()
try:
yield session
finally:
session.close()
def get_time(self) -> float:
"""Get the current server time as a timestamp.
Please note it's critical that time is obtained from the server since
we want a monotonic clock.
"""
with self._make_session() as session:
# * SQLite specific implementation, can be changed based on dialect.
# * For SQLite, unlike unixepoch it will work with older versions of SQLite.
# ----
# julianday('now'): Julian day number for the current date and time.
# The Julian day is a continuous count of days, starting from a
# reference date (Julian day number 0).
# 2440587.5 - constant represents the Julian day number for January 1, 1970
# 86400.0 - constant represents the number of seconds
# in a day (24 hours * 60 minutes * 60 seconds)
query = text("SELECT (julianday('now') - 2440587.5) * 86400.0;")
dt = session.execute(query).scalar()
if not isinstance(dt, float):
raise AssertionError(f"Unexpected type for datetime: {type(dt)}")
return dt
def update(
self,
keys: Sequence[str],
*,
group_ids: Optional[Sequence[Optional[str]]] = None,
time_at_least: Optional[float] = None,
) -> None:
"""Upsert records into the SQLite database."""
if group_ids is None:
group_ids = [None] * len(keys)
if len(keys) != len(group_ids):
raise ValueError(
f"Number of keys ({len(keys)}) does not match number of "
f"group_ids ({len(group_ids)})"
)
# Get the current time from the server.
# This makes an extra round trip to the server, should not be a big deal
# if the batch size is large enough.
# Getting the time here helps us compare it against the time_at_least
# and raise an error if there is a time sync issue.
# Here, we're just being extra careful to minimize the chance of
# data loss due to incorrectly deleting records.
update_time = self.get_time()
if time_at_least and update_time < time_at_least:
# Safeguard against time sync issues
raise AssertionError(f"Time sync issue: {update_time} < {time_at_least}")
records_to_upsert = [
{
"key": key,
"namespace": self.namespace,
"updated_at": update_time,
"group_id": group_id,
}
for key, group_id in zip(keys, group_ids)
]
with self._make_session() as session:
# Note: uses SQLite insert to make on_conflict_do_update work.
# This code needs to be generalized a bit to work with more dialects.
insert_stmt = insert(UpsertionRecord).values(records_to_upsert)
stmt = insert_stmt.on_conflict_do_update( # type: ignore[attr-defined]
[UpsertionRecord.key, UpsertionRecord.namespace],
set_=dict(
# attr-defined type ignore
updated_at=insert_stmt.excluded.updated_at, # type: ignore
group_id=insert_stmt.excluded.group_id, # type: ignore
),
)
session.execute(stmt)
session.commit()
def exists(self, keys: Sequence[str]) -> List[bool]:
"""Check if the given keys exist in the SQLite database."""
with self._make_session() as session:
records = (
# mypy does not recognize .all()
session.query(UpsertionRecord.key) # type: ignore[attr-defined]
.filter(
and_(
UpsertionRecord.key.in_(keys),
UpsertionRecord.namespace == self.namespace,
)
)
.all()
)
found_keys = set(r.key for r in records)
return [k in found_keys for k in keys]
def list_keys(
self,
*,
before: Optional[float] = None,
after: Optional[float] = None,
group_ids: Optional[Sequence[str]] = None,
) -> List[str]:
"""List records in the SQLite database based on the provided date range."""
with self._make_session() as session:
query = session.query(UpsertionRecord).filter(
UpsertionRecord.namespace == self.namespace
)
# mypy does not recognize .all() or .filter()
if after:
query = query.filter( # type: ignore[attr-defined]
UpsertionRecord.updated_at > after
)
if before:
query = query.filter( # type: ignore[attr-defined]
UpsertionRecord.updated_at < before
)
if group_ids:
query = query.filter( # type: ignore[attr-defined]
UpsertionRecord.group_id.in_(group_ids)
)
records = query.all() # type: ignore[attr-defined]
return [r.key for r in records]
def delete_keys(self, keys: Sequence[str]) -> None:
"""Delete records from the SQLite database."""
with self._make_session() as session:
# mypy does not recognize .delete()
session.query(UpsertionRecord).filter(
and_(
UpsertionRecord.key.in_(keys),
UpsertionRecord.namespace == self.namespace,
)
).delete() # type: ignore[attr-defined]
session.commit()

@ -0,0 +1,95 @@
from __future__ import annotations
import uuid
from abc import ABC, abstractmethod
from typing import List, Optional, Sequence
NAMESPACE_UUID = uuid.UUID(int=1984)
class RecordManager(ABC):
"""An abstract base class representing the interface for a record manager."""
def __init__(
self,
namespace: str,
) -> None:
"""Initialize the record manager.
Args:
namespace (str): The namespace for the record manager.
"""
self.namespace = namespace
@abstractmethod
def create_schema(self) -> None:
"""Create the database schema for the record manager."""
@abstractmethod
def get_time(self) -> float:
"""Get the current server time as a high resolution timestamp!
It's important to get this from the server to ensure a monotonic clock,
otherwise there may be data loss when cleaning up old documents!
Returns:
The current server time as a float timestamp.
"""
@abstractmethod
def update(
self,
keys: Sequence[str],
*,
group_ids: Optional[Sequence[Optional[str]]] = None,
time_at_least: Optional[float] = None,
) -> None:
"""Upsert records into the database.
Args:
keys: A list of record keys to upsert.
group_ids: A list of group IDs corresponding to the keys.
time_at_least: if provided, updates should only happen if the
updated_at field is at least this time.
Raises:
ValueError: If the length of keys doesn't match the length of group_ids.
"""
@abstractmethod
def exists(self, keys: Sequence[str]) -> List[bool]:
"""Check if the provided keys exist in the database.
Args:
keys: A list of keys to check.
Returns:
A list of boolean values indicating the existence of each key.
"""
@abstractmethod
def list_keys(
self,
*,
before: Optional[float] = None,
after: Optional[float] = None,
group_ids: Optional[Sequence[str]] = None,
) -> List[str]:
"""List records in the database based on the provided filters.
Args:
before: Filter to list records updated before this time.
after: Filter to list records updated after this time.
group_ids: Filter to list records with specific group IDs.
Returns:
A list of keys for the matching records.
"""
@abstractmethod
def delete_keys(self, keys: Sequence[str]) -> None:
"""Delete specified records from the database.
Args:
keys: A list of keys to delete.
"""

@ -0,0 +1,13 @@
from langchain.indexes import __all__
def test_all() -> None:
"""Use to catch obvious breaking changes."""
assert __all__ == sorted(__all__, key=str.lower)
assert __all__ == [
"GraphIndexCreator",
"index",
"IndexingResult",
"SQLRecordManager",
"VectorstoreIndexCreator",
]

@ -0,0 +1,50 @@
import pytest
from langchain.indexes._api import _HashedDocument
from langchain.schema import Document
def test_hashed_document_hashing() -> None:
hashed_document = _HashedDocument(
uid="123", page_content="Lorem ipsum dolor sit amet", metadata={"key": "value"}
)
assert isinstance(hashed_document.hash_, str)
def test_hashing_with_missing_content() -> None:
"""Check that ValueError is raised if page_content is missing."""
with pytest.raises(ValueError):
_HashedDocument(
metadata={"key": "value"},
)
def test_uid_auto_assigned_to_hash() -> None:
"""Test uid is auto-assigned to the hashed_document hash."""
hashed_document = _HashedDocument(
page_content="Lorem ipsum dolor sit amet", metadata={"key": "value"}
)
assert hashed_document.uid == hashed_document.hash_
def test_to_document() -> None:
"""Test to_document method."""
hashed_document = _HashedDocument(
page_content="Lorem ipsum dolor sit amet", metadata={"key": "value"}
)
doc = hashed_document.to_document()
assert isinstance(doc, Document)
assert doc.page_content == "Lorem ipsum dolor sit amet"
assert doc.metadata == {"key": "value"}
def test_from_document() -> None:
"""Test from document class method."""
document = Document(
page_content="Lorem ipsum dolor sit amet", metadata={"key": "value"}
)
hashed_document = _HashedDocument.from_document(document)
# hash should be deterministic
assert hashed_document.hash_ == "fd1dc827-051b-537d-a1fe-1fa043e8b276"
assert hashed_document.uid == hashed_document.hash_

@ -0,0 +1,474 @@
from datetime import datetime
from typing import Any, Dict, Iterable, Iterator, List, Optional, Sequence, Type
from unittest.mock import patch
import pytest
from langchain.document_loaders.base import BaseLoader
from langchain.embeddings.base import Embeddings
from langchain.indexes import index
from langchain.indexes._sql_record_manager import SQLRecordManager
from langchain.schema import Document
from langchain.vectorstores.base import VST, VectorStore
class ToyLoader(BaseLoader):
"""Toy loader that always returns the same documents."""
def __init__(self, documents: Sequence[Document]) -> None:
"""Initialize with the documents to return."""
self.documents = documents
def lazy_load(
self,
) -> Iterator[Document]:
yield from self.documents
def load(self) -> List[Document]:
"""Load the documents from the source."""
return list(self.lazy_load())
class InMemoryVectorStore(VectorStore):
"""In-memory implementation of VectorStore using a dictionary."""
def __init__(self) -> None:
"""Vector store interface for testing things in memory."""
self.store: Dict[str, Document] = {}
def delete(self, ids: Optional[Sequence[str]] = None, **kwargs: Any) -> None:
"""Delete the given documents from the store using their IDs."""
if ids:
for _id in ids:
self.store.pop(_id, None)
def add_documents( # type: ignore
self,
documents: Sequence[Document],
*,
ids: Optional[Sequence[str]] = None,
**kwargs: Any,
) -> None:
"""Add the given documents to the store (insert behavior)."""
if ids and len(ids) != len(documents):
raise ValueError(
f"Expected {len(ids)} ids, got {len(documents)} documents."
)
if not ids:
raise NotImplementedError("This is not implemented yet.")
for _id, document in zip(ids, documents):
if _id in self.store:
raise ValueError(
f"Document with uid {_id} already exists in the store."
)
self.store[_id] = document
def add_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
**kwargs: Any,
) -> List[str]:
"""Add the given texts to the store (insert behavior)."""
raise NotImplementedError()
@classmethod
def from_texts(
cls: Type[VST],
texts: List[str],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
**kwargs: Any,
) -> VST:
"""Create a vector store from a list of texts."""
raise NotImplementedError()
def similarity_search(
self, query: str, k: int = 4, **kwargs: Any
) -> List[Document]:
"""Find the most similar documents to the given query."""
raise NotImplementedError()
@pytest.fixture
def record_manager() -> SQLRecordManager:
"""Timestamped set fixture."""
record_manager = SQLRecordManager("kittens", db_url="sqlite:///:memory:")
record_manager.create_schema()
return record_manager
@pytest.fixture
def vector_store() -> InMemoryVectorStore:
"""Vector store fixture."""
return InMemoryVectorStore()
def test_indexing_same_content(
record_manager: SQLRecordManager, vector_store: InMemoryVectorStore
) -> None:
"""Indexing some content to confirm it gets added only once."""
loader = ToyLoader(
documents=[
Document(
page_content="This is a test document.",
),
Document(
page_content="This is another document.",
),
]
)
assert index(loader, record_manager, vector_store) == {
"num_added": 2,
"num_deleted": 0,
"num_skipped": 0,
"num_updated": 0,
}
assert len(list(vector_store.store)) == 2
for _ in range(2):
# Run the indexing again
assert index(loader, record_manager, vector_store) == {
"num_added": 0,
"num_deleted": 0,
"num_skipped": 2,
"num_updated": 0,
}
def test_index_simple_delete_full(
record_manager: SQLRecordManager, vector_store: InMemoryVectorStore
) -> None:
"""Indexing some content to confirm it gets added only once."""
loader = ToyLoader(
documents=[
Document(
page_content="This is a test document.",
),
Document(
page_content="This is another document.",
),
]
)
with patch.object(
record_manager, "get_time", return_value=datetime(2021, 1, 1).timestamp()
):
assert index(loader, record_manager, vector_store, delete_mode="full") == {
"num_added": 2,
"num_deleted": 0,
"num_skipped": 0,
"num_updated": 0,
}
with patch.object(
record_manager, "get_time", return_value=datetime(2021, 1, 1).timestamp()
):
assert index(loader, record_manager, vector_store, delete_mode="full") == {
"num_added": 0,
"num_deleted": 0,
"num_skipped": 2,
"num_updated": 0,
}
loader = ToyLoader(
documents=[
Document(
page_content="mutated document 1",
),
Document(
page_content="This is another document.", # <-- Same as original
),
]
)
with patch.object(
record_manager, "get_time", return_value=datetime(2021, 1, 2).timestamp()
):
assert index(loader, record_manager, vector_store, delete_mode="full") == {
"num_added": 1,
"num_deleted": 1,
"num_skipped": 1,
"num_updated": 0,
}
doc_texts = set(
# Ignoring type since doc should be in the store and not a None
vector_store.store.get(uid).page_content # type: ignore
for uid in vector_store.store
)
assert doc_texts == {"mutated document 1", "This is another document."}
# Attempt to index again verify that nothing changes
with patch.object(
record_manager, "get_time", return_value=datetime(2021, 1, 2).timestamp()
):
assert index(loader, record_manager, vector_store, delete_mode="full") == {
"num_added": 0,
"num_deleted": 0,
"num_skipped": 2,
"num_updated": 0,
}
def test_incremental_fails_with_bad_source_ids(
record_manager: SQLRecordManager, vector_store: InMemoryVectorStore
) -> None:
"""Test indexing with incremental deletion strategy."""
loader = ToyLoader(
documents=[
Document(
page_content="This is a test document.",
metadata={"source": "1"},
),
Document(
page_content="This is another document.",
metadata={"source": "2"},
),
Document(
page_content="This is yet another document.",
metadata={"source": None},
),
]
)
with pytest.raises(ValueError):
# Should raise an error because no source id function was specified
index(loader, record_manager, vector_store, delete_mode="incremental")
with pytest.raises(ValueError):
# Should raise an error because no source id function was specified
index(
loader,
record_manager,
vector_store,
delete_mode="incremental",
source_id_key="source",
)
def test_no_delete(
record_manager: SQLRecordManager, vector_store: InMemoryVectorStore
) -> None:
"""Test indexing without a deletion strategy."""
loader = ToyLoader(
documents=[
Document(
page_content="This is a test document.",
metadata={"source": "1"},
),
Document(
page_content="This is another document.",
metadata={"source": "2"},
),
]
)
with patch.object(
record_manager, "get_time", return_value=datetime(2021, 1, 2).timestamp()
):
assert index(
loader,
record_manager,
vector_store,
delete_mode=None,
source_id_key="source",
) == {
"num_added": 2,
"num_deleted": 0,
"num_skipped": 0,
"num_updated": 0,
}
# If we add the same content twice it should be skipped
with patch.object(
record_manager, "get_time", return_value=datetime(2021, 1, 2).timestamp()
):
assert index(
loader,
record_manager,
vector_store,
delete_mode=None,
source_id_key="source",
) == {
"num_added": 0,
"num_deleted": 0,
"num_skipped": 2,
"num_updated": 0,
}
loader = ToyLoader(
documents=[
Document(
page_content="mutated content",
metadata={"source": "1"},
),
Document(
page_content="This is another document.",
metadata={"source": "2"},
),
]
)
# Should result in no updates or deletions!
with patch.object(
record_manager, "get_time", return_value=datetime(2021, 1, 2).timestamp()
):
assert index(
loader,
record_manager,
vector_store,
delete_mode=None,
source_id_key="source",
) == {
"num_added": 1,
"num_deleted": 0,
"num_skipped": 1,
"num_updated": 0,
}
def test_incremental_delete(
record_manager: SQLRecordManager, vector_store: InMemoryVectorStore
) -> None:
"""Test indexing with incremental deletion strategy."""
loader = ToyLoader(
documents=[
Document(
page_content="This is a test document.",
metadata={"source": "1"},
),
Document(
page_content="This is another document.",
metadata={"source": "2"},
),
]
)
with patch.object(
record_manager, "get_time", return_value=datetime(2021, 1, 2).timestamp()
):
assert index(
loader,
record_manager,
vector_store,
delete_mode="incremental",
source_id_key="source",
) == {
"num_added": 2,
"num_deleted": 0,
"num_skipped": 0,
"num_updated": 0,
}
doc_texts = set(
# Ignoring type since doc should be in the store and not a None
vector_store.store.get(uid).page_content # type: ignore
for uid in vector_store.store
)
assert doc_texts == {"This is another document.", "This is a test document."}
# Attempt to index again verify that nothing changes
with patch.object(
record_manager, "get_time", return_value=datetime(2021, 1, 2).timestamp()
):
assert index(
loader,
record_manager,
vector_store,
delete_mode="incremental",
source_id_key="source",
) == {
"num_added": 0,
"num_deleted": 0,
"num_skipped": 2,
"num_updated": 0,
}
# Create 2 documents from the same source all with mutated content
loader = ToyLoader(
documents=[
Document(
page_content="mutated document 1",
metadata={"source": "1"},
),
Document(
page_content="mutated document 2",
metadata={"source": "1"},
),
Document(
page_content="This is another document.", # <-- Same as original
metadata={"source": "2"},
),
]
)
# Attempt to index again verify that nothing changes
with patch.object(
record_manager, "get_time", return_value=datetime(2021, 1, 3).timestamp()
):
assert index(
loader,
record_manager,
vector_store,
delete_mode="incremental",
source_id_key="source",
) == {
"num_added": 2,
"num_deleted": 1,
"num_skipped": 1,
"num_updated": 0,
}
doc_texts = set(
# Ignoring type since doc should be in the store and not a None
vector_store.store.get(uid).page_content # type: ignore
for uid in vector_store.store
)
assert doc_texts == {
"mutated document 1",
"mutated document 2",
"This is another document.",
}
def test_indexing_with_no_docs(
record_manager: SQLRecordManager, vector_store: VectorStore
) -> None:
"""Check edge case when loader returns no new docs."""
loader = ToyLoader(documents=[])
assert index(loader, record_manager, vector_store, delete_mode="full") == {
"num_added": 0,
"num_deleted": 0,
"num_skipped": 0,
"num_updated": 0,
}
def test_deduplication(
record_manager: SQLRecordManager, vector_store: VectorStore
) -> None:
"""Check edge case when loader returns no new docs."""
docs = [
Document(
page_content="This is a test document.",
metadata={"source": "1"},
),
Document(
page_content="This is a test document.",
metadata={"source": "1"},
),
]
# Should result in only a single document being added
assert index(docs, record_manager, vector_store, delete_mode="full") == {
"num_added": 1,
"num_deleted": 0,
"num_skipped": 0,
"num_updated": 0,
}

@ -0,0 +1,276 @@
from datetime import datetime
from unittest.mock import patch
import pytest
from langchain.indexes._sql_record_manager import SQLRecordManager, UpsertionRecord
@pytest.fixture()
def manager() -> SQLRecordManager:
"""Initialize the test database and yield the TimestampedSet instance."""
# Initialize and yield the TimestampedSet instance
record_manager = SQLRecordManager("kittens", db_url="sqlite:///:memory:")
record_manager.create_schema()
return record_manager
def test_update(manager: SQLRecordManager) -> None:
"""Test updating records in the database."""
# no keys should be present in the set
read_keys = manager.list_keys()
assert read_keys == []
# Insert records
keys = ["key1", "key2", "key3"]
manager.update(keys)
# Retrieve the records
read_keys = manager.list_keys()
assert read_keys == ["key1", "key2", "key3"]
def test_update_timestamp(manager: SQLRecordManager) -> None:
"""Test updating records in the database."""
# no keys should be present in the set
with patch.object(
manager, "get_time", return_value=datetime(2021, 1, 2).timestamp()
):
manager.update(["key1"])
with manager._make_session() as session:
records = (
session.query(UpsertionRecord)
.filter(UpsertionRecord.namespace == manager.namespace)
.all() # type: ignore[attr-defined]
)
assert [
{
"key": record.key,
"namespace": record.namespace,
"updated_at": record.updated_at,
"group_id": record.group_id,
}
for record in records
] == [
{
"group_id": None,
"key": "key1",
"namespace": "kittens",
"updated_at": datetime(2021, 1, 2, 0, 0).timestamp(),
}
]
with patch.object(
manager, "get_time", return_value=datetime(2023, 1, 2).timestamp()
):
manager.update(["key1"])
with manager._make_session() as session:
records = (
session.query(UpsertionRecord)
.filter(UpsertionRecord.namespace == manager.namespace)
.all() # type: ignore[attr-defined]
)
assert [
{
"key": record.key,
"namespace": record.namespace,
"updated_at": record.updated_at,
"group_id": record.group_id,
}
for record in records
] == [
{
"group_id": None,
"key": "key1",
"namespace": "kittens",
"updated_at": datetime(2023, 1, 2, 0, 0).timestamp(),
}
]
with patch.object(
manager, "get_time", return_value=datetime(2023, 2, 2).timestamp()
):
manager.update(["key1"], group_ids=["group1"])
with manager._make_session() as session:
records = (
session.query(UpsertionRecord)
.filter(UpsertionRecord.namespace == manager.namespace)
.all() # type: ignore[attr-defined]
)
assert [
{
"key": record.key,
"namespace": record.namespace,
"updated_at": record.updated_at,
"group_id": record.group_id,
}
for record in records
] == [
{
"group_id": "group1",
"key": "key1",
"namespace": "kittens",
"updated_at": datetime(2023, 2, 2, 0, 0).timestamp(),
}
]
def test_update_with_group_ids(manager: SQLRecordManager) -> None:
"""Test updating records in the database."""
# no keys should be present in the set
read_keys = manager.list_keys()
assert read_keys == []
# Insert records
keys = ["key1", "key2", "key3"]
manager.update(keys)
# Retrieve the records
read_keys = manager.list_keys()
assert read_keys == ["key1", "key2", "key3"]
def test_exists(manager: SQLRecordManager) -> None:
"""Test checking if keys exist in the database."""
# Insert records
keys = ["key1", "key2", "key3"]
manager.update(keys)
# Check if the keys exist in the database
exists = manager.exists(keys)
assert len(exists) == len(keys)
assert exists == [True, True, True]
exists = manager.exists(["key1", "key4"])
assert len(exists) == 2
assert exists == [True, False]
def test_list_keys(manager: SQLRecordManager) -> None:
"""Test listing keys based on the provided date range."""
# Insert records
assert manager.list_keys() == []
with manager._make_session() as session:
# Add some keys with explicit updated_ats
session.add(
UpsertionRecord(
key="key1",
updated_at=datetime(2021, 1, 1).timestamp(),
namespace="kittens",
)
)
session.add(
UpsertionRecord(
key="key2",
updated_at=datetime(2022, 1, 1).timestamp(),
namespace="kittens",
)
)
session.add(
UpsertionRecord(
key="key3",
updated_at=datetime(2023, 1, 1).timestamp(),
namespace="kittens",
)
)
session.add(
UpsertionRecord(
key="key4",
group_id="group1",
updated_at=datetime(2024, 1, 1).timestamp(),
namespace="kittens",
)
)
# Insert keys from a different namespace, these should not be visible!
session.add(
UpsertionRecord(
key="key1",
updated_at=datetime(2021, 1, 1).timestamp(),
namespace="puppies",
)
)
session.add(
UpsertionRecord(
key="key5",
updated_at=datetime(2021, 1, 1).timestamp(),
namespace="puppies",
)
)
session.commit()
# Retrieve all keys
assert manager.list_keys() == ["key1", "key2", "key3", "key4"]
# Retrieve keys updated after a certain date
assert manager.list_keys(after=datetime(2022, 2, 1).timestamp()) == ["key3", "key4"]
# Retrieve keys updated after a certain date
assert manager.list_keys(before=datetime(2022, 2, 1).timestamp()) == [
"key1",
"key2",
]
# Retrieve keys updated after a certain date
assert manager.list_keys(before=datetime(2019, 2, 1).timestamp()) == []
# Retrieve keys in a time range
assert manager.list_keys(
before=datetime(2022, 2, 1).timestamp(),
after=datetime(2021, 11, 1).timestamp(),
) == ["key2"]
assert manager.list_keys(group_ids=["group1", "group2"]) == ["key4"]
# Test multiple filters
assert (
manager.list_keys(
group_ids=["group1", "group2"], before=datetime(2019, 1, 1).timestamp()
)
== []
)
assert manager.list_keys(
group_ids=["group1", "group2"], after=datetime(2019, 1, 1).timestamp()
) == ["key4"]
def test_namespace_is_used(manager: SQLRecordManager) -> None:
"""Verify that namespace is taken into account for all operations."""
assert manager.namespace == "kittens"
with manager._make_session() as session:
# Add some keys with explicit updated_ats
session.add(UpsertionRecord(key="key1", namespace="kittens"))
session.add(UpsertionRecord(key="key2", namespace="kittens"))
session.add(UpsertionRecord(key="key1", namespace="puppies"))
session.add(UpsertionRecord(key="key3", namespace="puppies"))
session.commit()
assert manager.list_keys() == ["key1", "key2"]
manager.delete_keys(["key1"])
assert manager.list_keys() == ["key2"]
manager.update(["key3"], group_ids=["group3"])
with manager._make_session() as session:
results = session.query(UpsertionRecord).all()
assert sorted([(r.namespace, r.key, r.group_id) for r in results]) == [
("kittens", "key2", None),
("kittens", "key3", "group3"),
("puppies", "key1", None),
("puppies", "key3", None),
]
def test_delete_keys(manager: SQLRecordManager) -> None:
"""Test deleting keys from the database."""
# Insert records
keys = ["key1", "key2", "key3"]
manager.update(keys)
# Delete some keys
keys_to_delete = ["key1", "key2"]
manager.delete_keys(keys_to_delete)
# Check if the deleted keys are no longer in the database
remaining_keys = manager.list_keys()
assert remaining_keys == ["key3"]
Loading…
Cancel
Save