langchain/docs/extras/modules/data_connection/indexing.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0fe57ac5-31c5-4dbb-b96c-78dead32e1bd",
   "metadata": {},
   "source": [
    "# Indexing\n",
    "\n",
    "Here, we will look at a basic indexing workflow using the LangChain indexing API. \n",
    "\n",
    "The indexing API lets you load and keep in sync documents from any source into a vector store. Specifically, it helps:\n",
    "\n",
    "* Avoid writing duplicated content into the vector store\n",
    "* Avoid re-writing unchanged content\n",
    "* Avoid re-computing embeddings over unchanged content\n",
    "\n",
    "All of which should save you time and money, as well as improve your vector search results.\n",
    "\n",
    "Crucially, the indexing API will work even with documents that have gone through several \n",
    "transformation steps (e.g., via text chunking) with respect to the original source documents.\n",
    "\n",
    "## How it works\n",
    "\n",
    "LangChain indexing makes use of a record manager (`RecordManager`) that keeps track of document writes into the vector store.\n",
    "\n",
    "When indexing content, hashes are computed for each document, and the following information is stored in the record manager: \n",
    "\n",
    "- the document hash (hash of both page content and metadata)\n",
    "- write time\n",
    "- the source id -- each document should include information in its metadata to allow us to determine the ultimate source of this document\n",
    "\n",
    "## Deletion modes\n",
    "\n",
    "When indexing documents into a vector store, it's possible that some existing documents in the vector store should be deleted. In certain situations you may want to remove any existing documents that are derived from the same sources as the new documents being indexed. In others you may want to delete all existing documents wholesale. The indexing API deletion modes let you pick the behavior you want:\n",
    "\n",
    "| Cleanup Mode | De-Duplicates Content | Parallelizable | Cleans Up Deleted Source Docs   | Cleans Up Mutations of Source Docs and/or Derived Docs | Clean Up Timing   |\n",
    "|-------------|-----------------------|---------------|----------------------------------|----------------------------------------------------|---------------------|\n",
    "| None        | ✅                    | ✅            | ❌                               | ❌                                                 | -                  |\n",
    "| Incremental | ✅                    | ✅            | ❌                               | ✅                                                 | Continuously       |\n",
    "| Full        | ✅                    | ❌            | ✅                               | ✅                                                 | At end of indexing |\n",
    "\n",
    "\n",
    "`None` does not do any automatic clean up, allowing the user to manually do clean up of old content. \n",
    "\n",
    "`incremental` and `full` offer the following automated clean up:\n",
    "\n",
    "* If the content of source document or derived documents has **changed**, both `incremental` or `full` modes will clean up (delete) previous versions of the content.\n",
    "* If the source document has been **deleted** (meaning it is not included in the documents currently being indexed), the `full` cleanup mode will delete it from the vector store correctly, but the `incremental` mode will not.\n",
    "\n",
    "When content is mutated (e.g., the source PDF file was revised) there will be a period of time during indexing when both the new and old versions may be returned to the user. This happens after the new content was written, but before the old version was deleted.\n",
    "\n",
    "* `incremental` indexing minimizes this period of time as it is able to do clean up continuously, as it writes.\n",
    "* `full` mode does the clean up after all batches have been written.\n",
    "\n",
    "## Requirements\n",
    "\n",
    "1. Do not use with a store that has been pre-populated with content independently of the indexing API, as the record manager will not know that records have been inserted previously.\n",
    "2. Only works with LangChain ``VectorStore``'s that support:\n",
    "   * document addition by id (`add_documents` method with `ids` argument)\n",
    "   * delete by id (`delete` method with)\n",
    "  \n",
    "## Caution\n",
    "\n",
    "The record manager relies on a time-based mechanism to determine what content can be cleaned up (when using `full` or `incremental` cleanup modes).\n",
    "\n",
    "If two tasks run back to back, and the first task finishes before the the clock time changes, then the second task may not be able to clean up content.\n",
    "\n",
    "This is unlikely to be an issue in actual settings for the following reasons:\n",
    "\n",
    "1. The RecordManager uses higher resolutino timestamps.\n",
    "2. The data would need to change between the first and the second tasks runs, which becomes unlikely if the time interval between the tasks is small.\n",
    "3. Indexing tasks typically take more than a few ms."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec2109b4-cbcc-44eb-9dac-3f7345f971dc",
   "metadata": {},
   "source": [
    "## Quickstart"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "15f7263e-c82e-4914-874f-9699ea4de93e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import OpenAIEmbeddings\n",
    "from langchain.indexes import SQLRecordManager, index\n",
    "from langchain.schema import Document\n",
    "from langchain.vectorstores import ElasticsearchStore"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f81201ab-d997-433c-9f18-ceea70e61cbd",
   "metadata": {},
   "source": [
    "Initialize a vector store and set up the embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "4ffc9659-91c0-41e0-ae4b-f7ff0d97292d",
   "metadata": {},
   "outputs": [],
   "source": [
    "collection_name = \"test_index\"\n",
    "\n",
    "embedding = OpenAIEmbeddings()\n",
    "\n",
    "vectorstore = ElasticsearchStore(\n",
    "    es_url=\"http://localhost:9200\", index_name=\"test_index\", embedding=embedding\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9b7564f-2334-428b-b513-13045a08b56c",
   "metadata": {},
   "source": [
    "Initialize a record manager with an appropriate namespace.\n",
    "\n",
    "**Suggestion** Use a namespace that takes into account both the vectostore and the collection name in the vectorstore; e.g., 'redis/my_docs', 'chromadb/my_docs' or 'postgres/my_docs'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "498cc80e-c339-49ee-893b-b18d06346ef8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "namespace = f\"elasticsearch/{collection_name}\"\n",
    "record_manager = SQLRecordManager(\n",
    "    namespace, db_url=\"sqlite:///record_manager_cache.sql\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "835c2c19-68ec-4086-9066-f7ba40877fd5",
   "metadata": {},
   "source": [
    "Create a schema before using the record manager"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a4be2da3-3a5c-468a-a824-560157290f7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "record_manager.create_schema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f07c6bd-6ada-4b17-a8c5-fe5e4a5278fd",
   "metadata": {},
   "source": [
    "Let's index some test documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "bbfdf314-14f9-4799-8fb6-d42de4d51287",
   "metadata": {},
   "outputs": [],
   "source": [
    "doc1 = Document(page_content=\"kitty\", metadata={\"source\": \"kitty.txt\"})\n",
    "doc2 = Document(page_content=\"doggy\", metadata={\"source\": \"doggy.txt\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7d572be-a913-4511-ab64-2864a252458a",
   "metadata": {},
   "source": [
    "Indexing into an empty vectorstore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "67d2a5c8-f2bd-489a-b58e-2c7ba7fefe6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def _clear():\n",
    "    \"\"\"Hacky helper method to clear content. See the `full` mode section to to understand why it works.\"\"\"\n",
    "    index([], record_manager, vectorstore, cleanup=\"full\", source_id_key=\"source\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5e92e76-f23f-4a61-8a2d-f16baf288700",
   "metadata": {},
   "source": [
    "### ``None`` deletion mode\n",
    "\n",
    "This mode does not do automatic clean up of old versions of content; however, it still takes care of content de-duplication."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e2288cee-1738-4054-af72-23b5c5be8840",
   "metadata": {},
   "outputs": [],
   "source": [
    "_clear()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b253483b-5be0-4151-b732-ca93db4457b1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(\n",
    "    [doc1, doc1, doc1, doc1, doc1],\n",
    "    record_manager,\n",
    "    vectorstore,\n",
    "    cleanup=None,\n",
    "    source_id_key=\"source\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "7abaf351-bf5a-4d9e-95cd-4e3ecbfc1a84",
   "metadata": {},
   "outputs": [],
   "source": [
    "_clear()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "55b6873c-5907-4fa6-84ca-df6cdf1810f0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(\n",
    "    [doc1, doc2], record_manager, vectorstore, cleanup=None, source_id_key=\"source\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7be3e55a-5fe9-4f40-beff-577c2aa5e76a",
   "metadata": {},
   "source": [
    "Second time around all content will be skipped"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "59d74ca1-2e3d-4b4c-ad88-a4907aa20081",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(\n",
    "    [doc1, doc2], record_manager, vectorstore, cleanup=None, source_id_key=\"source\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "237a809e-575d-4f02-870e-5906a3643f30",
   "metadata": {},
   "source": [
    "### ``\"incremental\"`` deletion mode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "6bc91073-0ab4-465a-9302-e7f4bbd2285c",
   "metadata": {},
   "outputs": [],
   "source": [
    "_clear()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "4a551091-6d46-4cdd-9af9-8672e5866a0a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(\n",
    "    [doc1, doc2],\n",
    "    record_manager,\n",
    "    vectorstore,\n",
    "    cleanup=\"incremental\",\n",
    "    source_id_key=\"source\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0604ab8-318c-4706-959b-3907af438630",
   "metadata": {},
   "source": [
    "Indexing again should result in both documents getting **skipped** -- also skipping the embedding operation!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "81785863-391b-4578-a6f6-63b3e5285488",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(\n",
    "    [doc1, doc2],\n",
    "    record_manager,\n",
    "    vectorstore,\n",
    "    cleanup=\"incremental\",\n",
    "    source_id_key=\"source\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b205c1ba-f069-4a4e-af93-dc98afd5c9e6",
   "metadata": {},
   "source": [
    "If we provide no documents with incremental indexing mode, nothing will change"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "1f73ca85-7478-48ab-976c-17b00beec7bd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(\n",
    "    [], record_manager, vectorstore, cleanup=\"incremental\", source_id_key=\"source\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8c4ac96-8d60-4ade-8a94-e76ccb536442",
   "metadata": {},
   "source": [
    "If we mutate a document, the new version will be written and all old versions sharing the same source will be deleted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "27d05bcb-d96d-42eb-88a8-54b33d6cfcdc",
   "metadata": {},
   "outputs": [],
   "source": [
    "changed_doc_2 = Document(page_content=\"puppy\", metadata={\"source\": \"doggy.txt\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "3809e379-5962-4267-add9-b10f43e24c66",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 1}"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(\n",
    "    [changed_doc_2],\n",
    "    record_manager,\n",
    "    vectorstore,\n",
    "    cleanup=\"incremental\",\n",
    "    source_id_key=\"source\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bc75b9c-784a-4eb6-b5d6-688e3fbd4658",
   "metadata": {},
   "source": [
    "### ``\"full\"`` deletion mode\n",
    "\n",
    "In `full` mode the user should pass the `full` universe of content that should be indexed into the indexing function.\n",
    "\n",
    "Any documents that are not passed into the indexing functino and are present in the vectorstore will be deleted!\n",
    "\n",
    "This behavior is useful to handle deletions of source documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "38a14a3d-11c7-43e2-b7f1-08e487961bb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "_clear()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "46b5d7b6-ce91-47d2-a9d0-f390e77d847f",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_docs = [doc1, doc2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "06954765-6155-40a0-b95e-33ef87754c8d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(all_docs, record_manager, vectorstore, cleanup=\"full\", source_id_key=\"source\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "887c45c6-4363-4389-ac56-9cdad682b4c8",
   "metadata": {},
   "source": [
    "Say someone deleted the first doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "35270e4e-9b03-4486-95de-e819ca5e469f",
   "metadata": {},
   "outputs": [],
   "source": [
    "del all_docs[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "7d835a6a-f468-4d79-9a3d-47db187edbb8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='doggy', metadata={'source': 'doggy.txt'})]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_docs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d940bcb4-cf6d-4c21-a565-e7f53f6dacf1",
   "metadata": {},
   "source": [
    "Using full mode will clean up the deleted content as well"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "1b660eae-3bed-434d-a6f5-2aec96e5f0d6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 0, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 1}"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(all_docs, record_manager, vectorstore, cleanup=\"full\", source_id_key=\"source\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a7ecdc9-df3c-4601-b2f3-50fdffc6e5f9",
   "metadata": {},
   "source": [
    "## Source "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4002a4ac-02dd-4599-9b23-9b59f54237c8",
   "metadata": {},
   "source": [
    "The metadata attribute contains a filed called `source`. This source should be pointing at the *ultimate* provenance associated with the given document.\n",
    "\n",
    "For example, if these documents are representing chunks of some parent document, the `source` for both documents should be the same and reference the parent document.\n",
    "\n",
    "In general, `source` should always be specified. Only use a `None`, if you **never** intend to use `incremental` mode, and for some reason can't specify the `source` field correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "184d3051-7fd1-4db2-a1d5-218ac0e1e641",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import CharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "11318248-ad2a-4ef0-bd9b-9d4dab97caba",
   "metadata": {},
   "outputs": [],
   "source": [
    "doc1 = Document(\n",
    "    page_content=\"kitty kitty kitty kitty kitty\", metadata={\"source\": \"kitty.txt\"}\n",
    ")\n",
    "doc2 = Document(page_content=\"doggy doggy the doggy\", metadata={\"source\": \"doggy.txt\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "2cbf0902-d17b-44c9-8983-e8d0e831f909",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='kitty kit', metadata={'source': 'kitty.txt'}),\n",
       " Document(page_content='tty kitty ki', metadata={'source': 'kitty.txt'}),\n",
       " Document(page_content='tty kitty', metadata={'source': 'kitty.txt'}),\n",
       " Document(page_content='doggy doggy', metadata={'source': 'doggy.txt'}),\n",
       " Document(page_content='the doggy', metadata={'source': 'doggy.txt'})]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_docs = CharacterTextSplitter(\n",
    "    separator=\"t\", keep_separator=True, chunk_size=12, chunk_overlap=2\n",
    ").split_documents([doc1, doc2])\n",
    "new_docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "0f9d9bc2-ea85-48ab-b4a2-351c8708b1d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "_clear()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "58781d81-f273-4aeb-8df6-540236826d00",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 5, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(\n",
    "    new_docs,\n",
    "    record_manager,\n",
    "    vectorstore,\n",
    "    cleanup=\"incremental\",\n",
    "    source_id_key=\"source\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "11b81cb6-5f04-499b-b125-1abb22d353bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "changed_doggy_docs = [\n",
    "    Document(page_content=\"woof woof\", metadata={\"source\": \"doggy.txt\"}),\n",
    "    Document(page_content=\"woof woof woof\", metadata={\"source\": \"doggy.txt\"}),\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab1c0915-3f9e-42ac-bdb5-3017935c6e7f",
   "metadata": {},
   "source": [
    "This should delete the old versions of documents associated with `doggy.txt` source and replace them with the new versions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "fec71cb5-6757-4b92-a306-62509f6e867d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 2}"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(\n",
    "    changed_doggy_docs,\n",
    "    record_manager,\n",
    "    vectorstore,\n",
    "    cleanup=\"incremental\",\n",
    "    source_id_key=\"source\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "876f5ab6-4b25-423e-8cff-f5a7a014395b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='tty kitty', metadata={'source': 'kitty.txt'}),\n",
       " Document(page_content='tty kitty ki', metadata={'source': 'kitty.txt'}),\n",
       " Document(page_content='kitty kit', metadata={'source': 'kitty.txt'})]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorstore.similarity_search(\"dog\", k=30)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0af4d24-d735-4e5d-ad9b-a2e8b281f9f1",
   "metadata": {},
   "source": [
    "## Using with Loaders\n",
    "\n",
    "Indexing can accept either an iterable of documents or else any loader.\n",
    "\n",
    "**Attention** The loader **MUST** set source keys correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "08b68357-27c0-4f07-a51d-61c986aeb359",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders.base import BaseLoader\n",
    "\n",
    "\n",
    "class MyCustomLoader(BaseLoader):\n",
    "    def lazy_load(self):\n",
    "        text_splitter = CharacterTextSplitter(\n",
    "            separator=\"t\", keep_separator=True, chunk_size=12, chunk_overlap=2\n",
    "        )\n",
    "        docs = [\n",
    "            Document(page_content=\"woof woof\", metadata={\"source\": \"doggy.txt\"}),\n",
    "            Document(page_content=\"woof woof woof\", metadata={\"source\": \"doggy.txt\"}),\n",
    "        ]\n",
    "        yield from text_splitter.split_documents(docs)\n",
    "\n",
    "    def load(self):\n",
    "        return list(self.lazy_load())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "5dae8e11-c0d6-4fc6-aa0e-68f8d92b5087",
   "metadata": {},
   "outputs": [],
   "source": [
    "_clear()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "d8d72f76-6d6e-4a7c-8fea-9bdec05af05b",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = MyCustomLoader()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "945c45cc-5a8d-4bd7-9f36-4ebd4a50e08b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),\n",
       " Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'})]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "dcb1ba71-db49-4140-ab4a-c5d64fc2578a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index(loader, record_manager, vectorstore, cleanup=\"full\", source_id_key=\"source\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "441159c1-dd84-48d7-8599-37a65c9fb589",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),\n",
       " Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'})]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorstore.similarity_search(\"dog\", k=30)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}