Add self query retriever example with MD header splitting (#6359)

Flesh out the notebook example for `MarkdownHeaderTextSplitter`
2024-11-06 03:20:49 +00:00 · 2023-06-17 21:40:20 -07:00 · 2023-06-17 21:40:20 -07:00 · 370becdfc2
commit 370becdfc2
parent 2c97fbabbd
1 changed files with 328 additions and 17 deletions
--- a/docs/extras/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata.ipynb
+++ b/docs/extras/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata.ipynb
@ -7,23 +7,27 @@
   "source": [
    "# MarkdownHeaderTextSplitter\n",
    "\n",
+    "### Motivation\n",
+    "\n",
    "Many chat or Q+A applications involve chunking input documents prior to embedding and vector storage.\n",
    "\n",
    "[These notes](https://www.pinecone.io/learn/chunking-strategies/) from Pinecone provide some useful tips:\n",
    "\n",
    "```\n",
-    "When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text. Larger input text sizes, on the other hand, may introduce noise or dilute the significance of individual sentences or phrases, making finding precise matches when querying the index more difficult.\n",
+    "When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text.\n",
    "```\n",
    " \n",
-    "As mentioned, chunking usually uses delimiters or length to keep text with a common context together.\n",
+    "As mentioned, chunking often aims to keep text with common context together.\n",
    "\n",
-    "But, in some cases we might want to honor the structure of the document itself.\n",
+    "With this in mind, we might want to specifically honor the structure of the document itself.\n",
    "\n",
-    "For example, a markdown file is organized by headers and isolating chunks within header groups is an intuitive idea.\n",
+    "For example, a markdown file is organized by headers.\n",
    "\n",
-    "If we mix chunks across header groups, then we may degrade the retrieval quality.\n",
+    "Creating chunks within specific header groups is an intuitive idea.\n",
    "\n",
-    "To address this challenge, we can use `MarkdownHeaderTextSplitter` to split a markdown file by a specified set of headers. \n",
+    "To address this challenge, we can use `MarkdownHeaderTextSplitter`.\n",
+    "\n",
+    "This will split a markdown file by a specified set of headers. \n",
    "\n",
    "For example, if we want to split this markdown:\n",
    "```\n",
@ -46,7 +50,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 4,
   "id": "19c044f0",
   "metadata": {},
   "outputs": [],
@ -56,7 +60,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 5,
   "id": "2ae3649b",
   "metadata": {},
   "outputs": [
@ -90,14 +94,12 @@
   "id": "9bd8977a",
   "metadata": {},
   "source": [
-    "Within each markdown group we can then apply any splitter we want. \n",
-    "\n",
-    "Now, we can ensure that the splits are constrained to common header groups and we can keep the headers in the metadata!"
+    "Within each markdown group we can then apply any text splitter we want. "
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 6,
   "id": "480e0e3a",
   "metadata": {},
   "outputs": [],
@ -131,7 +133,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 7,
   "id": "3f5d775e",
   "metadata": {},
   "outputs": [
@ -141,7 +143,7 @@
       "'Markdown[9'"
      ]
     },
-     "execution_count": 6,
+     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -152,7 +154,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 8,
   "id": "33ab0d5c",
   "metadata": {},
   "outputs": [
@ -162,7 +164,7 @@
       "{'Header 1': 'Intro', 'Header 2': 'History'}"
      ]
     },
-     "execution_count": 7,
+     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -170,6 +172,315 @@
   "source": [
    "all_metadatas[0]"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dcf70760",
+   "metadata": {},
+   "source": [
+    "### Use case\n",
+    "\n",
+    "Let's appy `MarkdownHeaderTextSplitter` to a Notion page [here](https://rlancemartin.notion.site/Auto-Evaluation-of-Metadata-Filtering-18502448c85240828f33716740f9574b?pvs=4) as a test.\n",
+    "\n",
+    "The page is downloaded as markdown and stored locally as shown [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "73313d6c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load Notion database as a markdownfile file\n",
+    "from langchain.document_loaders import NotionDirectoryLoader\n",
+    "loader = NotionDirectoryLoader(\"../Notion_DB_Metadata\")\n",
+    "docs = loader.load()\n",
+    "md_file=docs[0].page_content"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "6fa341d7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'content': 'We previously introduced [auto-evaluator](https://blog.langchain.dev/auto-evaluator-opportunities/), an open-source tool for grading LLM question-answer chains. Here, we extend auto-evaluator with a [lightweight Streamlit app](https://github.com/langchain-ai/auto-evaluator/tree/main/streamlit) that can connect to any existing Pinecone index. We add the ability to test metadata filtering using `SelfQueryRetriever` as well as some other approaches that we’ve found to be useful, as discussed below.  \\n[ret_trim.mov](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/ret_trim.mov)',\n",
+       " 'metadata': {'Section': 'Evaluation'}}"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Let's create groups based on the section headers\n",
+    "headers_to_split_on = [\n",
+    "    (\"###\", \"Section\"),\n",
+    "]\n",
+    "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
+    "md_header_splits = markdown_splitter.split_text(md_file)\n",
+    "md_header_splits[3]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42d8bb9b",
+   "metadata": {},
+   "source": [
+    "Now, we split the text in each group and keep the group as metadata."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "a9831de2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define our text splitter\n",
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "chunk_size = 500\n",
+    "chunk_overlap = 50\n",
+    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)\n",
+    " \n",
+    "# Create splits within each header group\n",
+    "all_splits=[]\n",
+    "all_metadatas=[]\n",
+    "for header_group in md_header_splits:\n",
+    "    _splits = text_splitter.split_text(header_group['content'])\n",
+    "    _metadatas = [header_group['metadata'] for _ in _splits]\n",
+    "    all_splits += _splits\n",
+    "    all_metadatas += _metadatas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "id": "b5691ee5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'In these cases, semantic search will look for the concept `episode 53` in the chunks, but instead we simply want to filter the chunks for `episode 53` and then perform semantic search to extract those that best summarize the episode. Metadata filtering does this, so long as we 1) we have a metadata filter for episode number and 2) we can extract the value from the query (e.g., `54` or `252`) that we want to extract. The LangChain `SelfQueryRetriever` does the latter (see'"
+      ]
+     },
+     "execution_count": 43,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "all_splits[6]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "id": "e1dfb405",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'Section': 'Motivation'}"
+      ]
+     },
+     "execution_count": 44,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "all_metadatas[6]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "79868606",
+   "metadata": {},
+   "source": [
+    "This sets us up well do perform metadata filtering based on the document structure.\n",
+    "\n",
+    "Let's bring this all togther by building a vectorstore first."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "143d7347",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install chromadb"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "id": "cbcb917a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Build vectorstore\n",
+    "from langchain.vectorstores import Chroma\n",
+    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
+    "embeddings = OpenAIEmbeddings()\n",
+    "vectorstore = Chroma.from_texts(texts=all_splits,metadatas=all_metadatas,embedding=OpenAIEmbeddings())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3f6031fc",
+   "metadata": {},
+   "source": [
+    "Let's create a `SelfQueryRetriever` that can filter based upon metadata we defined."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "5b1b6a75",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create retriever \n",
+    "from langchain.llms import OpenAI\n",
+    "from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
+    "from langchain.chains.query_constructor.base import AttributeInfo\n",
+    "\n",
+    "# Define our metadata\n",
+    "metadata_field_info = [\n",
+    "    AttributeInfo(\n",
+    "        name=\"Section\",\n",
+    "        description=\"Headers of the markdown document that organize the ideas\",\n",
+    "        type=\"string or list[string]\",\n",
+    "    ),\n",
+    "]\n",
+    "document_content_description = \"Headers of the markdown document\"\n",
+    "\n",
+    "# Define self query retriver\n",
+    "llm = OpenAI(temperature=0)\n",
+    "sq_retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9d0dbed8",
+   "metadata": {},
+   "source": [
+    "Now we can fetch chunks specifically from any section of the doc!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "6c37fe1b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled.png)', metadata={'Section': 'Introduction'}),\n",
+       " Document(page_content='Q+A systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. There many ways to approach this. For example, we recently [discussed](https://blog.langchain.dev/auto-evaluation-of-anthropic-100k-context-window/) the Retriever-Less option (at bottom in the below diagram), highlighting the Anthropic 100k context window model. Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using', metadata={'Section': 'Introduction'}),\n",
+       " Document(page_content='on a user-defined criteria in a VectorDB using metadata tags prior to semantic search.', metadata={'Section': 'Introduction'})]"
+      ]
+     },
+     "execution_count": 48,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Test\n",
+    "question=\"Summarize the Introduction section of the document\"\n",
+    "sq_retriever.get_relevant_documents(question)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bb0efebd",
+   "metadata": {},
+   "source": [
+    "Now, we can create chat or Q+A apps that are aware of the explict document structure. \n",
+    "\n",
+    "Of course, semantic search without specific metadata filtering would probably work reasonably well for this simple document.\n",
+    "\n",
+    "But, the ability to retain document structure for metadata filtering can be helpful for more complicated or longer documents."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "id": "3b40e24e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'The document discusses different approaches to retrieve relevant text chunks and synthesize them into an answer in Q+A systems. One of the approaches is metadata filtering, which pre-filters chunks based on user-defined criteria in a VectorDB using metadata tags prior to semantic search. The Retriever-Less option, which uses the Anthropic 100k context window model, is also mentioned as an alternative approach.'"
+      ]
+     },
+     "execution_count": 49,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.chains import RetrievalQA\n",
+    "from langchain.chat_models import ChatOpenAI\n",
+    "llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)\n",
+    "qa_chain = RetrievalQA.from_chain_type(llm,retriever=sq_retriever)\n",
+    "qa_chain.run(question)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "id": "dfeeb327",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "query='Testing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Testing') limit=None\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'The Testing section of the document describes how the performance of the SelfQueryRetriever was evaluated using various test cases. The tests were designed to evaluate the ability of the SelfQueryRetriever to correctly infer metadata filters from the query using metadata_field_info. The results of the tests showed that the SelfQueryRetriever performed well in some cases, but failed in others. The document also provides a link to the code for the auto-evaluator and instructions on how to use it. Additionally, the document mentions the use of the Kor library for structured data extraction to explicitly specify transformations that the auto-evaluator can use.'"
+      ]
+     },
+     "execution_count": 50,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "question=\"Summarize the Testing section of the document\"\n",
+    "qa_chain.run(question)"
+   ]
  }
 ],
 "metadata": {