Update MD header text splitter notebook (#6339)

Highlight use case for maintaining header groups when splitting.
11 months ago · 2c97fbabbd
parent a2bbe3dda4
commit 2c97fbabbd
1 changed files with 90 additions and 41 deletions
--- a/docs/extras/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata.ipynb
+++ b/docs/extras/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata.ipynb
@ -7,30 +7,46 @@
   "source": [
    "# MarkdownHeaderTextSplitter\n",
    "\n",
-    "This splits a markdown file by a specified set of headers. For example, if we want to split this markdown:\n",
+    "Many chat or Q+A applications involve chunking input documents prior to embedding and vector storage.\n",
+    "\n",
+    "[These notes](https://www.pinecone.io/learn/chunking-strategies/) from Pinecone provide some useful tips:\n",
+    "\n",
    "```\n",
-    "md = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim  \\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n",
+    "When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text. Larger input text sizes, on the other hand, may introduce noise or dilute the significance of individual sentences or phrases, making finding precise matches when querying the index more difficult.\n",
    "```\n",
+    " \n",
+    "As mentioned, chunking usually uses delimiters or length to keep text with a common context together.\n",
+    "\n",
+    "But, in some cases we might want to honor the structure of the document itself.\n",
+    "\n",
+    "For example, a markdown file is organized by headers and isolating chunks within header groups is an intuitive idea.\n",
+    "\n",
+    "If we mix chunks across header groups, then we may degrade the retrieval quality.\n",
+    " \n",
+    "To address this challenge, we can use `MarkdownHeaderTextSplitter` to split a markdown file by a specified set of headers. \n",
    "\n",
-    "Headers to split on:\n",
+    "For example, if we want to split this markdown:\n",
+    "```\n",
+    "md = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim  \\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n",
+    "```\n",
+    " \n",
+    "We can specify the headers to split on:\n",
    "```\n",
    "[(\"#\", \"Header 1\"),(\"##\", \"Header 2\")]\n",
    "```\n",
    "\n",
-    "Expected output:\n",
+    "And content is grouped or split by common headers:\n",
    "```\n",
    "{'content': 'Hi this is Jim  \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
    "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n",
    "```\n",
    "\n",
-    "Optionally, this also includes `return_each_line` in case a user want to perform other types of aggregation. \n",
-    "\n",
-    "If `return_each_line=True`, each line and associated header metadata are simply returned. "
+    "Let's have a look at some examples below."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 1,
   "id": "19c044f0",
   "metadata": {},
   "outputs": [],
@ -40,7 +56,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 2,
   "id": "2ae3649b",
   "metadata": {},
   "outputs": [
@ -64,63 +80,96 @@
    "]\n",
    "\n",
    "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
-    "splits = markdown_splitter.split_text(markdown_document)\n",
-    "for split in splits:\n",
+    "md_header_splits = markdown_splitter.split_text(markdown_document)\n",
+    "for split in md_header_splits:\n",
    "    print(split)"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "2a32026a",
+   "id": "9bd8977a",
   "metadata": {},
   "source": [
-    "Here's an example on a larger file with `return_each_line=True` passed, allowing each line to be examined."
+    "Within each markdown group we can then apply any splitter we want. \n",
+    "\n",
+    "Now, we can ensure that the splits are constrained to common header groups and we can keep the headers in the metadata!"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
-   "id": "8af8f9a2",
+   "execution_count": 5,
+   "id": "480e0e3a",
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "{'content': 'Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', 'metadata': {'Header 1': 'Intro', 'Header 2': 'History'}}\n",
-      "{'content': 'Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'History'}}\n",
-      "{'content': 'As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}}\n",
-      "{'content': 'additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}}\n",
-      "{'content': 'From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Rise and divergence', 'Header 4': 'Standardization'}}\n",
-      "{'content': 'Implementations of Markdown are available for over a dozen programming languages.', 'metadata': {'Header 1': 'Intro', 'Header 2': 'Implementations'}}\n"
-     ]
-    }
-   ],
+   "outputs": [],
   "source": [
    "markdown_document = \"# Intro \\n\\n    ## History \\n\\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\n\\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \\n\\n ## Rise and divergence \\n\\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\n\\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n\\n #### Standardization \\n\\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \\n\\n ## Implementations \\n\\n Implementations of Markdown are available for over a dozen programming languages.\"\n",
    "\n",
    "headers_to_split_on = [\n",
    "    (\"#\", \"Header 1\"),\n",
    "    (\"##\", \"Header 2\"),\n",
-    "    (\"###\", \"Header 3\"),\n",
-    "    (\"####\", \"Header 4\"),\n",
    "]\n",
    "\n",
-    "markdown_splitter = MarkdownHeaderTextSplitter(\n",
-    "    headers_to_split_on=headers_to_split_on, return_each_line=True\n",
-    ")\n",
-    "splits = markdown_splitter.split_text(markdown_document)\n",
-    "for line in splits:\n",
-    "    print(line)"
+    "# MD splits\n",
+    "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
+    "md_header_splits = markdown_splitter.split_text(markdown_document)\n",
+    "\n",
+    "# Char-level splits\n",
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "chunk_size = 10\n",
+    "chunk_overlap = 0\n",
+    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)\n",
+    "\n",
+    "# Split within each header group\n",
+    "all_splits=[]\n",
+    "all_metadatas=[]    \n",
+    "for header_group in md_header_splits:\n",
+    "    _splits = text_splitter.split_text(header_group['content'])\n",
+    "    _metadatas = [header_group['metadata'] for _ in _splits]\n",
+    "    all_splits += _splits\n",
+    "    all_metadatas += _metadatas"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "id": "987183f2",
+   "execution_count": 6,
+   "id": "3f5d775e",
   "metadata": {},
-   "outputs": [],
-   "source": []
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Markdown[9'"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "all_splits[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "33ab0d5c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'Header 1': 'Intro', 'Header 2': 'History'}"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "all_metadatas[0]"
+   ]
  }
 ],
 "metadata": {