langchain/docs/extras/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "70e9b619",
   "metadata": {},
   "source": [
    "# MarkdownHeaderTextSplitter\n",
    "\n",
    "Many chat or Q+A applications involve chunking input documents prior to embedding and vector storage.\n",
    "\n",
    "[These notes](https://www.pinecone.io/learn/chunking-strategies/) from Pinecone provide some useful tips:\n",
    "\n",
    "```\n",
    "When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text. Larger input text sizes, on the other hand, may introduce noise or dilute the significance of individual sentences or phrases, making finding precise matches when querying the index more difficult.\n",
    "```\n",
    " \n",
    "As mentioned, chunking usually uses delimiters or length to keep text with a common context together.\n",
    "\n",
    "But, in some cases we might want to honor the structure of the document itself.\n",
    "\n",
    "For example, a markdown file is organized by headers and isolating chunks within header groups is an intuitive idea.\n",
    "\n",
    "If we mix chunks across header groups, then we may degrade the retrieval quality.\n",
    " \n",
    "To address this challenge, we can use `MarkdownHeaderTextSplitter` to split a markdown file by a specified set of headers. \n",
    "\n",
    "For example, if we want to split this markdown:\n",
    "```\n",
    "md = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim  \\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n",
    "```\n",
    " \n",
    "We can specify the headers to split on:\n",
    "```\n",
    "[(\"#\", \"Header 1\"),(\"##\", \"Header 2\")]\n",
    "```\n",
    "\n",
    "And content is grouped or split by common headers:\n",
    "```\n",
    "{'content': 'Hi this is Jim  \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
    "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n",
    "```\n",
    "\n",
    "Let's have a look at some examples below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "19c044f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import MarkdownHeaderTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2ae3649b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'content': 'Hi this is Jim  \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
      "{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n",
      "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n"
     ]
    }
   ],
   "source": [
    "markdown_document = \"# Foo\\n\\n    ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ### Boo \\n\\n Hi this is Lance \\n\\n ## Baz\\n\\n Hi this is Molly\"\n",
    "\n",
    "headers_to_split_on = [\n",
    "    (\"#\", \"Header 1\"),\n",
    "    (\"##\", \"Header 2\"),\n",
    "    (\"###\", \"Header 3\"),\n",
    "]\n",
    "\n",
    "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
    "md_header_splits = markdown_splitter.split_text(markdown_document)\n",
    "for split in md_header_splits:\n",
    "    print(split)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bd8977a",
   "metadata": {},
   "source": [
    "Within each markdown group we can then apply any splitter we want. \n",
    "\n",
    "Now, we can ensure that the splits are constrained to common header groups and we can keep the headers in the metadata!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "480e0e3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "markdown_document = \"# Intro \\n\\n    ## History \\n\\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\n\\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \\n\\n ## Rise and divergence \\n\\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\n\\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n\\n #### Standardization \\n\\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \\n\\n ## Implementations \\n\\n Implementations of Markdown are available for over a dozen programming languages.\"\n",
    "\n",
    "headers_to_split_on = [\n",
    "    (\"#\", \"Header 1\"),\n",
    "    (\"##\", \"Header 2\"),\n",
    "]\n",
    "\n",
    "# MD splits\n",
    "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
    "md_header_splits = markdown_splitter.split_text(markdown_document)\n",
    "\n",
    "# Char-level splits\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "chunk_size = 10\n",
    "chunk_overlap = 0\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)\n",
    "\n",
    "# Split within each header group\n",
    "all_splits=[]\n",
    "all_metadatas=[]    \n",
    "for header_group in md_header_splits:\n",
    "    _splits = text_splitter.split_text(header_group['content'])\n",
    "    _metadatas = [header_group['metadata'] for _ in _splits]\n",
    "    all_splits += _splits\n",
    "    all_metadatas += _metadatas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "3f5d775e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Markdown[9'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_splits[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "33ab0d5c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Header 1': 'Intro', 'Header 2': 'History'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_metadatas[0]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}