langchain/docs/modules/indexes/examples/textsplitter.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b118c9dc",
   "metadata": {},
   "source": [
    "# Text Splitter\n",
    "\n",
    "When you want to deal with long pieces of text, it is necessary to split up that text into chunks.\n",
    "As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What \"semantically related\" means could depend on the type of text.\n",
    "This notebook showcases several ways to do that.\n",
    "\n",
    "At a high level, text splitters work as following:\n",
    "\n",
    "1. Split the text up into small, semantically meaningful chunks (often sentences).\n",
    "2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).\n",
    "3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).\n",
    "\n",
    "That means there two different axes along which you can customize your text splitter:\n",
    "\n",
    "1. How the text is split\n",
    "2. How the chunk size is measured\n",
    "\n",
    "For all the examples below, we will highlight both of these attributes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e82c4685",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is a long document we can split up.\n",
    "with open('../../state_of_the_union.txt') as f:\n",
    "    state_of_the_union = f.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "072eee66",
   "metadata": {},
   "source": [
    "## Generic Recursive Text Splitting\n",
    "This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `[\"\\n\\n\", \"\\n\", \" \", \"\"]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.\n",
    "\n",
    "\n",
    "1. How the text is split: by list of characters\n",
    "2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "14662639",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "fc6e42c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    # Set a really small chunk size, just to show.\n",
    "    chunk_size = 100,\n",
    "    chunk_overlap  = 20,\n",
    "    length_function = len,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "bd1a0a15",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet.' lookup_str='' metadata={} lookup_index=0\n",
      "page_content='and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0\n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.create_documents([state_of_the_union])\n",
    "print(texts[0])\n",
    "print(texts[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80f6cd99",
   "metadata": {},
   "source": [
    "## Markdown Text Splitter\n",
    "\n",
    "MarkdownTextSplitter splits text along Markdown headings, code blocks, or horizontal rules. It's implemented as a simple subclass of RecursiveCharacterSplitter with Markdown-specific separators. See the source code to see the Markdown syntax expected by default.\n",
    "\n",
    "1. How the text is split: by list of markdown specific characters\n",
    "2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "96d64839",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import MarkdownTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "cfb0da17",
   "metadata": {},
   "outputs": [],
   "source": [
    "markdown_text = \"\"\"\n",
    "# 🦜️🔗 LangChain\n",
    "\n",
    "⚡ Building applications with LLMs through composability ⚡\n",
    "\n",
    "## Quick Install\n",
    "\n",
    "```bash\n",
    "# Hopefully this code block isn't split\n",
    "pip install langchain\n",
    "```\n",
    "\n",
    "As an open source project in a rapidly developing field, we are extremely open to contributions.\n",
    "\"\"\"\n",
    "markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "d59a4fe8",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = markdown_splitter.create_documents([markdown_text])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "cbb2e100",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='# 🦜️🔗 LangChain\\n\\n⚡ Building applications with LLMs through composability ⚡', lookup_str='', metadata={}, lookup_index=0),\n",
       " Document(page_content=\"Quick Install\\n\\n```bash\\n# Hopefully this code block isn't split\\npip install langchain\", lookup_str='', metadata={}, lookup_index=0),\n",
       " Document(page_content='As an open source project in a rapidly developing field, we are extremely open to contributions.', lookup_str='', metadata={}, lookup_index=0)]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c350765d",
   "metadata": {},
   "source": [
    "## Python Code Text Splitter\n",
    "\n",
    "PythonCodeTextSplitter splits text along python class and method definitions. It's implemented as a simple subclass of RecursiveCharacterSplitter with Python-specific separators. See the source code to see the Python syntax expected by default.\n",
    "\n",
    "1. How the text is split: by list of python specific characters\n",
    "2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "1703463f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import PythonCodeTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "f17a1854",
   "metadata": {},
   "outputs": [],
   "source": [
    "python_text = \"\"\"\n",
    "class Foo:\n",
    "\n",
    "    def bar():\n",
    "    \n",
    "    \n",
    "def foo():\n",
    "\n",
    "def testing_func():\n",
    "\n",
    "def bar():\n",
    "\"\"\"\n",
    "python_splitter = PythonCodeTextSplitter(chunk_size=30, chunk_overlap=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "6cdc55f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = python_splitter.create_documents([python_text])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "8cc33770",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Foo:\\n\\n    def bar():', lookup_str='', metadata={}, lookup_index=0),\n",
       " Document(page_content='foo():\\n\\ndef testing_func():', lookup_str='', metadata={}, lookup_index=0),\n",
       " Document(page_content='bar():', lookup_str='', metadata={}, lookup_index=0)]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c461b26",
   "metadata": {},
   "source": [
    "## Character Text Splitting\n",
    "\n",
    "This is a more simple method. This splits based on characters (by default \"\\n\\n\") and measure chunk length by number of characters.\n",
    "\n",
    "1. How the text is split: by single character\n",
    "2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "79ff6737",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "text_splitter = CharacterTextSplitter(        \n",
    "    separator = \"\\n\\n\",\n",
    "    chunk_size = 1000,\n",
    "    chunk_overlap  = 200,\n",
    "    length_function = len,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "38547666",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={} lookup_index=0\n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.create_documents([state_of_the_union])\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cede1b1",
   "metadata": {},
   "source": [
    "Here's an example of passing metadata along with the documents, notice that it is split along with the documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "4a47515a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={'document': 1} lookup_index=0\n"
     ]
    }
   ],
   "source": [
    "metadatas = [{\"document\": 1}, {\"document\": 2}]\n",
    "documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)\n",
    "print(documents[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13dc0983",
   "metadata": {},
   "source": [
    "## HuggingFace Length Function\n",
    "Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters. In order to get a more accurate estimate, we can use HuggingFace tokenizers to count the text length.\n",
    "\n",
    "1. How the text is split: by character passed in\n",
    "2. How the chunk size is measured: by Hugging Face tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "a8ce51d5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "365a203647c94effb38c2058a6c88577",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "230ce4d026714d508e3388bdcbfc58e7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2f6fbb29210547f584a74eebcd01d442",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0933ae25626e433ea0dc7595e68de0ed",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from transformers import GPT2TokenizerFast\n",
    "\n",
    "tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "ca5e72c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=100, chunk_overlap=0)\n",
    "texts = text_splitter.split_text(state_of_the_union)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "37cdfbeb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n",
      "\n",
      "Last year COVID-19 kept us apart. This year we are finally together again. \n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
      "\n",
      "With a duty to one another to the American people to the Constitution.\n"
     ]
    }
   ],
   "source": [
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7683b36a",
   "metadata": {},
   "source": [
    "## tiktoken (OpenAI) Length Function\n",
    "You can also use tiktoken, a open source tokenizer package from OpenAI to estimate tokens used. Will probably be more accurate for their models.\n",
    "\n",
    "1. How the text is split: by character passed in\n",
    "2. How the chunk size is measured: by `tiktoken` tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "825f7c0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=100, chunk_overlap=0)\n",
    "texts = text_splitter.split_text(state_of_the_union)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "ae35d165",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n",
      "\n",
      "Last year COVID-19 kept us apart. This year we are finally together again. \n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
      "\n",
      "With a duty to one another to the American people to the Constitution.\n"
     ]
    }
   ],
   "source": [
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea2973ac",
   "metadata": {},
   "source": [
    "## NLTK Text Splitter\n",
    "Rather than just splitting on \"\\n\\n\", we can use NLTK to split based on tokenizers.\n",
    "\n",
    "1. How the text is split: by NLTK\n",
    "2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "20fa9c23",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import NLTKTextSplitter\n",
    "text_splitter = NLTKTextSplitter(chunk_size=1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "5ea10835",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\n",
      "\n",
      "Members of Congress and the Cabinet.\n",
      "\n",
      "Justices of the Supreme Court.\n",
      "\n",
      "My fellow Americans.\n",
      "\n",
      "Last year COVID-19 kept us apart.\n",
      "\n",
      "This year we are finally together again.\n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents.\n",
      "\n",
      "But most importantly as Americans.\n",
      "\n",
      "With a duty to one another to the American people to the Constitution.\n",
      "\n",
      "And with an unwavering resolve that freedom will always triumph over tyranny.\n",
      "\n",
      "Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\n",
      "\n",
      "But he badly miscalculated.\n",
      "\n",
      "He thought he could roll into Ukraine and the world would roll over.\n",
      "\n",
      "Instead he met a wall of strength he never imagined.\n",
      "\n",
      "He met the Ukrainian people.\n",
      "\n",
      "From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\n",
      "\n",
      "Groups of citizens blocking tanks with their bodies.\n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dab86b60",
   "metadata": {},
   "source": [
    "## Spacy Text Splitter\n",
    "Another alternative to NLTK is to use Spacy.\n",
    "\n",
    "1. How the text is split: by Spacy\n",
    "2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4ec9b90",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import SpacyTextSplitter\n",
    "text_splitter = SpacyTextSplitter(chunk_size=1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "cef2b29e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\n",
      "\n",
      "Members of Congress and the Cabinet.\n",
      "\n",
      "Justices of the Supreme Court.\n",
      "\n",
      "My fellow Americans.  \n",
      "\n",
      "\n",
      "\n",
      "Last year COVID-19 kept us apart.\n",
      "\n",
      "This year we are finally together again.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents.\n",
      "\n",
      "But most importantly as Americans.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "With a duty to one another to the American people to the Constitution. \n",
      "\n",
      "\n",
      "\n",
      "And with an unwavering resolve that freedom will always triumph over tyranny.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\n",
      "\n",
      "But he badly miscalculated.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "He thought he could roll into Ukraine and the world would roll over.\n",
      "\n",
      "Instead he met a wall of strength he never imagined.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "He met the Ukrainian people.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Groups of citizens blocking tanks with their bodies.\n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53049ff5",
   "metadata": {},
   "source": [
    "## Token Text Splitter\n",
    "\n",
    "1. How the text is split: by `tiktoken` tokens\n",
    "2. How the chunk size is measured: by `tiktoken` tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a1a118b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import TokenTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ef37c5d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5750228a",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our\n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  },
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}