langchain/docs/modules/indexes/examples/textsplitter.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b118c9dc",
   "metadata": {},
   "source": [
    "# Text Splitter\n",
    "\n",
    "When you want to deal with long pieces of text, it is necessary to split up that text into chunks.\n",
    "This notebook showcases several ways to do that.\n",
    "\n",
    "At a high level, text splitters work as following:\n",
    "\n",
    "1. Split the text up into small, semantically meaningful chunks (often sentences).\n",
    "2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).\n",
    "3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e82c4685",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter\n",
    "# This is a long document we can split up.\n",
    "with open('../../state_of_the_union.txt') as f:\n",
    "    state_of_the_union = f.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c461b26",
   "metadata": {},
   "source": [
    "## Character Text Splitting\n",
    "\n",
    "Let's start with the most simple method: let's split based on characters (by default \"\\n\\n\") and measure chunk length by number of characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "79ff6737",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = CharacterTextSplitter(        \n",
    "    separator = \"\\n\\n\",\n",
    "    chunk_size = 1000,\n",
    "    chunk_overlap  = 200,\n",
    "    length_function = len,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "38547666",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n",
      "\n",
      "Last year COVID-19 kept us apart. This year we are finally together again. \n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
      "\n",
      "With a duty to one another to the American people to the Constitution. \n",
      "\n",
      "And with an unwavering resolve that freedom will always triumph over tyranny. \n",
      "\n",
      "Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n",
      "\n",
      "He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n",
      "\n",
      "He met the Ukrainian people. \n",
      "\n",
      "From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1be00b73",
   "metadata": {},
   "source": [
    "## Recursive Character Text Splitting\n",
    "Sometimes, it's not enough to split on just one character. This text splitter uses a whole list of characters and recursive splits them down until they are under the limit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1ac6376d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6787b13b",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    # Set a really small chunk size, just to show.\n",
    "    chunk_size = 100,\n",
    "    chunk_overlap  = 20,\n",
    "    length_function = len,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4f0e7d9b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet.\n",
      "and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])\n",
    "print(texts[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87a71115",
   "metadata": {},
   "source": [
    "## Document creation\n",
    "We can also use the text splitter to create \"Documents\" directly. Documents are a way of bundling pieces of text with associated metadata so that chains can interact with them. We can also create documents with empty metadata though!\n",
    "\n",
    "In the below example, we pass two pieces of text to get split up (we pass two just to show off the interface of splitting multiple pieces of text)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4cd16222",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. ' lookup_str='' metadata={} lookup_index=0\n"
     ]
    }
   ],
   "source": [
    "documents = text_splitter.create_documents([state_of_the_union, state_of_the_union])\n",
    "print(documents[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cede1b1",
   "metadata": {},
   "source": [
    "Here's an example of passing metadata along with the documents, notice that it is split along with the documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4a47515a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. ' lookup_str='' metadata={'document': 1} lookup_index=0\n"
     ]
    }
   ],
   "source": [
    "metadatas = [{\"document\": 1}, {\"document\": 2}]\n",
    "documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)\n",
    "print(documents[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13dc0983",
   "metadata": {},
   "source": [
    "## HuggingFace Length Function\n",
    "Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters. In order to get a more accurate estimate, we can use HuggingFace tokenizers to count the text length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a8ce51d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import GPT2TokenizerFast\n",
    "\n",
    "tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ca5e72c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=100, chunk_overlap=0)\n",
    "texts = text_splitter.split_text(state_of_the_union)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "37cdfbeb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n",
      "\n",
      "Last year COVID-19 kept us apart. This year we are finally together again. \n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
      "\n",
      "With a duty to one another to the American people to the Constitution. \n"
     ]
    }
   ],
   "source": [
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7683b36a",
   "metadata": {},
   "source": [
    "## tiktoken (OpenAI) Length Function\n",
    "You can also use tiktoken, a open source tokenizer package from OpenAI to estimate tokens used. Will probably be more accurate for their models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "825f7c0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=100, chunk_overlap=0)\n",
    "texts = text_splitter.split_text(state_of_the_union)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "ae35d165",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n",
      "\n",
      "Last year COVID-19 kept us apart. This year we are finally together again. \n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
      "\n",
      "With a duty to one another to the American people to the Constitution. \n"
     ]
    }
   ],
   "source": [
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea2973ac",
   "metadata": {},
   "source": [
    "## NLTK Text Splitter\n",
    "Rather than just splitting on \"\\n\\n\", we can use NLTK to split based on tokenizers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "20fa9c23",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = NLTKTextSplitter(chunk_size=1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "5ea10835",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\n",
      "\n",
      "Members of Congress and the Cabinet.\n",
      "\n",
      "Justices of the Supreme Court.\n",
      "\n",
      "My fellow Americans.\n",
      "\n",
      "Last year COVID-19 kept us apart.\n",
      "\n",
      "This year we are finally together again.\n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents.\n",
      "\n",
      "But most importantly as Americans.\n",
      "\n",
      "With a duty to one another to the American people to the Constitution.\n",
      "\n",
      "And with an unwavering resolve that freedom will always triumph over tyranny.\n",
      "\n",
      "Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\n",
      "\n",
      "But he badly miscalculated.\n",
      "\n",
      "He thought he could roll into Ukraine and the world would roll over.\n",
      "\n",
      "Instead he met a wall of strength he never imagined.\n",
      "\n",
      "He met the Ukrainian people.\n",
      "\n",
      "From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\n",
      "\n",
      "Groups of citizens blocking tanks with their bodies.\n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dab86b60",
   "metadata": {},
   "source": [
    "## Spacy Text Splitter\n",
    "Another alternative to NLTK is to use Spacy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f9cc9dfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = SpacyTextSplitter(chunk_size=1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "cef2b29e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\n",
      "\n",
      "Members of Congress and the Cabinet.\n",
      "\n",
      "Justices of the Supreme Court.\n",
      "\n",
      "My fellow Americans.  \n",
      "\n",
      "\n",
      "\n",
      "Last year COVID-19 kept us apart.\n",
      "\n",
      "This year we are finally together again.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents.\n",
      "\n",
      "But most importantly as Americans.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "With a duty to one another to the American people to the Constitution. \n",
      "\n",
      "\n",
      "\n",
      "And with an unwavering resolve that freedom will always triumph over tyranny.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\n",
      "\n",
      "But he badly miscalculated.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "He thought he could roll into Ukraine and the world would roll over.\n",
      "\n",
      "Instead he met a wall of strength he never imagined.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "He met the Ukrainian people.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Groups of citizens blocking tanks with their bodies.\n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53049ff5",
   "metadata": {},
   "source": [
    "## Token Text Splitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a1a118b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import TokenTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ef37c5d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5750228a",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our\n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c24dbbb7",
   "metadata": {},
   "source": [
    "# Markdown Text Splitter\n",
    "\n",
    "MarkdownTextSplitter splits text along Markdown headings, code blocks, or horizontal rules. It's implemented as a simple subclass of RecursiveCharacterSplitter with Markdown-specific separators. See the source code to see the Markdown syntax expected by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "593e490c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import MarkdownTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "89a9a3ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "markdown_text = \"\"\"\n",
    "# 🦜️🔗 LangChain\n",
    "\n",
    "⚡ Building applications with LLMs through composability ⚡\n",
    "\n",
    "## Quick Install\n",
    "\n",
    "```bash\n",
    "# Hopefully this code block isn't split\n",
    "pip install langchain\n",
    "```\n",
    "\n",
    "As an open source project in a rapidly developing field, we are extremely open to contributions.\n",
    "\"\"\"\n",
    "markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "241f0719",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = markdown_splitter.create_documents([markdown_text])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "7789e643",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='# 🦜️🔗 LangChain\\n\\n⚡ Building applications with LLMs through composability ⚡', lookup_str='', metadata={}, lookup_index=0),\n",
       " Document(page_content=\"Quick Install\\n\\n```bash\\n# Hopefully this code block isn't split\\npip install langchain\", lookup_str='', metadata={}, lookup_index=0),\n",
       " Document(page_content='As an open source project in a rapidly developing field, we are extremely open to contributions.', lookup_str='', metadata={}, lookup_index=0)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04a6392a",
   "metadata": {},
   "source": [
    "# Python Code Text Splitter\n",
    "\n",
    "PythonCodeTextSplitter splits text along python class and method definitions. It's implemented as a simple subclass of RecursiveCharacterSplitter with Python-specific separators. See the source code to see the Python syntax expected by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8fb36bc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import PythonCodeTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "d359f3dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "python_text = \"\"\"\n",
    "class Foo:\n",
    "\n",
    "    def bar():\n",
    "    \n",
    "    \n",
    "def foo():\n",
    "\n",
    "def testing_func():\n",
    "\n",
    "def bar():\n",
    "\"\"\"\n",
    "python_splitter = PythonCodeTextSplitter(chunk_size=30, chunk_overlap=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "26b79cd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = python_splitter.create_documents([python_text])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b1749579",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Foo:\\n\\n    def bar():', lookup_str='', metadata={}, lookup_index=0),\n",
       " Document(page_content='foo():\\n\\ndef testing_func():', lookup_str='', metadata={}, lookup_index=0),\n",
       " Document(page_content='bar():', lookup_str='', metadata={}, lookup_index=0)]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e6c8cc7",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  },
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}