From 91446a5e9ba9b3282d0d48f7e28307847fbbe07c Mon Sep 17 00:00:00 2001 From: Harrison Chase Date: Mon, 20 Feb 2023 11:24:31 -0800 Subject: [PATCH] clean up text splitting docs (#1184) --- .../indexes/examples/textsplitter.ipynb | 531 ++++++++++-------- 1 file changed, 286 insertions(+), 245 deletions(-) diff --git a/docs/modules/indexes/examples/textsplitter.ipynb b/docs/modules/indexes/examples/textsplitter.ipynb index c6a86ffc..3ad6b585 100644 --- a/docs/modules/indexes/examples/textsplitter.ipynb +++ b/docs/modules/indexes/examples/textsplitter.ipynb @@ -8,23 +8,30 @@ "# Text Splitter\n", "\n", "When you want to deal with long pieces of text, it is necessary to split up that text into chunks.\n", + "As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What \"semantically related\" means could depend on the type of text.\n", "This notebook showcases several ways to do that.\n", "\n", "At a high level, text splitters work as following:\n", "\n", "1. Split the text up into small, semantically meaningful chunks (often sentences).\n", "2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).\n", - "3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks)." + "3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).\n", + "\n", + "That means there two different axes along which you can customize your text splitter:\n", + "\n", + "1. How the text is split\n", + "2. How the chunk size is measured\n", + "\n", + "For all the examples below, we will highlight both of these attributes" ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 7, "id": "e82c4685", "metadata": {}, "outputs": [], "source": [ - "from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter\n", "# This is a long document we can split up.\n", "with open('../../state_of_the_union.txt') as f:\n", " state_of_the_union = f.read()" @@ -32,147 +39,267 @@ }, { "cell_type": "markdown", - "id": "5c461b26", + "id": "1c8e504a", "metadata": {}, "source": [ - "## Character Text Splitting\n", + "## Generic Recursive Text Splitting\n", + "This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `[\"\\n\\n\", \"\\n\", \" \", \"\"]`. This has the affect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.\n", "\n", - "Let's start with the most simple method: let's split based on characters (by default \"\\n\\n\") and measure chunk length by number of characters." + "\n", + "1. How the text is split: by list of characters\n", + "2. How the chunk size is measured: by length function passed in (defaults to number of characters)" ] }, { "cell_type": "code", - "execution_count": 2, - "id": "79ff6737", + "execution_count": 8, + "id": "1fedab44", "metadata": {}, "outputs": [], "source": [ - "text_splitter = CharacterTextSplitter( \n", - " separator = \"\\n\\n\",\n", - " chunk_size = 1000,\n", - " chunk_overlap = 200,\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "edd10895", + "metadata": {}, + "outputs": [], + "source": [ + "text_splitter = RecursiveCharacterTextSplitter(\n", + " # Set a really small chunk size, just to show.\n", + " chunk_size = 100,\n", + " chunk_overlap = 20,\n", " length_function = len,\n", ")" ] }, { "cell_type": "code", - "execution_count": 3, - "id": "38547666", + "execution_count": 12, + "id": "7c7badcd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n", - "\n", - "Last year COVID-19 kept us apart. This year we are finally together again. \n", - "\n", - "Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n", - "\n", - "With a duty to one another to the American people to the Constitution. \n", - "\n", - "And with an unwavering resolve that freedom will always triumph over tyranny. \n", - "\n", - "Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n", - "\n", - "He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n", - "\n", - "He met the Ukrainian people. \n", - "\n", - "From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n" + "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet.' lookup_str='' metadata={} lookup_index=0\n", + "page_content='and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0\n" ] } ], "source": [ - "texts = text_splitter.split_text(state_of_the_union)\n", - "print(texts[0])" + "texts = text_splitter.create_documents([state_of_the_union])\n", + "print(texts[0])\n", + "print(texts[1])" ] }, { "cell_type": "markdown", - "id": "1be00b73", + "id": "71e12a11", "metadata": {}, "source": [ - "## Recursive Character Text Splitting\n", - "Sometimes, it's not enough to split on just one character. This text splitter uses a whole list of characters and recursive splits them down until they are under the limit." + "## Markdown Text Splitter\n", + "\n", + "MarkdownTextSplitter splits text along Markdown headings, code blocks, or horizontal rules. It's implemented as a simple subclass of RecursiveCharacterSplitter with Markdown-specific separators. See the source code to see the Markdown syntax expected by default.\n", + "\n", + "1. How the text is split: by list of markdown specific characters\n", + "2. How the chunk size is measured: by length function passed in (defaults to number of characters)" ] }, { "cell_type": "code", - "execution_count": 2, - "id": "1ac6376d", + "execution_count": 13, + "id": "b5a64592", "metadata": {}, "outputs": [], "source": [ - "from langchain.text_splitter import RecursiveCharacterTextSplitter" + "from langchain.text_splitter import MarkdownTextSplitter" ] }, { "cell_type": "code", - "execution_count": 3, - "id": "6787b13b", + "execution_count": 14, + "id": "06beaf9b", "metadata": {}, "outputs": [], "source": [ - "text_splitter = RecursiveCharacterTextSplitter(\n", - " # Set a really small chunk size, just to show.\n", - " chunk_size = 100,\n", - " chunk_overlap = 20,\n", - " length_function = len,\n", - ")" + "markdown_text = \"\"\"\n", + "# 🦜️🔗 LangChain\n", + "\n", + "⚡ Building applications with LLMs through composability ⚡\n", + "\n", + "## Quick Install\n", + "\n", + "```bash\n", + "# Hopefully this code block isn't split\n", + "pip install langchain\n", + "```\n", + "\n", + "As an open source project in a rapidly developing field, we are extremely open to contributions.\n", + "\"\"\"\n", + "markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "38bd78a2", + "metadata": {}, + "outputs": [], + "source": [ + "docs = markdown_splitter.create_documents([markdown_text])" ] }, { "cell_type": "code", - "execution_count": 4, - "id": "4f0e7d9b", + "execution_count": 16, + "id": "681d5d19", "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet.\n", - "and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n" - ] + "data": { + "text/plain": [ + "[Document(page_content='# 🦜️🔗 LangChain\\n\\n⚡ Building applications with LLMs through composability ⚡', lookup_str='', metadata={}, lookup_index=0),\n", + " Document(page_content=\"Quick Install\\n\\n```bash\\n# Hopefully this code block isn't split\\npip install langchain\", lookup_str='', metadata={}, lookup_index=0),\n", + " Document(page_content='As an open source project in a rapidly developing field, we are extremely open to contributions.', lookup_str='', metadata={}, lookup_index=0)]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "texts = text_splitter.split_text(state_of_the_union)\n", - "print(texts[0])\n", - "print(texts[1])" + "docs" + ] + }, + { + "cell_type": "markdown", + "id": "400d82d8", + "metadata": {}, + "source": [ + "# Python Code Text Splitter\n", + "\n", + "PythonCodeTextSplitter splits text along python class and method definitions. It's implemented as a simple subclass of RecursiveCharacterSplitter with Python-specific separators. See the source code to see the Python syntax expected by default.\n", + "\n", + "1. How the text is split: by list of python specific characters\n", + "2. How the chunk size is measured: by length function passed in (defaults to number of characters)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "29fd2bcd", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.text_splitter import PythonCodeTextSplitter" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "c5197adc", + "metadata": {}, + "outputs": [], + "source": [ + "python_text = \"\"\"\n", + "class Foo:\n", + "\n", + " def bar():\n", + " \n", + " \n", + "def foo():\n", + "\n", + "def testing_func():\n", + "\n", + "def bar():\n", + "\"\"\"\n", + "python_splitter = PythonCodeTextSplitter(chunk_size=30, chunk_overlap=0)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "6a11320d", + "metadata": {}, + "outputs": [], + "source": [ + "docs = python_splitter.create_documents([python_text])" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "8ccd4be3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(page_content='Foo:\\n\\n def bar():', lookup_str='', metadata={}, lookup_index=0),\n", + " Document(page_content='foo():\\n\\ndef testing_func():', lookup_str='', metadata={}, lookup_index=0),\n", + " Document(page_content='bar():', lookup_str='', metadata={}, lookup_index=0)]" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "docs" ] }, { "cell_type": "markdown", - "id": "87a71115", + "id": "5c461b26", "metadata": {}, "source": [ - "## Document creation\n", - "We can also use the text splitter to create \"Documents\" directly. Documents are a way of bundling pieces of text with associated metadata so that chains can interact with them. We can also create documents with empty metadata though!\n", + "## Character Text Splitting\n", "\n", - "In the below example, we pass two pieces of text to get split up (we pass two just to show off the interface of splitting multiple pieces of text)." + "This is a more simple method. This splits based on characters (by default \"\\n\\n\") and measure chunk length by number of characters.\n", + "\n", + "1. How the text is split: by single character\n", + "2. How the chunk size is measured: by length function passed in (defaults to number of characters)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "79ff6737", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.text_splitter import CharacterTextSplitter\n", + "text_splitter = CharacterTextSplitter( \n", + " separator = \"\\n\\n\",\n", + " chunk_size = 1000,\n", + " chunk_overlap = 200,\n", + " length_function = len,\n", + ")" ] }, { "cell_type": "code", - "execution_count": 4, - "id": "4cd16222", + "execution_count": 23, + "id": "38547666", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. ' lookup_str='' metadata={} lookup_index=0\n" + "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={} lookup_index=0\n" ] } ], "source": [ - "documents = text_splitter.create_documents([state_of_the_union, state_of_the_union])\n", - "print(documents[0])" + "texts = text_splitter.create_documents([state_of_the_union])\n", + "print(texts[0])" ] }, { @@ -185,7 +312,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 24, "id": "4a47515a", "metadata": {}, "outputs": [ @@ -193,7 +320,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. ' lookup_str='' metadata={'document': 1} lookup_index=0\n" + "page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={'document': 1} lookup_index=0\n" ] } ], @@ -209,15 +336,75 @@ "metadata": {}, "source": [ "## HuggingFace Length Function\n", - "Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters. In order to get a more accurate estimate, we can use HuggingFace tokenizers to count the text length." + "Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters. In order to get a more accurate estimate, we can use HuggingFace tokenizers to count the text length.\n", + "\n", + "1. How the text is split: by character passed in\n", + "2. How the chunk size is measured: by Hugging Face tokenizer" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 25, "id": "a8ce51d5", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "365a203647c94effb38c2058a6c88577", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading (…)olve/main/vocab.json: 0%| | 0.00/1.04M [00:00