clean up text splitting docs (#1184)

searx-query-suffixy
Harrison Chase 1 year ago committed by GitHub
parent a5a14405ad
commit 91446a5e9b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -8,23 +8,30 @@
"# Text Splitter\n",
"\n",
"When you want to deal with long pieces of text, it is necessary to split up that text into chunks.\n",
"As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What \"semantically related\" means could depend on the type of text.\n",
"This notebook showcases several ways to do that.\n",
"\n",
"At a high level, text splitters work as following:\n",
"\n",
"1. Split the text up into small, semantically meaningful chunks (often sentences).\n",
"2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).\n",
"3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks)."
"3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).\n",
"\n",
"That means there two different axes along which you can customize your text splitter:\n",
"\n",
"1. How the text is split\n",
"2. How the chunk size is measured\n",
"\n",
"For all the examples below, we will highlight both of these attributes"
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 7,
"id": "e82c4685",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter\n",
"# This is a long document we can split up.\n",
"with open('../../state_of_the_union.txt') as f:\n",
" state_of_the_union = f.read()"
@ -32,147 +39,267 @@
},
{
"cell_type": "markdown",
"id": "5c461b26",
"id": "1c8e504a",
"metadata": {},
"source": [
"## Character Text Splitting\n",
"## Generic Recursive Text Splitting\n",
"This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `[\"\\n\\n\", \"\\n\", \" \", \"\"]`. This has the affect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.\n",
"\n",
"Let's start with the most simple method: let's split based on characters (by default \"\\n\\n\") and measure chunk length by number of characters."
"\n",
"1. How the text is split: by list of characters\n",
"2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "79ff6737",
"execution_count": 8,
"id": "1fedab44",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = CharacterTextSplitter( \n",
" separator = \"\\n\\n\",\n",
" chunk_size = 1000,\n",
" chunk_overlap = 200,\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "edd10895",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = RecursiveCharacterTextSplitter(\n",
" # Set a really small chunk size, just to show.\n",
" chunk_size = 100,\n",
" chunk_overlap = 20,\n",
" length_function = len,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "38547666",
"execution_count": 12,
"id": "7c7badcd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n",
"\n",
"Last year COVID-19 kept us apart. This year we are finally together again. \n",
"\n",
"Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
"\n",
"With a duty to one another to the American people to the Constitution. \n",
"\n",
"And with an unwavering resolve that freedom will always triumph over tyranny. \n",
"\n",
"Six days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n",
"\n",
"He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n",
"\n",
"He met the Ukrainian people. \n",
"\n",
"From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n"
"page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet.' lookup_str='' metadata={} lookup_index=0\n",
"page_content='and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0\n"
]
}
],
"source": [
"texts = text_splitter.split_text(state_of_the_union)\n",
"print(texts[0])"
"texts = text_splitter.create_documents([state_of_the_union])\n",
"print(texts[0])\n",
"print(texts[1])"
]
},
{
"cell_type": "markdown",
"id": "1be00b73",
"id": "71e12a11",
"metadata": {},
"source": [
"## Recursive Character Text Splitting\n",
"Sometimes, it's not enough to split on just one character. This text splitter uses a whole list of characters and recursive splits them down until they are under the limit."
"## Markdown Text Splitter\n",
"\n",
"MarkdownTextSplitter splits text along Markdown headings, code blocks, or horizontal rules. It's implemented as a simple subclass of RecursiveCharacterSplitter with Markdown-specific separators. See the source code to see the Markdown syntax expected by default.\n",
"\n",
"1. How the text is split: by list of markdown specific characters\n",
"2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1ac6376d",
"execution_count": 13,
"id": "b5a64592",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import RecursiveCharacterTextSplitter"
"from langchain.text_splitter import MarkdownTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6787b13b",
"execution_count": 14,
"id": "06beaf9b",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = RecursiveCharacterTextSplitter(\n",
" # Set a really small chunk size, just to show.\n",
" chunk_size = 100,\n",
" chunk_overlap = 20,\n",
" length_function = len,\n",
")"
"markdown_text = \"\"\"\n",
"# 🦜️🔗 LangChain\n",
"\n",
"⚡ Building applications with LLMs through composability ⚡\n",
"\n",
"## Quick Install\n",
"\n",
"```bash\n",
"# Hopefully this code block isn't split\n",
"pip install langchain\n",
"```\n",
"\n",
"As an open source project in a rapidly developing field, we are extremely open to contributions.\n",
"\"\"\"\n",
"markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "38bd78a2",
"metadata": {},
"outputs": [],
"source": [
"docs = markdown_splitter.create_documents([markdown_text])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4f0e7d9b",
"execution_count": 16,
"id": "681d5d19",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet.\n",
"and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n"
]
"data": {
"text/plain": [
"[Document(page_content='# 🦜️🔗 LangChain\\n\\n⚡ Building applications with LLMs through composability ⚡', lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content=\"Quick Install\\n\\n```bash\\n# Hopefully this code block isn't split\\npip install langchain\", lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content='As an open source project in a rapidly developing field, we are extremely open to contributions.', lookup_str='', metadata={}, lookup_index=0)]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"texts = text_splitter.split_text(state_of_the_union)\n",
"print(texts[0])\n",
"print(texts[1])"
"docs"
]
},
{
"cell_type": "markdown",
"id": "400d82d8",
"metadata": {},
"source": [
"# Python Code Text Splitter\n",
"\n",
"PythonCodeTextSplitter splits text along python class and method definitions. It's implemented as a simple subclass of RecursiveCharacterSplitter with Python-specific separators. See the source code to see the Python syntax expected by default.\n",
"\n",
"1. How the text is split: by list of python specific characters\n",
"2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "29fd2bcd",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import PythonCodeTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "c5197adc",
"metadata": {},
"outputs": [],
"source": [
"python_text = \"\"\"\n",
"class Foo:\n",
"\n",
" def bar():\n",
" \n",
" \n",
"def foo():\n",
"\n",
"def testing_func():\n",
"\n",
"def bar():\n",
"\"\"\"\n",
"python_splitter = PythonCodeTextSplitter(chunk_size=30, chunk_overlap=0)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "6a11320d",
"metadata": {},
"outputs": [],
"source": [
"docs = python_splitter.create_documents([python_text])"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "8ccd4be3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Foo:\\n\\n def bar():', lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content='foo():\\n\\ndef testing_func():', lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content='bar():', lookup_str='', metadata={}, lookup_index=0)]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs"
]
},
{
"cell_type": "markdown",
"id": "87a71115",
"id": "5c461b26",
"metadata": {},
"source": [
"## Document creation\n",
"We can also use the text splitter to create \"Documents\" directly. Documents are a way of bundling pieces of text with associated metadata so that chains can interact with them. We can also create documents with empty metadata though!\n",
"## Character Text Splitting\n",
"\n",
"In the below example, we pass two pieces of text to get split up (we pass two just to show off the interface of splitting multiple pieces of text)."
"This is a more simple method. This splits based on characters (by default \"\\n\\n\") and measure chunk length by number of characters.\n",
"\n",
"1. How the text is split: by single character\n",
"2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "79ff6737",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"text_splitter = CharacterTextSplitter( \n",
" separator = \"\\n\\n\",\n",
" chunk_size = 1000,\n",
" chunk_overlap = 200,\n",
" length_function = len,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4cd16222",
"execution_count": 23,
"id": "38547666",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. ' lookup_str='' metadata={} lookup_index=0\n"
"page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={} lookup_index=0\n"
]
}
],
"source": [
"documents = text_splitter.create_documents([state_of_the_union, state_of_the_union])\n",
"print(documents[0])"
"texts = text_splitter.create_documents([state_of_the_union])\n",
"print(texts[0])"
]
},
{
@ -185,7 +312,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 24,
"id": "4a47515a",
"metadata": {},
"outputs": [
@ -193,7 +320,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. ' lookup_str='' metadata={'document': 1} lookup_index=0\n"
"page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={'document': 1} lookup_index=0\n"
]
}
],
@ -209,15 +336,75 @@
"metadata": {},
"source": [
"## HuggingFace Length Function\n",
"Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters. In order to get a more accurate estimate, we can use HuggingFace tokenizers to count the text length."
"Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters. In order to get a more accurate estimate, we can use HuggingFace tokenizers to count the text length.\n",
"\n",
"1. How the text is split: by character passed in\n",
"2. How the chunk size is measured: by Hugging Face tokenizer"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 25,
"id": "a8ce51d5",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "365a203647c94effb38c2058a6c88577",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading (…)olve/main/vocab.json: 0%| | 0.00/1.04M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "230ce4d026714d508e3388bdcbfc58e7",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading (…)olve/main/merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2f6fbb29210547f584a74eebcd01d442",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading (…)/main/tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0933ae25626e433ea0dc7595e68de0ed",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading (…)lve/main/config.json: 0%| | 0.00/665 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from transformers import GPT2TokenizerFast\n",
"\n",
@ -226,7 +413,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 26,
"id": "ca5e72c0",
"metadata": {},
"outputs": [],
@ -237,7 +424,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 27,
"id": "37cdfbeb",
"metadata": {},
"outputs": [
@ -251,7 +438,7 @@
"\n",
"Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
"\n",
"With a duty to one another to the American people to the Constitution. \n"
"With a duty to one another to the American people to the Constitution.\n"
]
}
],
@ -265,12 +452,15 @@
"metadata": {},
"source": [
"## tiktoken (OpenAI) Length Function\n",
"You can also use tiktoken, a open source tokenizer package from OpenAI to estimate tokens used. Will probably be more accurate for their models."
"You can also use tiktoken, a open source tokenizer package from OpenAI to estimate tokens used. Will probably be more accurate for their models.\n",
"\n",
"1. How the text is split: by character passed in\n",
"2. How the chunk size is measured: by `tiktoken` tokenizer"
]
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 28,
"id": "825f7c0a",
"metadata": {},
"outputs": [],
@ -281,7 +471,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 29,
"id": "ae35d165",
"metadata": {},
"outputs": [
@ -295,7 +485,7 @@
"\n",
"Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
"\n",
"With a duty to one another to the American people to the Constitution. \n"
"With a duty to one another to the American people to the Constitution.\n"
]
}
],
@ -309,16 +499,20 @@
"metadata": {},
"source": [
"## NLTK Text Splitter\n",
"Rather than just splitting on \"\\n\\n\", we can use NLTK to split based on tokenizers."
"Rather than just splitting on \"\\n\\n\", we can use NLTK to split based on tokenizers.\n",
"\n",
"1. How the text is split: by NLTK\n",
"2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 30,
"id": "20fa9c23",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import NLTKTextSplitter\n",
"text_splitter = NLTKTextSplitter(chunk_size=1000)"
]
},
@ -379,16 +573,20 @@
"metadata": {},
"source": [
"## Spacy Text Splitter\n",
"Another alternative to NLTK is to use Spacy."
"Another alternative to NLTK is to use Spacy.\n",
"\n",
"1. How the text is split: by Spacy\n",
"2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f9cc9dfc",
"execution_count": null,
"id": "e974286e",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import SpacyTextSplitter\n",
"text_splitter = SpacyTextSplitter(chunk_size=1000)"
]
},
@ -480,7 +678,10 @@
"id": "53049ff5",
"metadata": {},
"source": [
"## Token Text Splitter"
"## Token Text Splitter\n",
"\n",
"1. How the text is split: by `tiktoken` tokens\n",
"2. How the chunk size is measured: by `tiktoken` tokens"
]
},
{
@ -523,166 +724,6 @@
"texts = text_splitter.split_text(state_of_the_union)\n",
"print(texts[0])"
]
},
{
"cell_type": "markdown",
"id": "c24dbbb7",
"metadata": {},
"source": [
"# Markdown Text Splitter\n",
"\n",
"MarkdownTextSplitter splits text along Markdown headings, code blocks, or horizontal rules. It's implemented as a simple subclass of RecursiveCharacterSplitter with Markdown-specific separators. See the source code to see the Markdown syntax expected by default."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "593e490c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import MarkdownTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "89a9a3ea",
"metadata": {},
"outputs": [],
"source": [
"markdown_text = \"\"\"\n",
"# 🦜️🔗 LangChain\n",
"\n",
"⚡ Building applications with LLMs through composability ⚡\n",
"\n",
"## Quick Install\n",
"\n",
"```bash\n",
"# Hopefully this code block isn't split\n",
"pip install langchain\n",
"```\n",
"\n",
"As an open source project in a rapidly developing field, we are extremely open to contributions.\n",
"\"\"\"\n",
"markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "241f0719",
"metadata": {},
"outputs": [],
"source": [
"docs = markdown_splitter.create_documents([markdown_text])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "7789e643",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='# 🦜️🔗 LangChain\\n\\n⚡ Building applications with LLMs through composability ⚡', lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content=\"Quick Install\\n\\n```bash\\n# Hopefully this code block isn't split\\npip install langchain\", lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content='As an open source project in a rapidly developing field, we are extremely open to contributions.', lookup_str='', metadata={}, lookup_index=0)]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs"
]
},
{
"cell_type": "markdown",
"id": "04a6392a",
"metadata": {},
"source": [
"# Python Code Text Splitter\n",
"\n",
"PythonCodeTextSplitter splits text along python class and method definitions. It's implemented as a simple subclass of RecursiveCharacterSplitter with Python-specific separators. See the source code to see the Python syntax expected by default."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8fb36bc7",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import PythonCodeTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d359f3dc",
"metadata": {},
"outputs": [],
"source": [
"python_text = \"\"\"\n",
"class Foo:\n",
"\n",
" def bar():\n",
" \n",
" \n",
"def foo():\n",
"\n",
"def testing_func():\n",
"\n",
"def bar():\n",
"\"\"\"\n",
"python_splitter = PythonCodeTextSplitter(chunk_size=30, chunk_overlap=0)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "26b79cd9",
"metadata": {},
"outputs": [],
"source": [
"docs = python_splitter.create_documents([python_text])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "b1749579",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Foo:\\n\\n def bar():', lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content='foo():\\n\\ndef testing_func():', lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content='bar():', lookup_str='', metadata={}, lookup_index=0)]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e6c8cc7",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

Loading…
Cancel
Save