forked from Archives/langchain
6b60c509ac
Co-authored-by: cameronccohen <cameron.c.cohen@gmail.com> Co-authored-by: Cameron Cohen <cameron.cohen@quantco.com>
369 lines
15 KiB
Plaintext
369 lines
15 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b118c9dc",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Text Splitter\n",
|
||
"\n",
|
||
"When you want to deal wit long pieces of text, it is necessary to split up that text into chunks.\n",
|
||
"This notebook showcases several ways to do that.\n",
|
||
"\n",
|
||
"At a high level, text splitters work as following:\n",
|
||
"\n",
|
||
"1. Split the text up into small, semantically meaningful chunks (often sentences).\n",
|
||
"2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).\n",
|
||
"3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "e82c4685",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter\n",
|
||
"# This is a long document we can split up.\n",
|
||
"with open('../state_of_the_union.txt') as f:\n",
|
||
" state_of_the_union = f.read()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5c461b26",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Character Text Splitting\n",
|
||
"\n",
|
||
"Let's start with the most simple method: let's split based on characters (by default \"\\n\\n\") and measure chunk length by number of characters."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "79ff6737",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"text_splitter = CharacterTextSplitter( \n",
|
||
" separator = \"\\n\\n\",\n",
|
||
" chunk_size = 1000,\n",
|
||
" chunk_overlap = 200,\n",
|
||
" length_function = len,\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "38547666",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \\n\\nGroups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. '"
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"texts = text_splitter.split_text(state_of_the_union)\n",
|
||
"texts[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "87a71115",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Document creation\n",
|
||
"We can also use the text splitter to create \"Documents\" directly. Documents a way of bundling pieces of text with associated metadata so that chains can interact with them. We can also create documents with empty metadata though!\n",
|
||
"\n",
|
||
"In the below example, we pass two pieces of text to get split up (we pass two just to show off the interface of splitting multiple pieces of text)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "4cd16222",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. ', lookup_str='', metadata={}, lookup_index=0)"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"documents = text_splitter.create_documents([state_of_the_union, state_of_the_union])\n",
|
||
"documents[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2cede1b1",
|
||
"metadata": {},
|
||
"source": [
|
||
"Here's an example of passing metadata along with the documents, notice that it is split along with the documents."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "4a47515a",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. ', lookup_str='', metadata={'document': 1}, lookup_index=0)"
|
||
]
|
||
},
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"metadatas = [{\"document\": 1}, {\"document\": 2}]\n",
|
||
"documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)\n",
|
||
"documents[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "13dc0983",
|
||
"metadata": {},
|
||
"source": [
|
||
"## HuggingFace Length Function\n",
|
||
"Most LLMs are constrained by the number of tokens that you can pass in, which is not the same as the number of characters. In order to get a more accurate estimate, we can use HuggingFace tokenizers to count the text length."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "a8ce51d5",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from transformers import GPT2TokenizerFast\n",
|
||
"\n",
|
||
"tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "ca5e72c0",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=100, chunk_overlap=0)\n",
|
||
"texts = text_splitter.split_text(state_of_the_union)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "37cdfbeb",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n",
|
||
"\n",
|
||
"Last year COVID-19 kept us apart. This year we are finally together again. \n",
|
||
"\n",
|
||
"Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
|
||
"\n",
|
||
"With a duty to one another to the American people to the Constitution. \n",
|
||
"\n",
|
||
"And with an unwavering resolve that freedom will always triumph over tyranny. \n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(texts[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7683b36a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## tiktoken (OpenAI) Length Function\n",
|
||
"You can also use tiktoken, a open source tokenizer package from OpenAI to estimate tokens used. Will probably be more accurate for their models."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "825f7c0a",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=100, chunk_overlap=0)\n",
|
||
"texts = text_splitter.split_text(state_of_the_union)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "ae35d165",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n",
|
||
"\n",
|
||
"Last year COVID-19 kept us apart. This year we are finally together again. \n",
|
||
"\n",
|
||
"Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n",
|
||
"\n",
|
||
"With a duty to one another to the American people to the Constitution. \n",
|
||
"\n",
|
||
"And with an unwavering resolve that freedom will always triumph over tyranny. \n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(texts[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ea2973ac",
|
||
"metadata": {},
|
||
"source": [
|
||
"## NLTK Text Splitter\n",
|
||
"Rather than just splitting on \"\\n\\n\", we can use NLTK to split based on tokenizers."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"id": "20fa9c23",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"text_splitter = NLTKTextSplitter(chunk_size=1000)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"id": "5ea10835",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\\n\\nMembers of Congress and the Cabinet.\\n\\nJustices of the Supreme Court.\\n\\nMy fellow Americans.\\n\\nLast year COVID-19 kept us apart.\\n\\nThis year we are finally together again.\\n\\nTonight, we meet as Democrats Republicans and Independents.\\n\\nBut most importantly as Americans.\\n\\nWith a duty to one another to the American people to the Constitution.\\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny.\\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\\n\\nBut he badly miscalculated.\\n\\nHe thought he could roll into Ukraine and the world would roll over.\\n\\nInstead he met a wall of strength he never imagined.\\n\\nHe met the Ukrainian people.\\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\\n\\nGroups of citizens blocking tanks with their bodies.\\n\\nEveryone from students to retirees teachers turned soldiers defending their homeland.'"
|
||
]
|
||
},
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"texts = text_splitter.split_text(state_of_the_union)\n",
|
||
"texts[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "dab86b60",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Spacy Text Splitter\n",
|
||
"Another alternative to NLTK is to use Spacy."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"id": "f9cc9dfc",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"text_splitter = SpacyTextSplitter(chunk_size=1000)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"id": "cef2b29e",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\\n\\nMembers of Congress and the Cabinet.\\n\\nJustices of the Supreme Court.\\n\\nMy fellow Americans. \\n\\n\\n\\nLast year COVID-19 kept us apart.\\n\\nThis year we are finally together again.\\n\\n\\n\\n\\n\\nTonight, we meet as Democrats Republicans and Independents.\\n\\nBut most importantly as Americans.\\n\\n\\n\\n\\n\\nWith a duty to one another to the American people to the Constitution. \\n\\n\\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny.\\n\\n\\n\\n\\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\\n\\nBut he badly miscalculated.\\n\\n\\n\\n\\n\\nHe thought he could roll into Ukraine and the world would roll over.\\n\\nInstead he met a wall of strength he never imagined.\\n\\n\\n\\n\\n\\nHe met the Ukrainian people.\\n\\n\\n\\n\\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\\n\\n\\n\\n\\n\\nGroups of citizens blocking tanks with their bodies.\\n\\nEveryone from students to retirees teachers turned soldiers defending their homeland.'"
|
||
]
|
||
},
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"texts = text_splitter.split_text(state_of_the_union)\n",
|
||
"texts[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "a1a118b1",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.8"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|