You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/modules/indexes/text_splitters/examples/nltk.ipynb

130 lines
3.6 KiB
Plaintext

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"id": "ea2973ac",
"metadata": {},
"source": [
"# NLTK\n",
"\n",
">[The Natural Language Toolkit](https://en.wikipedia.org/wiki/Natural_Language_Toolkit), or more commonly [NLTK](https://www.nltk.org/), is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.\n",
"\n",
"Rather than just splitting on \"\\n\\n\", we can use `NLTK` to split based on [NLTK tokenizers](https://www.nltk.org/api/nltk.tokenize.html).\n",
"\n",
"1. How the text is split: by `NLTK` tokenizer.\n",
"2. How the chunk size is measured:by number of characters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6af9886-7d53-4aab-84f6-303c4cce7882",
"metadata": {},
"outputs": [],
"source": [
"#pip install nltk"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "aed17ddf",
"metadata": {},
"outputs": [],
"source": [
"# This is a long document we can split up.\n",
"with open('../../../state_of_the_union.txt') as f:\n",
" state_of_the_union = f.read()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "20fa9c23",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import NLTKTextSplitter\n",
"text_splitter = NLTKTextSplitter(chunk_size=1000)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5ea10835",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\n",
"\n",
"Members of Congress and the Cabinet.\n",
"\n",
"Justices of the Supreme Court.\n",
"\n",
"My fellow Americans.\n",
"\n",
"Last year COVID-19 kept us apart.\n",
"\n",
"This year we are finally together again.\n",
"\n",
"Tonight, we meet as Democrats Republicans and Independents.\n",
"\n",
"But most importantly as Americans.\n",
"\n",
"With a duty to one another to the American people to the Constitution.\n",
"\n",
"And with an unwavering resolve that freedom will always triumph over tyranny.\n",
"\n",
"Six days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\n",
"\n",
"But he badly miscalculated.\n",
"\n",
"He thought he could roll into Ukraine and the world would roll over.\n",
"\n",
"Instead he met a wall of strength he never imagined.\n",
"\n",
"He met the Ukrainian people.\n",
"\n",
"From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\n",
"\n",
"Groups of citizens blocking tanks with their bodies.\n"
]
}
],
"source": [
"texts = text_splitter.split_text(state_of_the_union)\n",
"print(texts[0])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"vscode": {
"interpreter": {
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}