langchain/docs/modules/indexes/text_splitters/examples/spacy.ipynb
Leonid Ganeline c998569c8f
docs: text splitters improvements (#4490)
#docs: text splitters improvements

Changes are only in the Jupyter notebooks.
- added links to the source packages and a short description of these
packages
- removed " Text Splitters" suffixes from the TOC elements (they made
the list of the text splitters messy)
- moved text splitters, based on the length function into a separate
list. They can be mixed with any classes from the "Text Splitters", so
it is a different classification.

## Who can review?
        @hwchase17 - project lead
        @eyurtsev
        @vowelparrot

NOTE: please, check out the results of the `Python code` text splitter
example (text_splitters/examples/python.ipynb). It looks suboptimal.
2023-05-17 21:33:34 -07:00

152 lines
3.7 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "dab86b60",
"metadata": {},
"source": [
"# spaCy\n",
"\n",
">[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.\n",
"\n",
"Another alternative to `NLTK` is to use [Spacy tokenizer](https://spacy.io/api/tokenizer).\n",
"\n",
"1. How the text is split: by `spaCy` tokenizer\n",
"2. How the chunk size is measured: by number of characters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d0b9242f-690c-4819-b35a-bb68187281ed",
"metadata": {},
"outputs": [],
"source": [
"#!pip install spacy"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f1de7767",
"metadata": {},
"outputs": [],
"source": [
"# This is a long document we can split up.\n",
"with open('../../../state_of_the_union.txt') as f:\n",
" state_of_the_union = f.read()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "f4ec9b90",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import SpacyTextSplitter\n",
"text_splitter = SpacyTextSplitter(chunk_size=1000)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "cef2b29e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\n",
"\n",
"Members of Congress and the Cabinet.\n",
"\n",
"Justices of the Supreme Court.\n",
"\n",
"My fellow Americans. \n",
"\n",
"\n",
"\n",
"Last year COVID-19 kept us apart.\n",
"\n",
"This year we are finally together again. \n",
"\n",
"\n",
"\n",
"Tonight, we meet as Democrats Republicans and Independents.\n",
"\n",
"But most importantly as Americans. \n",
"\n",
"\n",
"\n",
"With a duty to one another to the American people to the Constitution. \n",
"\n",
"\n",
"\n",
"And with an unwavering resolve that freedom will always triumph over tyranny. \n",
"\n",
"\n",
"\n",
"Six days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\n",
"\n",
"But he badly miscalculated. \n",
"\n",
"\n",
"\n",
"He thought he could roll into Ukraine and the world would roll over.\n",
"\n",
"Instead he met a wall of strength he never imagined. \n",
"\n",
"\n",
"\n",
"He met the Ukrainian people. \n",
"\n",
"\n",
"\n",
"From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\n"
]
}
],
"source": [
"texts = text_splitter.split_text(state_of_the_union)\n",
"print(texts[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff3064a7",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"vscode": {
"interpreter": {
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}