langchain/docs/modules/indexes/text_splitters/examples/latex.ipynb
Leonid Ganeline c998569c8f
docs: text splitters improvements (#4490)
#docs: text splitters improvements

Changes are only in the Jupyter notebooks.
- added links to the source packages and a short description of these
packages
- removed " Text Splitters" suffixes from the TOC elements (they made
the list of the text splitters messy)
- moved text splitters, based on the length function into a separate
list. They can be mixed with any classes from the "Text Splitters", so
it is a different classification.

## Who can review?
        @hwchase17 - project lead
        @eyurtsev
        @vowelparrot

NOTE: please, check out the results of the `Python code` text splitter
example (text_splitters/examples/python.ipynb). It looks suboptimal.
2023-05-17 21:33:34 -07:00

156 lines
6.4 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "3a2f572e",
"metadata": {},
"source": [
"# LaTeX\n",
"\n",
">[LaTeX](https://en.wikipedia.org/wiki/LaTeX) is widely used in academia for the communication and publication of scientific documents in many fields, including mathematics, computer science, engineering, physics, chemistry, economics, linguistics, quantitative psychology, philosophy, and political science.\n",
"\n",
"`LatexTextSplitter` splits text along `LaTeX` headings, headlines, enumerations and more. It's implemented as a subclass of `RecursiveCharacterSplitter` with LaTeX-specific separators. See the source code for more details.\n",
"\n",
"1. How the text is split: by list of `LaTeX` specific tags\n",
"2. How the chunk size is measured: by number of characters"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c2503917",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.text_splitter import LatexTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e46b753b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"latex_text = \"\"\"\n",
"\\documentclass{article}\n",
"\n",
"\\begin{document}\n",
"\n",
"\\maketitle\n",
"\n",
"\\section{Introduction}\n",
"Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.\n",
"\n",
"\\subsection{History of LLMs}\n",
"The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.\n",
"\n",
"\\subsection{Applications of LLMs}\n",
"LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\n",
"\n",
"\\end{document}\n",
"\"\"\"\n",
"latex_splitter = LatexTextSplitter(chunk_size=400, chunk_overlap=0)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "73b5bd33",
"metadata": {},
"outputs": [],
"source": [
"docs = latex_splitter.create_documents([latex_text])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e1c7fbd5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='\\\\documentclass{article}\\n\\n\\x08egin{document}\\n\\n\\\\maketitle', lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content='Introduction}\\nLarge language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.', lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content='History of LLMs}\\nThe earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.', lookup_str='', metadata={}, lookup_index=0),\n",
" Document(page_content='Applications of LLMs}\\nLLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\\n\\n\\\\end{document}', lookup_str='', metadata={}, lookup_index=0)]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "40e62829-9485-414e-9ea1-e1a8fc7c88cb",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"['\\\\documentclass{article}\\n\\n\\x08egin{document}\\n\\n\\\\maketitle',\n",
" 'Introduction}\\nLarge language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.',\n",
" 'History of LLMs}\\nThe earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.',\n",
" 'Applications of LLMs}\\nLLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\\n\\n\\\\end{document}']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"latex_splitter.split_text(latex_text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7deb8f25-a062-4956-9f90-513802069667",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"vscode": {
"interpreter": {
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}