You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/libs/text-splitters
Ryan Elston 86ee4f0daa
text-splitters: Introduce Experimental Markdown Syntax Splitter (#22257)
#### Description
This MR defines a `ExperimentalMarkdownSyntaxTextSplitter` class. The
main goal is to replicate the functionality of the original
`MarkdownHeaderTextSplitter` which extracts the header stack as metadata
but with one critical difference: it keeps the whitespace of the
original text intact.

This draft reimplements the `MarkdownHeaderTextSplitter` with a very
different algorithmic approach. Instead of marking up each line of the
text individually and aggregating them back together into chunks, this
method builds each chunk sequentially and applies the metadata to each
chunk. This makes the implementation simpler. However, since it's
designed to keep white space intact its not a full drop in replacement
for the original. Since it is a radical implementation change to the
original code and I would like to get feedback to see if this is a
worthwhile replacement, should be it's own class, or is not a good idea
at all.

Note: I implemented the `return_each_line` parameter but I don't think
it's a necessary feature. I'd prefer to remove it.

This implementation also adds the following additional features:
- Splits out code blocks and includes the language in the `"Code"`
metadata key
- Splits text on the horizontal rule `---` as well
- The `headers_to_split_on` parameter is now optional - with sensible
defaults that can be overridden.

#### Issue
Keeping the whitespace keeps the paragraphs structure and the formatting
of the code blocks intact which allows the caller much more flexibility
in how they want to further split the individuals sections of the
resulting documents. This addresses the issues brought up by the
community in the following issues:
- https://github.com/langchain-ai/langchain/issues/20823
- https://github.com/langchain-ai/langchain/issues/19436
- https://github.com/langchain-ai/langchain/issues/22256

#### Dependencies
N/A

#### Twitter handle
@RyanElston

---------

Co-authored-by: isaac hershenson <ihershenson@hmc.edu>
3 months ago
..
langchain_text_splitters text-splitters: Introduce Experimental Markdown Syntax Splitter (#22257) 3 months ago
scripts text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 7 months ago
tests text-splitters: Introduce Experimental Markdown Syntax Splitter (#22257) 3 months ago
Makefile text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 7 months ago
README.md docs: text splitters readme (#18359) 7 months ago
extended_testing_deps.txt multiple: get rid of pyproject extras (#22581) 3 months ago
poetry.lock core[patch]: Pin pydantic in py3.12.4 (#23130) 3 months ago
pyproject.toml multiple: get rid of pyproject extras (#22581) 3 months ago

README.md

🦜✂️ LangChain Text Splitters

Downloads License: MIT

Quick Install

pip install langchain-text-splitters

What is it?

LangChain Text Splitters contains utilities for splitting into chunks a wide variety of text documents.

For full documentation see the API reference and the Text Splitters module in the main docs.

📕 Releases & Versioning

langchain-text-splitters is currently on version 0.0.x.

Minor version increases will occur for:

  • Breaking changes for any public interfaces NOT marked beta

Patch version increases will occur for:

  • Bug fixes
  • New features
  • Any changes to private interfaces
  • Any changes to beta features

💁 Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

For detailed information on how to contribute, see the Contributing Guide.