langchain/libs/text-splitters
Lei Zhang 2cd907ad7e
text-splitters[patch]: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters (#20645)
Description: MarkdownHeaderTextSplitter Fails to Parse Headers with
non-printable characters. more #20643

The following is the official test case. Just replacing `# Foo\n\n` with
`\ufeff# Foo\n\n` will cause the test case to fail.

chunk metadata is empty

```python
def test_md_header_text_splitter_1() -> None:
    """Test markdown splitter by header: Case 1."""

    markdown_document = (
        "\ufeff# Foo\n\n"
        "    ## Bar\n\n"
        "Hi this is Jim\n\n"
        "Hi this is Joe\n\n"
        " ## Baz\n\n"
        " Hi this is Molly"
    )
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
    ]
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
    )
    output = markdown_splitter.split_text(markdown_document)
    expected_output = [
        Document(
            page_content="Hi this is Jim  \nHi this is Joe",
            metadata={"Header 1": "Foo", "Header 2": "Bar"},
        ),
        Document(
            page_content="Hi this is Molly",
            metadata={"Header 1": "Foo", "Header 2": "Baz"},
        ),
    ]
    assert output == expected_output
```

twitter: @coolbeevip

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2024-04-25 00:07:42 +00:00
..
langchain_text_splitters text-splitters[patch]: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters (#20645) 2024-04-25 00:07:42 +00:00
scripts
tests text-splitters[patch]: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters (#20645) 2024-04-25 00:07:42 +00:00
Makefile
poetry.lock text-splitters[minor]: Adding a new section aware splitter to langchain (#16526) 2024-04-01 20:32:26 +00:00
pyproject.toml text-splitters[minor]: Adding a new section aware splitter to langchain (#16526) 2024-04-01 20:32:26 +00:00
README.md

🦜✂️ LangChain Text Splitters

Downloads License: MIT

Quick Install

pip install langchain-text-splitters

What is it?

LangChain Text Splitters contains utilities for splitting into chunks a wide variety of text documents.

For full documentation see the API reference and the Text Splitters module in the main docs.

📕 Releases & Versioning

langchain-text-splitters is currently on version 0.0.x.

Minor version increases will occur for:

  • Breaking changes for any public interfaces NOT marked beta

Patch version increases will occur for:

  • Bug fixes
  • New features
  • Any changes to private interfaces
  • Any changes to beta features

💁 Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

For detailed information on how to contribute, see the Contributing Guide.