You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/libs
Lei Zhang 2cd907ad7e
text-splitters[patch]: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters (#20645)
Description: MarkdownHeaderTextSplitter Fails to Parse Headers with
non-printable characters. more #20643

The following is the official test case. Just replacing `# Foo\n\n` with
`\ufeff# Foo\n\n` will cause the test case to fail.

chunk metadata is empty

```python
def test_md_header_text_splitter_1() -> None:
    """Test markdown splitter by header: Case 1."""

    markdown_document = (
        "\ufeff# Foo\n\n"
        "    ## Bar\n\n"
        "Hi this is Jim\n\n"
        "Hi this is Joe\n\n"
        " ## Baz\n\n"
        " Hi this is Molly"
    )
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
    ]
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
    )
    output = markdown_splitter.split_text(markdown_document)
    expected_output = [
        Document(
            page_content="Hi this is Jim  \nHi this is Joe",
            metadata={"Header 1": "Foo", "Header 2": "Bar"},
        ),
        Document(
            page_content="Hi this is Molly",
            metadata={"Header 1": "Foo", "Header 2": "Baz"},
        ),
    ]
    assert output == expected_output
```

twitter: @coolbeevip

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2 months ago
..
cli templates: readme langsmith not private beta (#20173) 2 months ago
community patch: remove usage of llm, chat model __call__ (#20788) 2 months ago
core support messages in messages out (#20862) 2 months ago
experimental experimental[minor]: upgrade the prompt injection model (#20783) 2 months ago
langchain patch: remove usage of llm, chat model __call__ (#20788) 2 months ago
partners patch: remove usage of llm, chat model __call__ (#20788) 2 months ago
standard-tests standard-tests: split tool calling test (#20803) 2 months ago
text-splitters text-splitters[patch]: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters (#20645) 2 months ago