langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-02 09:40:22 +00:00

History

Lei Zhang 2cd907ad7e text-splitters[patch]: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters (#20645 ) Description: MarkdownHeaderTextSplitter Fails to Parse Headers with non-printable characters. more #20643 The following is the official test case. Just replacing `# Foo\n\n` with `\ufeff# Foo\n\n` will cause the test case to fail. chunk metadata is empty ```python def test_md_header_text_splitter_1() -> None: """Test markdown splitter by header: Case 1.""" markdown_document = ( "\ufeff# Foo\n\n" " ## Bar\n\n" "Hi this is Jim\n\n" "Hi this is Joe\n\n" " ## Baz\n\n" " Hi this is Molly" ) headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ] markdown_splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on, ) output = markdown_splitter.split_text(markdown_document) expected_output = [ Document( page_content="Hi this is Jim \nHi this is Joe", metadata={"Header 1": "Foo", "Header 2": "Bar"}, ), Document( page_content="Hi this is Molly", metadata={"Header 1": "Foo", "Header 2": "Baz"}, ), ] assert output == expected_output ``` twitter: @coolbeevip Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>		2024-04-25 00:07:42 +00:00
..
xsl	text-splitters[minor]: Adding a new section aware splitter to langchain (#16526 )	2024-04-01 20:32:26 +00:00
__init__.py	text-splitters[minor]: Adding a new section aware splitter to langchain (#16526 )	2024-04-01 20:32:26 +00:00
base.py	text-splitters[minor]: Added Haskell support in langchain.text_splitter module (#16191 )	2024-03-29 20:17:50 +00:00
character.py	text-splitters[minor]: Add lua code splitting (#20421 )	2024-04-13 22:42:51 +00:00
html.py	text-splitters[minor]: Adding a new section aware splitter to langchain (#16526 )	2024-04-01 20:32:26 +00:00
json.py	splitters: Add ensure_ascii parameter (#18485 )	2024-03-19 12:51:16 -07:00
konlpy.py	text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )	2024-02-29 18:33:21 -08:00
latex.py	text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )	2024-02-29 18:33:21 -08:00
markdown.py	text-splitters[patch]: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters (#20645 )	2024-04-25 00:07:42 +00:00
nltk.py	text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )	2024-02-29 18:33:21 -08:00
py.typed	text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )	2024-02-29 18:33:21 -08:00
python.py	text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )	2024-02-29 18:33:21 -08:00
sentence_transformers.py	text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )	2024-02-29 18:33:21 -08:00
spacy.py	text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346 )	2024-02-29 18:33:21 -08:00