You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/libs/text-splitters/langchain_text_splitters
Jiejun Tan c8c67dde6f
text-splitters[patch]: Fix HTMLSectionSplitter (#22812)
Update former pull request:
https://github.com/langchain-ai/langchain/pull/22654.

Modified `langchain_text_splitters.HTMLSectionSplitter`, where in the
latest version `dict` data structure is used to store sections from a
html document, in function `split_html_by_headers`. The header/section
element names serve as dict keys. This can be a problem when duplicate
header/section element names are present in a single html document.
Latter ones can replace former ones with the same name. Therefore some
contents can be miss after html text splitting is conducted.

Using a list to store sections can hopefully solve the problem. A Unit
test considering duplicate header names has been added.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
3 months ago
..
xsl text-splitters[minor]: Adding a new section aware splitter to langchain (#16526) 6 months ago
__init__.py text-splitters[minor]: Adding a new section aware splitter to langchain (#16526) 6 months ago
base.py Community[minor]: Add language parser for Elixir (#22742) 3 months ago
character.py Community[minor]: Add language parser for Elixir (#22742) 3 months ago
html.py text-splitters[patch]: Fix HTMLSectionSplitter (#22812) 3 months ago
json.py splitters: Add ensure_ascii parameter (#18485) 6 months ago
konlpy.py text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 7 months ago
latex.py text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 7 months ago
markdown.py text-splitters[patch]: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters (#20645) 5 months ago
nltk.py text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 7 months ago
py.typed text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 7 months ago
python.py text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 7 months ago
sentence_transformers.py text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 7 months ago
spacy.py text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 7 months ago