langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-10 01:10:59 +00:00

History

Jiejun Tan c8c67dde6f text-splitters[patch]: Fix HTMLSectionSplitter (#22812 ) Update former pull request: https://github.com/langchain-ai/langchain/pull/22654. Modified `langchain_text_splitters.HTMLSectionSplitter`, where in the latest version `dict` data structure is used to store sections from a html document, in function `split_html_by_headers`. The header/section element names serve as dict keys. This can be a problem when duplicate header/section element names are present in a single html document. Latter ones can replace former ones with the same name. Therefore some contents can be miss after html text splitting is conducted. Using a list to store sections can hopefully solve the problem. A Unit test considering duplicate header names has been added. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>		2024-06-14 22:40:39 +00:00
..
xsl	text-splitters[minor]: Adding a new section aware splitter to langchain (#16526 )	2024-04-01 20:32:26 +00:00
__init__.py	text-splitters[minor]: Adding a new section aware splitter to langchain (#16526 )	2024-04-01 20:32:26 +00:00
base.py	Community[minor]: Add language parser for Elixir (#22742 )	2024-06-10 15:56:57 +00:00
character.py	Community[minor]: Add language parser for Elixir (#22742 )	2024-06-10 15:56:57 +00:00
html.py	text-splitters[patch]: Fix HTMLSectionSplitter (#22812 )	2024-06-14 22:40:39 +00:00
json.py	splitters: Add ensure_ascii parameter (#18485 )	2024-03-19 12:51:16 -07:00
konlpy.py
latex.py
markdown.py	text-splitters[patch]: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters (#20645 )	2024-04-25 00:07:42 +00:00
nltk.py
py.typed
python.py
sentence_transformers.py
spacy.py