mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
c8c67dde6f
Update former pull request: https://github.com/langchain-ai/langchain/pull/22654. Modified `langchain_text_splitters.HTMLSectionSplitter`, where in the latest version `dict` data structure is used to store sections from a html document, in function `split_html_by_headers`. The header/section element names serve as dict keys. This can be a problem when duplicate header/section element names are present in a single html document. Latter ones can replace former ones with the same name. Therefore some contents can be miss after html text splitting is conducted. Using a list to store sections can hopefully solve the problem. A Unit test considering duplicate header names has been added. --------- Co-authored-by: Bagatur <baskaryan@gmail.com> |
||
---|---|---|
.. | ||
xsl | ||
__init__.py | ||
base.py | ||
character.py | ||
html.py | ||
json.py | ||
konlpy.py | ||
latex.py | ||
markdown.py | ||
nltk.py | ||
py.typed | ||
python.py | ||
sentence_transformers.py | ||
spacy.py |