mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
c8c67dde6f
Update former pull request: https://github.com/langchain-ai/langchain/pull/22654. Modified `langchain_text_splitters.HTMLSectionSplitter`, where in the latest version `dict` data structure is used to store sections from a html document, in function `split_html_by_headers`. The header/section element names serve as dict keys. This can be a problem when duplicate header/section element names are present in a single html document. Latter ones can replace former ones with the same name. Therefore some contents can be miss after html text splitting is conducted. Using a list to store sections can hopefully solve the problem. A Unit test considering duplicate header names has been added. --------- Co-authored-by: Bagatur <baskaryan@gmail.com> |
||
---|---|---|
.. | ||
langchain_text_splitters | ||
scripts | ||
tests | ||
extended_testing_deps.txt | ||
Makefile | ||
poetry.lock | ||
pyproject.toml | ||
README.md |
🦜✂️ LangChain Text Splitters
Quick Install
pip install langchain-text-splitters
What is it?
LangChain Text Splitters contains utilities for splitting into chunks a wide variety of text documents.
For full documentation see the API reference and the Text Splitters module in the main docs.
📕 Releases & Versioning
langchain-text-splitters
is currently on version 0.0.x
.
Minor version increases will occur for:
- Breaking changes for any public interfaces NOT marked
beta
Patch version increases will occur for:
- Bug fixes
- New features
- Any changes to private interfaces
- Any changes to
beta
features
💁 Contributing
As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.
For detailed information on how to contribute, see the Contributing Guide.