langchain/docs/extras/modules/data_connection
mziru 9e3c1d4463
add HTMLHeaderTextSplitter (#11039)
Description: Similar in concept to the `MarkdownHeaderTextSplitter`, the
`HTMLHeaderTextSplitter` is a "structure-aware" chunker that splits text
at the element level and adds metadata for each header "relevant" to any
given chunk. It can return chunks element by element or combine elements
with the same metadata, with the objectives of (a) keeping related text
grouped (more or less) semantically and (b) preserving context-rich
information encoded in document structures. It can be used with other
text splitters as part of a chunking pipeline.

Dependency: lxml python package

Maintainer: @hwchase17

Twitter handle: @MartinZirulnik

---------

Co-authored-by: PresidioVantage <github@presidiovantage.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-10-04 09:24:25 -04:00
..
document_transformers add HTMLHeaderTextSplitter (#11039) 2023-10-04 09:24:25 -04:00
retrievers [OpenSearch] Add Self Query Retriever Support to OpenSearch (#11184) 2023-09-28 12:36:52 -07:00
text_embedding docs: misc retrievers fixes (#9791) 2023-09-03 20:26:49 -07:00
indexing.ipynb Fixing some spelling mistakes (#10881) 2023-09-27 10:56:51 -07:00