mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
86ee4f0daa
#### Description This MR defines a `ExperimentalMarkdownSyntaxTextSplitter` class. The main goal is to replicate the functionality of the original `MarkdownHeaderTextSplitter` which extracts the header stack as metadata but with one critical difference: it keeps the whitespace of the original text intact. This draft reimplements the `MarkdownHeaderTextSplitter` with a very different algorithmic approach. Instead of marking up each line of the text individually and aggregating them back together into chunks, this method builds each chunk sequentially and applies the metadata to each chunk. This makes the implementation simpler. However, since it's designed to keep white space intact its not a full drop in replacement for the original. Since it is a radical implementation change to the original code and I would like to get feedback to see if this is a worthwhile replacement, should be it's own class, or is not a good idea at all. Note: I implemented the `return_each_line` parameter but I don't think it's a necessary feature. I'd prefer to remove it. This implementation also adds the following additional features: - Splits out code blocks and includes the language in the `"Code"` metadata key - Splits text on the horizontal rule `---` as well - The `headers_to_split_on` parameter is now optional - with sensible defaults that can be overridden. #### Issue Keeping the whitespace keeps the paragraphs structure and the formatting of the code blocks intact which allows the caller much more flexibility in how they want to further split the individuals sections of the resulting documents. This addresses the issues brought up by the community in the following issues: - https://github.com/langchain-ai/langchain/issues/20823 - https://github.com/langchain-ai/langchain/issues/19436 - https://github.com/langchain-ai/langchain/issues/22256 #### Dependencies N/A #### Twitter handle @RyanElston --------- Co-authored-by: isaac hershenson <ihershenson@hmc.edu> |
||
---|---|---|
.. | ||
xsl | ||
__init__.py | ||
base.py | ||
character.py | ||
html.py | ||
json.py | ||
konlpy.py | ||
latex.py | ||
markdown.py | ||
nltk.py | ||
py.typed | ||
python.py | ||
sentence_transformers.py | ||
spacy.py |