You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/libs/community/langchain_community/document_transformers
Sokolov Fedor f4ddf64faa
community: Add MarkdownifyTransformer to langchain_community.document_transformers (#21247)
- Added new document_transformer: MarkdonifyTransformer, that uses
`markdonify` package with customizable options to convert HTML to
Markdown. It's similar to Html2TextTransformer, but has more flexible
options and also I've noticed that sometimes MarkdownifyTransformer
performs better than html2text one, so that's why I use markdownify on
my project.
- Added docs and tests

- Usage:
```python
from langchain_community.document_transformers import MarkdownifyTransformer

markdownify = MarkdownifyTransformer()
docs_transform = markdownify.transform_documents(docs)
```

- Example of better performance on simple task, that I've noticed:
```
<html>
<head><title>Reports on product movement</title></head>
<body>
<p data-block-key="2wst7">The reports on product movement will be useful for forming supplier orders and controlling outcomes.</p>
</body>
```
**Html2TextTransformer**: 
```python
[Document(page_content='The reports on product movement will be useful for forming supplier orders and\ncontrolling outcomes.\n\n')]
# Here we can see 'and\ncontrolling', which has extra '\n' in it
```
**MarkdownifyTranformer**:
```python
[Document(page_content='Reports on product movement\n\nThe reports on product movement will be useful for forming supplier orders and controlling outcomes.')]
```

---------

Co-authored-by: Sokolov Fedor <f.sokolov@sokolov-macbook.bbrouter>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Sokolov Fedor <f.sokolov@sokolov-macbook.local>
Co-authored-by: Sokolov Fedor <f.sokolov@192.168.1.6>
4 months ago
..
xsl community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
__init__.py community: Add MarkdownifyTransformer to langchain_community.document_transformers (#21247) 4 months ago
beautiful_soup_transformer.py community[patch]: add BeautifulSoupTransformer remove_unwanted_classnames method (#20467) 5 months ago
doctran_text_extract.py community[minor]: Adding asynchronous function implementation for Doctran (#15941) 8 months ago
doctran_text_qa.py community: Make doctran synchronous (#15264) 9 months ago
doctran_text_translate.py community: Make doctran synchronous (#15264) 9 months ago
embeddings_redundant_filter.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
google_translate.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 4 months ago
html2text.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
long_context_reorder.py community[patch]: docstrings update (#20301) 5 months ago
markdownify.py community: Add MarkdownifyTransformer to langchain_community.document_transformers (#21247) 4 months ago
nuclia_text_transform.py community[patch]: docstrings update (#20301) 5 months ago
openai_functions.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago