langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-06 03:20:49 +00:00

History

Sokolov Fedor f4ddf64faa community: Add MarkdownifyTransformer to langchain_community.document_transformers (#21247 ) - Added new document_transformer: MarkdonifyTransformer, that uses `markdonify` package with customizable options to convert HTML to Markdown. It's similar to Html2TextTransformer, but has more flexible options and also I've noticed that sometimes MarkdownifyTransformer performs better than html2text one, so that's why I use markdownify on my project. - Added docs and tests - Usage: ```python from langchain_community.document_transformers import MarkdownifyTransformer markdownify = MarkdownifyTransformer() docs_transform = markdownify.transform_documents(docs) ``` - Example of better performance on simple task, that I've noticed: ``` <html> <head><title>Reports on product movement</title></head> <body> <p data-block-key="2wst7">The reports on product movement will be useful for forming supplier orders and controlling outcomes.</p> </body> ``` Html2TextTransformer: ```python [Document(page_content='The reports on product movement will be useful for forming supplier orders and\ncontrolling outcomes.\n\n')] # Here we can see 'and\ncontrolling', which has extra '\n' in it ``` MarkdownifyTranformer: ```python [Document(page_content='Reports on product movement\n\nThe reports on product movement will be useful for forming supplier orders and controlling outcomes.')] ``` --------- Co-authored-by: Sokolov Fedor <f.sokolov@sokolov-macbook.bbrouter> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Sokolov Fedor <f.sokolov@sokolov-macbook.local> Co-authored-by: Sokolov Fedor <f.sokolov@192.168.1.6>		2024-05-08 14:45:13 -07:00
..
xsl	community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463 )	2023-12-11 13:53:30 -08:00
__init__.py	community: Add MarkdownifyTransformer to langchain_community.document_transformers (#21247 )	2024-05-08 14:45:13 -07:00
beautiful_soup_transformer.py	community[patch]: add BeautifulSoupTransformer remove_unwanted_classnames method (#20467 )	2024-04-25 17:04:04 +00:00
doctran_text_extract.py	community[minor]: Adding asynchronous function implementation for Doctran (#15941 )	2024-01-15 10:39:25 -08:00
doctran_text_qa.py	community: Make doctran synchronous (#15264 )	2023-12-28 08:05:24 -08:00
doctran_text_translate.py	community: Make doctran synchronous (#15264 )	2023-12-28 08:05:24 -08:00
embeddings_redundant_filter.py	community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463 )	2023-12-11 13:53:30 -08:00
google_translate.py	(all): update removal in deprecation warnings from 0.2 to 0.3 (#21265 )	2024-05-03 14:29:36 -04:00
html2text.py	community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463 )	2023-12-11 13:53:30 -08:00
long_context_reorder.py	community[patch]: docstrings update (#20301 )	2024-04-11 16:23:27 -04:00
markdownify.py	community: Add MarkdownifyTransformer to langchain_community.document_transformers (#21247 )	2024-05-08 14:45:13 -07:00
nuclia_text_transform.py	community[patch]: docstrings update (#20301 )	2024-04-11 16:23:27 -04:00
openai_functions.py	community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463 )	2023-12-11 13:53:30 -08:00