langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-11 19:11:02 +00:00

History

Alexander Golodkov 2a70a07aad community[minor]: added new document loaders based on dedoc library (#24303 ) ### Description This pull request added new document loaders to load documents of various formats using [Dedoc](https://github.com/ispras/dedoc): - `DedocFileLoader` (determine file types automatically and parse) - `DedocPDFLoader` (for `PDF` and images parsing) - `DedocAPIFileLoader` (determine file types automatically and parse using Dedoc API without library installation) [Dedoc](https://dedoc.readthedocs.io) is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats. The library is actively developed and maintained by a group of developers. `Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images and more. Full list of supported formats can be found [here](https://dedoc.readthedocs.io/en/latest/#id1). For `PDF` documents, `Dedoc` allows to determine textual layer correctness and split the document into paragraphs. ### Issue This pull request extends variety of document loaders supported by `langchain_community` allowing users to choose the most suitable option for raw documents parsing. ### Dependencies The PR added a new (optional) dependency `dedoc>=2.2.5` ([library documentation](https://dedoc.readthedocs.io)) to the `extended_testing_deps.txt` ### Twitter handle None ### Add tests and docs 1. Test for the integration: `libs/community/tests/integration_tests/document_loaders/test_dedoc.py` 2. Example notebook: `docs/docs/integrations/document_loaders/dedoc.ipynb` 3. Information about the library: `docs/docs/integrations/providers/dedoc.mdx` ### Lint and test Done locally: - `make format` - `make lint` - `make integration_tests` - `make docs_build` (from the project root) --------- Co-authored-by: Nasty <bogatenkova.anastasiya@mail.ru>		2024-07-23 02:04:53 +00:00
..
cli	all: add release notes to pypi (#24519 )	2024-07-22 13:59:13 -07:00
community	community[minor]: added new document loaders based on dedoc library (#24303 )	2024-07-23 02:04:53 +00:00
core	community[minor]: add document transformer for extracting links (#24186 )	2024-07-22 22:01:21 -04:00
experimental	all: add release notes to pypi (#24519 )	2024-07-22 13:59:13 -07:00
langchain	all: add release notes to pypi (#24519 )	2024-07-22 13:59:13 -07:00
partners	standard-tests: add override check (#24407 )	2024-07-22 23:38:01 +00:00
standard-tests	standard-tests: add override check (#24407 )	2024-07-22 23:38:01 +00:00
text-splitters	all: add release notes to pypi (#24519 )	2024-07-22 13:59:13 -07:00