langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-04 06:00:26 +00:00

History

Eugene Yurtsev 2ceb807da2 Add PDF parser implementations (#4356 ) # Add PDF parser implementations This PR separates the data loading from the parsing for a number of existing PDF loaders. Parser tests have been designed to help encourage developers to create a consistent interface for parsing PDFs. This interface can be made more consistent in the future by adding information into the initializer on desired behavior with respect to splitting by page etc. This code is expected to be backwards compatible -- with the exception of a bug fix with pymupdf parser which was returning `bytes` in the page content rather than strings. Also changing the lazy parser method of document loader to return an Iterator rather than Iterable over documents. ## Before submitting <!-- If you're adding a new integration, include an integration test and an example notebook showing its use! --> ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @ <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoader Abstractions - @eyurtsev LLM/Chat Wrappers - @hwchase17 - @agola11 Tools / Toolkits - @vowelparrot -->		2023-05-09 10:24:17 -04:00
..
parsers	Add PDF parser implementations (#4356 )	2023-05-09 10:24:17 -04:00
__init__.py
test_arxiv.py	`Arxiv` document loader (#3627 )	2023-04-26 21:04:56 -07:00
test_bigquery.py	Harrison/big query (#2100 )	2023-03-28 08:17:22 -07:00
test_bilibili.py	Added bilibili loader (#2673 ) (#2724 )	2023-04-11 10:40:32 -07:00
test_blockchain.py	Enhancement: option to Get All Tokens with a single Blockchain Document Loader call (#3797 )	2023-05-03 15:46:44 -07:00
test_bshtml.py	Add get_text_separator parameter to BSHTMLLoader (#3551 )	2023-04-26 16:10:16 -07:00
test_confluence.py	Several confluence loader improvements (#3300 )	2023-04-23 15:06:10 -07:00
test_dataframe.py	rm pandas dependency (#2102 )	2023-03-28 08:38:19 -07:00
test_duckdb.py	Harrison/duckdb (#2064 )	2023-03-27 19:51:34 -07:00
test_email.py	Harrison/msg files (#2375 )	2023-04-04 06:48:34 -07:00
test_facebook_chat.py	Refactor TelegramChatLoader and FacebookChatLoader classes and add tests (#3863 )	2023-05-03 15:59:19 -07:00
test_figma.py	Harrison/figma doc loader (#1908 )	2023-03-22 19:57:46 -07:00
test_gitbook.py	Harrison/gitbook (#2044 )	2023-03-28 15:28:33 -07:00
test_ifixit.py
test_json_loader.py	JSON loader (#4067 )	2023-05-05 14:48:13 -07:00
test_modern_treasury.py	Dev2049/add modern treasury (#3924 )	2023-05-01 20:28:02 -07:00
test_pdf.py	Dev2049/pypdfium2 (#4209 )	2023-05-05 17:55:31 -07:00
test_python.py	Add PythonLoader which auto-detects encoding of Python files (#3311 )	2023-04-21 10:47:57 -07:00
test_sitemap.py	Harrison/blockwise sitemap (#3940 )	2023-05-01 21:34:07 -07:00
test_slack.py	Add Slack Directory Loader (#2841 )	2023-04-13 21:31:59 -07:00
test_spreedly.py	Harrison/spreedly (#3937 )	2023-05-01 20:56:56 -07:00
test_stripe.py	Dev2049/add modern treasury (#3924 )	2023-05-01 20:28:02 -07:00
test_telegram.py	Refactor TelegramChatLoader and FacebookChatLoader classes and add tests (#3863 )	2023-05-03 15:59:19 -07:00
test_url_playwright.py	Harrison/playwright selector (#3185 )	2023-04-19 16:54:15 -07:00
test_url.py	add continue to fix 'continue_on_failure' parameter for URL doc loader (#2735 )	2023-04-11 21:12:39 -07:00
test_whatsapp_chat.py	Update WhatsAppChatLoader regex to handle multiple date-time formats (#4186 )	2023-05-05 13:13:05 -07:00