mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
35297ca0d3
**Description** It is for #10423 that it will be a useful feature if we can extract images from pdf and recognize text on them. I have implemented it with `PyPDFLoader`, `PyPDFium2Loader`, `PyPDFDirectoryLoader`, `PyMuPDFLoader`, `PDFMinerLoader`, and `PDFPlumberLoader`. [RapidOCR](https://github.com/RapidAI/RapidOCR.git) is used to recognize text on extracted images. It is time-consuming for ocr so a boolen parameter `extract_images` is set to control whether to extract and recognize. I have tested the time usage for each parser on my own laptop thinkbook 14+ with AMD R7-6800H by unit test and the result is: | extract_images | PyPDFParser | PDFMinerParser | PyMuPDFParser | PyPDFium2Parser | PDFPlumberParser | | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | | False | 0.27s | 0.39s | 0.06s | 0.08s | 1.01s | | True | 17.01s | 20.67s | 20.32s | 19,75s | 20.55s | **Issue** #10423 **Dependencies** rapidocr_onnxruntime in [RapidOCR](https://github.com/RapidAI/RapidOCR/tree/main) --------- Co-authored-by: Bagatur <baskaryan@gmail.com> |
||
---|---|---|
.. | ||
csv.mdx | ||
file_directory.mdx | ||
html.mdx | ||
json.mdx | ||
markdown.mdx | ||
pdf.mdx |