langchain/docs/snippets/modules/data_connection/document_loaders/how_to
Theron Tau 35297ca0d3
Add feature for extracting images from pdf and recognizing text from images. (#10653)
**Description**

It is for #10423 that it will be a useful feature if we can extract
images from pdf and recognize text on them. I have implemented it with
`PyPDFLoader`, `PyPDFium2Loader`, `PyPDFDirectoryLoader`,
`PyMuPDFLoader`, `PDFMinerLoader`, and `PDFPlumberLoader`.
[RapidOCR](https://github.com/RapidAI/RapidOCR.git) is used to recognize
text on extracted images. It is time-consuming for ocr so a boolen
parameter `extract_images` is set to control whether to extract and
recognize. I have tested the time usage for each parser on my own laptop
thinkbook 14+ with AMD R7-6800H by unit test and the result is:

| extract_images | PyPDFParser | PDFMinerParser | PyMuPDFParser |
PyPDFium2Parser | PDFPlumberParser |
| ------------- | ------------- | ------------- | ------------- |
------------- | ------------- |
| False | 0.27s | 0.39s | 0.06s | 0.08s | 1.01s |
| True  | 17.01s  | 20.67s | 20.32s | 19,75s | 20.55s |

**Issue**

#10423 

**Dependencies**

rapidocr_onnxruntime in
[RapidOCR](https://github.com/RapidAI/RapidOCR/tree/main)

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-10-05 18:51:59 -07:00
..
csv.mdx docs: misc retrievers fixes (#9791) 2023-09-03 20:26:49 -07:00
file_directory.mdx docs: misc retrievers fixes (#9791) 2023-09-03 20:26:49 -07:00
html.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
json.mdx JSONLoader Documentation Fix (#10505) 2023-09-21 11:37:40 -07:00
markdown.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
pdf.mdx Add feature for extracting images from pdf and recognizing text from images. (#10653) 2023-10-05 18:51:59 -07:00