langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-08 07:10:35 +00:00

History

Theron Tau 35297ca0d3 Add feature for extracting images from pdf and recognizing text from images. (#10653 ) Description It is for #10423 that it will be a useful feature if we can extract images from pdf and recognize text on them. I have implemented it with `PyPDFLoader`, `PyPDFium2Loader`, `PyPDFDirectoryLoader`, `PyMuPDFLoader`, `PDFMinerLoader`, and `PDFPlumberLoader`. [RapidOCR](https://github.com/RapidAI/RapidOCR.git) is used to recognize text on extracted images. It is time-consuming for ocr so a boolen parameter `extract_images` is set to control whether to extract and recognize. I have tested the time usage for each parser on my own laptop thinkbook 14+ with AMD R7-6800H by unit test and the result is: \| extract_images \| PyPDFParser \| PDFMinerParser \| PyMuPDFParser \| PyPDFium2Parser \| PDFPlumberParser \| \| ------------- \| ------------- \| ------------- \| ------------- \| ------------- \| ------------- \| \| False \| 0.27s \| 0.39s \| 0.06s \| 0.08s \| 1.01s \| \| True \| 17.01s \| 20.67s \| 20.32s \| 19,75s \| 20.55s \| Issue #10423 Dependencies rapidocr_onnxruntime in [RapidOCR](https://github.com/RapidAI/RapidOCR/tree/main) --------- Co-authored-by: Bagatur <baskaryan@gmail.com>		2023-10-05 18:51:59 -07:00
..
agents	Use term keyword according to the official python doc glossary (#11338 )	2023-10-03 12:56:08 -07:00
callbacks	docs: agents & callbacks fixes (#10066 )	2023-09-01 13:28:55 -07:00
chains	Fix documents for RetrievalQAWithSourcesChain (#11292 )	2023-10-03 17:36:16 -07:00
data_connection	Add feature for extracting images from pdf and recognizing text from images. (#10653 )	2023-10-05 18:51:59 -07:00
memory	Fixed typo in get_started.mdx (#10163 )	2023-09-04 00:09:50 -07:00
model_io	Docs: improve similarity search examples (#11298 )	2023-10-03 21:47:08 -04:00