langchain

Archives/langchain

Fork 1

mirror of https://github.com/hwchase17/langchain synced 2024-10-29 17:07:25 +00:00

Commit Graph

Author	SHA1	Message	Date
Chetanya Rastogi	50c511d75f	Add new loader to load pdf as html content (#2607 ) Adds a new pdf loader using the existing dependency on PDFMiner. The new loader can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, pdf headers/footers, etc. which may not be available otherwise with other pdf loaders	2023-04-09 17:57:25 -07:00
Harrison Chase	357d808484	Harrison/remote paths pdf (#1544 ) Co-authored-by: Tim Asp <707699+timothyasp@users.noreply.github.com>	2023-03-08 20:53:37 -08:00
Tim Asp	23231d65a9	Add PyMuPDF PDF loader (#1426 ) Different PDF libraries have different strengths and weaknesses. PyMuPDF does a good job at extracting the most amount of content from the doc, regardless of the source quality, extremely fast (especially compared to Unstructured). https://pymupdf.readthedocs.io/en/latest/index.html	2023-03-03 20:59:28 -08:00

Author

SHA1

Message

Date

Chetanya Rastogi

50c511d75f

Add new loader to load pdf as html content (#2607 )

Adds a new pdf loader using the existing dependency on PDFMiner. 

The new loader can be helpful for chunking texts semantically into
sections as the output html content can be parsed via `BeautifulSoup` to
get more structured and rich information about font size, page numbers,
pdf headers/footers, etc. which may not be available otherwise with
other pdf loaders

2023-04-09 17:57:25 -07:00

Harrison Chase

357d808484

Harrison/remote paths pdf (#1544 )

Co-authored-by: Tim Asp <707699+timothyasp@users.noreply.github.com>

2023-03-08 20:53:37 -08:00

Tim Asp

23231d65a9

Add PyMuPDF PDF loader (#1426 )

Different PDF libraries have different strengths and weaknesses. PyMuPDF
does a good job at extracting the most amount of content from the doc,
regardless of the source quality, extremely fast (especially compared to
Unstructured).

https://pymupdf.readthedocs.io/en/latest/index.html

2023-03-03 20:59:28 -08:00

3 Commits