Commit Graph

9 Commits (06cde06a200ffaf7ddbb5553c2c5af9ac552cf31)

Author SHA1 Message Date
Brice Fotzo 034a8c7c1b
community: support advanced text extraction options for pdf documents (#20265)
**Description:** 
- Updated constructors in PyPDFParser and PyPDFLoader to handle
`extraction_mode` and additional kwargs, aligning with the capabilities
of `PageObject.extract_text()` from pypdf.

- Added `test_pypdf_loader_with_layout` along with a corresponding
example text file to validate layout extraction from PDFs.

**Issue:** fixes #19735 

**Dependencies:** This change requires updating the pypdf dependency
from version 3.4.0 to at least 4.0.0.

Additional changes include the addition of a new test
test_pypdf_loader_with_layout and an example text file to ensure the
functionality of layout extraction from PDFs aligns with the new
capabilities.

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
2 months ago
Erick Friis 74b2c0aa01
templates, cli: more security deps (#19006) 6 months ago
Erick Friis 1317578ad1
templates: use langchain-text-splitters (#18360)
- deps
- import
- import
7 months ago
Erick Friis 64785822dc
templates: bump (#17074) 8 months ago
Erick Friis 08be477c24
templates: 0.1 bump (#15648) 9 months ago
Erick Friis 69a8a26683
templates: fix deps (#15439) 9 months ago
Erick Friis 78da34153e
TEMPLATES Metadata (#13691)
Co-authored-by: Lance Martin <lance@langchain.dev>
10 months ago
Erick Friis 9dfad613c2
IMPROVEMENT Allow openai v1 in all templates that require it (#13489)
- pyproject change
- lockfiles
10 months ago
Harrison Chase 60d025b83b
mongo parent document retrieval (#12887) 11 months ago