You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/tests/integration_tests
Eugene Yurtsev 2ceb807da2
Add PDF parser implementations (#4356)
# Add PDF parser implementations

This PR separates the data loading from the parsing for a number of
existing PDF loaders.

Parser tests have been designed to help encourage developers to create a
consistent interface for parsing PDFs.

This interface can be made more consistent in the future by adding
information into the initializer on desired behavior with respect to splitting by
page etc.

This code is expected to be backwards compatible -- with the exception
of a bug fix with pymupdf parser which was returning `bytes` in the page
content rather than strings.

Also changing the lazy parser method of document loader to return an
Iterator rather than Iterable over documents.

## Before submitting

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

@

<!-- For a quicker response, figure out the right person to tag with @

        @hwchase17 - project lead

        Tracing / Callbacks
        - @agola11

        Async
        - @agola11

        DataLoader Abstractions
        - @eyurtsev

        LLM/Chat Wrappers
        - @hwchase17
        - @agola11

        Tools / Toolkits
        - @vowelparrot
 -->
1 year ago
..
agent [test] Add integration_test for PandasAgent (#4056) 1 year ago
cache Harrison/redis cache (#3766) 1 year ago
callbacks Callbacks Refactor [base] (#3256) 1 year ago
chains Callbacks Refactor [base] (#3256) 1 year ago
chat_models Check OpenAI model kwargs (#4366) 1 year ago
document_loaders Add PDF parser implementations (#4356) 1 year ago
embeddings Dev2049/hf emb encode kwargs (#3925) 1 year ago
examples JSON loader (#4067) 1 year ago
llms Check OpenAI model kwargs (#4366) 1 year ago
memory mongodb support for chat history (#4266) 1 year ago
prompts Cleanup integration test dir (#3308) 1 year ago
retrievers Update Cohere Reranker (#4180) 1 year ago
utilities added `Wikipedia` document loader (#4141) 1 year ago
vectorstores OpenSearch: Add Similarity Search with Score (#4089) 1 year ago
.env.example Change in method name for creating an issue on JIRA (#3307) 1 year ago
__init__.py initial commit 2 years ago
conftest.py feat: improve pinecone tests (#2806) 1 year ago
test_document_transformers.py Contextual compression retriever (#2915) 1 year ago
test_nlp_text_splitters.py OptimizedPrompt -- k-shot example choice backed by semantic search (#91) 2 years ago
test_pdf_pagesplitter.py cleanup: unify 3 different pdf loaders, rename PagedPDFSplitter (#1615) 2 years ago
test_schema.py Callbacks Refactor [base] (#3256) 1 year ago
test_text_splitter.py Fix TextSplitter.from_tiktoken(#4361) 1 year ago