You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/langchain
Eugene Yurtsev 2ceb807da2
Add PDF parser implementations (#4356)
# Add PDF parser implementations

This PR separates the data loading from the parsing for a number of
existing PDF loaders.

Parser tests have been designed to help encourage developers to create a
consistent interface for parsing PDFs.

This interface can be made more consistent in the future by adding
information into the initializer on desired behavior with respect to splitting by
page etc.

This code is expected to be backwards compatible -- with the exception
of a bug fix with pymupdf parser which was returning `bytes` in the page
content rather than strings.

Also changing the lazy parser method of document loader to return an
Iterator rather than Iterable over documents.

## Before submitting

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

@

<!-- For a quicker response, figure out the right person to tag with @

        @hwchase17 - project lead

        Tracing / Callbacks
        - @agola11

        Async
        - @agola11

        DataLoader Abstractions
        - @eyurtsev

        LLM/Chat Wrappers
        - @hwchase17
        - @agola11

        Tools / Toolkits
        - @vowelparrot
 -->
1 year ago
..
agents Pass parsed inputs through to tool _run (#4309) 1 year ago
callbacks Update V2 Tracer (#4193) 1 year ago
chains fix json saving, update docs to reference anthropic chat model (#4364) 1 year ago
chat_models Check OpenAI model kwargs (#4366) 1 year ago
client Add LCP Client (#4198) 1 year ago
docstore Add `DocstoreFn` - lookup doc via arbitrary function (#3760) 1 year ago
document_loaders Add PDF parser implementations (#4356) 1 year ago
embeddings Fix typo in huggingface.py (#4277) 1 year ago
evaluation Replace remaining usage of basellm with baselangmodel (#3981) 1 year ago
experimental Add Example Notebook for LCP Client (#4207) 1 year ago
graphs Minor: Remove duplicated word in error message (#2706) 1 year ago
indexes Replace remaining usage of basellm with baselangmodel (#3981) 1 year ago
llms Update writer integration (#4363) 1 year ago
memory fix for cosmos not loading old messages (#4094) 1 year ago
output_parsers fix: invalid escape sequence error in regex pattern (#3902) 1 year ago
prompts Validate `input_variables` when using `jinja2` templates (#3140) 1 year ago
retrievers Update Cohere Reranker (#4180) 1 year ago
tools fix json saving, update docs to reference anthropic chat model (#4364) 1 year ago
utilities added `Wikipedia` document loader (#4141) 1 year ago
vectorstores fix: vectorstore pgvector ensure compatibility #3884 (#4248) 1 year ago
__init__.py Callbacks Refactor [base] (#3256) 1 year ago
base_language.py issue#4082 base_language had wrong code comment that it was using gpt… (#4084) 1 year ago
cache.py Harrison/one drive loader (#4081) 1 year ago
docker-compose.yaml Update docker-compose.yaml (#3582) 1 year ago
document_transformers.py Contextual compression retriever (#2915) 1 year ago
example_generator.py Replace remaining usage of basellm with baselangmodel (#3981) 1 year ago
formatting.py Validate `input_variables` when using `jinja2` templates (#3140) 1 year ago
input.py Add asyncio support for LLM (OpenAI), Chain (LLMChain, LLMMathChain), and Agent (#841) 2 years ago
math_utils.py fix #3884 (#3475) 1 year ago
model_laboratory.py Harrison/improve cache (#368) 2 years ago
py.typed Add py.typed marker to package (#121) 2 years ago
python.py Move PythonRepl -> langchain.utilities (#2917) 1 year ago
requests.py fixed aiohttp.client_exceptions.ClientConnectionError: Connection closed (#2718) 1 year ago
schema.py Add ChatModel, LLM, and Embeddings for Google's PaLM APIs (#3575) 1 year ago
serpapi.py move serpapi wrapper (#1199) 2 years ago
server.py Fix missing docker-compose (#2899) 1 year ago
sql_database.py Vwp/sqlalchemy (#4145) 1 year ago
text_splitter.py Fix TextSplitter.from_tiktoken(#4361) 1 year ago
utils.py Update V2 Tracer (#4193) 1 year ago