langchain/docs
Martin Triska 7a9149f5dd
community: ZeroxPDFLoader (#27800)
# OCR-based PDF loader

This implements [Zerox](https://github.com/getomni-ai/zerox) PDF
document loader.
Zerox utilizes simple but very powerful (even though slower and more
costly) approach to parsing PDF documents: it converts PDF to series of
images and passes it to a vision model requesting the contents in
markdown.

It is especially suitable for complex PDFs that are not parsed well by
other alternatives.

## Example use:
```python
from langchain_community.document_loaders.pdf import ZeroxPDFLoader

os.environ["OPENAI_API_KEY"] = "" ## your-api-key

model = "gpt-4o-mini" ## openai model
pdf_url = "https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf"

loader = ZeroxPDFLoader(file_path=pdf_url, model=model)
docs = loader.load()
```

The Zerox library supports wide range of provides/models. See Zerox
documentation for details.

- **Dependencies:** `zerox`
- **Twitter handle:** @martintriska1

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <erickfriis@gmail.com>
2024-11-07 03:14:57 +00:00
..
api_reference infra: remove some special cases (#27839) 2024-11-01 21:13:43 +00:00
cassettes docs: run how-to guides in CI (#27615) 2024-10-30 12:35:38 -04:00
data docs: 👥 Update LangChain people data (#27022) 2024-10-08 17:09:07 +00:00
docs community: ZeroxPDFLoader (#27800) 2024-11-07 03:14:57 +00:00
scripts infra: remove some special cases (#27839) 2024-11-01 21:13:43 +00:00
src Add nvidia as provider for embedding, llm (#27810) 2024-11-04 19:45:51 +00:00
static update llm graph transformer documentation (#27905) 2024-11-05 11:54:26 -05:00
.gitignore infra: cleanup docs build (#21134) 2024-05-01 17:34:05 -07:00
.yarnrc.yml docs[minor]: Add thumbs up/down to all docs pages (#18526) 2024-03-04 15:14:28 -08:00
babel.config.js Restructure docs (#11620) 2023-10-10 12:55:19 -07:00
docusaurus.config.js docs, core: error messaging [wip] (#27397) 2024-10-17 03:39:36 +00:00
ignore-step.sh multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
Makefile docs: platforms -> providers (#27285) 2024-10-16 18:27:07 +00:00
package.json docs: add discussions with giscus (#27172) 2024-10-11 15:14:45 -07:00
README.md docs: reorganize contributing docs (#27649) 2024-10-25 22:41:54 +00:00
sidebars.js docs: sidebar capitalization (#27894) 2024-11-04 22:09:32 +00:00
vercel_requirements.txt docs: add api referencs to langgraph (#26877) 2024-09-26 15:21:10 -04:00
vercel.json docs: INVALID_CHAT_HISTORY redirect (#27845) 2024-11-01 21:35:11 +00:00
yarn.lock docs: add discussions with giscus (#27172) 2024-10-11 15:14:45 -07:00

LangChain Documentation

For more information on contributing to our documentation, see the Documentation Contributing Guide