langchain/docs/ecosystem/unstructured.md
Matt Robinson c51dec5101
feat: add Unstructured API loaders (#3906)
### Summary

Adds `UnstructuredAPIFileLoaders` and `UnstructuredAPIFIleIOLoaders`
that partition documents through the Unstructured API. Defaults to the
URL for hosted Unstructured API, but can switch to a self hosted or
locally running API using the `url` kwarg. Currently, the Unstructured
API is open and does not require an API, but it will soon. A note was
added about that to the Unstructured ecosystem page.

### Testing


```python
from langchain.document_loaders import UnstructuredAPIFileIOLoader

filename = "fake-email.eml"

with open(filename, "rb") as f:
    loader = UnstructuredAPIFileIOLoader(file=f, file_filename=filename)
    docs = loader.load()

docs[0]
```

```python
from langchain.document_loaders import UnstructuredAPIFileLoader

filename = "fake-email.eml"
loader = UnstructuredAPIFileLoader(file_path=filename, mode="elements")
docs = loader.load()

docs[0]
```
2023-05-01 20:37:35 -07:00

2.7 KiB

Unstructured

This page covers how to use the unstructured ecosystem within LangChain. The unstructured package from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents.

This page is broken into two parts: installation and setup, and then references to specific unstructured wrappers.

Installation and Setup

If you are using a loader that runs locally, use the following steps to get unstructured and its dependencies running locally.

  • Install the Python SDK with pip install "unstructured[local-inference]"
  • Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.
    • libmagic-dev (filetype detection)
    • poppler-utils (images and PDFs)
    • tesseract-ocr(images and PDFs)
    • libreoffice (MS Office docs)
    • pandoc (EPUBs)
  • If you are parsing PDFs using the "hi_res" strategy, run the following to install the detectron2 model, which unstructured uses for layout detection:
    • pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2"
    • If detectron2 is not installed, unstructured will fallback to processing PDFs using the "fast" strategy, which uses pdfminer directly and doesn't require detectron2.

If you want to get up and running with less set up, you can simply run pip install unstructured and use UnstructuredAPIFileLoader or UnstructuredAPIFileIOLoader. That will process your document using the hosted Unstructured API. Note that currently (as of 1 May 2023) the Unstructured API is open, but it will soon require an API. The Unstructured documentation page will have instructions on how to generate an API key once they're available. Check out the instructions here if you'd like to self-host the Unstructured API or run it locally.

Wrappers

Data Loaders

The primary unstructured wrappers within langchain are data loaders. The following shows how to use the most basic unstructured data loader. There are other file-specific data loaders available in the langchain.document_loaders module.

from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("state_of_the_union.txt")
loader.load()

If you instantiate the loader with UnstructuredFileLoader(mode="elements"), the loader will track additional metadata like the page number and text type (i.e. title, narrative text) when that information is available.