### Summary Adds `UnstructuredAPIFileLoaders` and `UnstructuredAPIFIleIOLoaders` that partition documents through the Unstructured API. Defaults to the URL for hosted Unstructured API, but can switch to a self hosted or locally running API using the `url` kwarg. Currently, the Unstructured API is open and does not require an API, but it will soon. A note was added about that to the Unstructured ecosystem page. ### Testing ```python from langchain.document_loaders import UnstructuredAPIFileIOLoader filename = "fake-email.eml" with open(filename, "rb") as f: loader = UnstructuredAPIFileIOLoader(file=f, file_filename=filename) docs = loader.load() docs[0] ``` ```python from langchain.document_loaders import UnstructuredAPIFileLoader filename = "fake-email.eml" loader = UnstructuredAPIFileLoader(file_path=filename, mode="elements") docs = loader.load() docs[0] ```
2.7 KiB
Unstructured
This page covers how to use the unstructured
ecosystem within LangChain. The unstructured
package from
Unstructured.IO extracts clean text from raw source documents like
PDFs and Word documents.
This page is broken into two parts: installation and setup, and then references to specific
unstructured
wrappers.
Installation and Setup
If you are using a loader that runs locally, use the following steps to get unstructured
and
its dependencies running locally.
- Install the Python SDK with
pip install "unstructured[local-inference]"
- Install the following system dependencies if they are not already available on your system.
Depending on what document types you're parsing, you may not need all of these.
libmagic-dev
(filetype detection)poppler-utils
(images and PDFs)tesseract-ocr
(images and PDFs)libreoffice
(MS Office docs)pandoc
(EPUBs)
- If you are parsing PDFs using the
"hi_res"
strategy, run the following to install thedetectron2
model, whichunstructured
uses for layout detection:pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2"
- If
detectron2
is not installed,unstructured
will fallback to processing PDFs using the"fast"
strategy, which usespdfminer
directly and doesn't requiredetectron2
.
If you want to get up and running with less set up, you can
simply run pip install unstructured
and use UnstructuredAPIFileLoader
or
UnstructuredAPIFileIOLoader
. That will process your document using the hosted Unstructured API.
Note that currently (as of 1 May 2023) the Unstructured API is open, but it will soon require
an API. The Unstructured documentation page will have
instructions on how to generate an API key once they're available. Check out the instructions
here
if you'd like to self-host the Unstructured API or run it locally.
Wrappers
Data Loaders
The primary unstructured
wrappers within langchain
are data loaders. The following
shows how to use the most basic unstructured data loader. There are other file-specific
data loaders available in the langchain.document_loaders
module.
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("state_of_the_union.txt")
loader.load()
If you instantiate the loader with UnstructuredFileLoader(mode="elements")
, the loader
will track additional metadata like the page number and text type (i.e. title, narrative text)
when that information is available.