### Summary Corrects the install instruction for local inference to `pip install "unstructured[local-inference]"`
1.9 KiB
Unstructured
This page covers how to use the unstructured
ecosystem within LangChain. The unstructured
package from
Unstructured.IO extracts clean text from raw source documents like
PDFs and Word documents.
This page is broken into two parts: installation and setup, and then references to specific
unstructured
wrappers.
Installation and Setup
- Install the Python SDK with
pip install "unstructured[local-inference]"
- Install the following system dependencies if they are not already available on your system.
Depending on what document types you're parsing, you may not need all of these.
libmagic-dev
poppler-utils
tesseract-ocr
libreoffice
- Run the following to install NLTK dependencies.
unstructured
will handle this automatically soon.python -c "import nltk; nltk.download('punkt')"
python -c "import nltk; nltk.download('averaged_perceptron_tagger')"
- If you are parsing PDFs, run the following to install the
detectron2
model, whichunstructured
uses for layout detection:pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
Wrappers
Data Loaders
The primary unstructured
wrappers within langchain
are data loaders. The following
shows how to use the most basic unstructured data loader. There are other file-specific
data loaders available in the langchain.document_loaders
module.
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("state_of_the_union.txt")
loader.load()
If you instantiate the loader with UnstructuredFileLoader(mode="elements")
, the loader
will track additional metadata like the page number and text type (i.e. title, narrative text)
when that information is available.