langchain/docs/ecosystem/unstructured.md
Matt Robinson 3d5f56a8a1
docs: add quotes to unstructured[local-inference] install instructions (#1208)
### Summary

Corrects the install instruction for local inference to `pip install
"unstructured[local-inference]"`
2023-02-21 08:06:43 -08:00

1.9 KiB

Unstructured

This page covers how to use the unstructured ecosystem within LangChain. The unstructured package from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents.

This page is broken into two parts: installation and setup, and then references to specific unstructured wrappers.

Installation and Setup

  • Install the Python SDK with pip install "unstructured[local-inference]"
  • Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.
    • libmagic-dev
    • poppler-utils
    • tesseract-ocr
    • libreoffice
  • Run the following to install NLTK dependencies. unstructured will handle this automatically soon.
    • python -c "import nltk; nltk.download('punkt')"
    • python -c "import nltk; nltk.download('averaged_perceptron_tagger')"
  • If you are parsing PDFs, run the following to install the detectron2 model, which unstructured uses for layout detection:
    • pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"

Wrappers

Data Loaders

The primary unstructured wrappers within langchain are data loaders. The following shows how to use the most basic unstructured data loader. There are other file-specific data loaders available in the langchain.document_loaders module.

from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("state_of_the_union.txt")
loader.load()

If you instantiate the loader with UnstructuredFileLoader(mode="elements"), the loader will track additional metadata like the page number and text type (i.e. title, narrative text) when that information is available.