{ "cells": [ { "cell_type": "markdown", "id": "20deed05", "metadata": {}, "source": [ "# Unstructured File\n", "\n", "This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more." ] }, { "cell_type": "code", "execution_count": 1, "id": "2886982e", "metadata": {}, "outputs": [], "source": [ "# # Install package\n", "!pip install \"unstructured[all-docs]\"\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "54d62efd", "metadata": {}, "outputs": [], "source": [ "# # Install other dependencies\n", "# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst\n", "# !brew install libmagic\n", "# !brew install poppler\n", "# !brew install tesseract\n", "# # If parsing xml / html documents:\n", "# !brew install libxml2\n", "# !brew install libxslt" ] }, { "cell_type": "code", "execution_count": 3, "id": "af6a64f5", "metadata": {}, "outputs": [], "source": [ "# import nltk\n", "# nltk.download('punkt')" ] }, { "cell_type": "code", "execution_count": 2, "id": "79d3e549", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import UnstructuredFileLoader" ] }, { "cell_type": "code", "execution_count": 5, "id": "2593d1dc", "metadata": {}, "outputs": [], "source": [ "loader = UnstructuredFileLoader(\"./example_data/state_of_the_union.txt\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "fe34e941", "metadata": {}, "outputs": [], "source": [ "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 7, "id": "ee449788", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\\n\\nLast year COVID-19 kept us apart. This year we are finally together again.\\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\\n\\nWith a duty to one another to the American people to the Constit'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs[0].page_content[:400]" ] }, { "cell_type": "markdown", "id": "7874d01d", "metadata": {}, "source": [ "## Retain Elements\n", "\n", "Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`." ] }, { "cell_type": "code", "execution_count": 8, "id": "ff5b616d", "metadata": {}, "outputs": [], "source": [ "loader = UnstructuredFileLoader(\n", " \"./example_data/state_of_the_union.txt\", mode=\"elements\"\n", ")" ] }, { "cell_type": "code", "execution_count": 9, "id": "feca3b6c", "metadata": {}, "outputs": [], "source": [ "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 12, "id": "fec5bbac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n", " Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n", " Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n", " Document(page_content='With a duty to one another to the American people to the Constitution.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n", " Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs[:5]" ] }, { "cell_type": "markdown", "id": "672733fd", "metadata": {}, "source": [ "## Define a Partitioning Strategy\n", "\n", "Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `\"hi_res\"` (the default) and `\"fast\"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below." ] }, { "cell_type": "code", "execution_count": 1, "id": "767238a4", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import UnstructuredFileLoader" ] }, { "cell_type": "code", "execution_count": 2, "id": "9518b425", "metadata": {}, "outputs": [], "source": [ "loader = UnstructuredFileLoader(\n", " \"layout-parser-paper-fast.pdf\", strategy=\"fast\", mode=\"elements\"\n", ")" ] }, { "cell_type": "code", "execution_count": 3, "id": "645f29e9", "metadata": {}, "outputs": [], "source": [ "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 4, "id": "60685353", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n", " Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n", " Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n", " Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n", " Document(page_content='n', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs[:5]" ] }, { "cell_type": "markdown", "id": "8de9ef16", "metadata": {}, "source": [ "## PDF Example\n", "\n", "Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements. Modes of operation are \n", "- `single` all the text from all elements are combined into one (default)\n", "- `elements` maintain individual elements\n", "- `paged` texts from each page are only combined" ] }, { "cell_type": "code", "execution_count": 1, "id": "8ca8a648", "metadata": {}, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P \"../../\"" ] }, { "cell_type": "code", "execution_count": 7, "id": "686e5eb4", "metadata": {}, "outputs": [], "source": [ "loader = UnstructuredFileLoader(\n", " \"./example_data/layout-parser-paper.pdf\", mode=\"elements\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "c90f0e94", "metadata": {}, "outputs": [], "source": [ "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 1, "id": "6ec859d8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='LayoutParser : A Unified Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n", " Document(page_content='Zejiang Shen 1 ( (ea)\\n ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n", " Document(page_content='Allen Institute for AI shannons@allenai.org', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n", " Document(page_content='Brown University ruochen zhang@brown.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n", " Document(page_content='Harvard University { melissadell,jacob carlson } @fas.harvard.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0)]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs[:5]" ] }, { "cell_type": "markdown", "id": "1cf27fc8", "metadata": {}, "source": [ "If you need to post process the `unstructured` elements after extraction, you can pass in a list of `str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredFileLoader`. This applies to other Unstructured loaders as well. Below is an example." ] }, { "cell_type": "code", "execution_count": 2, "id": "112e5538", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import UnstructuredFileLoader\n", "from unstructured.cleaners.core import clean_extra_whitespace" ] }, { "cell_type": "code", "execution_count": 3, "id": "b9c5ac8d", "metadata": {}, "outputs": [], "source": [ "loader = UnstructuredFileLoader(\n", " \"./example_data/layout-parser-paper.pdf\",\n", " mode=\"elements\",\n", " post_processors=[clean_extra_whitespace],\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "id": "c44d5def", "metadata": {}, "outputs": [], "source": [ "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 5, "id": "b6f27929", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'Title'}),\n", " Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),\n", " Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),\n", " Document(page_content='1 2 0 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),\n", " Document(page_content='n u J', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 258.36), (16.34, 286.14), (36.34, 286.14), (36.34, 258.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'Title'})]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs[:5]" ] }, { "cell_type": "markdown", "id": "b066cb5a", "metadata": {}, "source": [ "## Unstructured API\n", "\n", "If you want to get up and running with less set up, you can simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. You can generate a free Unstructured API key [here](https://www.unstructured.io/api-key/). The [Unstructured documentation](https://unstructured-io.github.io/) page will have instructions on how to generate an API key once they’re available. Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you’d like to self-host the Unstructured API or run it locally." ] }, { "cell_type": "code", "execution_count": 1, "id": "b50c70bc", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import UnstructuredAPIFileLoader" ] }, { "cell_type": "code", "execution_count": 2, "id": "12b6d2cf", "metadata": {}, "outputs": [], "source": [ "filenames = [\"example_data/fake.docx\", \"example_data/fake-email.eml\"]" ] }, { "cell_type": "code", "execution_count": 3, "id": "39a9894d", "metadata": {}, "outputs": [], "source": [ "loader = UnstructuredAPIFileLoader(\n", " file_path=filenames[0],\n", " api_key=\"FAKE_API_KEY\",\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "id": "386eb63c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs = loader.load()\n", "docs[0]" ] }, { "cell_type": "markdown", "id": "94158999", "metadata": {}, "source": [ "You can also batch multiple files through the Unstructured API in a single API using `UnstructuredAPIFileLoader`." ] }, { "cell_type": "code", "execution_count": 5, "id": "79a18e7e", "metadata": {}, "outputs": [], "source": [ "loader = UnstructuredAPIFileLoader(\n", " file_path=filenames,\n", " api_key=\"FAKE_API_KEY\",\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "id": "a3d7c846", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Document(page_content='Lorem ipsum dolor sit amet.\\n\\nThis is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs = loader.load()\n", "docs[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "0e510495", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }