langchain/docs/modules/indexes/document_loaders/examples/unstructured_file.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "20deed05",
   "metadata": {},
   "source": [
    "# Unstructured File\n",
    "\n",
    "This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2886982e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Install package\n",
    "!pip install \"unstructured[local-inference]\"\n",
    "!pip install layoutparser[layoutmodels,tesseract]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "54d62efd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Install other dependencies\n",
    "# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst\n",
    "# !brew install libmagic\n",
    "# !brew install poppler\n",
    "# !brew install tesseract\n",
    "# # If parsing xml / html documents:\n",
    "# !brew install libxml2\n",
    "# !brew install libxslt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "af6a64f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import nltk\n",
    "# nltk.download('punkt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "79d3e549",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import UnstructuredFileLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2593d1dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredFileLoader(\"./example_data/state_of_the_union.txt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fe34e941",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ee449788",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\\n\\nLast year COVID-19 kept us apart. This year we are finally together again.\\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\\n\\nWith a duty to one another to the American people to the Constit'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[0].page_content[:400]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7874d01d",
   "metadata": {},
   "source": [
    "## Retain Elements\n",
    "\n",
    "Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ff5b616d",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredFileLoader(\"./example_data/state_of_the_union.txt\", mode=\"elements\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "feca3b6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "fec5bbac",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
       " Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
       " Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
       " Document(page_content='With a duty to one another to the American people to the Constitution.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
       " Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "672733fd",
   "metadata": {},
   "source": [
    "## Define a Partitioning Strategy\n",
    "\n",
    "Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `\"hi_res\"` (the default) and `\"fast\"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "767238a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import UnstructuredFileLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9518b425",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredFileLoader(\"layout-parser-paper-fast.pdf\", strategy=\"fast\", mode=\"elements\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "645f29e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "60685353",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
       " Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
       " Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
       " Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
       " Document(page_content='n', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8de9ef16",
   "metadata": {},
   "source": [
    "## PDF Example\n",
    "\n",
    "Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of `elements`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8ca8a648",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P \"../../\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "686e5eb4",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredFileLoader(\"./example_data/layout-parser-paper.pdf\", mode=\"elements\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c90f0e94",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6ec859d8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
       " Document(page_content='Zejiang Shen 1 ( (ea)\\n ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
       " Document(page_content='Allen Institute for AI shannons@allenai.org', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
       " Document(page_content='Brown University ruochen zhang@brown.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
       " Document(page_content='Harvard University { melissadell,jacob carlson } @fas.harvard.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0)]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b066cb5a",
   "metadata": {},
   "source": [
    "## Unstructured API\n",
    "\n",
    "If you want to get up and running with less set up, you can simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. Note that currently (as of 11 May 2023) the Unstructured API is open, but it will soon require an API. The [Unstructured documentation](https://unstructured-io.github.io/) page will have instructions on how to generate an API key once they’re available. Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you’d like to self-host the Unstructured API or run it locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b50c70bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import UnstructuredAPIFileLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "12b6d2cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "filenames = [\"example_data/fake.docx\", \"example_data/fake-email.eml\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "39a9894d",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredAPIFileLoader(\n",
    "    file_path=filenames[0],\n",
    "    api_key=\"FAKE_API_KEY\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "386eb63c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = loader.load()\n",
    "docs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94158999",
   "metadata": {},
   "source": [
    "You can also batch multiple files through the Unstructured API in a single API using `UnstructuredAPIFileLoader`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "79a18e7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredAPIFileLoader(\n",
    "    file_path=filenames,\n",
    "    api_key=\"FAKE_API_KEY\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a3d7c846",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='Lorem ipsum dolor sit amet.\\n\\nThis is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = loader.load()\n",
    "docs[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e510495",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								{
 								 "cells": [
 								  {
 								   "cell_type": "markdown",
 								   "id": "20deed05",
 								   "metadata": {},
 								   "source": [
-												docs: `document_loaders` improvements (#4200)

- made notebooks consistent: titles, service/format descriptions.
- corrected short names to full names, for example, `Word` -> `Microsoft
Word`
- added missed descriptions
- renamed notebook files to make ToC correctly sorted
											
										
										
											2023-05-06 00:44:54 +00:00
+								    "# Unstructured File\n",
 								    "\n",
 								    "This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more."
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   "id": "2886982e",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "# # Install package\n",
-												docs: add quotes to `unstructured[local-inference]` install instructions (#1208)

### Summary

Corrects the install instruction for local inference to `pip install
"unstructured[local-inference]"`
											
										
										
											2023-02-21 16:06:43 +00:00
+								    "!pip install \"unstructured[local-inference]\"\n",
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								    "!pip install layoutparser[layoutmodels,tesseract]"
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 2,
 								   "id": "54d62efd",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "# # Install other dependencies\n",
 								    "# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst\n",
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								    "# !brew install libmagic\n",
 								    "# !brew install poppler\n",
 								    "# !brew install tesseract\n",
 								    "# # If parsing xml / html documents:\n",
 								    "# !brew install libxml2\n",
 								    "# !brew install libxslt"
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 3,
 								   "id": "af6a64f5",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "# import nltk\n",
 								    "# nltk.download('punkt')"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												big docs refactor (#1978)

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
											
										
										
											2023-03-27 02:49:46 +00:00
+								   "execution_count": 2,
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   "id": "79d3e549",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "from langchain.document_loaders import UnstructuredFileLoader"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   "execution_count": 5,
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   "id": "2593d1dc",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												big docs refactor (#1978)

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
											
										
										
											2023-03-27 02:49:46 +00:00
+								    "loader = UnstructuredFileLoader(\"./example_data/state_of_the_union.txt\")"
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   "execution_count": 6,
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   "id": "fe34e941",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "docs = loader.load()"
 								   ]
 								  },
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								  {
 								   "cell_type": "code",
 								   "execution_count": 7,
 								   "id": "ee449788",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\\n\\nLast year COVID-19 kept us apart. This year we are finally together again.\\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\\n\\nWith a duty to one another to the American people to the Constit'"
 								      ]
 								     },
 								     "execution_count": 7,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs[0].page_content[:400]"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "7874d01d",
 								   "metadata": {},
 								   "source": [
 								    "## Retain Elements\n",
 								    "\n",
 								    "Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 8,
 								   "id": "ff5b616d",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												big docs refactor (#1978)

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
											
										
										
											2023-03-27 02:49:46 +00:00
+								    "loader = UnstructuredFileLoader(\"./example_data/state_of_the_union.txt\", mode=\"elements\")"
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 9,
 								   "id": "feca3b6c",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "docs = loader.load()"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 12,
 								   "id": "fec5bbac",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
 								       " Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
 								       " Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
 								       " Document(page_content='With a duty to one another to the American people to the Constitution.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
 								       " Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0)]"
 								      ]
 								     },
 								     "execution_count": 12,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs[:5]"
 								   ]
 								  },
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								  {
 								   "cell_type": "markdown",
-												feat: allow the unstructured kwargs to be passed in to Unstructured document loaders (#1667)

### Summary

Allows users to pass in `**unstructured_kwargs` to Unstructured document
loaders. Implemented with the `strategy` kwargs in mind, but will pass
in other kwargs like `include_page_breaks` as well. The two currently
supported strategies are `"hi_res"`, which is more accurate but takes
longer, and `"fast"`, which processes faster but with lower accuracy.
The `"hi_res"` strategy is the default. For PDFs, if `detectron2` is not
available and the user selects `"hi_res"`, the loader will fallback to
using the `"fast"` strategy.


### Testing

#### Make sure the `strategy` kwarg works

Run the following in iPython to verify that the `"fast"` strategy is
indeed faster.

```python
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")
%timeit loader.load()

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
%timeit loader.load()
```

On my system I get:

```python
In [3]: from langchain.document_loaders import UnstructuredFileLoader

In [4]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")

In [5]: %timeit loader.load()
247 ms ± 369 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")

In [7]: %timeit loader.load()
2.45 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

#### Make sure older versions of `unstructured` still work

Run `pip install unstructured==0.5.3` and then verify the following runs
without error:

```python
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf",  mode="elements")
loader.load()
```
											
										
										
											2023-03-15 01:15:28 +00:00
+								   "id": "672733fd",
 								   "metadata": {},
 								   "source": [
 								    "## Define a Partitioning Strategy\n",
 								    "\n",
-												Update unstructured_file.ipynb (#3377)

Fix typo in docs
											
										
										
											2023-04-24 04:22:38 +00:00
+								    "Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `\"hi_res\"` (the default) and `\"fast\"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below."
-												feat: allow the unstructured kwargs to be passed in to Unstructured document loaders (#1667)

### Summary

Allows users to pass in `**unstructured_kwargs` to Unstructured document
loaders. Implemented with the `strategy` kwargs in mind, but will pass
in other kwargs like `include_page_breaks` as well. The two currently
supported strategies are `"hi_res"`, which is more accurate but takes
longer, and `"fast"`, which processes faster but with lower accuracy.
The `"hi_res"` strategy is the default. For PDFs, if `detectron2` is not
available and the user selects `"hi_res"`, the loader will fallback to
using the `"fast"` strategy.


### Testing

#### Make sure the `strategy` kwarg works

Run the following in iPython to verify that the `"fast"` strategy is
indeed faster.

```python
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")
%timeit loader.load()

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
%timeit loader.load()
```

On my system I get:

```python
In [3]: from langchain.document_loaders import UnstructuredFileLoader

In [4]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")

In [5]: %timeit loader.load()
247 ms ± 369 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")

In [7]: %timeit loader.load()
2.45 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

#### Make sure older versions of `unstructured` still work

Run `pip install unstructured==0.5.3` and then verify the following runs
without error:

```python
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf",  mode="elements")
loader.load()
```
											
										
										
											2023-03-15 01:15:28 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
 								   "id": "767238a4",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "from langchain.document_loaders import UnstructuredFileLoader"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 2,
 								   "id": "9518b425",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "loader = UnstructuredFileLoader(\"layout-parser-paper-fast.pdf\", strategy=\"fast\", mode=\"elements\")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 3,
 								   "id": "645f29e9",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "docs = loader.load()"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 4,
 								   "id": "60685353",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "[Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
 								       " Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
 								       " Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
 								       " Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
 								       " Document(page_content='n', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0)]"
 								      ]
 								     },
 								     "execution_count": 4,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs[:5]"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "8de9ef16",
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								   "metadata": {},
 								   "source": [
 								    "## PDF Example\n",
 								    "\n",
 								    "Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of `elements`. "
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
 								   "id": "8ca8a648",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P \"../../\""
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												big docs refactor (#1978)

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
											
										
										
											2023-03-27 02:49:46 +00:00
+								   "execution_count": 7,
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								   "id": "686e5eb4",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												big docs refactor (#1978)

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
											
										
										
											2023-03-27 02:49:46 +00:00
+								    "loader = UnstructuredFileLoader(\"./example_data/layout-parser-paper.pdf\", mode=\"elements\")"
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												big docs refactor (#1978)

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
											
										
										
											2023-03-27 02:49:46 +00:00
+								   "execution_count": null,
 								   "id": "c90f0e94",
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "docs = loader.load()"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
 								   "id": "6ec859d8",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "[Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
 								       " Document(page_content='Zejiang Shen 1 ( (ea)\\n ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
 								       " Document(page_content='Allen Institute for AI shannons@allenai.org', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
 								       " Document(page_content='Brown University ruochen zhang@brown.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
 								       " Document(page_content='Harvard University { melissadell,jacob carlson } @fas.harvard.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0)]"
 								      ]
 								     },
 								     "execution_count": 1,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs[:5]"
 								   ]
-												docs: add quotes to `unstructured[local-inference]` install instructions (#1208)

### Summary

Corrects the install instruction for local inference to `pip install
"unstructured[local-inference]"`
											
										
										
											2023-02-21 16:06:43 +00:00
+								  },
-												feat: batch multiple files in a single Unstructured API request (#4525)

### Submit Multiple Files to the Unstructured API

Enables batching multiple files into a single Unstructured API requests.
Support for requests with multiple files was added to both
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. Note that
if you submit multiple files in "single" mode, the result will be
concatenated into a single document. We recommend using this feature in
"elements" mode.

### Testing

The following should load both documents, using two of the example docs
from the integration tests folder.

```python
    from langchain.document_loaders import UnstructuredAPIFileLoader

    file_paths = ["examples/layout-parser-paper.pdf",  "examples/whatsapp_chat.txt"]

    loader = UnstructuredAPIFileLoader(
        file_paths=file_paths,
        api_key="FAKE_API_KEY",
        strategy="fast",
        mode="elements",
    )
    docs = loader.load()
```
											
										
										
											2023-05-22 03:48:20 +00:00
+								  {
 								   "cell_type": "markdown",
 								   "id": "b066cb5a",
 								   "metadata": {},
 								   "source": [
 								    "## Unstructured API\n",
 								    "\n",
 								    "If you want to get up and running with less set up, you can simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. Note that currently (as of 11 May 2023) the Unstructured API is open, but it will soon require an API. The [Unstructured documentation](https://unstructured-io.github.io/) page will have instructions on how to generate an API key once they’re available. Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you’d like to self-host the Unstructured API or run it locally."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
 								   "id": "b50c70bc",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "from langchain.document_loaders import UnstructuredAPIFileLoader"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 2,
 								   "id": "12b6d2cf",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "filenames = [\"example_data/fake.docx\", \"example_data/fake-email.eml\"]"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 3,
 								   "id": "39a9894d",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "loader = UnstructuredAPIFileLoader(\n",
 								    "    file_path=filenames[0],\n",
 								    "    api_key=\"FAKE_API_KEY\",\n",
 								    ")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 4,
 								   "id": "386eb63c",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})"
 								      ]
 								     },
 								     "execution_count": 4,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs = loader.load()\n",
 								    "docs[0]"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "94158999",
 								   "metadata": {},
 								   "source": [
 								    "You can also batch multiple files through the Unstructured API in a single API using `UnstructuredAPIFileLoader`."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 5,
 								   "id": "79a18e7e",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "loader = UnstructuredAPIFileLoader(\n",
 								    "    file_path=filenames,\n",
 								    "    api_key=\"FAKE_API_KEY\",\n",
 								    ")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 6,
 								   "id": "a3d7c846",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "Document(page_content='Lorem ipsum dolor sit amet.\\n\\nThis is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})"
 								      ]
 								     },
 								     "execution_count": 6,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs = loader.load()\n",
 								    "docs[0]"
 								   ]
 								  },
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat: batch multiple files in a single Unstructured API request (#4525)

### Submit Multiple Files to the Unstructured API

Enables batching multiple files into a single Unstructured API requests.
Support for requests with multiple files was added to both
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. Note that
if you submit multiple files in "single" mode, the result will be
concatenated into a single document. We recommend using this feature in
"elements" mode.

### Testing

The following should load both documents, using two of the example docs
from the integration tests folder.

```python
    from langchain.document_loaders import UnstructuredAPIFileLoader

    file_paths = ["examples/layout-parser-paper.pdf",  "examples/whatsapp_chat.txt"]

    loader = UnstructuredAPIFileLoader(
        file_paths=file_paths,
        api_key="FAKE_API_KEY",
        strategy="fast",
        mode="elements",
    )
    docs = loader.load()
```
											
										
										
											2023-05-22 03:48:20 +00:00
+								   "id": "0e510495",
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   "metadata": {},
 								   "outputs": [],
 								   "source": []
 								  }
 								 ],
 								 "metadata": {
 								  "kernelspec": {
 								   "display_name": "Python 3 (ipykernel)",
 								   "language": "python",
 								   "name": "python3"
 								  },
 								  "language_info": {
 								   "codemirror_mode": {
 								    "name": "ipython",
 								    "version": 3
 								   },
 								   "file_extension": ".py",
 								   "mimetype": "text/x-python",
 								   "name": "python",
 								   "nbconvert_exporter": "python",
 								   "pygments_lexer": "ipython3",
-												feat: batch multiple files in a single Unstructured API request (#4525)

### Submit Multiple Files to the Unstructured API

Enables batching multiple files into a single Unstructured API requests.
Support for requests with multiple files was added to both
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. Note that
if you submit multiple files in "single" mode, the result will be
concatenated into a single document. We recommend using this feature in
"elements" mode.

### Testing

The following should load both documents, using two of the example docs
from the integration tests folder.

```python
    from langchain.document_loaders import UnstructuredAPIFileLoader

    file_paths = ["examples/layout-parser-paper.pdf",  "examples/whatsapp_chat.txt"]

    loader = UnstructuredAPIFileLoader(
        file_paths=file_paths,
        api_key="FAKE_API_KEY",
        strategy="fast",
        mode="elements",
    )
    docs = loader.load()
```
											
										
										
											2023-05-22 03:48:20 +00:00
+								   "version": "3.8.13"
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								  }
 								 },
 								 "nbformat": 4,
 								 "nbformat_minor": 5
 								}