langchain/docs/extras/integrations/document_loaders/unstructured_file.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "20deed05",
   "metadata": {},
   "source": [
    "# Unstructured File\n",
    "\n",
    "This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2886982e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Install package\n",
    "!pip install \"unstructured[all-docs]\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "54d62efd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Install other dependencies\n",
    "# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst\n",
    "# !brew install libmagic\n",
    "# !brew install poppler\n",
    "# !brew install tesseract\n",
    "# # If parsing xml / html documents:\n",
    "# !brew install libxml2\n",
    "# !brew install libxslt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "af6a64f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import nltk\n",
    "# nltk.download('punkt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "79d3e549",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import UnstructuredFileLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2593d1dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredFileLoader(\"./example_data/state_of_the_union.txt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fe34e941",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ee449788",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\\n\\nLast year COVID-19 kept us apart. This year we are finally together again.\\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\\n\\nWith a duty to one another to the American people to the Constit'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[0].page_content[:400]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7874d01d",
   "metadata": {},
   "source": [
    "## Retain Elements\n",
    "\n",
    "Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ff5b616d",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredFileLoader(\n",
    "    \"./example_data/state_of_the_union.txt\", mode=\"elements\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "feca3b6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "fec5bbac",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
       " Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
       " Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
       " Document(page_content='With a duty to one another to the American people to the Constitution.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
       " Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "672733fd",
   "metadata": {},
   "source": [
    "## Define a Partitioning Strategy\n",
    "\n",
    "Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `\"hi_res\"` (the default) and `\"fast\"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "767238a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import UnstructuredFileLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9518b425",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredFileLoader(\n",
    "    \"layout-parser-paper-fast.pdf\", strategy=\"fast\", mode=\"elements\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "645f29e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "60685353",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
       " Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
       " Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
       " Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
       " Document(page_content='n', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8de9ef16",
   "metadata": {},
   "source": [
    "## PDF Example\n",
    "\n",
    "Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements. Modes of operation are \n",
    "- `single` all the text from all elements are combined into one (default)\n",
    "- `elements` maintain individual elements\n",
    "- `paged` texts from each page are only combined"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8ca8a648",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P \"../../\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "686e5eb4",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredFileLoader(\n",
    "    \"./example_data/layout-parser-paper.pdf\", mode=\"elements\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c90f0e94",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6ec859d8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
       " Document(page_content='Zejiang Shen 1 ( (ea)\\n ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
       " Document(page_content='Allen Institute for AI shannons@allenai.org', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
       " Document(page_content='Brown University ruochen zhang@brown.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
       " Document(page_content='Harvard University { melissadell,jacob carlson } @fas.harvard.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0)]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cf27fc8",
   "metadata": {},
   "source": [
    "If you need to post process the `unstructured` elements after extraction, you can pass in a list of `str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredFileLoader`. This applies to other Unstructured loaders as well. Below is an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "112e5538",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import UnstructuredFileLoader\n",
    "from unstructured.cleaners.core import clean_extra_whitespace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b9c5ac8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredFileLoader(\n",
    "    \"./example_data/layout-parser-paper.pdf\",\n",
    "    mode=\"elements\",\n",
    "    post_processors=[clean_extra_whitespace],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "c44d5def",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b6f27929",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'Title'}),\n",
       " Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),\n",
       " Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),\n",
       " Document(page_content='1 2 0 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),\n",
       " Document(page_content='n u J', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 258.36), (16.34, 286.14), (36.34, 286.14), (36.34, 258.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'Title'})]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b066cb5a",
   "metadata": {},
   "source": [
    "## Unstructured API\n",
    "\n",
    "If you want to get up and running with less set up, you can simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. You can generate a free Unstructured API key [here](https://www.unstructured.io/api-key/). The [Unstructured documentation](https://unstructured-io.github.io/) page will have instructions on how to generate an API key once they’re available. Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you’d like to self-host the Unstructured API or run it locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b50c70bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import UnstructuredAPIFileLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "12b6d2cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "filenames = [\"example_data/fake.docx\", \"example_data/fake-email.eml\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "39a9894d",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredAPIFileLoader(\n",
    "    file_path=filenames[0],\n",
    "    api_key=\"FAKE_API_KEY\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "386eb63c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = loader.load()\n",
    "docs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94158999",
   "metadata": {},
   "source": [
    "You can also batch multiple files through the Unstructured API in a single API using `UnstructuredAPIFileLoader`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "79a18e7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredAPIFileLoader(\n",
    "    file_path=filenames,\n",
    "    api_key=\"FAKE_API_KEY\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a3d7c846",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='Lorem ipsum dolor sit amet.\\n\\nThis is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = loader.load()\n",
    "docs[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e510495",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								{
 								 "cells": [
 								  {
 								   "cell_type": "markdown",
 								   "id": "20deed05",
 								   "metadata": {},
 								   "source": [
-												docs: `document_loaders` improvements (#4200)

- made notebooks consistent: titles, service/format descriptions.
- corrected short names to full names, for example, `Word` -> `Microsoft
Word`
- added missed descriptions
- renamed notebook files to make ToC correctly sorted
											
										
										
											2023-05-06 00:44:54 +00:00
+								    "# Unstructured File\n",
 								    "\n",
 								    "This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more."
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   "id": "2886982e",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "# # Install package\n",
-												docs: update `unstructured` install instructions (#8596)

### Summary

Updates the `unstructured` install instructions. For
`unstructured>=0.9.0`, dependencies are broken out by document type and
the base `unstructured` package includes fewer dependencies. `pip
install "unstructured[local-inference]"` has been replace by `pip
install "unstructured[all-docs]"`, though the `local-inference` extra is
still supported for the time being.

### Reviewers

- @rlancemartin
- @eyurtsev
- @hwchase17
											
										
										
											2023-08-01 21:17:49 +00:00
+								    "!pip install \"unstructured[all-docs]\"\n"
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 2,
 								   "id": "54d62efd",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "# # Install other dependencies\n",
 								    "# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst\n",
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								    "# !brew install libmagic\n",
 								    "# !brew install poppler\n",
 								    "# !brew install tesseract\n",
 								    "# # If parsing xml / html documents:\n",
 								    "# !brew install libxml2\n",
 								    "# !brew install libxslt"
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 3,
 								   "id": "af6a64f5",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "# import nltk\n",
 								    "# nltk.download('punkt')"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												big docs refactor (#1978)

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
											
										
										
											2023-03-27 02:49:46 +00:00
+								   "execution_count": 2,
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   "id": "79d3e549",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "from langchain.document_loaders import UnstructuredFileLoader"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   "execution_count": 5,
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   "id": "2593d1dc",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												big docs refactor (#1978)

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
											
										
										
											2023-03-27 02:49:46 +00:00
+								    "loader = UnstructuredFileLoader(\"./example_data/state_of_the_union.txt\")"
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   "execution_count": 6,
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   "id": "fe34e941",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "docs = loader.load()"
 								   ]
 								  },
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								  {
 								   "cell_type": "code",
 								   "execution_count": 7,
 								   "id": "ee449788",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\\n\\nLast year COVID-19 kept us apart. This year we are finally together again.\\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\\n\\nWith a duty to one another to the American people to the Constit'"
 								      ]
 								     },
 								     "execution_count": 7,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs[0].page_content[:400]"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "7874d01d",
 								   "metadata": {},
 								   "source": [
 								    "## Retain Elements\n",
 								    "\n",
 								    "Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 8,
 								   "id": "ff5b616d",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												Doc refactor (#6300)

Co-authored-by: jacoblee93 <jacoblee93@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
											
										
										
											2023-06-16 18:52:56 +00:00
+								    "loader = UnstructuredFileLoader(\n",
 								    "    \"./example_data/state_of_the_union.txt\", mode=\"elements\"\n",
 								    ")"
-												Harrison/unstructured structured (#1004)


											
										
										
											2023-02-12 15:36:11 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 9,
 								   "id": "feca3b6c",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "docs = loader.load()"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 12,
 								   "id": "fec5bbac",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
 								       " Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
 								       " Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
 								       " Document(page_content='With a duty to one another to the American people to the Constitution.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),\n",
 								       " Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0)]"
 								      ]
 								     },
 								     "execution_count": 12,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs[:5]"
 								   ]
 								  },
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								  {
 								   "cell_type": "markdown",
-												feat: allow the unstructured kwargs to be passed in to Unstructured document loaders (#1667)

### Summary

Allows users to pass in `**unstructured_kwargs` to Unstructured document
loaders. Implemented with the `strategy` kwargs in mind, but will pass
in other kwargs like `include_page_breaks` as well. The two currently
supported strategies are `"hi_res"`, which is more accurate but takes
longer, and `"fast"`, which processes faster but with lower accuracy.
The `"hi_res"` strategy is the default. For PDFs, if `detectron2` is not
available and the user selects `"hi_res"`, the loader will fallback to
using the `"fast"` strategy.


### Testing

#### Make sure the `strategy` kwarg works

Run the following in iPython to verify that the `"fast"` strategy is
indeed faster.

```python
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")
%timeit loader.load()

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
%timeit loader.load()
```

On my system I get:

```python
In [3]: from langchain.document_loaders import UnstructuredFileLoader

In [4]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")

In [5]: %timeit loader.load()
247 ms ± 369 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")

In [7]: %timeit loader.load()
2.45 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

#### Make sure older versions of `unstructured` still work

Run `pip install unstructured==0.5.3` and then verify the following runs
without error:

```python
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf",  mode="elements")
loader.load()
```
											
										
										
											2023-03-15 01:15:28 +00:00
+								   "id": "672733fd",
 								   "metadata": {},
 								   "source": [
 								    "## Define a Partitioning Strategy\n",
 								    "\n",
-												Update unstructured_file.ipynb (#3377)

Fix typo in docs
											
										
										
											2023-04-24 04:22:38 +00:00
+								    "Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `\"hi_res\"` (the default) and `\"fast\"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below."
-												feat: allow the unstructured kwargs to be passed in to Unstructured document loaders (#1667)

### Summary

Allows users to pass in `**unstructured_kwargs` to Unstructured document
loaders. Implemented with the `strategy` kwargs in mind, but will pass
in other kwargs like `include_page_breaks` as well. The two currently
supported strategies are `"hi_res"`, which is more accurate but takes
longer, and `"fast"`, which processes faster but with lower accuracy.
The `"hi_res"` strategy is the default. For PDFs, if `detectron2` is not
available and the user selects `"hi_res"`, the loader will fallback to
using the `"fast"` strategy.


### Testing

#### Make sure the `strategy` kwarg works

Run the following in iPython to verify that the `"fast"` strategy is
indeed faster.

```python
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")
%timeit loader.load()

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
%timeit loader.load()
```

On my system I get:

```python
In [3]: from langchain.document_loaders import UnstructuredFileLoader

In [4]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")

In [5]: %timeit loader.load()
247 ms ± 369 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")

In [7]: %timeit loader.load()
2.45 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

#### Make sure older versions of `unstructured` still work

Run `pip install unstructured==0.5.3` and then verify the following runs
without error:

```python
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf",  mode="elements")
loader.load()
```
											
										
										
											2023-03-15 01:15:28 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
 								   "id": "767238a4",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "from langchain.document_loaders import UnstructuredFileLoader"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 2,
 								   "id": "9518b425",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												Doc refactor (#6300)

Co-authored-by: jacoblee93 <jacoblee93@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
											
										
										
											2023-06-16 18:52:56 +00:00
+								    "loader = UnstructuredFileLoader(\n",
 								    "    \"layout-parser-paper-fast.pdf\", strategy=\"fast\", mode=\"elements\"\n",
 								    ")"
-												feat: allow the unstructured kwargs to be passed in to Unstructured document loaders (#1667)

### Summary

Allows users to pass in `**unstructured_kwargs` to Unstructured document
loaders. Implemented with the `strategy` kwargs in mind, but will pass
in other kwargs like `include_page_breaks` as well. The two currently
supported strategies are `"hi_res"`, which is more accurate but takes
longer, and `"fast"`, which processes faster but with lower accuracy.
The `"hi_res"` strategy is the default. For PDFs, if `detectron2` is not
available and the user selects `"hi_res"`, the loader will fallback to
using the `"fast"` strategy.


### Testing

#### Make sure the `strategy` kwarg works

Run the following in iPython to verify that the `"fast"` strategy is
indeed faster.

```python
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")
%timeit loader.load()

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
%timeit loader.load()
```

On my system I get:

```python
In [3]: from langchain.document_loaders import UnstructuredFileLoader

In [4]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")

In [5]: %timeit loader.load()
247 ms ± 369 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")

In [7]: %timeit loader.load()
2.45 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

#### Make sure older versions of `unstructured` still work

Run `pip install unstructured==0.5.3` and then verify the following runs
without error:

```python
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf",  mode="elements")
loader.load()
```
											
										
										
											2023-03-15 01:15:28 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 3,
 								   "id": "645f29e9",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "docs = loader.load()"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 4,
 								   "id": "60685353",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "[Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
 								       " Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
 								       " Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
 								       " Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),\n",
 								       " Document(page_content='n', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0)]"
 								      ]
 								     },
 								     "execution_count": 4,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs[:5]"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "8de9ef16",
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								   "metadata": {},
 								   "source": [
 								    "## PDF Example\n",
 								    "\n",
-												Harrison/unstructured page number (#6464)

Co-authored-by: Reza Sanaie <reza@sanaie.ca>
											
										
										
											2023-06-20 05:31:43 +00:00
+								    "Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements. Modes of operation are \n",
 								    "- `single` all the text from all elements are combined into one (default)\n",
 								    "- `elements` maintain individual elements\n",
 								    "- `paged` texts from each page are only combined"
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
 								   "id": "8ca8a648",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P \"../../\""
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												big docs refactor (#1978)

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
											
										
										
											2023-03-27 02:49:46 +00:00
+								   "execution_count": 7,
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								   "id": "686e5eb4",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												Doc refactor (#6300)

Co-authored-by: jacoblee93 <jacoblee93@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
											
										
										
											2023-06-16 18:52:56 +00:00
+								    "loader = UnstructuredFileLoader(\n",
 								    "    \"./example_data/layout-parser-paper.pdf\", mode=\"elements\"\n",
 								    ")"
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
-												big docs refactor (#1978)

Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
											
										
										
											2023-03-27 02:49:46 +00:00
+								   "execution_count": null,
 								   "id": "c90f0e94",
-												Unstructured example notebook: add a pdf, related deps (#1011)

Updates the Unstructured example notebook with a PDF example. Includes
additional dependencies for PDF processing (and images, etc).
											
										
										
											2023-02-12 22:56:48 +00:00
+								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "docs = loader.load()"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
 								   "id": "6ec859d8",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "[Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
 								       " Document(page_content='Zejiang Shen 1 ( (ea)\\n ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
 								       " Document(page_content='Allen Institute for AI shannons@allenai.org', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
 								       " Document(page_content='Brown University ruochen zhang@brown.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),\n",
 								       " Document(page_content='Harvard University { melissadell,jacob carlson } @fas.harvard.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0)]"
 								      ]
 								     },
 								     "execution_count": 1,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs[:5]"
 								   ]
-												docs: add quotes to `unstructured[local-inference]` install instructions (#1208)

### Summary

Corrects the install instruction for local inference to `pip install
"unstructured[local-inference]"`
											
										
										
											2023-02-21 16:06:43 +00:00
+								  },
-												feat: optional post-processing for Unstructured loaders (#7850)

### Summary

Adds a post-processing method for Unstructured loaders that allows users
to optionally modify or clean extracted elements.

### Testing

```python
from langchain.document_loaders import UnstructuredFileLoader
from unstructured.cleaners.core import clean_extra_whitespace

loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="elements",
    post_processors=[clean_extra_whitespace],
)

docs = loader.load()
docs[:5]
```


### Reviewrs
  - @rlancemartin
  - @eyurtsev
  - @hwchase17
											
										
										
											2023-07-17 19:13:05 +00:00
+								  {
 								   "cell_type": "markdown",
 								   "id": "1cf27fc8",
 								   "metadata": {},
 								   "source": [
-												fix: apply unstructured preprocess functions (#9473)

### Summary

Fixes a bug from #7850 where post processing functions in Unstructured
loaders were not apply. Adds a assertion to the test to verify the post
processing function was applied and also updates the explanation in the
example notebook.
											
										
										
											2023-08-19 01:54:28 +00:00
+								    "If you need to post process the `unstructured` elements after extraction, you can pass in a list of `str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredFileLoader`. This applies to other Unstructured loaders as well. Below is an example."
-												feat: optional post-processing for Unstructured loaders (#7850)

### Summary

Adds a post-processing method for Unstructured loaders that allows users
to optionally modify or clean extracted elements.

### Testing

```python
from langchain.document_loaders import UnstructuredFileLoader
from unstructured.cleaners.core import clean_extra_whitespace

loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="elements",
    post_processors=[clean_extra_whitespace],
)

docs = loader.load()
docs[:5]
```


### Reviewrs
  - @rlancemartin
  - @eyurtsev
  - @hwchase17
											
										
										
											2023-07-17 19:13:05 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 2,
 								   "id": "112e5538",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "from langchain.document_loaders import UnstructuredFileLoader\n",
 								    "from unstructured.cleaners.core import clean_extra_whitespace"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 3,
 								   "id": "b9c5ac8d",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "loader = UnstructuredFileLoader(\n",
 								    "    \"./example_data/layout-parser-paper.pdf\",\n",
 								    "    mode=\"elements\",\n",
 								    "    post_processors=[clean_extra_whitespace],\n",
 								    ")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 4,
 								   "id": "c44d5def",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "docs = loader.load()"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 5,
 								   "id": "b6f27929",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "[Document(page_content='LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'Title'}),\n",
 								       " Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),\n",
 								       " Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),\n",
 								       " Document(page_content='1 2 0 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),\n",
 								       " Document(page_content='n u J', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 258.36), (16.34, 286.14), (36.34, 286.14), (36.34, 258.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'Title'})]"
 								      ]
 								     },
 								     "execution_count": 5,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs[:5]"
 								   ]
 								  },
-												feat: batch multiple files in a single Unstructured API request (#4525)

### Submit Multiple Files to the Unstructured API

Enables batching multiple files into a single Unstructured API requests.
Support for requests with multiple files was added to both
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. Note that
if you submit multiple files in "single" mode, the result will be
concatenated into a single document. We recommend using this feature in
"elements" mode.

### Testing

The following should load both documents, using two of the example docs
from the integration tests folder.

```python
    from langchain.document_loaders import UnstructuredAPIFileLoader

    file_paths = ["examples/layout-parser-paper.pdf",  "examples/whatsapp_chat.txt"]

    loader = UnstructuredAPIFileLoader(
        file_paths=file_paths,
        api_key="FAKE_API_KEY",
        strategy="fast",
        mode="elements",
    )
    docs = loader.load()
```
											
										
										
											2023-05-22 03:48:20 +00:00
+								  {
 								   "cell_type": "markdown",
 								   "id": "b066cb5a",
 								   "metadata": {},
 								   "source": [
 								    "## Unstructured API\n",
 								    "\n",
-												Docs/unstructured api key (#6781)

### Summary

The Unstructured API will soon begin requiring API keys. This PR updates
the Unstructured integrations docs with instructions on how to generate
Unstructured API keys.

### Reviewers

@rlancemartin
@eyurtsev
@hwchase17
											
										
										
											2023-06-27 23:54:15 +00:00
+								    "If you want to get up and running with less set up, you can simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. You can generate a free Unstructured API key [here](https://www.unstructured.io/api-key/). The [Unstructured documentation](https://unstructured-io.github.io/) page will have instructions on how to generate an API key once they’re available. Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you’d like to self-host the Unstructured API or run it locally."
-												feat: batch multiple files in a single Unstructured API request (#4525)

### Submit Multiple Files to the Unstructured API

Enables batching multiple files into a single Unstructured API requests.
Support for requests with multiple files was added to both
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. Note that
if you submit multiple files in "single" mode, the result will be
concatenated into a single document. We recommend using this feature in
"elements" mode.

### Testing

The following should load both documents, using two of the example docs
from the integration tests folder.

```python
    from langchain.document_loaders import UnstructuredAPIFileLoader

    file_paths = ["examples/layout-parser-paper.pdf",  "examples/whatsapp_chat.txt"]

    loader = UnstructuredAPIFileLoader(
        file_paths=file_paths,
        api_key="FAKE_API_KEY",
        strategy="fast",
        mode="elements",
    )
    docs = loader.load()
```
											
										
										
											2023-05-22 03:48:20 +00:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 1,
 								   "id": "b50c70bc",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "from langchain.document_loaders import UnstructuredAPIFileLoader"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 2,
 								   "id": "12b6d2cf",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "filenames = [\"example_data/fake.docx\", \"example_data/fake-email.eml\"]"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 3,
 								   "id": "39a9894d",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "loader = UnstructuredAPIFileLoader(\n",
 								    "    file_path=filenames[0],\n",
 								    "    api_key=\"FAKE_API_KEY\",\n",
 								    ")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 4,
 								   "id": "386eb63c",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})"
 								      ]
 								     },
 								     "execution_count": 4,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs = loader.load()\n",
 								    "docs[0]"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "id": "94158999",
 								   "metadata": {},
 								   "source": [
 								    "You can also batch multiple files through the Unstructured API in a single API using `UnstructuredAPIFileLoader`."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 5,
 								   "id": "79a18e7e",
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "loader = UnstructuredAPIFileLoader(\n",
 								    "    file_path=filenames,\n",
 								    "    api_key=\"FAKE_API_KEY\",\n",
 								    ")"
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": 6,
 								   "id": "a3d7c846",
 								   "metadata": {},
 								   "outputs": [
 								    {
 								     "data": {
 								      "text/plain": [
 								       "Document(page_content='Lorem ipsum dolor sit amet.\\n\\nThis is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})"
 								      ]
 								     },
 								     "execution_count": 6,
 								     "metadata": {},
 								     "output_type": "execute_result"
 								    }
 								   ],
 								   "source": [
 								    "docs = loader.load()\n",
 								    "docs[0]"
 								   ]
 								  },
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								  {
 								   "cell_type": "code",
 								   "execution_count": null,
-												feat: batch multiple files in a single Unstructured API request (#4525)

### Submit Multiple Files to the Unstructured API

Enables batching multiple files into a single Unstructured API requests.
Support for requests with multiple files was added to both
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. Note that
if you submit multiple files in "single" mode, the result will be
concatenated into a single document. We recommend using this feature in
"elements" mode.

### Testing

The following should load both documents, using two of the example docs
from the integration tests folder.

```python
    from langchain.document_loaders import UnstructuredAPIFileLoader

    file_paths = ["examples/layout-parser-paper.pdf",  "examples/whatsapp_chat.txt"]

    loader = UnstructuredAPIFileLoader(
        file_paths=file_paths,
        api_key="FAKE_API_KEY",
        strategy="fast",
        mode="elements",
    )
    docs = loader.load()
```
											
										
										
											2023-05-22 03:48:20 +00:00
+								   "id": "0e510495",
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								   "metadata": {},
 								   "outputs": [],
 								   "source": []
 								  }
 								 ],
 								 "metadata": {
 								  "kernelspec": {
 								   "display_name": "Python 3 (ipykernel)",
 								   "language": "python",
 								   "name": "python3"
 								  },
 								  "language_info": {
 								   "codemirror_mode": {
 								    "name": "ipython",
 								    "version": 3
 								   },
 								   "file_extension": ".py",
 								   "mimetype": "text/x-python",
 								   "name": "python",
 								   "nbconvert_exporter": "python",
 								   "pygments_lexer": "ipython3",
-												fix: apply unstructured preprocess functions (#9473)

### Summary

Fixes a bug from #7850 where post processing functions in Unstructured
loaders were not apply. Adds a assertion to the test to verify the post
processing function was applied and also updates the explanation in the
example notebook.
											
										
										
											2023-08-19 01:54:28 +00:00
+								   "version": "3.8.10"
-												Harrison/unstructured support (#903)


											
										
										
											2023-02-06 07:02:07 +00:00
+								  }
 								 },
 								 "nbformat": 4,
 								 "nbformat_minor": 5
 								}