docs: `integrations/providers/unstructured` update (#19892)

Updated a page with existing document loaders with links to examples. Fixed formatting of one example. Co-authored-by: Erick Friis <erick@langchain.dev>
5 months ago · 69bf6262aa
parent 1b7ed6071a
commit 69bf6262aa
2 changed files with 256 additions and 24 deletions
--- a/docs/docs/integrations/document_loaders/url.ipynb
+++ b/docs/docs/integrations/document_loaders/url.ipynb
@ -7,7 +7,35 @@
   "source": [
    "# URL\n",
    "\n",
-    "This covers how to load HTML documents from a list of URLs into a document format that we can use downstream."
+    "This example covers how to load `HTML` documents from a list of `URLs` into the `Document` format that we can use downstream."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ccca101-b167-43bc-849e-9d456b16a123",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2024-04-02T00:13:43.279309Z",
+     "iopub.status.busy": "2024-04-02T00:13:43.278977Z",
+     "iopub.status.idle": "2024-04-02T00:13:43.282230Z",
+     "shell.execute_reply": "2024-04-02T00:13:43.281907Z",
+     "shell.execute_reply.started": "2024-04-02T00:13:43.279282Z"
+    }
+   },
+   "source": [
+    "## Unstructured URL Loader\n",
+    "\n",
+    "You have to install the `unstructured` library:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cb26084d-a2b0-4685-9ec4-346139ffe0fb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -U unstructured"
   ]
  },
  {
@ -67,15 +95,24 @@
   "id": "f3afa135",
   "metadata": {},
   "source": [
-    "# Selenium URL Loader\n",
+    "## Selenium URL Loader\n",
    "\n",
    "This covers how to load HTML documents from a list of URLs using the `SeleniumURLLoader`.\n",
    "\n",
-    "Using selenium allows us to load pages that require JavaScript to render.\n",
+    "Using `Selenium` allows us to load pages that require JavaScript to render.\n",
    "\n",
-    "## Setup\n",
    "\n",
-    "To use the `SeleniumURLLoader`, you will need to install `selenium` and `unstructured`.\n"
+    "To use the `SeleniumURLLoader`, you have to install `selenium` and `unstructured`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4d2b86cf-55c6-430d-bf31-45591a1aa25a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -U selenium unstructured"
   ]
  },
  {
@ -127,15 +164,25 @@
   "id": "a2c1c79f",
   "metadata": {},
   "source": [
-    "# Playwright URL Loader\n",
+    "## Playwright URL Loader\n",
    "\n",
    "This covers how to load HTML documents from a list of URLs using the `PlaywrightURLLoader`.\n",
    "\n",
-    "As in the Selenium case, Playwright allows us to load pages that need JavaScript to render.\n",
+    "[Playwright](https://playwright.dev/) enables reliable end-to-end testing for modern web apps.\n",
    "\n",
-    "## Setup\n",
+    "As in the Selenium case, `Playwright` allows us to load and render the JavaScript pages.\n",
    "\n",
-    "To use the `PlaywrightURLLoader`, you will need to install `playwright` and `unstructured`. Additionally, you will need to install the Playwright Chromium browser:"
+    "To use the `PlaywrightURLLoader`, you have to install `playwright` and `unstructured`. Additionally, you have to install the `Playwright Chromium` browser:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "017ba3d2-ccb0-4c24-a079-44a8e524b2fa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -U playwright unstructured"
   ]
  },
  {
@ -145,9 +192,6 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# Install playwright\n",
-    "%pip install --upgrade --quiet  \"playwright\"\n",
-    "%pip install --upgrade --quiet  \"unstructured\"\n",
    "!playwright install"
   ]
  },
@ -211,7 +255,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.6"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
--- a/docs/docs/integrations/providers/unstructured.mdx
+++ b/docs/docs/integrations/providers/unstructured.mdx
@ -27,7 +27,7 @@ simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
 `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.


-The Unstructured API requires API keys to make requests.
+The `Unstructured API` requires API keys to make requests.
 You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
 Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
 We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).
@ -35,21 +35,209 @@ And stay tuned for improvements to both quality and performance!
 Check out the instructions
 [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you'd like to self-host the Unstructured API or run it locally.

-## Wrappers

-### Data Loaders
+## Data Loaders
+
+The primary usage of the `Unstructured` is in data loaders.
+
+### UnstructuredAPIFileIOLoader
+
+See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
+
+```python
+from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
+```
+
+### UnstructuredAPIFileLoader
+
+See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
+
+```python
+from langchain_community.document_loaders import UnstructuredAPIFileLoader
+```
+
+### UnstructuredCHMLoader
+
+`CHM` means `Microsoft Compiled HTML Help`.
+
+See a usage example in the API documentation.
+
+```python
+from langchain_community.document_loaders import UnstructuredCHMLoader
+```
+
+### UnstructuredCSVLoader
+
+A `comma-separated values` (`CSV`) file is a delimited text file that uses 
+a comma to separate values. Each line of the file is a data record. 
+Each record consists of one or more fields, separated by commas.
+
+See a [usage example](/docs/integrations/document_loaders/csv#unstructuredcsvloader).
+
+```python
+from langchain_community.document_loaders import UnstructuredCSVLoader
+```
+
+### UnstructuredEmailLoader
+
+See a [usage example](/docs/integrations/document_loaders/email).
+
+```python
+from langchain_community.document_loaders import UnstructuredEmailLoader
+```
+
+### UnstructuredEPubLoader
+
+[EPUB](https://en.wikipedia.org/wiki/EPUB) is an `e-book file format` that uses 
+the “.epub” file extension. The term is short for electronic publication and 
+is sometimes styled `ePub`. `EPUB` is supported by many e-readers, and compatible 
+software is available for most smartphones, tablets, and computers.
+
+See a [usage example](/docs/integrations/document_loaders/epub).
+
+```python
+from langchain_community.document_loaders import UnstructuredEPubLoader
+```
+
+### UnstructuredExcelLoader
+
+See a [usage example](/docs/integrations/document_loaders/microsoft_excel).
+
+```python
+from langchain_community.document_loaders import UnstructuredExcelLoader
+```
+
+### UnstructuredFileIOLoader
+
+See a [usage example](/docs/integrations/document_loaders/google_drive#passing-in-optional-file-loaders).
+
+```python
+from langchain_community.document_loaders import UnstructuredFileIOLoader
+```
+
+### UnstructuredFileLoader
+
+See a [usage example](/docs/integrations/document_loaders/unstructured_file).

-The primary `unstructured` wrappers within `langchain` are data loaders. The following
-shows how to use the most basic unstructured data loader. There are other file-specific
-data loaders available in the `langchain_community.document_loaders` module.

 ```python
 from langchain_community.document_loaders import UnstructuredFileLoader
+```
+
+### UnstructuredHTMLLoader
+
+See a [usage example](/docs/modules/data_connection/document_loaders/html).
+
+```python
+from langchain_community.document_loaders import UnstructuredHTMLLoader
+```
+
+### UnstructuredImageLoader
+
+See a [usage example](/docs/integrations/document_loaders/image).
+
+```python
+from langchain_community.document_loaders import UnstructuredImageLoader
+```

-loader = UnstructuredFileLoader("state_of_the_union.txt")
-loader.load()
+### UnstructuredMarkdownLoader
+
+See a [usage example](/docs/integrations/vectorstores/starrocks).
+
+```python
+from langchain_community.document_loaders import UnstructuredMarkdownLoader
+```
+
+### UnstructuredODTLoader
+
+The `Open Document Format for Office Applications (ODF)`, also known as `OpenDocument`, 
+is an open file format for word processing documents, spreadsheets, presentations 
+and graphics and using ZIP-compressed XML files. It was developed with the aim of 
+providing an open, XML-based file format specification for office applications.
+
+See a [usage example](/docs/integrations/document_loaders/odt).
+
+```python
+from langchain_community.document_loaders import UnstructuredODTLoader
+```
+
+### UnstructuredOrgModeLoader
+
+An [Org Mode](https://en.wikipedia.org/wiki/Org-mode) document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.
+
+See a [usage example](/docs/integrations/document_loaders/org_mode).
+
+```python
+from langchain_community.document_loaders import UnstructuredOrgModeLoader
+```
+
+### UnstructuredPDFLoader
+
+See a [usage example](/docs/modules/data_connection/document_loaders/pdf#using-unstructured).
+
+```python
+from langchain_community.document_loaders import UnstructuredPDFLoader
+```
+
+### UnstructuredPowerPointLoader
+
+See a [usage example](/docs/integrations/document_loaders/microsoft_powerpoint).
+
+```python
+from langchain_community.document_loaders import UnstructuredPowerPointLoader
+```
+
+### UnstructuredRSTLoader
+
+A `reStructured Text` (`RST`) file is a file format for textual data 
+used primarily in the Python programming language community for technical documentation.
+
+See a [usage example](/docs/integrations/document_loaders/rst).
+
+```python
+from langchain_community.document_loaders import UnstructuredRSTLoader
+```
+
+### UnstructuredRTFLoader
+
+See a usage example in the API documentation.
+
+```python
+from langchain_community.document_loaders import UnstructuredRTFLoader
+```
+
+### UnstructuredTSVLoader
+
+A `tab-separated values` (`TSV`) file is a simple, text-based file format for storing tabular data.
+Records are separated by newlines, and values within a record are separated by tab characters.
+
+See a [usage example](/docs/integrations/document_loaders/tsv).
+
+```python
+from langchain_community.document_loaders import UnstructuredTSVLoader
+```
+
+### UnstructuredURLLoader
+
+See a [usage example](/docs/integrations/document_loaders/url).
+
+```python
+from langchain_community.document_loaders import UnstructuredURLLoader
+```
+
+### UnstructuredWordDocumentLoader
+
+See a [usage example](/docs/integrations/document_loaders/microsoft_word#using-unstructured).
+
+```python
+from langchain_community.document_loaders import UnstructuredWordDocumentLoader
+```
+
+### UnstructuredXMLLoader
+
+See a [usage example](/docs/integrations/document_loaders/xml).
+
+```python
+from langchain_community.document_loaders import UnstructuredXMLLoader
 ```

-If you instantiate the loader with `UnstructuredFileLoader(mode="elements")`, the loader
-will track additional metadata like the page number and text type (i.e. title, narrative text)
-when that information is available.