From 69bf6262aa18679b5c551487a93b51418149e959 Mon Sep 17 00:00:00 2001
From: Leonid Ganeline <leo.gan.57@gmail.com>
Date: Thu, 4 Apr 2024 14:31:27 -0700
Subject: [PATCH] docs: `integrations/providers/unstructured` update (#19892)

Updated a page with existing document loaders with links to examples.
Fixed formatting of one example.

Co-authored-by: Erick Friis <erick@langchain.dev>
---
 .../integrations/document_loaders/url.ipynb   |  70 ++++--
 .../integrations/providers/unstructured.mdx   | 212 +++++++++++++++++-
 2 files changed, 257 insertions(+), 25 deletions(-)

diff --git a/docs/docs/integrations/document_loaders/url.ipynb b/docs/docs/integrations/document_loaders/url.ipynb
index 366a348ff0..bc26f36961 100644
--- a/docs/docs/integrations/document_loaders/url.ipynb
+++ b/docs/docs/integrations/document_loaders/url.ipynb
@@ -7,7 +7,35 @@
    "source": [
     "# URL\n",
     "\n",
-    "This covers how to load HTML documents from a list of URLs into a document format that we can use downstream."
+    "This example covers how to load `HTML` documents from a list of `URLs` into the `Document` format that we can use downstream."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ccca101-b167-43bc-849e-9d456b16a123",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2024-04-02T00:13:43.279309Z",
+     "iopub.status.busy": "2024-04-02T00:13:43.278977Z",
+     "iopub.status.idle": "2024-04-02T00:13:43.282230Z",
+     "shell.execute_reply": "2024-04-02T00:13:43.281907Z",
+     "shell.execute_reply.started": "2024-04-02T00:13:43.279282Z"
+    }
+   },
+   "source": [
+    "## Unstructured URL Loader\n",
+    "\n",
+    "You have to install the `unstructured` library:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cb26084d-a2b0-4685-9ec4-346139ffe0fb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -U unstructured"
    ]
   },
   {
@@ -67,15 +95,24 @@
    "id": "f3afa135",
    "metadata": {},
    "source": [
-    "# Selenium URL Loader\n",
+    "## Selenium URL Loader\n",
     "\n",
     "This covers how to load HTML documents from a list of URLs using the `SeleniumURLLoader`.\n",
     "\n",
-    "Using selenium allows us to load pages that require JavaScript to render.\n",
+    "Using `Selenium` allows us to load pages that require JavaScript to render.\n",
     "\n",
-    "## Setup\n",
     "\n",
-    "To use the `SeleniumURLLoader`, you will need to install `selenium` and `unstructured`.\n"
+    "To use the `SeleniumURLLoader`, you have to install `selenium` and `unstructured`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4d2b86cf-55c6-430d-bf31-45591a1aa25a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -U selenium unstructured"
    ]
   },
   {
@@ -127,15 +164,25 @@
    "id": "a2c1c79f",
    "metadata": {},
    "source": [
-    "# Playwright URL Loader\n",
+    "## Playwright URL Loader\n",
     "\n",
     "This covers how to load HTML documents from a list of URLs using the `PlaywrightURLLoader`.\n",
     "\n",
-    "As in the Selenium case, Playwright allows us to load pages that need JavaScript to render.\n",
+    "[Playwright](https://playwright.dev/) enables reliable end-to-end testing for modern web apps.\n",
     "\n",
-    "## Setup\n",
+    "As in the Selenium case, `Playwright` allows us to load and render the JavaScript pages.\n",
     "\n",
-    "To use the `PlaywrightURLLoader`, you will need to install `playwright` and `unstructured`. Additionally, you will need to install the Playwright Chromium browser:"
+    "To use the `PlaywrightURLLoader`, you have to install `playwright` and `unstructured`. Additionally, you have to install the `Playwright Chromium` browser:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "017ba3d2-ccb0-4c24-a079-44a8e524b2fa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -U playwright unstructured"
    ]
   },
   {
@@ -145,9 +192,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Install playwright\n",
-    "%pip install --upgrade --quiet  \"playwright\"\n",
-    "%pip install --upgrade --quiet  \"unstructured\"\n",
     "!playwright install"
    ]
   },
@@ -211,7 +255,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.6"
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,
diff --git a/docs/docs/integrations/providers/unstructured.mdx b/docs/docs/integrations/providers/unstructured.mdx
index e23ce3c502..e210151646 100644
--- a/docs/docs/integrations/providers/unstructured.mdx
+++ b/docs/docs/integrations/providers/unstructured.mdx
@@ -27,7 +27,7 @@ simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
 `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.
 
 
-The Unstructured API requires API keys to make requests.
+The `Unstructured API` requires API keys to make requests.
 You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
 Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
 We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).
@@ -35,21 +35,209 @@ And stay tuned for improvements to both quality and performance!
 Check out the instructions
 [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you'd like to self-host the Unstructured API or run it locally.
 
-## Wrappers
 
-### Data Loaders
+## Data Loaders
+
+The primary usage of the `Unstructured` is in data loaders.
+
+### UnstructuredAPIFileIOLoader
+
+See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
+
+```python
+from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
+```
+
+### UnstructuredAPIFileLoader
+
+See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
+
+```python
+from langchain_community.document_loaders import UnstructuredAPIFileLoader
+```
+
+### UnstructuredCHMLoader
+
+`CHM` means `Microsoft Compiled HTML Help`.
+
+See a usage example in the API documentation.
+
+```python
+from langchain_community.document_loaders import UnstructuredCHMLoader
+```
+
+### UnstructuredCSVLoader
+
+A `comma-separated values` (`CSV`) file is a delimited text file that uses 
+a comma to separate values. Each line of the file is a data record. 
+Each record consists of one or more fields, separated by commas.
+
+See a [usage example](/docs/integrations/document_loaders/csv#unstructuredcsvloader).
+
+```python
+from langchain_community.document_loaders import UnstructuredCSVLoader
+```
+
+### UnstructuredEmailLoader
+
+See a [usage example](/docs/integrations/document_loaders/email).
+
+```python
+from langchain_community.document_loaders import UnstructuredEmailLoader
+```
+
+### UnstructuredEPubLoader
+
+[EPUB](https://en.wikipedia.org/wiki/EPUB) is an `e-book file format` that uses 
+the “.epub” file extension. The term is short for electronic publication and 
+is sometimes styled `ePub`. `EPUB` is supported by many e-readers, and compatible 
+software is available for most smartphones, tablets, and computers.
+
+See a [usage example](/docs/integrations/document_loaders/epub).
+
+```python
+from langchain_community.document_loaders import UnstructuredEPubLoader
+```
+
+### UnstructuredExcelLoader
+
+See a [usage example](/docs/integrations/document_loaders/microsoft_excel).
+
+```python
+from langchain_community.document_loaders import UnstructuredExcelLoader
+```
+
+### UnstructuredFileIOLoader
+
+See a [usage example](/docs/integrations/document_loaders/google_drive#passing-in-optional-file-loaders).
+
+```python
+from langchain_community.document_loaders import UnstructuredFileIOLoader
+```
+
+### UnstructuredFileLoader
+
+See a [usage example](/docs/integrations/document_loaders/unstructured_file).
 
-The primary `unstructured` wrappers within `langchain` are data loaders. The following
-shows how to use the most basic unstructured data loader. There are other file-specific
-data loaders available in the `langchain_community.document_loaders` module.
 
 ```python
 from langchain_community.document_loaders import UnstructuredFileLoader
-
-loader = UnstructuredFileLoader("state_of_the_union.txt")
-loader.load()
 ```
 
-If you instantiate the loader with `UnstructuredFileLoader(mode="elements")`, the loader
-will track additional metadata like the page number and text type (i.e. title, narrative text)
-when that information is available.
+### UnstructuredHTMLLoader
+
+See a [usage example](/docs/modules/data_connection/document_loaders/html).
+
+```python
+from langchain_community.document_loaders import UnstructuredHTMLLoader
+```
+
+### UnstructuredImageLoader
+
+See a [usage example](/docs/integrations/document_loaders/image).
+
+```python
+from langchain_community.document_loaders import UnstructuredImageLoader
+```
+
+### UnstructuredMarkdownLoader
+
+See a [usage example](/docs/integrations/vectorstores/starrocks).
+
+```python
+from langchain_community.document_loaders import UnstructuredMarkdownLoader
+```
+
+### UnstructuredODTLoader
+
+The `Open Document Format for Office Applications (ODF)`, also known as `OpenDocument`, 
+is an open file format for word processing documents, spreadsheets, presentations 
+and graphics and using ZIP-compressed XML files. It was developed with the aim of 
+providing an open, XML-based file format specification for office applications.
+
+See a [usage example](/docs/integrations/document_loaders/odt).
+
+```python
+from langchain_community.document_loaders import UnstructuredODTLoader
+```
+
+### UnstructuredOrgModeLoader
+
+An [Org Mode](https://en.wikipedia.org/wiki/Org-mode) document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.
+
+See a [usage example](/docs/integrations/document_loaders/org_mode).
+
+```python
+from langchain_community.document_loaders import UnstructuredOrgModeLoader
+```
+
+### UnstructuredPDFLoader
+
+See a [usage example](/docs/modules/data_connection/document_loaders/pdf#using-unstructured).
+
+```python
+from langchain_community.document_loaders import UnstructuredPDFLoader
+```
+
+### UnstructuredPowerPointLoader
+
+See a [usage example](/docs/integrations/document_loaders/microsoft_powerpoint).
+
+```python
+from langchain_community.document_loaders import UnstructuredPowerPointLoader
+```
+
+### UnstructuredRSTLoader
+
+A `reStructured Text` (`RST`) file is a file format for textual data 
+used primarily in the Python programming language community for technical documentation.
+
+See a [usage example](/docs/integrations/document_loaders/rst).
+
+```python
+from langchain_community.document_loaders import UnstructuredRSTLoader
+```
+
+### UnstructuredRTFLoader
+
+See a usage example in the API documentation.
+
+```python
+from langchain_community.document_loaders import UnstructuredRTFLoader
+```
+
+### UnstructuredTSVLoader
+
+A `tab-separated values` (`TSV`) file is a simple, text-based file format for storing tabular data.
+Records are separated by newlines, and values within a record are separated by tab characters.
+
+See a [usage example](/docs/integrations/document_loaders/tsv).
+
+```python
+from langchain_community.document_loaders import UnstructuredTSVLoader
+```
+
+### UnstructuredURLLoader
+
+See a [usage example](/docs/integrations/document_loaders/url).
+
+```python
+from langchain_community.document_loaders import UnstructuredURLLoader
+```
+
+### UnstructuredWordDocumentLoader
+
+See a [usage example](/docs/integrations/document_loaders/microsoft_word#using-unstructured).
+
+```python
+from langchain_community.document_loaders import UnstructuredWordDocumentLoader
+```
+
+### UnstructuredXMLLoader
+
+See a [usage example](/docs/integrations/document_loaders/xml).
+
+```python
+from langchain_community.document_loaders import UnstructuredXMLLoader
+```
+