docs: `integrations/providers/unstructured` update (#19892)

Updated a page with existing document loaders with links to examples.
Fixed formatting of one example.

Co-authored-by: Erick Friis <erick@langchain.dev>
pull/20030/head
Leonid Ganeline 5 months ago committed by GitHub
parent 1b7ed6071a
commit 69bf6262aa
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -7,7 +7,35 @@
"source": [
"# URL\n",
"\n",
"This covers how to load HTML documents from a list of URLs into a document format that we can use downstream."
"This example covers how to load `HTML` documents from a list of `URLs` into the `Document` format that we can use downstream."
]
},
{
"cell_type": "markdown",
"id": "5ccca101-b167-43bc-849e-9d456b16a123",
"metadata": {
"execution": {
"iopub.execute_input": "2024-04-02T00:13:43.279309Z",
"iopub.status.busy": "2024-04-02T00:13:43.278977Z",
"iopub.status.idle": "2024-04-02T00:13:43.282230Z",
"shell.execute_reply": "2024-04-02T00:13:43.281907Z",
"shell.execute_reply.started": "2024-04-02T00:13:43.279282Z"
}
},
"source": [
"## Unstructured URL Loader\n",
"\n",
"You have to install the `unstructured` library:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb26084d-a2b0-4685-9ec4-346139ffe0fb",
"metadata": {},
"outputs": [],
"source": [
"!pip install -U unstructured"
]
},
{
@ -67,15 +95,24 @@
"id": "f3afa135",
"metadata": {},
"source": [
"# Selenium URL Loader\n",
"## Selenium URL Loader\n",
"\n",
"This covers how to load HTML documents from a list of URLs using the `SeleniumURLLoader`.\n",
"\n",
"Using selenium allows us to load pages that require JavaScript to render.\n",
"Using `Selenium` allows us to load pages that require JavaScript to render.\n",
"\n",
"## Setup\n",
"\n",
"To use the `SeleniumURLLoader`, you will need to install `selenium` and `unstructured`.\n"
"To use the `SeleniumURLLoader`, you have to install `selenium` and `unstructured`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d2b86cf-55c6-430d-bf31-45591a1aa25a",
"metadata": {},
"outputs": [],
"source": [
"!pip install -U selenium unstructured"
]
},
{
@ -127,15 +164,25 @@
"id": "a2c1c79f",
"metadata": {},
"source": [
"# Playwright URL Loader\n",
"## Playwright URL Loader\n",
"\n",
"This covers how to load HTML documents from a list of URLs using the `PlaywrightURLLoader`.\n",
"\n",
"As in the Selenium case, Playwright allows us to load pages that need JavaScript to render.\n",
"[Playwright](https://playwright.dev/) enables reliable end-to-end testing for modern web apps.\n",
"\n",
"## Setup\n",
"As in the Selenium case, `Playwright` allows us to load and render the JavaScript pages.\n",
"\n",
"To use the `PlaywrightURLLoader`, you will need to install `playwright` and `unstructured`. Additionally, you will need to install the Playwright Chromium browser:"
"To use the `PlaywrightURLLoader`, you have to install `playwright` and `unstructured`. Additionally, you have to install the `Playwright Chromium` browser:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "017ba3d2-ccb0-4c24-a079-44a8e524b2fa",
"metadata": {},
"outputs": [],
"source": [
"!pip install -U playwright unstructured"
]
},
{
@ -145,9 +192,6 @@
"metadata": {},
"outputs": [],
"source": [
"# Install playwright\n",
"%pip install --upgrade --quiet \"playwright\"\n",
"%pip install --upgrade --quiet \"unstructured\"\n",
"!playwright install"
]
},
@ -211,7 +255,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
"version": "3.10.12"
}
},
"nbformat": 4,

@ -27,7 +27,7 @@ simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
`UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.
The Unstructured API requires API keys to make requests.
The `Unstructured API` requires API keys to make requests.
You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).
@ -35,21 +35,209 @@ And stay tuned for improvements to both quality and performance!
Check out the instructions
[here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you'd like to self-host the Unstructured API or run it locally.
## Wrappers
### Data Loaders
## Data Loaders
The primary usage of the `Unstructured` is in data loaders.
### UnstructuredAPIFileIOLoader
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
```python
from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
```
### UnstructuredAPIFileLoader
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
```python
from langchain_community.document_loaders import UnstructuredAPIFileLoader
```
### UnstructuredCHMLoader
`CHM` means `Microsoft Compiled HTML Help`.
See a usage example in the API documentation.
```python
from langchain_community.document_loaders import UnstructuredCHMLoader
```
### UnstructuredCSVLoader
A `comma-separated values` (`CSV`) file is a delimited text file that uses
a comma to separate values. Each line of the file is a data record.
Each record consists of one or more fields, separated by commas.
See a [usage example](/docs/integrations/document_loaders/csv#unstructuredcsvloader).
```python
from langchain_community.document_loaders import UnstructuredCSVLoader
```
### UnstructuredEmailLoader
See a [usage example](/docs/integrations/document_loaders/email).
```python
from langchain_community.document_loaders import UnstructuredEmailLoader
```
### UnstructuredEPubLoader
[EPUB](https://en.wikipedia.org/wiki/EPUB) is an `e-book file format` that uses
the “.epub” file extension. The term is short for electronic publication and
is sometimes styled `ePub`. `EPUB` is supported by many e-readers, and compatible
software is available for most smartphones, tablets, and computers.
See a [usage example](/docs/integrations/document_loaders/epub).
```python
from langchain_community.document_loaders import UnstructuredEPubLoader
```
### UnstructuredExcelLoader
See a [usage example](/docs/integrations/document_loaders/microsoft_excel).
```python
from langchain_community.document_loaders import UnstructuredExcelLoader
```
### UnstructuredFileIOLoader
See a [usage example](/docs/integrations/document_loaders/google_drive#passing-in-optional-file-loaders).
```python
from langchain_community.document_loaders import UnstructuredFileIOLoader
```
### UnstructuredFileLoader
See a [usage example](/docs/integrations/document_loaders/unstructured_file).
The primary `unstructured` wrappers within `langchain` are data loaders. The following
shows how to use the most basic unstructured data loader. There are other file-specific
data loaders available in the `langchain_community.document_loaders` module.
```python
from langchain_community.document_loaders import UnstructuredFileLoader
```
### UnstructuredHTMLLoader
See a [usage example](/docs/modules/data_connection/document_loaders/html).
```python
from langchain_community.document_loaders import UnstructuredHTMLLoader
```
### UnstructuredImageLoader
See a [usage example](/docs/integrations/document_loaders/image).
```python
from langchain_community.document_loaders import UnstructuredImageLoader
```
loader = UnstructuredFileLoader("state_of_the_union.txt")
loader.load()
### UnstructuredMarkdownLoader
See a [usage example](/docs/integrations/vectorstores/starrocks).
```python
from langchain_community.document_loaders import UnstructuredMarkdownLoader
```
### UnstructuredODTLoader
The `Open Document Format for Office Applications (ODF)`, also known as `OpenDocument`,
is an open file format for word processing documents, spreadsheets, presentations
and graphics and using ZIP-compressed XML files. It was developed with the aim of
providing an open, XML-based file format specification for office applications.
See a [usage example](/docs/integrations/document_loaders/odt).
```python
from langchain_community.document_loaders import UnstructuredODTLoader
```
### UnstructuredOrgModeLoader
An [Org Mode](https://en.wikipedia.org/wiki/Org-mode) document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.
See a [usage example](/docs/integrations/document_loaders/org_mode).
```python
from langchain_community.document_loaders import UnstructuredOrgModeLoader
```
### UnstructuredPDFLoader
See a [usage example](/docs/modules/data_connection/document_loaders/pdf#using-unstructured).
```python
from langchain_community.document_loaders import UnstructuredPDFLoader
```
### UnstructuredPowerPointLoader
See a [usage example](/docs/integrations/document_loaders/microsoft_powerpoint).
```python
from langchain_community.document_loaders import UnstructuredPowerPointLoader
```
### UnstructuredRSTLoader
A `reStructured Text` (`RST`) file is a file format for textual data
used primarily in the Python programming language community for technical documentation.
See a [usage example](/docs/integrations/document_loaders/rst).
```python
from langchain_community.document_loaders import UnstructuredRSTLoader
```
### UnstructuredRTFLoader
See a usage example in the API documentation.
```python
from langchain_community.document_loaders import UnstructuredRTFLoader
```
### UnstructuredTSVLoader
A `tab-separated values` (`TSV`) file is a simple, text-based file format for storing tabular data.
Records are separated by newlines, and values within a record are separated by tab characters.
See a [usage example](/docs/integrations/document_loaders/tsv).
```python
from langchain_community.document_loaders import UnstructuredTSVLoader
```
### UnstructuredURLLoader
See a [usage example](/docs/integrations/document_loaders/url).
```python
from langchain_community.document_loaders import UnstructuredURLLoader
```
### UnstructuredWordDocumentLoader
See a [usage example](/docs/integrations/document_loaders/microsoft_word#using-unstructured).
```python
from langchain_community.document_loaders import UnstructuredWordDocumentLoader
```
### UnstructuredXMLLoader
See a [usage example](/docs/integrations/document_loaders/xml).
```python
from langchain_community.document_loaders import UnstructuredXMLLoader
```
If you instantiate the loader with `UnstructuredFileLoader(mode="elements")`, the loader
will track additional metadata like the page number and text type (i.e. title, narrative text)
when that information is available.

Loading…
Cancel
Save