From 69bf6262aa18679b5c551487a93b51418149e959 Mon Sep 17 00:00:00 2001 From: Leonid Ganeline Date: Thu, 4 Apr 2024 14:31:27 -0700 Subject: [PATCH] docs: `integrations/providers/unstructured` update (#19892) Updated a page with existing document loaders with links to examples. Fixed formatting of one example. Co-authored-by: Erick Friis --- .../integrations/document_loaders/url.ipynb | 70 ++++-- .../integrations/providers/unstructured.mdx | 212 +++++++++++++++++- 2 files changed, 257 insertions(+), 25 deletions(-) diff --git a/docs/docs/integrations/document_loaders/url.ipynb b/docs/docs/integrations/document_loaders/url.ipynb index 366a348ff0..bc26f36961 100644 --- a/docs/docs/integrations/document_loaders/url.ipynb +++ b/docs/docs/integrations/document_loaders/url.ipynb @@ -7,7 +7,35 @@ "source": [ "# URL\n", "\n", - "This covers how to load HTML documents from a list of URLs into a document format that we can use downstream." + "This example covers how to load `HTML` documents from a list of `URLs` into the `Document` format that we can use downstream." + ] + }, + { + "cell_type": "markdown", + "id": "5ccca101-b167-43bc-849e-9d456b16a123", + "metadata": { + "execution": { + "iopub.execute_input": "2024-04-02T00:13:43.279309Z", + "iopub.status.busy": "2024-04-02T00:13:43.278977Z", + "iopub.status.idle": "2024-04-02T00:13:43.282230Z", + "shell.execute_reply": "2024-04-02T00:13:43.281907Z", + "shell.execute_reply.started": "2024-04-02T00:13:43.279282Z" + } + }, + "source": [ + "## Unstructured URL Loader\n", + "\n", + "You have to install the `unstructured` library:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb26084d-a2b0-4685-9ec4-346139ffe0fb", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -U unstructured" ] }, { @@ -67,15 +95,24 @@ "id": "f3afa135", "metadata": {}, "source": [ - "# Selenium URL Loader\n", + "## Selenium URL Loader\n", "\n", "This covers how to load HTML documents from a list of URLs using the `SeleniumURLLoader`.\n", "\n", - "Using selenium allows us to load pages that require JavaScript to render.\n", + "Using `Selenium` allows us to load pages that require JavaScript to render.\n", "\n", - "## Setup\n", "\n", - "To use the `SeleniumURLLoader`, you will need to install `selenium` and `unstructured`.\n" + "To use the `SeleniumURLLoader`, you have to install `selenium` and `unstructured`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d2b86cf-55c6-430d-bf31-45591a1aa25a", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -U selenium unstructured" ] }, { @@ -127,15 +164,25 @@ "id": "a2c1c79f", "metadata": {}, "source": [ - "# Playwright URL Loader\n", + "## Playwright URL Loader\n", "\n", "This covers how to load HTML documents from a list of URLs using the `PlaywrightURLLoader`.\n", "\n", - "As in the Selenium case, Playwright allows us to load pages that need JavaScript to render.\n", + "[Playwright](https://playwright.dev/) enables reliable end-to-end testing for modern web apps.\n", "\n", - "## Setup\n", + "As in the Selenium case, `Playwright` allows us to load and render the JavaScript pages.\n", "\n", - "To use the `PlaywrightURLLoader`, you will need to install `playwright` and `unstructured`. Additionally, you will need to install the Playwright Chromium browser:" + "To use the `PlaywrightURLLoader`, you have to install `playwright` and `unstructured`. Additionally, you have to install the `Playwright Chromium` browser:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "017ba3d2-ccb0-4c24-a079-44a8e524b2fa", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -U playwright unstructured" ] }, { @@ -145,9 +192,6 @@ "metadata": {}, "outputs": [], "source": [ - "# Install playwright\n", - "%pip install --upgrade --quiet \"playwright\"\n", - "%pip install --upgrade --quiet \"unstructured\"\n", "!playwright install" ] }, @@ -211,7 +255,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.6" + "version": "3.10.12" } }, "nbformat": 4, diff --git a/docs/docs/integrations/providers/unstructured.mdx b/docs/docs/integrations/providers/unstructured.mdx index e23ce3c502..e210151646 100644 --- a/docs/docs/integrations/providers/unstructured.mdx +++ b/docs/docs/integrations/providers/unstructured.mdx @@ -27,7 +27,7 @@ simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. -The Unstructured API requires API keys to make requests. +The `Unstructured API` requires API keys to make requests. You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today! Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls. We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ). @@ -35,21 +35,209 @@ And stay tuned for improvements to both quality and performance! Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you'd like to self-host the Unstructured API or run it locally. -## Wrappers -### Data Loaders +## Data Loaders + +The primary usage of the `Unstructured` is in data loaders. + +### UnstructuredAPIFileIOLoader + +See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api). + +```python +from langchain_community.document_loaders import UnstructuredAPIFileIOLoader +``` + +### UnstructuredAPIFileLoader + +See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api). + +```python +from langchain_community.document_loaders import UnstructuredAPIFileLoader +``` + +### UnstructuredCHMLoader + +`CHM` means `Microsoft Compiled HTML Help`. + +See a usage example in the API documentation. + +```python +from langchain_community.document_loaders import UnstructuredCHMLoader +``` + +### UnstructuredCSVLoader + +A `comma-separated values` (`CSV`) file is a delimited text file that uses +a comma to separate values. Each line of the file is a data record. +Each record consists of one or more fields, separated by commas. + +See a [usage example](/docs/integrations/document_loaders/csv#unstructuredcsvloader). + +```python +from langchain_community.document_loaders import UnstructuredCSVLoader +``` + +### UnstructuredEmailLoader + +See a [usage example](/docs/integrations/document_loaders/email). + +```python +from langchain_community.document_loaders import UnstructuredEmailLoader +``` + +### UnstructuredEPubLoader + +[EPUB](https://en.wikipedia.org/wiki/EPUB) is an `e-book file format` that uses +the “.epub” file extension. The term is short for electronic publication and +is sometimes styled `ePub`. `EPUB` is supported by many e-readers, and compatible +software is available for most smartphones, tablets, and computers. + +See a [usage example](/docs/integrations/document_loaders/epub). + +```python +from langchain_community.document_loaders import UnstructuredEPubLoader +``` + +### UnstructuredExcelLoader + +See a [usage example](/docs/integrations/document_loaders/microsoft_excel). + +```python +from langchain_community.document_loaders import UnstructuredExcelLoader +``` + +### UnstructuredFileIOLoader + +See a [usage example](/docs/integrations/document_loaders/google_drive#passing-in-optional-file-loaders). + +```python +from langchain_community.document_loaders import UnstructuredFileIOLoader +``` + +### UnstructuredFileLoader + +See a [usage example](/docs/integrations/document_loaders/unstructured_file). -The primary `unstructured` wrappers within `langchain` are data loaders. The following -shows how to use the most basic unstructured data loader. There are other file-specific -data loaders available in the `langchain_community.document_loaders` module. ```python from langchain_community.document_loaders import UnstructuredFileLoader - -loader = UnstructuredFileLoader("state_of_the_union.txt") -loader.load() ``` -If you instantiate the loader with `UnstructuredFileLoader(mode="elements")`, the loader -will track additional metadata like the page number and text type (i.e. title, narrative text) -when that information is available. +### UnstructuredHTMLLoader + +See a [usage example](/docs/modules/data_connection/document_loaders/html). + +```python +from langchain_community.document_loaders import UnstructuredHTMLLoader +``` + +### UnstructuredImageLoader + +See a [usage example](/docs/integrations/document_loaders/image). + +```python +from langchain_community.document_loaders import UnstructuredImageLoader +``` + +### UnstructuredMarkdownLoader + +See a [usage example](/docs/integrations/vectorstores/starrocks). + +```python +from langchain_community.document_loaders import UnstructuredMarkdownLoader +``` + +### UnstructuredODTLoader + +The `Open Document Format for Office Applications (ODF)`, also known as `OpenDocument`, +is an open file format for word processing documents, spreadsheets, presentations +and graphics and using ZIP-compressed XML files. It was developed with the aim of +providing an open, XML-based file format specification for office applications. + +See a [usage example](/docs/integrations/document_loaders/odt). + +```python +from langchain_community.document_loaders import UnstructuredODTLoader +``` + +### UnstructuredOrgModeLoader + +An [Org Mode](https://en.wikipedia.org/wiki/Org-mode) document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs. + +See a [usage example](/docs/integrations/document_loaders/org_mode). + +```python +from langchain_community.document_loaders import UnstructuredOrgModeLoader +``` + +### UnstructuredPDFLoader + +See a [usage example](/docs/modules/data_connection/document_loaders/pdf#using-unstructured). + +```python +from langchain_community.document_loaders import UnstructuredPDFLoader +``` + +### UnstructuredPowerPointLoader + +See a [usage example](/docs/integrations/document_loaders/microsoft_powerpoint). + +```python +from langchain_community.document_loaders import UnstructuredPowerPointLoader +``` + +### UnstructuredRSTLoader + +A `reStructured Text` (`RST`) file is a file format for textual data +used primarily in the Python programming language community for technical documentation. + +See a [usage example](/docs/integrations/document_loaders/rst). + +```python +from langchain_community.document_loaders import UnstructuredRSTLoader +``` + +### UnstructuredRTFLoader + +See a usage example in the API documentation. + +```python +from langchain_community.document_loaders import UnstructuredRTFLoader +``` + +### UnstructuredTSVLoader + +A `tab-separated values` (`TSV`) file is a simple, text-based file format for storing tabular data. +Records are separated by newlines, and values within a record are separated by tab characters. + +See a [usage example](/docs/integrations/document_loaders/tsv). + +```python +from langchain_community.document_loaders import UnstructuredTSVLoader +``` + +### UnstructuredURLLoader + +See a [usage example](/docs/integrations/document_loaders/url). + +```python +from langchain_community.document_loaders import UnstructuredURLLoader +``` + +### UnstructuredWordDocumentLoader + +See a [usage example](/docs/integrations/document_loaders/microsoft_word#using-unstructured). + +```python +from langchain_community.document_loaders import UnstructuredWordDocumentLoader +``` + +### UnstructuredXMLLoader + +See a [usage example](/docs/integrations/document_loaders/xml). + +```python +from langchain_community.document_loaders import UnstructuredXMLLoader +``` +