docs: `document_transformers` consistency (#10467)

- Updated `document_transformers` examples: titles, descriptions, links - Added `integrations/providers` for missed document_transformers
1 year ago · cb84f612c9
parent 240190db3f
commit cb84f612c9
13 changed files with 226 additions and 56 deletions
--- a/docs/extras/integrations/document_transformers/beautiful_soup.ipynb
+++ b/docs/extras/integrations/document_transformers/beautiful_soup.ipynb
@ -7,7 +7,12 @@
   "source": [
    "# Beautiful Soup\n",
    "\n",
-    "Beautiful Soup offers fine-grained control over HTML content, enabling specific tag extraction, removal, and content cleaning. \n",
+    ">[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) is a Python package for parsing \n",
+    "> HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). \n",
+    "> It creates a parse tree for parsed pages that can be used to extract data from HTML,[3] which \n",
+    "> is useful for web scraping.\n",
+    "\n",
+    "`Beautiful Soup` offers fine-grained control over HTML content, enabling specific tag extraction, removal, and content cleaning. \n",
    "\n",
    "It's suited for cases where you want to extract specific information and clean up the HTML content according to your needs.\n",
    "\n",
@ -87,7 +92,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.16"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
--- a/docs/extras/integrations/document_transformers/docai.ipynb
+++ b/docs/extras/integrations/document_transformers/docai.ipynb
@ -1,14 +1,11 @@
 {
 "cells": [
  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "48438efb-9f0d-473b-a91c-9f1e29c2539d",
+   "cell_type": "markdown",
+   "id": "310fce10-e051-40db-89b0-5b5bb85cd145",
   "metadata": {},
-   "outputs": [],
   "source": [
-    "from langchain.document_loaders.blob_loaders import Blob\n",
-    "from langchain.document_loaders.parsers import DocAIParser"
+    "# Document AI\n"
   ]
  },
  {
@ -16,7 +13,28 @@
   "id": "f95ac25b-f025-40c3-95b8-77919fc4da7f",
   "metadata": {},
   "source": [
-    "DocAI is a Google Cloud platform to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume. You can read more about it: https://cloud.google.com/document-ai/docs/overview "
+    ">[Document AI](https://cloud.google.com/document-ai/docs/overview) is a `Google Cloud Platform` service to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume.  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "275f2193-248f-4565-a872-93a89589cf2b",
+   "metadata": {},
+   "source": [
+    "The module contains a `PDF` parser based on DocAI from Google Cloud.\n",
+    "\n",
+    "You need to install two libraries to use this parser:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "34132fab-0069-4942-b68b-5b093ccfc92a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install google-cloud-documentai\n",
+    "!pip install google-cloud-documentai-toolbox"
   ]
  },
  {
@ -24,8 +42,8 @@
   "id": "51946817-798c-4d11-abd6-db2ae53a0270",
   "metadata": {},
   "source": [
-    "First, you need to set up a GCS bucket and create your own OCR processor as described here: https://cloud.google.com/document-ai/docs/create-processor\n",
-    "The GCS_OUTPUT_PATH should be a path to a folder on GCS (starting with `gs://`) and a processor name should look like `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID`. You can get it either programmatically or copy from the `Prediction endpoint` section of the `Processor details` tab in the Google Cloud Console."
+    "First, you need to set up a [`GCS` bucket and create your own OCR processor](https://cloud.google.com/document-ai/docs/create-processor)  \n",
+    "The `GCS_OUTPUT_PATH` should be a path to a folder on GCS (starting with `gs://`) and a processor name should look like `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID`. You can get it either programmatically or copy from the `Prediction endpoint` section of the `Processor details` tab in the Google Cloud Console."
   ]
  },
  {
@ -40,6 +58,17 @@
    "PROCESSOR_NAME = \"PUT_SOMETHING_HERE\""
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "48438efb-9f0d-473b-a91c-9f1e29c2539d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders.blob_loaders import Blob\n",
+    "from langchain.document_loaders.parsers import DocAIParser"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "fad2bcca-1c0e-4888-b82d-15823ba57e60",
@ -261,7 +290,7 @@
   "uri": "gcr.io/deeplearning-platform-release/base-cpu:m109"
  },
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@ -275,7 +304,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.11"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
--- a/docs/extras/integrations/document_transformers/doctran_extract_properties.ipynb
+++ b/docs/extras/integrations/document_transformers/doctran_extract_properties.ipynb
@ -4,14 +4,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Doctran Extract Properties\n",
+    "# Doctran: extract properties\n",
    "\n",
    "We can extract useful features of documents using the [Doctran](https://github.com/psychic-api/doctran) library, which uses OpenAI's function calling feature to extract specific metadata.\n",
    "\n",
    "Extracting metadata from documents is helpful for a variety of tasks, including:\n",
-    "* Classification: classifying documents into different categories\n",
-    "* Data mining: Extract structured data that can be used for data analysis\n",
-    "* Style transfer: Change the way text is written to more closely match expected user input, improving vector search results"
+    "* **Classification:** classifying documents into different categories\n",
+    "* **Data mining:** Extract structured data that can be used for data analysis\n",
+    "* **Style transfer:** Change the way text is written to more closely match expected user input, improving vector search results"
   ]
  },
  {
@ -26,9 +26,7 @@
  {
   "cell_type": "code",
   "execution_count": 1,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
@ -261,9 +259,9 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.11.3"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/docs/extras/integrations/document_transformers/doctran_interrogate_document.ipynb
+++ b/docs/extras/integrations/document_transformers/doctran_interrogate_document.ipynb
@ -4,8 +4,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Doctran Interrogate Documents\n",
-    "Documents used in a vector store knowledge base are typically stored in narrative or conversational format. However, most user queries are in question format. If we convert documents into Q&A format before vectorizing them, we can increase the liklihood of retrieving relevant documents, and decrease the liklihood of retrieving irrelevant documents.\n",
+    "# Doctran: interrogate documents\n",
+    "\n",
+    "Documents used in a vector store knowledge base are typically stored in a narrative or conversational format. However, most user queries are in question format. If we **convert documents into Q&A format** before vectorizing them, we can increase the likelihood of retrieving relevant documents, and decrease the likelihood of retrieving irrelevant documents.\n",
    "\n",
    "We can accomplish this using the [Doctran](https://github.com/psychic-api/doctran) library, which uses OpenAI's function calling feature to \"interrogate\" documents.\n",
    "\n",
@ -24,9 +25,7 @@
  {
   "cell_type": "code",
   "execution_count": 1,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
@ -258,9 +257,9 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.11.3"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/docs/extras/integrations/document_transformers/doctran_translate_document.ipynb
+++ b/docs/extras/integrations/document_transformers/doctran_translate_document.ipynb
@ -4,10 +4,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Doctran Translate Documents\n",
+    "# Doctran: language translation\n",
+    "\n",
    "Comparing documents through embeddings has the benefit of working across multiple languages. \"Harrison says hello\" and \"Harrison dice hola\" will occupy similar positions in the vector space because they have the same meaning semantically.\n",
    "\n",
-    "However, it can still be useful to use a LLM translate documents into other languages before vectorizing them. This is especially helpful when users are expected to query the knowledge base in different languages, or when state of the art embeddings models are not available for a given language.\n",
+    "However, it can still be useful to use an LLM to **translate documents into other languages** before vectorizing them. This is especially helpful when users are expected to query the knowledge base in different languages, or when state-of-the-art embedding models are not available for a given language.\n",
    "\n",
    "We can accomplish this using the [Doctran](https://github.com/psychic-api/doctran) library, which uses OpenAI's function calling feature to translate documents between languages."
   ]
@ -125,9 +126,7 @@
  {
   "cell_type": "code",
   "execution_count": 7,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
   "outputs": [],
   "source": [
    "translated_document = await qa_translator.atransform_documents(documents)"
@ -200,9 +199,9 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.11.3"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/docs/extras/integrations/document_transformers/html2text.ipynb
+++ b/docs/extras/integrations/document_transformers/html2text.ipynb
@ -5,11 +5,11 @@
   "id": "fe6e5c82",
   "metadata": {},
   "source": [
-    "# html2text\n",
+    "# HTML to text\n",
    "\n",
-    "[html2text](https://github.com/Alir3z4/html2text/) is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. \n",
+    ">[html2text](https://github.com/Alir3z4/html2text/) is a Python package that converts a page of `HTML` into clean, easy-to-read plain `ASCII text`. \n",
    "\n",
-    "The ASCII also happens to be valid Markdown (a text-to-HTML format)."
+    "The ASCII also happens to be a valid `Markdown` (a text-to-HTML format)."
   ]
  },
  {
@ -125,7 +125,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.16"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
--- a/docs/extras/integrations/document_transformers/nuclia_transformer.ipynb
+++ b/docs/extras/integrations/document_transformers/nuclia_transformer.ipynb
@ -5,11 +5,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Nuclia Understanding API document transformer\n",
+    "# Nuclia\n",
    "\n",
-    "[Nuclia](https://nuclia.com) automatically indexes your unstructured data from any internal and external source, providing optimized search results and generative answers. It can handle video and audio transcription, image content extraction, and document parsing.\n",
+    ">[Nuclia](https://nuclia.com) automatically indexes your unstructured data from any internal and external source, providing optimized search results and generative answers. It can handle video and audio transcription, image content extraction, and document parsing.\n",
    "\n",
-    "The Nuclia Understanding API document transformer splits text into paragraphs and sentences, identifies entities, provides a summary of the text and generates embeddings for all the sentences.\n",
+    "`Nuclia Understanding API` document transformer splits text into paragraphs and sentences, identifies entities, provides a summary of the text and generates embeddings for all the sentences.\n",
    "\n",
    "To use the Nuclia Understanding API, you need to have a Nuclia account. You can create one for free at [https://nuclia.cloud](https://nuclia.cloud), and then [create a NUA key](https://docs.nuclia.dev/docs/docs/using/understanding/intro).\n",
    "\n",
@ -94,7 +94,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "langchain",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@ -108,10 +108,9 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.5"
-  },
-  "orig_nbformat": 4
+   "version": "3.10.12"
+  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/docs/extras/integrations/document_transformers/openai_metadata_tagger.ipynb
+++ b/docs/extras/integrations/document_transformers/openai_metadata_tagger.ipynb
@ -4,15 +4,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# OpenAI Functions Metadata Tagger\n",
+    "# OpenAI metadata tagger\n",
    "\n",
-    "It can often be useful to tag ingested documents with structured metadata, such as the title, tone, or length of a document, to allow for more targeted similarity search later. However, for large numbers of documents, performing this labelling process manually can be tedious.\n",
+    "It can often be useful to tag ingested documents with structured metadata, such as the title, tone, or length of a document, to allow for a more targeted similarity search later. However, for large numbers of documents, performing this labelling process manually can be tedious.\n",
    "\n",
-    "The `OpenAIMetadataTagger` document transformer automates this process by extracting metadata from each provided document according to a provided schema. It uses a configurable OpenAI Functions-powered chain under the hood, so if you pass a custom LLM instance, it must be an OpenAI model with functions support. \n",
+    "The `OpenAIMetadataTagger` document transformer automates this process by extracting metadata from each provided document according to a provided schema. It uses a configurable `OpenAI Functions`-powered chain under the hood, so if you pass a custom LLM instance, it must be an `OpenAI` model with functions support. \n",
    "\n",
    "**Note:** This document transformer works best with complete documents, so it's best to run it first with whole documents before doing any other splitting or processing!\n",
    "\n",
-    "For example, let's say you wanted to index a set of movie reviews. You could initialize the document transformer with a valid JSON Schema object as follows:"
+    "For example, let's say you wanted to index a set of movie reviews. You could initialize the document transformer with a valid `JSON Schema` object as follows:"
   ]
  },
  {
@ -239,9 +239,9 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "venv",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
-   "name": "venv"
+   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
@ -253,9 +253,9 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.11.3"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/docs/extras/integrations/providers/beautiful_soup.mdx
+++ b/docs/extras/integrations/providers/beautiful_soup.mdx
@ -0,0 +1,20 @@
+# Beautiful Soup
+
+>[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) is a Python package for parsing 
+> HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). 
+> It creates a parse tree for parsed pages that can be used to extract data from HTML,[3] which 
+> is useful for web scraping.
+
+## Installation and Setup
+
+```bash
+pip install beautifulsoup4
+```
+
+## Document Transformer
+
+See a [usage example](/docs/integrations/document_transformers/beautiful_soup).
+
+```python
+from langchain.document_loaders import BeautifulSoupTransformer
+```
--- a/docs/extras/integrations/providers/doctran.mdx
+++ b/docs/extras/integrations/providers/doctran.mdx
@ -0,0 +1,37 @@
+# Doctran
+
+>[Doctran](https://github.com/psychic-api/doctran) is a python package. It uses LLMs and open source 
+> NLP libraries to transform raw text into clean, structured, information-dense documents 
+> that are optimized for vector space retrieval. You can think of `Doctran` as a black box where 
+> messy strings go in and nice, clean, labelled strings come out.
+
+
+## Installation and Setup
+
+```bash
+pip install doctran
+```
+
+## Document Transformers
+
+### Document Interrogator
+
+See a [usage example for DoctranQATransformer](/docs/integrations/document_transformers/doctran_interrogate_document).
+
+```python
+from langchain.document_loaders import DoctranQATransformer
+```
+### Property Extractor
+
+See a [usage example for DoctranPropertyExtractor](/docs/integrations/document_transformers/doctran_extract_properties).
+
+```python
+from langchain.document_loaders import DoctranPropertyExtractor
+```
+### Document Translator
+
+See a [usage example for DoctranTextTranslator](/docs/integrations/document_transformers/doctran_translate_document).
+
+```python
+from langchain.document_loaders import DoctranTextTranslator
+```
--- a/docs/extras/integrations/providers/google_document_ai.mdx
+++ b/docs/extras/integrations/providers/google_document_ai.mdx
@ -0,0 +1,28 @@
+# Google Document AI
+
+>[Document AI](https://cloud.google.com/document-ai/docs/overview) is a `Google Cloud Platform` 
+> service to transform unstructured data from documents into structured data, making it easier 
+> to understand, analyze, and consume.  
+
+
+## Installation and Setup
+
+You need to set up a [`GCS` bucket and create your own OCR processor](https://cloud.google.com/document-ai/docs/create-processor)  
+The `GCS_OUTPUT_PATH` should be a path to a folder on GCS (starting with `gs://`) 
+and a processor name should look like `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID`.
+You can get it either programmatically or copy from the `Prediction endpoint` section of the `Processor details`
+tab in the Google Cloud Console.
+
+```bash
+pip install google-cloud-documentai
+pip install google-cloud-documentai-toolbox
+```
+
+## Document Transformer
+
+See a [usage example](/docs/integrations/document_transformers/docai).
+
+```python
+from langchain.document_loaders.blob_loaders import Blob
+from langchain.document_loaders.parsers import DocAIParser
+```
--- a/docs/extras/integrations/providers/html2text.mdx
+++ b/docs/extras/integrations/providers/html2text.mdx
@ -0,0 +1,19 @@
+# HTML to text
+
+>[html2text](https://github.com/Alir3z4/html2text/) is a Python package that converts a page of `HTML` into clean, easy-to-read plain `ASCII text`. 
+
+The ASCII also happens to be a valid `Markdown` (a text-to-HTML format).
+
+## Installation and Setup
+
+```bash
+pip install html2text
+```
+
+## Document Transformer
+
+See a [usage example](/docs/integrations/document_transformers/html2text).
+
+```python
+from langchain.document_loaders import Html2TextTransformer
+```
--- a/docs/extras/integrations/providers/nuclia.mdx
+++ b/docs/extras/integrations/providers/nuclia.mdx
@ -0,0 +1,37 @@
+# Nuclia
+
+>[Nuclia](https://nuclia.com) automatically indexes your unstructured data from any internal
+> and external source, providing optimized search results and generative answers. 
+> It can handle video and audio transcription, image content extraction, and document parsing.
+
+>`Nuclia Understanding API` document transformer splits text into paragraphs and sentences, 
+> identifies entities, provides a summary of the text and generates embeddings for all the sentences.
+
+
+## Installation and Setup
+
+We need to install the `nucliadb-protos` package to use the `Nuclia Understanding API`.
+```bash
+pip install nucliadb-protos
+```
+
+To use the `Nuclia Understanding API`, we need to have a `Nuclia account`. 
+We can create one for free at [https://nuclia.cloud](https://nuclia.cloud), 
+and then [create a NUA key](https://docs.nuclia.dev/docs/docs/using/understanding/intro).
+
+To use the Nuclia document transformer, we need to instantiate a `NucliaUnderstandingAPI`
+tool with `enable_ml` set to `True`:
+
+```python
+from langchain.tools.nuclia import NucliaUnderstandingAPI
+
+nua = NucliaUnderstandingAPI(enable_ml=True)
+```
+
+## Document Transformer
+
+See a [usage example](/docs/integrations/document_transformers/nuclia_transformer).
+
+```python
+from langchain.document_transformers.nuclia_text_transform import NucliaTextTransformer
+```