mirror of https://github.com/hwchase17/langchain
docs: `document_transformers` consistency (#10467)
- Updated `document_transformers` examples: titles, descriptions, links - Added `integrations/providers` for missed document_transformerspull/11254/head^2
parent
240190db3f
commit
cb84f612c9
@ -0,0 +1,20 @@
|
||||
# Beautiful Soup
|
||||
|
||||
>[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) is a Python package for parsing
|
||||
> HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup).
|
||||
> It creates a parse tree for parsed pages that can be used to extract data from HTML,[3] which
|
||||
> is useful for web scraping.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
```bash
|
||||
pip install beautifulsoup4
|
||||
```
|
||||
|
||||
## Document Transformer
|
||||
|
||||
See a [usage example](/docs/integrations/document_transformers/beautiful_soup).
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import BeautifulSoupTransformer
|
||||
```
|
@ -0,0 +1,37 @@
|
||||
# Doctran
|
||||
|
||||
>[Doctran](https://github.com/psychic-api/doctran) is a python package. It uses LLMs and open source
|
||||
> NLP libraries to transform raw text into clean, structured, information-dense documents
|
||||
> that are optimized for vector space retrieval. You can think of `Doctran` as a black box where
|
||||
> messy strings go in and nice, clean, labelled strings come out.
|
||||
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
```bash
|
||||
pip install doctran
|
||||
```
|
||||
|
||||
## Document Transformers
|
||||
|
||||
### Document Interrogator
|
||||
|
||||
See a [usage example for DoctranQATransformer](/docs/integrations/document_transformers/doctran_interrogate_document).
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import DoctranQATransformer
|
||||
```
|
||||
### Property Extractor
|
||||
|
||||
See a [usage example for DoctranPropertyExtractor](/docs/integrations/document_transformers/doctran_extract_properties).
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import DoctranPropertyExtractor
|
||||
```
|
||||
### Document Translator
|
||||
|
||||
See a [usage example for DoctranTextTranslator](/docs/integrations/document_transformers/doctran_translate_document).
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import DoctranTextTranslator
|
||||
```
|
@ -0,0 +1,28 @@
|
||||
# Google Document AI
|
||||
|
||||
>[Document AI](https://cloud.google.com/document-ai/docs/overview) is a `Google Cloud Platform`
|
||||
> service to transform unstructured data from documents into structured data, making it easier
|
||||
> to understand, analyze, and consume.
|
||||
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
You need to set up a [`GCS` bucket and create your own OCR processor](https://cloud.google.com/document-ai/docs/create-processor)
|
||||
The `GCS_OUTPUT_PATH` should be a path to a folder on GCS (starting with `gs://`)
|
||||
and a processor name should look like `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID`.
|
||||
You can get it either programmatically or copy from the `Prediction endpoint` section of the `Processor details`
|
||||
tab in the Google Cloud Console.
|
||||
|
||||
```bash
|
||||
pip install google-cloud-documentai
|
||||
pip install google-cloud-documentai-toolbox
|
||||
```
|
||||
|
||||
## Document Transformer
|
||||
|
||||
See a [usage example](/docs/integrations/document_transformers/docai).
|
||||
|
||||
```python
|
||||
from langchain.document_loaders.blob_loaders import Blob
|
||||
from langchain.document_loaders.parsers import DocAIParser
|
||||
```
|
@ -0,0 +1,19 @@
|
||||
# HTML to text
|
||||
|
||||
>[html2text](https://github.com/Alir3z4/html2text/) is a Python package that converts a page of `HTML` into clean, easy-to-read plain `ASCII text`.
|
||||
|
||||
The ASCII also happens to be a valid `Markdown` (a text-to-HTML format).
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
```bash
|
||||
pip install html2text
|
||||
```
|
||||
|
||||
## Document Transformer
|
||||
|
||||
See a [usage example](/docs/integrations/document_transformers/html2text).
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import Html2TextTransformer
|
||||
```
|
@ -0,0 +1,37 @@
|
||||
# Nuclia
|
||||
|
||||
>[Nuclia](https://nuclia.com) automatically indexes your unstructured data from any internal
|
||||
> and external source, providing optimized search results and generative answers.
|
||||
> It can handle video and audio transcription, image content extraction, and document parsing.
|
||||
|
||||
>`Nuclia Understanding API` document transformer splits text into paragraphs and sentences,
|
||||
> identifies entities, provides a summary of the text and generates embeddings for all the sentences.
|
||||
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
We need to install the `nucliadb-protos` package to use the `Nuclia Understanding API`.
|
||||
```bash
|
||||
pip install nucliadb-protos
|
||||
```
|
||||
|
||||
To use the `Nuclia Understanding API`, we need to have a `Nuclia account`.
|
||||
We can create one for free at [https://nuclia.cloud](https://nuclia.cloud),
|
||||
and then [create a NUA key](https://docs.nuclia.dev/docs/docs/using/understanding/intro).
|
||||
|
||||
To use the Nuclia document transformer, we need to instantiate a `NucliaUnderstandingAPI`
|
||||
tool with `enable_ml` set to `True`:
|
||||
|
||||
```python
|
||||
from langchain.tools.nuclia import NucliaUnderstandingAPI
|
||||
|
||||
nua = NucliaUnderstandingAPI(enable_ml=True)
|
||||
```
|
||||
|
||||
## Document Transformer
|
||||
|
||||
See a [usage example](/docs/integrations/document_transformers/nuclia_transformer).
|
||||
|
||||
```python
|
||||
from langchain.document_transformers.nuclia_text_transform import NucliaTextTransformer
|
||||
```
|
Loading…
Reference in New Issue