langchain/docs/extras/modules/data_connection/document_loaders/integrations
Matt Robinson 3c489be773
feat: optional post-processing for Unstructured loaders (#7850)
### Summary

Adds a post-processing method for Unstructured loaders that allows users
to optionally modify or clean extracted elements.

### Testing

```python
from langchain.document_loaders import UnstructuredFileLoader
from unstructured.cleaners.core import clean_extra_whitespace

loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="elements",
    post_processors=[clean_extra_whitespace],
)

docs = loader.load()
docs[:5]
```


### Reviewrs
  - @rlancemartin
  - @eyurtsev
  - @hwchase17
2023-07-17 12:13:05 -07:00
..
example_data codespell: workflow, config + some (quite a few) typos fixed (#6785) 2023-07-12 16:20:08 -04:00
acreom.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
airbyte_json.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
airtable.ipynb docstrings document_loaders 1 (#6847) 2023-07-02 12:13:04 -07:00
alibaba_cloud_maxcompute.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
apify_dataset.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
arxiv.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
aws_s3_directory.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
aws_s3_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azlyrics.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_blob_storage_container.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_blob_storage_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bibtex.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bilibili.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
blackboard.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
blockchain.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
brave_search.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
browserless.ipynb add browserless loader (#7562) 2023-07-13 13:18:28 -07:00
chatgpt_loader.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
college_confidential.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
confluence.ipynb fix titles in documentation 2023-06-17 11:09:11 -07:00
conll-u.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
copypaste.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
csv.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
cube_semantic.ipynb Document loader for Cube Semantic Layer (#6882) 2023-07-05 15:18:12 -07:00
datadog_logs.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
diffbot.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
discord.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
docugami.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
duckdb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
email.ipynb feat: enable UnstructuredEmailLoader to process attachments (#6977) 2023-07-01 06:09:26 -07:00
embaas.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
epub.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
evernote.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
excel.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
facebook_chat.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
fauna.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
figma.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
git.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
gitbook.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
github.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_bigquery.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_cloud_storage_directory.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_cloud_storage_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_drive.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
grobid.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
gutenberg.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hacker_news.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hugging_face_dataset.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
ifixit.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
image_captions.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
image.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
imsdb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
iugu.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
joplin.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
jupyter_notebook.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
larksuite.ipynb feat (documents): add LarkSuite document loader (#6420) 2023-06-27 23:08:05 -07:00
mastodon.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
mediawikidump.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
merge_doc_loader.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
mhtml.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
microsoft_onedrive.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_powerpoint.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_word.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
modern_treasury.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
notion.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
notiondb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
obsidian.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
odt.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
open_city_data.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
org_mode.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
pandas_dataframe.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
psychic.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
pyspark_dataframe.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
readthedocs_documentation.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
recursive_url_loader.ipynb Make recursive loader yield while crawling (#7568) 2023-07-13 21:55:20 -07:00
reddit.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
roam.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
rockset.ipynb Integrate Rockset as a document loader (#7681) 2023-07-14 07:58:13 -07:00
rst.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
sitemap.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
slack.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
snowflake.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
source_code.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
spreedly.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
stripe.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
subtitle.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
telegram.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
tencent_cos_directory.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
tencent_cos_file.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
tomarkdown.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
toml.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
trello.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
tsv.ipynb feat: Add UnstructuredTSVLoader (#7367) 2023-07-10 03:07:10 -04:00
twitter.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
unstructured_file.ipynb feat: optional post-processing for Unstructured loaders (#7850) 2023-07-17 12:13:05 -07:00
url.ipynb Add markdown to specify important arguments (#6246) 2023-06-18 17:47:00 -07:00
weather.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
web_base.ipynb Fix make docs_build and related scripts (#7276) 2023-07-11 22:05:14 -04:00
whatsapp_chat.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
wikipedia.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
xml.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
xorbits.ipynb Add Xorbits Dataframe as a Document Loader (#7319) 2023-07-10 04:24:47 -04:00
youtube_audio.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
youtube_transcript.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00