langchain/docs/modules/indexes/document_loaders/examples
Eugene Yurtsev 3c490b5ba3
Docugami DataLoader (#4727)
### Adds a document loader for Docugami

Specifically:

1. Adds a data loader that talks to the [Docugami](http://docugami.com)
API to download processed documents as semantic XML
2. Parses the semantic XML into chunks, with additional metadata
capturing chunk semantics
3. Adds a detailed notebook showing how you can use additional metadata
returned by Docugami for techniques like the [self-querying
retriever](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query_retriever.html)
4. Adds an integration test, and related documentation

Here is an example of a result that is not possible without the
capabilities added by Docugami (from the notebook):

<img width="1585" alt="image"
src="https://github.com/hwchase17/langchain/assets/749277/bb6c1ce3-13dc-4349-a53b-de16681fdd5b">

---------

Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>
Co-authored-by: Taqi Jaffri <tjaffri@gmail.com>
2023-05-15 10:53:00 -04:00
..
example_data Harrison/sitemap local (#4704) 2023-05-14 22:04:38 -07:00
airbyte_json.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
apify_dataset.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
arxiv.ipynb Deleted importing Document from document_loaders.base because Documen… (#4068) 2023-05-03 17:54:30 -07:00
aws_s3_directory.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
aws_s3_file.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
azlyrics.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
azure_blob_storage_container.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
azure_blob_storage_file.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
bilibili.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
blackboard.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
blockchain.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
chatgpt_loader.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
college_confidential.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
confluence.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
conll-u.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
copypaste.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
csv.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
diffbot.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
discord_loader.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
docugami.ipynb Docugami DataLoader (#4727) 2023-05-15 10:53:00 -04:00
duckdb.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
email.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
epub.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
evernote.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
facebook_chat.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
figma.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
file_directory.ipynb Harrison/multithreading directory loader (#4650) 2023-05-13 21:46:02 -07:00
git.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
gitbook.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
google_bigquery.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
google_cloud_storage_directory.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
google_cloud_storage_file.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
google_drive.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
gutenberg.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
hacker_news.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
html.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
hugging_face_dataset.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
ifixit.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
image_captions.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
image.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
imsdb.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
json_loader.ipynb JSON loader (#4067) 2023-05-05 14:48:13 -07:00
jupyter_notebook.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
markdown.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
mediawikidump.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
microsoft_onedrive.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
microsoft_powerpoint.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
microsoft_word.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
modern_treasury.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
notion.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
notiondb.ipynb Harrison/param notion db (#4689) 2023-05-14 18:26:25 -07:00
obsidian.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
odt.ipynb feat: add loader for open office odt files (#4405) 2023-05-10 01:37:17 -07:00
pandas_dataframe.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
pdf.ipynb Feature: pdfplumber PDF loader with BaseBlobParser (#4552) 2023-05-15 09:47:02 -04:00
readthedocs_documentation.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
reddit.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
roam.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
sitemap.ipynb Harrison/sitemap local (#4704) 2023-05-14 22:04:38 -07:00
slack.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
spreedly.ipynb Vwp/docs improved document loaders (#4006) 2023-05-02 15:24:53 -07:00
stripe.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
subtitle.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
telegram.ipynb Harrison/telegram chat loader (#4698) 2023-05-14 22:04:27 -07:00
toml.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
twitter.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
unstructured_file.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
url.ipynb Harrison/playwright (#2871) 2023-04-13 22:15:03 -07:00
web_base.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
whatsapp_chat.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00
wikipedia.ipynb added Wikipedia document loader (#4141) 2023-05-06 09:32:45 -07:00
youtube_transcript.ipynb docs: document_loaders improvements (#4200) 2023-05-05 17:44:54 -07:00