diff --git a/docs/modules/indexes/document_loaders.rst b/docs/modules/indexes/document_loaders.rst index 45307041..4e301fee 100644 --- a/docs/modules/indexes/document_loaders.rst +++ b/docs/modules/indexes/document_loaders.rst @@ -6,19 +6,126 @@ Document Loaders Combining language models with your own text data is a powerful way to differentiate them. -The first step in doing this is to load the data into "documents" - a fancy way of say some pieces of text. -This module is aimed at making this easy. +The first step in doing this is to load the data into "Documents" - a fancy way of say some pieces of text. +The document loader is aimed at making this easy. -A primary driver of a lot of this is the `Unstructured `_ python package. -This package is a great way to transform all types of files - text, powerpoint, images, html, pdf, etc - into text data. + +The following document loaders are provided: + + +Transform loaders +------------------------------ + +These **transform** loaders transform data from a specific format into the Document format. +For example, there are **transformers** for CSV and SQL. +Mostly, these loaders input data from files but sometime from URLs. + +A primary driver of a lot of these transformers is the `Unstructured `_ python package. +This package transforms many types of files - text, powerpoint, images, html, pdf, etc - into text data. For detailed instructions on how to get set up with Unstructured, see installation guidelines `here `_. -The following document loaders are provided: + +.. toctree:: + :maxdepth: 1 + :glob: + + ./document_loaders/examples/conll-u.ipynb + ./document_loaders/examples/copypaste.ipynb + ./document_loaders/examples/csv.ipynb + ./document_loaders/examples/email.ipynb + ./document_loaders/examples/epub.ipynb + ./document_loaders/examples/evernote.ipynb + ./document_loaders/examples/facebook_chat.ipynb + ./document_loaders/examples/file_directory.ipynb + ./document_loaders/examples/html.ipynb + ./document_loaders/examples/image.ipynb + ./document_loaders/examples/jupyter_notebook.ipynb + ./document_loaders/examples/markdown.ipynb + ./document_loaders/examples/microsoft_powerpoint.ipynb + ./document_loaders/examples/microsoft_word.ipynb + ./document_loaders/examples/pandas_dataframe.ipynb + ./document_loaders/examples/pdf.ipynb + ./document_loaders/examples/sitemap.ipynb + ./document_loaders/examples/subtitle.ipynb + ./document_loaders/examples/telegram.ipynb + ./document_loaders/examples/toml.ipynb + ./document_loaders/examples/unstructured_file.ipynb + ./document_loaders/examples/url.ipynb + ./document_loaders/examples/web_base.ipynb + ./document_loaders/examples/whatsapp_chat.ipynb + + + +Public dataset or service loaders +---------------------------------- +These datasets and sources are created for public domain and we use queries to search there +and download necessary documents. +For example, **Hacker News** service. + +We don't need any access permissions to these datasets and services. + + +.. toctree:: + :maxdepth: 1 + :glob: + + ./document_loaders/examples/arxiv.ipynb + ./document_loaders/examples/azlyrics.ipynb + ./document_loaders/examples/bilibili.ipynb + ./document_loaders/examples/college_confidential.ipynb + ./document_loaders/examples/gutenberg.ipynb + ./document_loaders/examples/hacker_news.ipynb + ./document_loaders/examples/hugging_face_dataset.ipynb + ./document_loaders/examples/ifixit.ipynb + ./document_loaders/examples/imsdb.ipynb + ./document_loaders/examples/mediawikidump.ipynb + ./document_loaders/examples/youtube_transcript.ipynb + + +Proprietary dataset or service loaders +------------------------------ +These datasets and services are not from the public domain. +These loaders mostly transform data from specific formats of applications or cloud services, +for example **Google Drive**. + +We need access tokens and sometime other parameters to get access to these datasets and services. .. toctree:: :maxdepth: 1 :glob: - ./document_loaders/examples/* \ No newline at end of file + ./document_loaders/examples/airbyte_json.ipynb + ./document_loaders/examples/apify_dataset.ipynb + ./document_loaders/examples/aws_s3_directory.ipynb + ./document_loaders/examples/aws_s3_file.ipynb + ./document_loaders/examples/azure_blob_storage_container.ipynb + ./document_loaders/examples/azure_blob_storage_file.ipynb + ./document_loaders/examples/blackboard.ipynb + ./document_loaders/examples/blockchain.ipynb + ./document_loaders/examples/chatgpt_loader.ipynb + ./document_loaders/examples/confluence.ipynb + ./document_loaders/examples/diffbot.ipynb + ./document_loaders/examples/discord_loader.ipynb + ./document_loaders/examples/duckdb.ipynb + ./document_loaders/examples/figma.ipynb + ./document_loaders/examples/gitbook.ipynb + ./document_loaders/examples/git.ipynb + ./document_loaders/examples/google_bigquery.ipynb + ./document_loaders/examples/google_cloud_storage_directory.ipynb + ./document_loaders/examples/google_cloud_storage_file.ipynb + ./document_loaders/examples/google_drive.ipynb + ./document_loaders/examples/image_captions.ipynb + ./document_loaders/examples/microsoft_onedrive.ipynb + ./document_loaders/examples/modern_treasury.ipynb + ./document_loaders/examples/notiondb.ipynb + ./document_loaders/examples/notion.ipynb + ./document_loaders/examples/obsidian.ipynb + ./document_loaders/examples/readthedocs_documentation.ipynb + ./document_loaders/examples/reddit.ipynb + ./document_loaders/examples/roam.ipynb + ./document_loaders/examples/slack.ipynb + ./document_loaders/examples/spreedly.ipynb + ./document_loaders/examples/stripe.ipynb + ./document_loaders/examples/twitter.ipynb