langchain/docs/extras/modules/data_connection/document_loaders/integrations
Lance Martin c2b25c17c5
Recursive URL loader (#6455)
We may want to process load all URLs under a root directory.

For example, let's look at the [LangChain JS
documentation](https://js.langchain.com/docs/).

This has many interesting child pages that we may want to read in bulk.

Of course, the `WebBaseLoader` can load a list of pages. 

But, the challenge is traversing the tree of child pages and actually
assembling that list!
 
We do this using the `RecusiveUrlLoader`.

This also gives us the flexibility to exclude some children (e.g., the
`api` directory with > 800 child pages).
2023-06-23 13:09:00 -07:00
..
example_data Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
acreom.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
airbyte_json.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
airtable.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
alibaba_cloud_maxcompute.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
apify_dataset.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
arxiv.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
aws_s3_directory.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
aws_s3_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azlyrics.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_blob_storage_container.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_blob_storage_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bibtex.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bilibili.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
blackboard.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
blockchain.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
chatgpt_loader.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
college_confidential.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
confluence.ipynb fix titles in documentation 2023-06-17 11:09:11 -07:00
conll-u.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
copypaste.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
csv.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
diffbot.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
discord.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
docugami.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
duckdb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
email.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
embaas.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
epub.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
evernote.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
excel.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
facebook_chat.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
fauna.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
figma.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
git.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
gitbook.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
github.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_bigquery.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_cloud_storage_directory.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_cloud_storage_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_drive.ipynb Harrison/gdrive enhancements (#6375) 2023-06-18 11:07:23 -07:00
gutenberg.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hacker_news.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hugging_face_dataset.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
ifixit.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
image_captions.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
image.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
imsdb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
iugu.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
joplin.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
jupyter_notebook.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
mastodon.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
mediawikidump.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
merge_doc_loader.ipynb Create merge loader that combines documents from a set of loaders (#6659) 2023-06-23 13:02:48 -07:00
microsoft_onedrive.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_powerpoint.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_word.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
modern_treasury.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
notion.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
notiondb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
obsidian.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
odt.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
open_city_data.ipynb Loader for OpenCityData and minor cleanups to Pandas, Airtable loaders (#6301) 2023-06-22 22:20:42 -07:00
pandas_dataframe.ipynb Loader for OpenCityData and minor cleanups to Pandas, Airtable loaders (#6301) 2023-06-22 22:20:42 -07:00
psychic.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
pyspark_dataframe.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
readthedocs_documentation.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
recursive_url_loader.ipynb Recursive URL loader (#6455) 2023-06-23 13:09:00 -07:00
reddit.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
roam.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
sitemap.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
slack.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
snowflake.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
spreedly.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
stripe.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
subtitle.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
telegram.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
tomarkdown.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
toml.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
trello.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
twitter.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
unstructured_file.ipynb Harrison/unstructured page number (#6464) 2023-06-19 22:31:43 -07:00
url.ipynb Add markdown to specify important arguments (#6246) 2023-06-18 17:47:00 -07:00
weather.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
web_base.ipynb Update web_base.ipynb (#6430) 2023-06-19 21:43:35 -07:00
whatsapp_chat.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
wikipedia.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
xml.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
youtube_audio.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
youtube_transcript.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00