{ "cells": [ { "cell_type": "markdown", "id": "5a7cc773", "metadata": {}, "source": [ "# Recursive URL Loader\n", "\n", "We may want to process load all URLs under a root directory.\n", "\n", "For example, let's look at the [LangChain JS documentation](https://js.langchain.com/docs/).\n", "\n", "This has many interesting child pages that we may want to read in bulk.\n", "\n", "Of course, the `WebBaseLoader` can load a list of pages. \n", "\n", "But, the challenge is traversing the tree of child pages and actually assembling that list!\n", " \n", "We do this using the `RecursiveUrlLoader`.\n", "\n", "This also gives us the flexibility to exclude some children (e.g., the `api` directory with > 800 child pages)." ] }, { "cell_type": "code", "execution_count": 1, "id": "2e3532b2", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders.recursive_url_loader import RecursiveUrlLoader" ] }, { "cell_type": "markdown", "id": "6384c057", "metadata": {}, "source": [ "Let's try a simple example." ] }, { "cell_type": "code", "execution_count": 2, "id": "d69e5620", "metadata": {}, "outputs": [], "source": [ "url = \"https://js.langchain.com/docs/modules/memory/examples/\"\n", "loader = RecursiveUrlLoader(url=url)\n", "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 3, "id": "084fb2ce", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(docs)" ] }, { "cell_type": "code", "execution_count": 4, "id": "89355b7c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\n\\n\\n\\n\\nBuffer Window Memory | 🦜️🔗 Langchain\\n\\n\\n\\n\\n\\nSki'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs[0].page_content[:50]" ] }, { "cell_type": "code", "execution_count": 5, "id": "13bd7e16", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'source': 'https://js.langchain.com/docs/modules/memory/examples/buffer_window_memory',\n", " 'title': 'Buffer Window Memory | 🦜️🔗 Langchain',\n", " 'description': 'BufferWindowMemory keeps track of the back-and-forths in conversation, and then uses a window of size k to surface the last k back-and-forths to use as memory.',\n", " 'language': 'en'}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs[0].metadata" ] }, { "cell_type": "markdown", "id": "40fc13ef", "metadata": {}, "source": [ "Now, let's try a more extensive example, the `docs` root dir.\n", "\n", "We will skip everything under `api`.\n", "\n", "For this, we can `lazy_load` each page as we crawl the tree, using `WebBaseLoader` to load each as we go." ] }, { "cell_type": "code", "execution_count": null, "id": "5c938b9f", "metadata": {}, "outputs": [], "source": [ "url = \"https://js.langchain.com/docs/\"\n", "exclude_dirs = [\"https://js.langchain.com/docs/api/\"]\n", "loader = RecursiveUrlLoader(url=url, exclude_dirs=exclude_dirs)\n", "# Lazy load each\n", "docs = [print(doc) or doc for doc in loader.lazy_load()]" ] }, { "cell_type": "code", "execution_count": 7, "id": "30ff61d3", "metadata": {}, "outputs": [], "source": [ "# Load all pages\n", "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 8, "id": "457e30f3", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "188" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(docs)" ] }, { "cell_type": "code", "execution_count": 9, "id": "bca80b4a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\n\\n\\n\\n\\nAgent Simulations | 🦜️🔗 Langchain\\n\\n\\n\\n\\n\\nSkip t'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs[0].page_content[:50]" ] }, { "cell_type": "code", "execution_count": 10, "id": "df97cf22", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'source': 'https://js.langchain.com/docs/use_cases/agent_simulations/',\n", " 'title': 'Agent Simulations | 🦜️🔗 Langchain',\n", " 'description': 'Agent simulations involve taking multiple agents and having them interact with each other.',\n", " 'language': 'en'}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs[0].metadata" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 5 }