Description: This PR improves the function of recursive_url_loader, such
as limiting the depth of the access, and customizable extractors(from
the raw webpage to the text of the Document object), so that users can
use other tools to extract the webpage. This PR also includes the
document and test for the new loader.
Old PR closed due to project structure change. #7756
Because socket requests are not allowed, the old unit test was removed.
Issue: N/A
Dependencies: asyncio, aiohttp
Tag maintainer: @rlancemartin
Twitter handle: @ Zend_Nihility
---------
Co-authored-by: Lance Martin <lance@langchain.dev>
"We may want to process load all URLs under a root directory.\n",
"\n",
"For example, let's look at the [LangChain JS documentation](https://js.langchain.com/docs/).\n",
"For example, let's look at the [Python 3.9 Document](https://docs.python.org/3.9/).\n",
"\n",
"This has many interesting child pages that we may want to read in bulk.\n",
"\n",
@ -19,13 +19,28 @@
" \n",
"We do this using the `RecursiveUrlLoader`.\n",
"\n",
"This also gives us the flexibility to exclude some children (e.g., the `api` directory with > 800 child pages)."
"This also gives us the flexibility to exclude some children, customize the extractor, and more."
]
},
{
"cell_type": "markdown",
"id": "1be8094f",
"metadata": {},
"source": [
"# Parameters\n",
"- url: str, the target url to crawl.\n",
"- exclude_dirs: Optional[str], webpage directories to exclude.\n",
"- use_async: Optional[bool], wether to use async requests, using async requests is usually faster in large tasks. However, async will disable the lazy loading feature(the function still works, but it is not lazy). By default, it is set to False.\n",
"- extractor: Optional[Callable[[str], str]], a function to extract the text of the document from the webpage, by default it returns the page as it is. It is recommended to use tools like goose3 and beautifulsoup to extract the text. By default, it just returns the page as it is.\n",
"- max_depth: Optional[int] = None, the maximum depth to crawl. By default, it is set to 2. If you need to crawl the whole website, set it to a number that is large enough would simply do the job.\n",
"- timeout: Optional[int] = None, the timeout for each request, in the unit of seconds. By default, it is set to 10.\n",
"- prevent_outside: Optional[bool] = None, whether to prevent crawling outside the root url. By default, it is set to True."
" 'description': 'BufferWindowMemory keeps track of the back-and-forths in conversation, and then uses a window of size k to surface the last k back-and-forths to use as memory.',\n",
" 'title': 'The Python Standard Library — Python 3.9.17 documentation',\n",
" 'language': None}"
]
},
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0].metadata"
"docs[-1].metadata"
]
},
{
"cell_type": "markdown",
"id": "40fc13ef",
"id": "5866e5a6",
"metadata": {},
"source": [
"Now, let's try a more extensive example, the `docs` root dir.\n",
"\n",
"We will skip everything under `api`.\n",
"\n",
"For this, we can `lazy_load` each page as we crawl the tree, using `WebBaseLoader` to load each as we go."
"However, since it's hard to perform a perfect filter, you may still see some irrelevant results in the results. You can perform a filter on the returned documents by yourself, if it's needed. Most of the time, the returned results are good enough."