diff --git a/docs/modules/indexes/document_loaders/examples/sitemap.ipynb b/docs/modules/indexes/document_loaders/examples/sitemap.ipynb index f27dad30..98f103b6 100644 --- a/docs/modules/indexes/document_loaders/examples/sitemap.ipynb +++ b/docs/modules/indexes/document_loaders/examples/sitemap.ipynb @@ -146,6 +146,73 @@ "documents[0]" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Add custom scraping rules\n", + "\n", + "The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n", + "\n", + " The following example shows how to develop and use a custom function to avoid navigation and header elements." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Import the `beautifulsoup4` library and define the custom function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pip install beautifulsoup4" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from bs4 import BeautifulSoup\n", + "\n", + "def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n", + " # Find all 'nav' and 'header' elements in the BeautifulSoup object\n", + " nav_elements = content.find_all('nav')\n", + " header_elements = content.find_all('header')\n", + "\n", + " # Remove each 'nav' and 'header' element from the BeautifulSoup object\n", + " for element in nav_elements + header_elements:\n", + " element.decompose()\n", + "\n", + " return str(content.get_text())" + ] +}, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Add your custom function to the `SitemapLoader` object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "loader = SitemapLoader(\n", + " \"https://langchain.readthedocs.io/sitemap.xml\",\n", + " filter_urls=[\"https://python.langchain.com/en/latest/\"],\n", + " parsing_function=remove_nav_and_header_elements\n", + ")" + ] +}, { "cell_type": "markdown", "metadata": {},