From 0b4a51930c37ec946aefe5a2d5f621d0a743f4f9 Mon Sep 17 00:00:00 2001
From: Soos3D <99700157+soos3d@users.noreply.github.com>
Date: Wed, 7 Jun 2023 23:16:51 -0300
Subject: [PATCH] Add how to use a custom scraping function with the sitemap
 loader. (#5847)

Hi! I just added an example of how to use a custom scraping function
with the sitemap loader. I recently used this feature and had to dig in
the source code to find it. I thought it might be useful to other devs
to have an example in the Jupyter Notebook directly.

I only added the example to the documentation page.

@eyurtsev I was not able to run the lint. Please let me know if I have
to do anything else.

I know this is a very small contribution, but I hope it will be
valuable. My Twitter handle is @web3Dav3.

<!-- For a quicker response, figure out the right person to tag with @

  @hwchase17 - project lead

  Tracing / Callbacks
  - @agola11

  Async
  - @agola11

  DataLoaders
  - @eyurtsev

  Models
  - @hwchase17
  - @agola11

  Agents / Tools / Toolkits
  - @vowelparrot

  VectorStores / Retrievers / Memory
  - @dev2049

 -->
---
 .../document_loaders/examples/sitemap.ipynb   | 67 +++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/docs/modules/indexes/document_loaders/examples/sitemap.ipynb b/docs/modules/indexes/document_loaders/examples/sitemap.ipynb
index f27dad30..98f103b6 100644
--- a/docs/modules/indexes/document_loaders/examples/sitemap.ipynb
+++ b/docs/modules/indexes/document_loaders/examples/sitemap.ipynb
@@ -146,6 +146,73 @@
     "documents[0]"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Add custom scraping rules\n",
+    "\n",
+    "The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n",
+    "\n",
+    " The following example shows how to develop and use a custom function to avoid navigation and header elements."
+   ]
+  },
+    {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Import the `beautifulsoup4` library and define the custom function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pip install beautifulsoup4"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from bs4 import BeautifulSoup\n",
+    "\n",
+    "def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n",
+    "    # Find all 'nav' and 'header' elements in the BeautifulSoup object\n",
+    "    nav_elements = content.find_all('nav')\n",
+    "    header_elements = content.find_all('header')\n",
+    "\n",
+    "    # Remove each 'nav' and 'header' element from the BeautifulSoup object\n",
+    "    for element in nav_elements + header_elements:\n",
+    "        element.decompose()\n",
+    "\n",
+    "    return str(content.get_text())"
+   ]
+},    
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Add your custom function to the `SitemapLoader` object."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = SitemapLoader(\n",
+    "    \"https://langchain.readthedocs.io/sitemap.xml\",\n",
+    "    filter_urls=[\"https://python.langchain.com/en/latest/\"],\n",
+    "    parsing_function=remove_nav_and_header_elements\n",
+    ")"
+   ]
+},
   {
    "cell_type": "markdown",
    "metadata": {},