Add how to use a custom scraping function with the sitemap loader. (#5847)

Hi! I just added an example of how to use a custom scraping function with the sitemap loader. I recently used this feature and had to dig in the source code to find it. I thought it might be useful to other devs to have an example in the Jupyter Notebook directly. I only added the example to the documentation page. @eyurtsev I was not able to run the lint. Please let me know if I have to do anything else. I know this is a very small contribution, but I hope it will be valuable. My Twitter handle is @web3Dav3.
11 months ago · 0b4a51930c
parent c66755b661
commit 0b4a51930c
1 changed files with 67 additions and 0 deletions
--- a/docs/modules/indexes/document_loaders/examples/sitemap.ipynb
+++ b/docs/modules/indexes/document_loaders/examples/sitemap.ipynb
@ -146,6 +146,73 @@
    "documents[0]"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Add custom scraping rules\n",
+    "\n",
+    "The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n",
+    "\n",
+    " The following example shows how to develop and use a custom function to avoid navigation and header elements."
+   ]
+  },
+    {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Import the `beautifulsoup4` library and define the custom function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pip install beautifulsoup4"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from bs4 import BeautifulSoup\n",
+    "\n",
+    "def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n",
+    "    # Find all 'nav' and 'header' elements in the BeautifulSoup object\n",
+    "    nav_elements = content.find_all('nav')\n",
+    "    header_elements = content.find_all('header')\n",
+    "\n",
+    "    # Remove each 'nav' and 'header' element from the BeautifulSoup object\n",
+    "    for element in nav_elements + header_elements:\n",
+    "        element.decompose()\n",
+    "\n",
+    "    return str(content.get_text())"
+   ]
+},    
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Add your custom function to the `SitemapLoader` object."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = SitemapLoader(\n",
+    "    \"https://langchain.readthedocs.io/sitemap.xml\",\n",
+    "    filter_urls=[\"https://python.langchain.com/en/latest/\"],\n",
+    "    parsing_function=remove_nav_and_header_elements\n",
+    ")"
+   ]
+},
  {
   "cell_type": "markdown",
   "metadata": {},