From 0b4a51930c37ec946aefe5a2d5f621d0a743f4f9 Mon Sep 17 00:00:00 2001 From: Soos3D <99700157+soos3d@users.noreply.github.com> Date: Wed, 7 Jun 2023 23:16:51 -0300 Subject: [PATCH] Add how to use a custom scraping function with the sitemap loader. (#5847) Hi! I just added an example of how to use a custom scraping function with the sitemap loader. I recently used this feature and had to dig in the source code to find it. I thought it might be useful to other devs to have an example in the Jupyter Notebook directly. I only added the example to the documentation page. @eyurtsev I was not able to run the lint. Please let me know if I have to do anything else. I know this is a very small contribution, but I hope it will be valuable. My Twitter handle is @web3Dav3. --- .../document_loaders/examples/sitemap.ipynb | 67 +++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/docs/modules/indexes/document_loaders/examples/sitemap.ipynb b/docs/modules/indexes/document_loaders/examples/sitemap.ipynb index f27dad30..98f103b6 100644 --- a/docs/modules/indexes/document_loaders/examples/sitemap.ipynb +++ b/docs/modules/indexes/document_loaders/examples/sitemap.ipynb @@ -146,6 +146,73 @@ "documents[0]" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Add custom scraping rules\n", + "\n", + "The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n", + "\n", + " The following example shows how to develop and use a custom function to avoid navigation and header elements." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Import the `beautifulsoup4` library and define the custom function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pip install beautifulsoup4" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from bs4 import BeautifulSoup\n", + "\n", + "def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n", + " # Find all 'nav' and 'header' elements in the BeautifulSoup object\n", + " nav_elements = content.find_all('nav')\n", + " header_elements = content.find_all('header')\n", + "\n", + " # Remove each 'nav' and 'header' element from the BeautifulSoup object\n", + " for element in nav_elements + header_elements:\n", + " element.decompose()\n", + "\n", + " return str(content.get_text())" + ] +}, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Add your custom function to the `SitemapLoader` object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "loader = SitemapLoader(\n", + " \"https://langchain.readthedocs.io/sitemap.xml\",\n", + " filter_urls=[\"https://python.langchain.com/en/latest/\"],\n", + " parsing_function=remove_nav_and_header_elements\n", + ")" + ] +}, { "cell_type": "markdown", "metadata": {},