Add how to use a custom scraping function with the sitemap loader. (#5847)

Hi! I just added an example of how to use a custom scraping function
with the sitemap loader. I recently used this feature and had to dig in
the source code to find it. I thought it might be useful to other devs
to have an example in the Jupyter Notebook directly.

I only added the example to the documentation page. 

@eyurtsev I was not able to run the lint. Please let me know if I have
to do anything else.

I know this is a very small contribution, but I hope it will be
valuable. My Twitter handle is @web3Dav3.

<!-- For a quicker response, figure out the right person to tag with @

  @hwchase17 - project lead

  Tracing / Callbacks
  - @agola11

  Async
  - @agola11

  DataLoaders
  - @eyurtsev

  Models
  - @hwchase17
  - @agola11

  Agents / Tools / Toolkits
  - @vowelparrot

  VectorStores / Retrievers / Memory
  - @dev2049

 -->
searx_updates
Soos3D 11 months ago committed by GitHub
parent c66755b661
commit 0b4a51930c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -146,6 +146,73 @@
"documents[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add custom scraping rules\n",
"\n",
"The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n",
"\n",
" The following example shows how to develop and use a custom function to avoid navigation and header elements."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the `beautifulsoup4` library and define the custom function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pip install beautifulsoup4"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"\n",
"def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n",
" # Find all 'nav' and 'header' elements in the BeautifulSoup object\n",
" nav_elements = content.find_all('nav')\n",
" header_elements = content.find_all('header')\n",
"\n",
" # Remove each 'nav' and 'header' element from the BeautifulSoup object\n",
" for element in nav_elements + header_elements:\n",
" element.decompose()\n",
"\n",
" return str(content.get_text())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Add your custom function to the `SitemapLoader` object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loader = SitemapLoader(\n",
" \"https://langchain.readthedocs.io/sitemap.xml\",\n",
" filter_urls=[\"https://python.langchain.com/en/latest/\"],\n",
" parsing_function=remove_nav_and_header_elements\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},

Loading…
Cancel
Save