Strips whitespace and \n from loc before filtering urls from sitemap (#5728)

Fixes #5699 



#### Who can review?

Tag maintainers/contributors who might be interested:

@woodworker @LeSphax @johannhartmann

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
searx_updates
Shelby Jenkins 12 months ago committed by GitHub
parent 98dd6d068a
commit 2dcda8a8ac
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -79,8 +79,11 @@ class SitemapLoader(WebBaseLoader):
if not loc:
continue
# Strip leading and trailing whitespace and newlines
loc_text = loc.text.strip()
if self.filter_urls and not any(
re.match(r, loc.text) for r in self.filter_urls
re.match(r, loc_text) for r in self.filter_urls
):
continue

Loading…
Cancel
Save