Respect User-Specified User-Agent in WebBaseLoader (#4579)

# Respect User-Specified User-Agent in WebBaseLoader
This pull request modifies the `WebBaseLoader` class initializer from
the `langchain.document_loaders.web_base` module to preserve any
User-Agent specified by the user in the `header_template` parameter.
Previously, even if a User-Agent was specified in `header_template`, it
would always be overridden by a random User-Agent generated by the
`fake_useragent` library.

With this change, if a User-Agent is specified in `header_template`, it
will be used. Only in the case where no User-Agent is specified will a
random User-Agent be generated and used. This provides additional
flexibility when using the `WebBaseLoader` class, allowing users to
specify their own User-Agent if they have a specific need or preference,
while still providing a reasonable default for cases where no User-Agent
is specified.

This change has no impact on existing users who do not specify a
User-Agent, as the behavior in this case remains the same. However, for
users who do specify a User-Agent, their choice will now be respected
and used for all subsequent requests made using the `WebBaseLoader`
class.


Fixes #4167

## Before submitting

============================= test session starts
==============================
collecting ... collected 1 item


test_web_base.py::TestWebBaseLoader::test_respect_user_specified_user_agent

============================== 1 passed in 3.64s
===============================
PASSED [100%]

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested: @eyurtsev

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
This commit is contained in:
Li Yuanzheng 2023-05-15 11:09:27 +08:00 committed by GitHub
parent 372a5113ff
commit 3b6206af49
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 22 additions and 10 deletions

View File

@ -68,17 +68,19 @@ class WebBaseLoader(BaseLoader):
"bs4 package not found, please install it with " "`pip install bs4`"
)
headers = header_template or default_header_template
if not headers.get("User-Agent"):
try:
from fake_useragent import UserAgent
headers = header_template or default_header_template
headers["User-Agent"] = UserAgent().random
self.session.headers = dict(headers)
except ImportError:
logger.info(
"fake_useragent not found, using default user agent. "
"To get a realistic header for requests, `pip install fake_useragent`."
"fake_useragent not found, using default user agent."
"To get a realistic header for requests, "
"`pip install fake_useragent`."
)
self.session.headers = dict(headers)
@property
def web_path(self) -> str:

View File

@ -0,0 +1,10 @@
from langchain.document_loaders.web_base import WebBaseLoader
class TestWebBaseLoader:
def test_respect_user_specified_user_agent(self) -> None:
user_specified_user_agent = "user_specified_user_agent"
header_template = {"User-Agent": user_specified_user_agent}
url = "https://www.example.com"
loader = WebBaseLoader(url, header_template=header_template)
assert loader.session.headers["User-Agent"] == user_specified_user_agent