# URL

This covers how to load HTML documents from a list of URLs into a document format that we can use downstream.

In [1]:
from langchain.document_loaders import UnstructuredURLLoader

In [2]:
urls = [
 "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
 "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]

Pass in ssl_verify=False with headers=headers to get past ssl_verification error.

In [3]:
loader = UnstructuredURLLoader(urls=urls)

In [4]:
data = loader.load()

# Selenium URL Loader

This covers how to load HTML documents from a list of URLs using the `SeleniumURLLoader`.

Using selenium allows us to load pages that require JavaScript to render.

## Setup

To use the `SeleniumURLLoader`, you will need to install `selenium` and `unstructured`.


In [None]:
from langchain.document_loaders import SeleniumURLLoader

In [None]:
urls = [
 "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
 "https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]

In [None]:
loader = SeleniumURLLoader(urls=urls)

In [None]:
data = loader.load()

# Playwright URL Loader

This covers how to load HTML documents from a list of URLs using the `PlaywrightURLLoader`.

As in the Selenium case, Playwright allows us to load pages that need JavaScript to render.

## Setup

To use the `PlaywrightURLLoader`, you will need to install `playwright` and `unstructured`. Additionally, you will need to install the Playwright Chromium browser:

In [None]:
# Install playwright
!pip install "playwright"
!pip install "unstructured"
!playwright install

In [None]:
from langchain.document_loaders import PlaywrightURLLoader

In [None]:
urls = [
 "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
 "https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]

In [None]:
loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])

In [None]:
data = loader.load()