langchain/tests/unit_tests/document_loaders/test_bshtml.py

import sys
from pathlib import Path

import pytest

from langchain.document_loaders.html_bs import BSHTMLLoader

HERE = Path(__file__).parent
EXAMPLES = HERE.parent.parent / "integration_tests" / "examples"


@pytest.mark.requires("bs4", "lxml")
def test_bs_html_loader() -> None:
    """Test unstructured loader."""
    file_path = EXAMPLES / "example.html"
    loader = BSHTMLLoader(str(file_path), get_text_separator="|")
    docs = loader.load()

    assert len(docs) == 1

    metadata = docs[0].metadata
    content = docs[0].page_content

    assert metadata["title"] == "Chew dad's slippers"
    assert metadata["source"] == str(file_path)
    assert content[:2] == "\n|"


@pytest.mark.skipif(
    bool(sys.flags.utf8_mode) or not sys.platform.startswith("win"),
    reason="default encoding is utf8",
)
@pytest.mark.requires("bs4", "lxml")
def test_bs_html_loader_non_utf8() -> None:
    """Test providing encoding to BSHTMLLoader."""
    file_path = EXAMPLES / "example-utf8.html"

    with pytest.raises(UnicodeDecodeError):
        BSHTMLLoader(str(file_path)).load()

    loader = BSHTMLLoader(str(file_path), open_encoding="utf8")
    docs = loader.load()

    assert len(docs) == 1

    metadata = docs[0].metadata

    assert metadata["title"] == "Chew dad's slippers"
    assert metadata["source"] == str(file_path)
Add ability to pass kwargs to loader classes in `DirectoryLoader`, add ability to modify encoding and BeautifulSoup behaviour in `BSHTMLLoader` (#2275) Solves #2247. Noted that the only test I added checks for the BeautifulSoup behaviour change. Happy to add a test for `DirectoryLoader` if deemed necessary. 2023-04-01 19:48:27 +00:00			`import sys`
Add HTML document_loader that includes page title metadata (#1720) This `BSHTMLLoader` document_loader loads an HTML document, extracts text and adds the page title to the returned Document's metadata. The loader uses the already installed bs4 package to extract both text content and the page title. Included in this PR is an example HTML file and an integration test that tests against this file. --------- Co-authored-by: Daniel Chalef <daniel.chalef@private.org> 2023-03-17 04:47:17 +00:00			`from pathlib import Path`

Add ability to pass kwargs to loader classes in `DirectoryLoader`, add ability to modify encoding and BeautifulSoup behaviour in `BSHTMLLoader` (#2275) Solves #2247. Noted that the only test I added checks for the BeautifulSoup behaviour change. Happy to add a test for `DirectoryLoader` if deemed necessary. 2023-04-01 19:48:27 +00:00			`import pytest`

Add HTML document_loader that includes page title metadata (#1720) This `BSHTMLLoader` document_loader loads an HTML document, extracts text and adds the page title to the returned Document's metadata. The loader uses the already installed bs4 package to extract both text content and the page title. Included in this PR is an example HTML file and an integration test that tests against this file. --------- Co-authored-by: Daniel Chalef <daniel.chalef@private.org> 2023-03-17 04:47:17 +00:00			`from langchain.document_loaders.html_bs import BSHTMLLoader`

Add html parsers (#4874) # Add bs4 html parser * Some minor refactors * Extract the bs4 html parsing code from the bs html loader * Move some tests from integration tests to unit tests 2023-05-18 02:39:11 +00:00			`HERE = Path(__file__).parent`
			`EXAMPLES = HERE.parent.parent / "integration_tests" / "examples"`
Add HTML document_loader that includes page title metadata (#1720) This `BSHTMLLoader` document_loader loads an HTML document, extracts text and adds the page title to the returned Document's metadata. The loader uses the already installed bs4 package to extract both text content and the page title. Included in this PR is an example HTML file and an integration test that tests against this file. --------- Co-authored-by: Daniel Chalef <daniel.chalef@private.org> 2023-03-17 04:47:17 +00:00
Add html parsers (#4874) # Add bs4 html parser * Some minor refactors * Extract the bs4 html parsing code from the bs html loader * Move some tests from integration tests to unit tests 2023-05-18 02:39:11 +00:00
			`@pytest.mark.requires("bs4", "lxml")`
Add HTML document_loader that includes page title metadata (#1720) This `BSHTMLLoader` document_loader loads an HTML document, extracts text and adds the page title to the returned Document's metadata. The loader uses the already installed bs4 package to extract both text content and the page title. Included in this PR is an example HTML file and an integration test that tests against this file. --------- Co-authored-by: Daniel Chalef <daniel.chalef@private.org> 2023-03-17 04:47:17 +00:00			`def test_bs_html_loader() -> None:`
			`"""Test unstructured loader."""`
Add html parsers (#4874) # Add bs4 html parser * Some minor refactors * Extract the bs4 html parsing code from the bs html loader * Move some tests from integration tests to unit tests 2023-05-18 02:39:11 +00:00			`file_path = EXAMPLES / "example.html"`
Add get_text_separator parameter to BSHTMLLoader (#3551) By default get_text doesn't separate content of different HTML tag. Adding option for specifying separator helps with document splitting. 2023-04-26 23:10:16 +00:00			`loader = BSHTMLLoader(str(file_path), get_text_separator="\|")`
Add HTML document_loader that includes page title metadata (#1720) This `BSHTMLLoader` document_loader loads an HTML document, extracts text and adds the page title to the returned Document's metadata. The loader uses the already installed bs4 package to extract both text content and the page title. Included in this PR is an example HTML file and an integration test that tests against this file. --------- Co-authored-by: Daniel Chalef <daniel.chalef@private.org> 2023-03-17 04:47:17 +00:00			`docs = loader.load()`

			`assert len(docs) == 1`

			`metadata = docs[0].metadata`
Add get_text_separator parameter to BSHTMLLoader (#3551) By default get_text doesn't separate content of different HTML tag. Adding option for specifying separator helps with document splitting. 2023-04-26 23:10:16 +00:00			`content = docs[0].page_content`
Add HTML document_loader that includes page title metadata (#1720) This `BSHTMLLoader` document_loader loads an HTML document, extracts text and adds the page title to the returned Document's metadata. The loader uses the already installed bs4 package to extract both text content and the page title. Included in this PR is an example HTML file and an integration test that tests against this file. --------- Co-authored-by: Daniel Chalef <daniel.chalef@private.org> 2023-03-17 04:47:17 +00:00
			`assert metadata["title"] == "Chew dad's slippers"`
			`assert metadata["source"] == str(file_path)`
Add get_text_separator parameter to BSHTMLLoader (#3551) By default get_text doesn't separate content of different HTML tag. Adding option for specifying separator helps with document splitting. 2023-04-26 23:10:16 +00:00			`assert content[:2] == "\n\|"`
Add ability to pass kwargs to loader classes in `DirectoryLoader`, add ability to modify encoding and BeautifulSoup behaviour in `BSHTMLLoader` (#2275) Solves #2247. Noted that the only test I added checks for the BeautifulSoup behaviour change. Happy to add a test for `DirectoryLoader` if deemed necessary. 2023-04-01 19:48:27 +00:00

			`@pytest.mark.skipif(`
			`bool(sys.flags.utf8_mode) or not sys.platform.startswith("win"),`
			`reason="default encoding is utf8",`
			`)`
Add html parsers (#4874) # Add bs4 html parser * Some minor refactors * Extract the bs4 html parsing code from the bs html loader * Move some tests from integration tests to unit tests 2023-05-18 02:39:11 +00:00			`@pytest.mark.requires("bs4", "lxml")`
Add ability to pass kwargs to loader classes in `DirectoryLoader`, add ability to modify encoding and BeautifulSoup behaviour in `BSHTMLLoader` (#2275) Solves #2247. Noted that the only test I added checks for the BeautifulSoup behaviour change. Happy to add a test for `DirectoryLoader` if deemed necessary. 2023-04-01 19:48:27 +00:00			`def test_bs_html_loader_non_utf8() -> None:`
			`"""Test providing encoding to BSHTMLLoader."""`
Add html parsers (#4874) # Add bs4 html parser * Some minor refactors * Extract the bs4 html parsing code from the bs html loader * Move some tests from integration tests to unit tests 2023-05-18 02:39:11 +00:00			`file_path = EXAMPLES / "example-utf8.html"`
Add ability to pass kwargs to loader classes in `DirectoryLoader`, add ability to modify encoding and BeautifulSoup behaviour in `BSHTMLLoader` (#2275) Solves #2247. Noted that the only test I added checks for the BeautifulSoup behaviour change. Happy to add a test for `DirectoryLoader` if deemed necessary. 2023-04-01 19:48:27 +00:00
			`with pytest.raises(UnicodeDecodeError):`
			`BSHTMLLoader(str(file_path)).load()`

			`loader = BSHTMLLoader(str(file_path), open_encoding="utf8")`
			`docs = loader.load()`

			`assert len(docs) == 1`

			`metadata = docs[0].metadata`

			`assert metadata["title"] == "Chew dad's slippers"`
			`assert metadata["source"] == str(file_path)`