langchain

Archives/langchain

Fork 1

mirror of https://github.com/hwchase17/langchain synced 2024-11-08 07:10:35 +00:00

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Maciej Bryński	aa345a4bb7	Add get_text_separator parameter to BSHTMLLoader (#3551 ) By default get_text doesn't separate content of different HTML tag. Adding option for specifying separator helps with document splitting.	2023-04-26 16:10:16 -07:00
Sam Cordner-Matthews	1ddd6dbf0b	Add ability to pass kwargs to loader classes in `DirectoryLoader`, add ability to modify encoding and BeautifulSoup behaviour in `BSHTMLLoader` (#2275 ) Solves #2247. Noted that the only test I added checks for the BeautifulSoup behaviour change. Happy to add a test for `DirectoryLoader` if deemed necessary.	2023-04-01 12:48:27 -07:00
Daniel Chalef	b157e0c1c3	Add HTML document_loader that includes page title metadata (#1720 ) This `BSHTMLLoader` document_loader loads an HTML document, extracts text and adds the page title to the returned Document's metadata. The loader uses the already installed bs4 package to extract both text content and the page title. Included in this PR is an example HTML file and an integration test that tests against this file. --------- Co-authored-by: Daniel Chalef <daniel.chalef@private.org>	2023-03-16 21:47:17 -07:00

Maciej Bryński

aa345a4bb7

Add get_text_separator parameter to BSHTMLLoader (#3551 )

By default get_text doesn't separate content of different HTML tag.
Adding option for specifying separator helps with document splitting.

2023-04-26 16:10:16 -07:00

Sam Cordner-Matthews

1ddd6dbf0b

Add ability to pass kwargs to loader classes in DirectoryLoader, add ability to modify encoding and BeautifulSoup behaviour in BSHTMLLoader (#2275 )

Solves #2247. Noted that the only test I added checks for the
BeautifulSoup behaviour change. Happy to add a test for
`DirectoryLoader` if deemed necessary.

2023-04-01 12:48:27 -07:00

Daniel Chalef

b157e0c1c3

Add HTML document_loader that includes page title metadata (#1720 )

This `BSHTMLLoader` document_loader loads an HTML document, extracts
text and adds the page title to the returned Document's metadata. The
loader uses the already installed bs4 package to extract both text
content and the page title.

Included in this PR is an example HTML file and an integration test that
tests against this file.

---------

Co-authored-by: Daniel Chalef <daniel.chalef@private.org>

2023-03-16 21:47:17 -07:00

3 Commits