mirror of https://github.com/hwchase17/langchain
feat: add support for non-html in `UnstructuredURLLoader` (#2793)
### Summary Adds support for processing non HTML document types in the URL loader. For example, the URL loader can now process a PDF or markdown files hosted at a URL. ### Testing ```python from langchain.document_loaders import UnstructuredURLLoader urls = ["https://www.understandingwar.org/sites/default/files/Russian%20Offensive%20Campaign%20Assessment%2C%20April%2011%2C%202023.pdf"] loader = UnstructuredURLLoader(urls=urls, strategy="fast") docs = loader.load() print(docs[0].page_content[:1000]) ```pull/2651/head
parent
e081c62aac
commit
f0be3b0689
Loading…
Reference in New Issue