forked from Archives/langchain
enhancement: support headers for non-html urls (#3166)
### Summary Updates the `UnstructuredURLLoader` to support passing in headers for non HTML content types. While this update maintains backward compatibility with older versions of `unstructured`, we strongly recommended upgrading to `unstructured>=0.5.13` if you are using the `UnstructuredURLLoader`. ### Testing #### With headers ```python from langchain.document_loaders import UnstructuredURLLoader urls = ["https://www.understandingwar.org/sites/default/files/Russian%20Offensive%20Campaign%20Assessment%2C%20April%2011%2C%202023.pdf"] loader = UnstructuredURLLoader(urls=urls, headers={"Accept": "application/json"}, strategy="fast") docs = loader.load() print(docs[0].page_content[:1000]) ``` #### Without headers ```python from langchain.document_loaders import UnstructuredURLLoader urls = ["https://www.understandingwar.org/sites/default/files/Russian%20Offensive%20Campaign%20Assessment%2C%20April%2011%2C%202023.pdf"] loader = UnstructuredURLLoader(urls=urls, strategy="fast") docs = loader.load() print(docs[0].page_content[:1000]) ``` --------- Co-authored-by: Zander Chase <130414180+vowelparrot@users.noreply.github.com>fix_agent_callbacks
parent
7b1f0656b8
commit
3e0c44bae8
Loading…
Reference in New Issue