langchain/libs/community/langchain_community
Lei Zhang 748a6ae609
community[patch]: add HTTP response headers Content-Type to metadata of RecursiveUrlLoader document (#20875)
**Description:** 
The RecursiveUrlLoader loader offers a link_regex parameter that can
filter out URLs. However, this filtering capability is limited, and if
the internal links of the website change, unexpected resources may be
loaded. These resources, such as font files, can cause problems in
subsequent embedding processing.

>
https://blog.langchain.dev/assets/fonts/source-sans-pro-v21-latin-ext_latin-regular.woff2?v=0312715cbf

We can add the Content-Type in the HTTP response headers to the document
metadata so developers can choose which resources to use. This allows
developers to make their own choices.

For example, the following may be a good choice for text knowledge.

- text/plain - simple text file
- text/html - HTML web page
- text/xml - XML format file
- text/json - JSON format data
- application/pdf - PDF file
- application/msword - Word document

and ignore the following

- text/css - CSS stylesheet
- text/javascript - JavaScript script
- application/octet-stream - binary data
- image/jpeg - JPEG image
- image/png - PNG image
- image/gif - GIF image
- image/svg+xml - SVG image
- audio/mpeg - MPEG audio files
- video/mp4 - MP4 video file
- application/font-woff - WOFF font file
- application/font-ttf - TTF font file
- application/zip - ZIP compressed file
- application/octet-stream - binary data

**Twitter handle:** @coolbeevip

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-04-25 11:29:41 -07:00
..
adapters
agent_toolkits community[patch],core[minor]: Move BaseToolKit to core.tools (#20669) 2024-04-22 14:04:30 -04:00
callbacks patch: remove usage of llm, chat model __call__ (#20788) 2024-04-24 19:39:23 -04:00
chat_loaders community[patch]: import flattening fix (#20110) 2024-04-10 13:01:19 -04:00
chat_message_histories core[patch],community[patch]: Move file chat history back to community (#20834) 2024-04-24 12:47:25 -04:00
chat_models patch: remove usage of llm, chat model __call__ (#20788) 2024-04-24 19:39:23 -04:00
cross_encoders community[patch]: cross_encoders flatten namespaces (#20183) 2024-04-08 20:50:23 -04:00
docstore community[patch]: docstrings update (#20301) 2024-04-11 16:23:27 -04:00
document_compressors community[mionr]: add Jina Reranker in retrievers module (#19406) 2024-04-25 10:27:10 -07:00
document_loaders community[patch]: add HTTP response headers Content-Type to metadata of RecursiveUrlLoader document (#20875) 2024-04-25 11:29:41 -07:00
document_transformers community[patch]: add BeautifulSoupTransformer remove_unwanted_classnames method (#20467) 2024-04-25 17:04:04 +00:00
embeddings community[patch]: YandexGPT API add ability to disable request logging (#20670) 2024-04-19 21:40:37 -04:00
example_selectors
graphs community[patch]: Add driver config param for neo4j graph (#20772) 2024-04-24 21:14:41 +00:00
indexes community[patch]: docstrings update (#20301) 2024-04-11 16:23:27 -04:00
llms patch: remove usage of llm, chat model __call__ (#20788) 2024-04-24 19:39:23 -04:00
output_parsers
retrievers patch: deprecate (a)get_relevant_documents (#20477) 2024-04-22 11:14:53 -04:00
storage community[patch]: import flattening fix (#20110) 2024-04-10 13:01:19 -04:00
tools community[patch]: deprecating remaining google_community integrations (#20471) 2024-04-15 09:57:12 -04:00
utilities community[minor]: Add async methods to CassandraVectorStore (#20602) 2024-04-20 02:09:58 +00:00
utils community[patch]: docstrings update (#20301) 2024-04-11 16:23:27 -04:00
vectorstores core[minor], langchain[patch], community[patch]: mv StructuredQuery (#20849) 2024-04-25 09:40:26 -07:00
__init__.py
cache.py patch: remove usage of llm, chat model __call__ (#20788) 2024-04-24 19:39:23 -04:00
py.typed