langchain/docs/integrations/diffbot.md
Leonid Ganeline a3598193a0
docs: ecosystem/integrations update 2 (#5282)
# docs: ecosystem/integrations update 2

#5219 - part 1 
The second part of this update (parts are independent of each other! no
overlap):

- added diffbot.md
- updated confluence.ipynb; added confluence.md
- updated college_confidential.md
- updated openai.md
- added blackboard.md
- added bilibili.md
- added azure_blob_storage.md
- added azlyrics.md
- added aws_s3.md

## Who can review?

@hwchase17@agola11
@agola11
 @vowelparrot
 @dev2049
2023-05-29 07:19:43 -07:00

823 B

Diffbot

Diffbot is a service to read web pages. Unlike traditional web scraping tools, Diffbot doesn't require any rules to read the content on a page. It starts with computer vision, which classifies a page into one of 20 possible types. Content is then interpreted by a machine learning model trained to identify the key attributes on a page based on its type. The result is a website transformed into clean-structured data (like JSON or CSV), ready for your application.

Installation and Setup

Read instructions how to get the Diffbot API Token.

Document Loader

See a usage example.

from langchain.document_loaders import DiffbotLoader