mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
a3598193a0
# docs: ecosystem/integrations update 2 #5219 - part 1 The second part of this update (parts are independent of each other! no overlap): - added diffbot.md - updated confluence.ipynb; added confluence.md - updated college_confidential.md - updated openai.md - added blackboard.md - added bilibili.md - added azure_blob_storage.md - added azlyrics.md - added aws_s3.md ## Who can review? @hwchase17@agola11 @agola11 @vowelparrot @dev2049
19 lines
823 B
Markdown
19 lines
823 B
Markdown
# Diffbot
|
|
|
|
>[Diffbot](https://docs.diffbot.com/docs) is a service to read web pages. Unlike traditional web scraping tools,
|
|
> `Diffbot` doesn't require any rules to read the content on a page.
|
|
>It starts with computer vision, which classifies a page into one of 20 possible types. Content is then interpreted by a machine learning model trained to identify the key attributes on a page based on its type.
|
|
>The result is a website transformed into clean-structured data (like JSON or CSV), ready for your application.
|
|
|
|
## Installation and Setup
|
|
|
|
Read [instructions](https://docs.diffbot.com/reference/authentication) how to get the Diffbot API Token.
|
|
|
|
## Document Loader
|
|
|
|
See a [usage example](../modules/indexes/document_loaders/examples/diffbot.ipynb).
|
|
|
|
```python
|
|
from langchain.document_loaders import DiffbotLoader
|
|
```
|