community: Spider integration (#20937)

Added the [Spider.cloud](https://spider.cloud) document loader.
[Spider](https://github.com/spider-rs/spider) is the
[fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)
and cheapest crawler that returns LLM-ready data.

```
- **Description:** Adds Spider data loader
- **Dependencies:** spider-client
- **Twitter handle:** @WilliamEspegren 
```

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: = <=>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
pull/16255/head
WilliamEspegren 2 weeks ago committed by GitHub
parent 6342217b93
commit 804390ba4b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -0,0 +1,95 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Spider\n",
"[Spider](https://spider.cloud/) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and most affordable crawler and scraper that returns LLM-ready data.\n",
"\n",
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pip install spider-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage\n",
"To use spider you need to have an API key from [spider.cloud](https://spider.cloud/)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]\n"
]
}
],
"source": [
"from langchain_community.document_loaders import SpiderLoader\n",
"\n",
"loader = SpiderLoader(\n",
" api_key=\"YOUR_API_KEY\",\n",
" url=\"https://spider.cloud\",\n",
" mode=\"scrape\", # if no API key is provided it looks for SPIDER_API_KEY in env\n",
")\n",
"\n",
"data = loader.load()\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modes\n",
"- `scrape`: Default mode that scrapes a single URL\n",
"- `crawl`: Crawl all subpages of the domain url provided"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Crawler options\n",
"The `params` parameter is a dictionary that can be passed to the loader. See the [Spider documentation](https://spider.cloud/docs/api) to see all available parameters"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -9,7 +9,7 @@
"\n",
"This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. \n",
"\n",
"If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader`.\n"
"If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.\n"
]
},
{

@ -55,6 +55,32 @@ data
</CodeOutputBlock>
## Loading HTML with SpiderLoader
[Spider](https://spider.cloud/?ref=langchain) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc...
## Prerequisite
You need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud).
```python
%pip install --upgrade --quiet langchain langchain-community spider-client
```
```python
from langchain_community.document_loaders import SpiderLoader
loader = SpiderLoader(
api_key="YOUR_API_KEY", url="https://spider.cloud", mode="crawl"
)
data = loader.load()
```
For guides and documentation, visit [Spider](https://spider.cloud/docs/api)
## Loading HTML with FireCrawlLoader
[FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each.

@ -14,6 +14,7 @@
Document, <name>TextSplitter
"""
import importlib
from typing import TYPE_CHECKING, Any
@ -409,6 +410,9 @@ if TYPE_CHECKING:
from langchain_community.document_loaders.snowflake_loader import (
SnowflakeLoader, # noqa: F401
)
from langchain_community.document_loaders.spider import (
SpiderLoader, # noqa: F401
)
from langchain_community.document_loaders.spreedly import (
SpreedlyLoader, # noqa: F401
)
@ -647,6 +651,7 @@ __all__ = [
"SitemapLoader",
"SlackDirectoryLoader",
"SnowflakeLoader",
"SpiderLoader",
"SpreedlyLoader",
"StripeLoader",
"SurrealDBLoader",
@ -836,6 +841,7 @@ _module_lookup = {
"SitemapLoader": "langchain_community.document_loaders.sitemap",
"SlackDirectoryLoader": "langchain_community.document_loaders.slack_directory",
"SnowflakeLoader": "langchain_community.document_loaders.snowflake_loader",
"SpiderLoader": "langchain_community.document_loaders.spider",
"SpreedlyLoader": "langchain_community.document_loaders.spreedly",
"StripeLoader": "langchain_community.document_loaders.stripe",
"SurrealDBLoader": "langchain_community.document_loaders.surrealdb",

@ -0,0 +1,94 @@
from typing import Iterator, Literal, Optional
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document
from langchain_core.utils import get_from_env
class SpiderLoader(BaseLoader):
"""Load web pages as Documents using Spider AI.
Must have the Python package `spider-client` installed and a Spider API key.
See https://spider.cloud for more.
"""
def __init__(
self,
url: str,
*,
api_key: Optional[str] = None,
mode: Literal["scrape", "crawl"] = "scrape",
params: Optional[dict] = {"return_format": "markdown"},
):
"""Initialize with API key and URL.
Args:
url: The URL to be processed.
api_key: The Spider API key. If not specified, will be read from env
var `SPIDER_API_KEY`.
mode: The mode to run the loader in. Default is "scrape".
Options include "scrape" (single page) and "crawl" (with deeper
crawling following subpages).
params: Additional parameters for the Spider API.
"""
try:
from spider import Spider # noqa: F401
except ImportError:
raise ImportError(
"`spider` package not found, please run `pip install spider-client`"
)
if mode not in ("scrape", "crawl"):
raise ValueError(
f"Unrecognized mode '{mode}'. Expected one of 'scrape', 'crawl'."
)
# If `params` is `None`, initialize it as an empty dictionary
if params is None:
params = {}
# Add a default value for 'metadata' if it's not already present
if "metadata" not in params:
params["metadata"] = True
# Use the environment variable if the API key isn't provided
api_key = api_key or get_from_env("api_key", "SPIDER_API_KEY")
self.spider = Spider(api_key=api_key)
self.url = url
self.mode = mode
self.params = params
def lazy_load(self) -> Iterator[Document]:
"""Load documents based on the specified mode."""
spider_docs = []
if self.mode == "scrape":
# Scrape a single page
response = self.spider.scrape_url(self.url, params=self.params)
if response:
spider_docs.append(response)
elif self.mode == "crawl":
# Crawl multiple pages
response = self.spider.crawl_url(self.url, params=self.params)
if response:
spider_docs.extend(response)
for doc in spider_docs:
if self.mode == "scrape":
# Ensure page_content is also not None
page_content = doc[0].get("content", "")
# Ensure metadata is also not None
metadata = doc[0].get("metadata", {})
yield Document(page_content=page_content, metadata=metadata)
if self.mode == "crawl":
# Ensure page_content is also not None
page_content = doc.get("content", "")
# Ensure metadata is also not None
metadata = doc.get("metadata", {})
if page_content is not None:
yield Document(
page_content=page_content,
metadata=metadata,
)

@ -143,6 +143,7 @@ EXPECTED_ALL = [
"SitemapLoader",
"SlackDirectoryLoader",
"SnowflakeLoader",
"SpiderLoader",
"SpreedlyLoader",
"StripeLoader",
"SurrealDBLoader",

Loading…
Cancel
Save