Compare commits

...

4 Commits

Author SHA1 Message Date
WilliamEspegren 804390ba4b
community: Spider integration (#20937)
Added the [Spider.cloud](https://spider.cloud) document loader.
[Spider](https://github.com/spider-rs/spider) is the
[fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)
and cheapest crawler that returns LLM-ready data.

```
- **Description:** Adds Spider data loader
- **Dependencies:** spider-client
- **Twitter handle:** @WilliamEspegren 
```

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: = <=>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
2 weeks ago
Jamie Lemon 6342217b93
docs: Moves "Using PyMuPDF" to higher up the page. (#20832)
**Description:**
This PR moves the **PyMuPDF** PDF loader solution to be underneath
**PyPDF**. This is because it is the the 2nd most popular PyPI package
after **PyPDF**.

Please refer to these numbers, at the time of writing as follows:

PyPDF
https://www.pepy.tech/projects/PyPDF2
160 million

PyMuPDF
https://www.pepy.tech/projects/pymupdf
60 million

PDFPlumber
https://www.pepy.tech/projects/pdfplumber
23 million

PDFMiner
https://www.pepy.tech/projects/pdfminer
16 million

PyPDFium2
https://www.pepy.tech/projects/pypdfium2
8 million

Unstructured
https://www.pepy.tech/projects/unstructured
8 million


Please note I am an active contributor to
https://github.com/pymupdf/PyMuPDF

Many thanks!

----

**Twitter handle:**
@artifex
2 weeks ago
Chouaieb Nemri 8097bec472
Added LogEntry, Any, Dict, List, Optional, TypedDict imports (#20970)
Thank you for contributing to LangChain!

- [ ] **PR title**: "package: docs"

- [ ] **PR message**:
- **Description:** Uptaded docs: Rag streaming use-cases notebook with
LogEntry, Any, Dict, List, Optional, TypedDict imports
    - **Twitter handle:** c_nemri

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
2 weeks ago
ccurme 9ec7151317
fireworks: fix integration tests (#20973) 2 weeks ago

@ -0,0 +1,95 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Spider\n",
"[Spider](https://spider.cloud/) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and most affordable crawler and scraper that returns LLM-ready data.\n",
"\n",
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pip install spider-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage\n",
"To use spider you need to have an API key from [spider.cloud](https://spider.cloud/)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]\n"
]
}
],
"source": [
"from langchain_community.document_loaders import SpiderLoader\n",
"\n",
"loader = SpiderLoader(\n",
" api_key=\"YOUR_API_KEY\",\n",
" url=\"https://spider.cloud\",\n",
" mode=\"scrape\", # if no API key is provided it looks for SPIDER_API_KEY in env\n",
")\n",
"\n",
"data = loader.load()\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modes\n",
"- `scrape`: Default mode that scrapes a single URL\n",
"- `crawl`: Crawl all subpages of the domain url provided"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Crawler options\n",
"The `params` parameter is a dictionary that can be passed to the loader. See the [Spider documentation](https://spider.cloud/docs/api) to see all available parameters"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -9,7 +9,7 @@
"\n",
"This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. \n",
"\n",
"If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader`.\n"
"If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.\n"
]
},
{

@ -55,6 +55,32 @@ data
</CodeOutputBlock>
## Loading HTML with SpiderLoader
[Spider](https://spider.cloud/?ref=langchain) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc...
## Prerequisite
You need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud).
```python
%pip install --upgrade --quiet langchain langchain-community spider-client
```
```python
from langchain_community.document_loaders import SpiderLoader
loader = SpiderLoader(
api_key="YOUR_API_KEY", url="https://spider.cloud", mode="crawl"
)
data = loader.load()
```
For guides and documentation, visit [Spider](https://spider.cloud/docs/api)
## Loading HTML with FireCrawlLoader
[FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each.

@ -108,6 +108,41 @@ pages[4].page_content
</CodeOutputBlock>
## Using PyMuPDF
This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page.
```python
from langchain_community.document_loaders import PyMuPDFLoader
```
```python
loader = PyMuPDFLoader("example_data/layout-parser-paper.pdf")
```
```python
data = loader.load()
```
```python
data[0]
```
<CodeOutputBlock lang="python">
```
Document(page_content='LayoutParser: A Unified Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1 (<28>), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model configurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\nefforts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities. This paper introduces LayoutParser, an open-source\nlibrary for streamlining the usage of DL in DIA research and applica-\ntions. The core LayoutParser library comes with a set of simple and\nintuitive interfaces for applying and customizing DL models for layout de-\ntection, character recognition, and many other document processing tasks.\nTo promote extensibility, LayoutParser also incorporates a community\nplatform for sharing both pre-trained models and full document digiti-\nzation pipelines. We demonstrate that LayoutParser is helpful for both\nlightweight and large-scale digitization pipelines in real-word use cases.\nThe library is publicly available at https://layout-parser.github.io.\nKeywords: Document Image Analysis · Deep Learning · Layout Analysis\n· Character Recognition · Open Source library · Toolkit.\n1\nIntroduction\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\ndocument image analysis (DIA) tasks including document image classification [11,\narXiv:2103.15348v2 [cs.CV] 21 Jun 2021\n', lookup_str='', metadata={'file_path': 'example_data/layout-parser-paper.pdf', 'page_number': 1, 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationDate': 'D:20210622012710Z', 'modDate': 'D:20210622012710Z', 'trapped': '', 'encryption': None}, lookup_index=0)
```
</CodeOutputBlock>
Additionally, you can pass along any of the options from the [PyMuPDF documentation](https://pymupdf.readthedocs.io/en/latest/app1.html#plain-text/) as keyword arguments in the `load` call, and it will be pass along to the `get_text()` call.
## Using MathPix
Inspired by Daniel Gross's [https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21](https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21)
@ -349,39 +384,6 @@ semantic_snippets[4]
</CodeOutputBlock>
## Using PyMuPDF
This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page.
```python
from langchain_community.document_loaders import PyMuPDFLoader
```
```python
loader = PyMuPDFLoader("example_data/layout-parser-paper.pdf")
```
```python
data = loader.load()
```
```python
data[0]
```
<CodeOutputBlock lang="python">
```
Document(page_content='LayoutParser: A Unified Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1 (<28>), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model configurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\nefforts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities. This paper introduces LayoutParser, an open-source\nlibrary for streamlining the usage of DL in DIA research and applica-\ntions. The core LayoutParser library comes with a set of simple and\nintuitive interfaces for applying and customizing DL models for layout de-\ntection, character recognition, and many other document processing tasks.\nTo promote extensibility, LayoutParser also incorporates a community\nplatform for sharing both pre-trained models and full document digiti-\nzation pipelines. We demonstrate that LayoutParser is helpful for both\nlightweight and large-scale digitization pipelines in real-word use cases.\nThe library is publicly available at https://layout-parser.github.io.\nKeywords: Document Image Analysis · Deep Learning · Layout Analysis\n· Character Recognition · Open Source library · Toolkit.\n1\nIntroduction\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\ndocument image analysis (DIA) tasks including document image classification [11,\narXiv:2103.15348v2 [cs.CV] 21 Jun 2021\n', lookup_str='', metadata={'file_path': 'example_data/layout-parser-paper.pdf', 'page_number': 1, 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationDate': 'D:20210622012710Z', 'modDate': 'D:20210622012710Z', 'trapped': '', 'encryption': None}, lookup_index=0)
```
</CodeOutputBlock>
Additionally, you can pass along any of the options from the [PyMuPDF documentation](https://pymupdf.readthedocs.io/en/latest/app1.html#plain-text/) as keyword arguments in the `load` call, and it will be pass along to the `get_text()` call.
## PyPDF Directory

@ -346,7 +346,7 @@
"from operator import itemgetter\n",
"\n",
"from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
"from langchain_core.tracers.log_stream import LogStreamCallbackHandler\n",
"from langchain_core.tracers.log_stream import LogEntry, LogStreamCallbackHandler\n",
"\n",
"contextualize_q_system_prompt = \"\"\"Given a chat history and the latest user question \\\n",
"which might reference context in the chat history, formulate a standalone question \\\n",
@ -400,6 +400,8 @@
"To stream intermediate steps we'll use the `astream_log` method. This is an async method that yields JSONPatch ops that when applied in the same order as received build up the RunState:\n",
"\n",
"```python\n",
"from typing import Any, Dict, List, Optional, TypedDict\n",
"\n",
"class RunState(TypedDict):\n",
" id: str\n",
" \"\"\"ID of the run.\"\"\"\n",

@ -14,6 +14,7 @@
Document, <name>TextSplitter
"""
import importlib
from typing import TYPE_CHECKING, Any
@ -409,6 +410,9 @@ if TYPE_CHECKING:
from langchain_community.document_loaders.snowflake_loader import (
SnowflakeLoader, # noqa: F401
)
from langchain_community.document_loaders.spider import (
SpiderLoader, # noqa: F401
)
from langchain_community.document_loaders.spreedly import (
SpreedlyLoader, # noqa: F401
)
@ -647,6 +651,7 @@ __all__ = [
"SitemapLoader",
"SlackDirectoryLoader",
"SnowflakeLoader",
"SpiderLoader",
"SpreedlyLoader",
"StripeLoader",
"SurrealDBLoader",
@ -836,6 +841,7 @@ _module_lookup = {
"SitemapLoader": "langchain_community.document_loaders.sitemap",
"SlackDirectoryLoader": "langchain_community.document_loaders.slack_directory",
"SnowflakeLoader": "langchain_community.document_loaders.snowflake_loader",
"SpiderLoader": "langchain_community.document_loaders.spider",
"SpreedlyLoader": "langchain_community.document_loaders.spreedly",
"StripeLoader": "langchain_community.document_loaders.stripe",
"SurrealDBLoader": "langchain_community.document_loaders.surrealdb",

@ -0,0 +1,94 @@
from typing import Iterator, Literal, Optional
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document
from langchain_core.utils import get_from_env
class SpiderLoader(BaseLoader):
"""Load web pages as Documents using Spider AI.
Must have the Python package `spider-client` installed and a Spider API key.
See https://spider.cloud for more.
"""
def __init__(
self,
url: str,
*,
api_key: Optional[str] = None,
mode: Literal["scrape", "crawl"] = "scrape",
params: Optional[dict] = {"return_format": "markdown"},
):
"""Initialize with API key and URL.
Args:
url: The URL to be processed.
api_key: The Spider API key. If not specified, will be read from env
var `SPIDER_API_KEY`.
mode: The mode to run the loader in. Default is "scrape".
Options include "scrape" (single page) and "crawl" (with deeper
crawling following subpages).
params: Additional parameters for the Spider API.
"""
try:
from spider import Spider # noqa: F401
except ImportError:
raise ImportError(
"`spider` package not found, please run `pip install spider-client`"
)
if mode not in ("scrape", "crawl"):
raise ValueError(
f"Unrecognized mode '{mode}'. Expected one of 'scrape', 'crawl'."
)
# If `params` is `None`, initialize it as an empty dictionary
if params is None:
params = {}
# Add a default value for 'metadata' if it's not already present
if "metadata" not in params:
params["metadata"] = True
# Use the environment variable if the API key isn't provided
api_key = api_key or get_from_env("api_key", "SPIDER_API_KEY")
self.spider = Spider(api_key=api_key)
self.url = url
self.mode = mode
self.params = params
def lazy_load(self) -> Iterator[Document]:
"""Load documents based on the specified mode."""
spider_docs = []
if self.mode == "scrape":
# Scrape a single page
response = self.spider.scrape_url(self.url, params=self.params)
if response:
spider_docs.append(response)
elif self.mode == "crawl":
# Crawl multiple pages
response = self.spider.crawl_url(self.url, params=self.params)
if response:
spider_docs.extend(response)
for doc in spider_docs:
if self.mode == "scrape":
# Ensure page_content is also not None
page_content = doc[0].get("content", "")
# Ensure metadata is also not None
metadata = doc[0].get("metadata", {})
yield Document(page_content=page_content, metadata=metadata)
if self.mode == "crawl":
# Ensure page_content is also not None
page_content = doc.get("content", "")
# Ensure metadata is also not None
metadata = doc.get("metadata", {})
if page_content is not None:
yield Document(
page_content=page_content,
metadata=metadata,
)

@ -143,6 +143,7 @@ EXPECTED_ALL = [
"SitemapLoader",
"SlackDirectoryLoader",
"SnowflakeLoader",
"SpiderLoader",
"SpreedlyLoader",
"StripeLoader",
"SurrealDBLoader",

@ -5,7 +5,7 @@ all: help
# Define a variable for the test file path.
TEST_FILE ?= tests/unit_tests/
integration_test integration_tests: TEST_FILE ?= tests/integration_tests/
integration_test integration_tests: TEST_FILE = tests/integration_tests/
test tests integration_test integration_tests:
poetry run pytest $(TEST_FILE)

@ -13,3 +13,21 @@ class TestFireworksStandard(ChatModelIntegrationTests):
@pytest.fixture
def chat_model_class(self) -> Type[BaseChatModel]:
return ChatFireworks
@pytest.fixture
def chat_model_params(self) -> dict:
return {
"model": "accounts/fireworks/models/firefunction-v1",
"temperature": 0,
}
@pytest.mark.xfail(reason="Not yet implemented.")
def test_tool_message_histories_list_content(
self,
chat_model_class: Type[BaseChatModel],
chat_model_params: dict,
chat_model_has_tool_calling: bool,
) -> None:
super().test_tool_message_histories_list_content(
chat_model_class, chat_model_params, chat_model_has_tool_calling
)

Loading…
Cancel
Save