`Arxiv` document loader (#3627)

It makes sense to use `arxiv` as another source of the documents for
downloading.
- Added the `arxiv` document_loader, based on the
`utilities/arxiv.py:ArxivAPIWrapper`
- added tests
- added an example notebook
- sorted `__all__` in `__init__.py` (otherwise it is hard to find a
class in the very long list)
fix_agent_callbacks
leo-gan 1 year ago committed by GitHub
parent 539142f8d5
commit 36c59e0c25
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,177 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "bda1f3f5",
"metadata": {},
"source": [
"# Arxiv\n",
"\n",
"[arXiv](https://arxiv.org/) is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.\n",
"\n",
"This notebook shows how to load scientific articles from `Arxiv.org` into a document format that we can use downstream."
]
},
{
"cell_type": "markdown",
"id": "1b7a1eef-7bf7-4e7d-8bfc-c4e27c9488cb",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "2abd5578-aa3d-46b9-99af-8b262f0b3df8",
"metadata": {},
"source": [
"First, you need to install `arxiv` python package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b674aaea-ed3a-4541-8414-260a8f67f623",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install arxiv"
]
},
{
"cell_type": "markdown",
"id": "094b5f13-7e54-4354-9d83-26d6926ecaa0",
"metadata": {
"tags": []
},
"source": [
"Second, you need to install `PyMuPDF` python package which transform PDF files from the `arxiv.org` site into the text fromat."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7cd91121-2e96-43ba-af50-319853695f86",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install pymupdf"
]
},
{
"cell_type": "markdown",
"id": "95f05e1c-195e-4e2b-ae8e-8d6637f15be6",
"metadata": {},
"source": [
"## Examples"
]
},
{
"cell_type": "markdown",
"id": "e29b954c-1407-4797-ae21-6ba8937156be",
"metadata": {},
"source": [
"`ArxivLoader` has these arguments:\n",
"- `query`: free text which used to find documents in the Arxiv\n",
"- optional `load_max_docs`: default=100. Use it to limit number of downloaded documents. It takes time to download all 100 documents, so use a small number for experiments.\n",
"- optional `load_all_available_meta`: default=False. By defaul only the most important fields downloaded: `Published` (date when document was published/last updated), `Title`, `Authors`, `Summary`. If True, other fields also downloaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9bfd5e46",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders.base import Document\n",
"from langchain.document_loaders import ArxivLoader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "700e4ef2",
"metadata": {},
"outputs": [],
"source": [
"docs = ArxivLoader(query=\"1605.08386\", load_max_docs=2).load()\n",
"len(docs)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8977bac0-0042-4f23-9754-247dbd32439b",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'Published': '2016-05-26',\n",
" 'Title': 'Heat-bath random walks with Markov bases',\n",
" 'Authors': 'Caprice Stanley, Tobias Windisch',\n",
" 'Summary': 'Graphs on lattice points are studied whose edges come from a finite set of\\nallowed moves of arbitrary length. We show that the diameter of these graphs on\\nfibers of a fixed integer matrix can be bounded from above by a constant. We\\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\\nalso state explicit conditions on the set of moves so that the heat-bath random\\nwalk, a generalization of the Glauber dynamics, is an expander in fixed\\ndimension.'}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc[0].metadata # meta-information of the Document"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "46969806-45a9-4c4d-a61b-cfb9658fc9de",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'arXiv:1605.08386v1 [math.CO] 26 May 2016\\nHEAT-BATH RANDOM WALKS WITH MARKOV BASES\\nCAPRICE STANLEY AND TOBIAS WINDISCH\\nAbstract. Graphs on lattice points are studied whose edges come from a finite set of\\nallowed moves of arbitrary length. We show that the diameter of these graphs on fibers of a\\nfixed integer matrix can be bounded from above by a constant. We then study the mixing\\nbehaviour of heat-b'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc[0].page_content[:400] # all pages of the Document content\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -47,7 +47,7 @@ from langchain.prompts import (
PromptTemplate, PromptTemplate,
) )
from langchain.sql_database import SQLDatabase from langchain.sql_database import SQLDatabase
from langchain.utilities import ArxivAPIWrapper from langchain.utilities.arxiv import ArxivAPIWrapper
from langchain.utilities.google_search import GoogleSearchAPIWrapper from langchain.utilities.google_search import GoogleSearchAPIWrapper
from langchain.utilities.google_serper import GoogleSerperAPIWrapper from langchain.utilities.google_serper import GoogleSerperAPIWrapper
from langchain.utilities.powerbi import PowerBIDataset from langchain.utilities.powerbi import PowerBIDataset

@ -2,6 +2,7 @@
from langchain.document_loaders.airbyte_json import AirbyteJSONLoader from langchain.document_loaders.airbyte_json import AirbyteJSONLoader
from langchain.document_loaders.apify_dataset import ApifyDatasetLoader from langchain.document_loaders.apify_dataset import ApifyDatasetLoader
from langchain.document_loaders.arxiv import ArxivLoader
from langchain.document_loaders.azlyrics import AZLyricsLoader from langchain.document_loaders.azlyrics import AZLyricsLoader
from langchain.document_loaders.azure_blob_storage_container import ( from langchain.document_loaders.azure_blob_storage_container import (
AzureBlobStorageContainerLoader, AzureBlobStorageContainerLoader,
@ -90,78 +91,79 @@ from langchain.document_loaders.youtube import (
PagedPDFSplitter = PyPDFLoader PagedPDFSplitter = PyPDFLoader
__all__ = [ __all__ = [
"UnstructuredFileLoader", "AZLyricsLoader",
"UnstructuredFileIOLoader", "AirbyteJSONLoader",
"UnstructuredURLLoader", "ApifyDatasetLoader",
"SeleniumURLLoader", "ArxivLoader",
"PlaywrightURLLoader", "AzureBlobStorageContainerLoader",
"AzureBlobStorageFileLoader",
"BSHTMLLoader",
"BigQueryLoader",
"BiliBiliLoader",
"BlackboardLoader",
"BlockchainDocumentLoader",
"CSVLoader",
"ChatGPTLoader",
"CoNLLULoader",
"CollegeConfidentialLoader",
"ConfluenceLoader",
"DataFrameLoader",
"DiffbotLoader",
"DirectoryLoader", "DirectoryLoader",
"NotionDirectoryLoader", "DiscordChatLoader",
"NotionDBLoader", "DuckDBLoader",
"ReadTheDocsLoader", "EverNoteLoader",
"FacebookChatLoader",
"GCSDirectoryLoader",
"GCSFileLoader",
"GitLoader",
"GitbookLoader",
"GoogleApiClient",
"GoogleApiYoutubeLoader",
"GoogleDriveLoader", "GoogleDriveLoader",
"UnstructuredHTMLLoader", "GutenbergLoader",
"BSHTMLLoader",
"UnstructuredPowerPointLoader",
"UnstructuredWordDocumentLoader",
"UnstructuredPDFLoader",
"UnstructuredImageLoader",
"ObsidianLoader",
"UnstructuredEmailLoader",
"OutlookMessageLoader",
"UnstructuredEPubLoader",
"UnstructuredMarkdownLoader",
"UnstructuredRTFLoader",
"RoamLoader",
"YoutubeLoader",
"S3FileLoader",
"TextLoader",
"HNLoader", "HNLoader",
"GitbookLoader", "HuggingFaceDatasetLoader",
"S3DirectoryLoader",
"GCSFileLoader",
"GCSDirectoryLoader",
"WebBaseLoader",
"IMSDbLoader",
"AZLyricsLoader",
"CollegeConfidentialLoader",
"IFixitLoader", "IFixitLoader",
"GutenbergLoader", "IMSDbLoader",
"PagedPDFSplitter", "ImageCaptionLoader",
"PyPDFLoader", "NotebookLoader",
"EverNoteLoader", "NotionDBLoader",
"AirbyteJSONLoader", "NotionDirectoryLoader",
"ObsidianLoader",
"OnlinePDFLoader", "OnlinePDFLoader",
"OutlookMessageLoader",
"PDFMinerLoader", "PDFMinerLoader",
"PDFMinerPDFasHTMLLoader", "PDFMinerPDFasHTMLLoader",
"PagedPDFSplitter",
"PlaywrightURLLoader",
"PyMuPDFLoader", "PyMuPDFLoader",
"TelegramChatLoader", "PyPDFLoader",
"PythonLoader",
"ReadTheDocsLoader",
"RoamLoader",
"S3DirectoryLoader",
"S3FileLoader",
"SRTLoader", "SRTLoader",
"FacebookChatLoader", "SeleniumURLLoader",
"NotebookLoader",
"CoNLLULoader",
"GoogleApiYoutubeLoader",
"GoogleApiClient",
"CSVLoader",
"BlackboardLoader",
"ApifyDatasetLoader",
"WhatsAppChatLoader",
"DataFrameLoader",
"AzureBlobStorageFileLoader",
"AzureBlobStorageContainerLoader",
"SitemapLoader", "SitemapLoader",
"DuckDBLoader",
"BigQueryLoader",
"DiffbotLoader",
"BiliBiliLoader",
"SlackDirectoryLoader", "SlackDirectoryLoader",
"GitLoader", "TelegramChatLoader",
"TextLoader",
"TwitterTweetLoader", "TwitterTweetLoader",
"ImageCaptionLoader", "UnstructuredEPubLoader",
"DiscordChatLoader", "UnstructuredEmailLoader",
"ConfluenceLoader", "UnstructuredFileIOLoader",
"PythonLoader", "UnstructuredFileLoader",
"ChatGPTLoader", "UnstructuredHTMLLoader",
"HuggingFaceDatasetLoader", "UnstructuredImageLoader",
"BlockchainDocumentLoader", "UnstructuredMarkdownLoader",
"UnstructuredPDFLoader",
"UnstructuredPowerPointLoader",
"UnstructuredRTFLoader",
"UnstructuredURLLoader",
"UnstructuredWordDocumentLoader",
"WebBaseLoader",
"WhatsAppChatLoader",
"YoutubeLoader",
] ]

@ -0,0 +1,31 @@
from typing import List, Optional
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
from langchain.utilities.arxiv import ArxivAPIWrapper
class ArxivLoader(BaseLoader):
"""Loads a query result from arxiv.org into a list of Documents.
Each document represents one Document.
The loader converts the original PDF format into the text.
"""
def __init__(
self,
query: str,
load_max_docs: Optional[int] = 100,
load_all_available_meta: Optional[bool] = False,
):
self.query = query
self.load_max_docs = load_max_docs
self.load_all_available_meta = load_all_available_meta
def load(self) -> List[Document]:
arxiv_client = ArxivAPIWrapper(
load_max_docs=self.load_max_docs,
load_all_available_meta=self.load_all_available_meta,
)
docs = arxiv_client.load(self.query)
return docs

@ -1,8 +1,13 @@
"""Util that calls Arxiv.""" """Util that calls Arxiv."""
from typing import Any, Dict import logging
from typing import Any, Dict, List
from pydantic import BaseModel, Extra, root_validator from pydantic import BaseModel, Extra, root_validator
from langchain.schema import Document
logger = logging.getLogger(__name__)
class ArxivAPIWrapper(BaseModel): class ArxivAPIWrapper(BaseModel):
"""Wrapper around ArxivAPI. """Wrapper around ArxivAPI.
@ -12,12 +17,23 @@ class ArxivAPIWrapper(BaseModel):
This wrapper will use the Arxiv API to conduct searches and This wrapper will use the Arxiv API to conduct searches and
fetch document summaries. By default, it will return the document summaries fetch document summaries. By default, it will return the document summaries
of the top-k results of an input search. of the top-k results of an input search.
Parameters:
top_k_results: number of the top-scored document used for the arxiv tool
ARXIV_MAX_QUERY_LENGTH: the cut limit on the query used for the arxiv tool.
load_max_docs: a limit to the number of loaded documents
load_all_available_meta:
if True: the `metadata` of the loaded Documents gets all available meta info
(see https://lukasschwab.me/arxiv.py/index.html#Result),
if False: the `metadata` gets only the most informative fields.
""" """
arxiv_client: Any #: :meta private: arxiv_client: Any #: :meta private:
arxiv_exceptions: Any # :meta private: arxiv_exceptions: Any # :meta private:
top_k_results: int = 3 top_k_results: int = 3
ARXIV_MAX_QUERY_LENGTH = 300 ARXIV_MAX_QUERY_LENGTH = 300
load_max_docs: int = 100
load_all_available_meta: bool = False
class Config: class Config:
"""Configuration for this pydantic object.""" """Configuration for this pydantic object."""
@ -36,6 +52,7 @@ class ArxivAPIWrapper(BaseModel):
arxiv.UnexpectedEmptyPageError, arxiv.UnexpectedEmptyPageError,
arxiv.HTTPError, arxiv.HTTPError,
) )
values["arxiv_result"] = arxiv.Result
except ImportError: except ImportError:
raise ValueError( raise ValueError(
"Could not import arxiv python package. " "Could not import arxiv python package. "
@ -62,3 +79,64 @@ class ArxivAPIWrapper(BaseModel):
return "\n\n".join(docs) if docs else "No good Arxiv Result was found" return "\n\n".join(docs) if docs else "No good Arxiv Result was found"
except self.arxiv_exceptions as ex: except self.arxiv_exceptions as ex:
return f"Arxiv exception: {ex}" return f"Arxiv exception: {ex}"
def load(self, query: str) -> List[Document]:
"""
Run Arxiv search and get the PDF documents plus the meta information.
See https://lukasschwab.me/arxiv.py/index.html#Search
Returns: a list of documents with the document.page_content in PDF format
"""
try:
import fitz
except ImportError:
raise ValueError(
"PyMuPDF package not found, please install it with "
"`pip install pymupdf`"
)
try:
docs: List[Document] = []
for result in self.arxiv_search( # type: ignore
query[: self.ARXIV_MAX_QUERY_LENGTH], max_results=self.load_max_docs
).results():
try:
doc_file_name: str = result.download_pdf()
with fitz.open(doc_file_name) as doc_file:
text: str = "".join(page.get_text() for page in doc_file)
add_meta = (
{
"entry_id": result.entry_id,
"published_first_time": str(result.published.date()),
"comment": result.comment,
"journal_ref": result.journal_ref,
"doi": result.doi,
"primary_category": result.primary_category,
"categories": result.categories,
"links": [link.href for link in result.links],
}
if self.load_all_available_meta
else {}
)
doc = Document(
page_content=text,
metadata=(
{
"Published": str(result.updated.date()),
"Title": result.title,
"Authors": ", ".join(
a.name for a in result.authors
),
"Summary": result.summary,
**add_meta,
}
),
)
docs.append(doc)
except FileNotFoundError as f_ex:
logger.debug(f_ex)
return docs
except self.arxiv_exceptions as ex:
logger.debug("Error on arxiv: %s", ex)
return []

@ -0,0 +1,55 @@
from typing import List
from langchain.document_loaders.arxiv import ArxivLoader
from langchain.schema import Document
def assert_docs(docs: List[Document]) -> None:
for doc in docs:
assert doc.page_content
assert doc.metadata
assert set(doc.metadata) == {"Published", "Title", "Authors", "Summary"}
def test_load_success() -> None:
"""Test that returns one document"""
loader = ArxivLoader(query="1605.08386", load_max_docs=2)
docs = loader.load()
assert len(docs) == 1
print(docs[0].metadata)
print(docs[0].page_content)
assert_docs(docs)
def test_load_returns_no_result() -> None:
"""Test that returns no docs"""
loader = ArxivLoader(query="1605.08386WWW", load_max_docs=2)
docs = loader.load()
assert len(docs) == 0
def test_load_returns_limited_docs() -> None:
"""Test that returns several docs"""
expected_docs = 2
loader = ArxivLoader(query="ChatGPT", load_max_docs=expected_docs)
docs = loader.load()
assert len(docs) == expected_docs
assert_docs(docs)
def test_load_returns_full_set_of_metadata() -> None:
"""Test that returns several docs"""
loader = ArxivLoader(query="ChatGPT", load_max_docs=1, load_all_available_meta=True)
docs = loader.load()
assert len(docs) == 1
for doc in docs:
assert doc.page_content
assert doc.metadata
assert set(doc.metadata).issuperset(
{"Published", "Title", "Authors", "Summary"}
)
print(doc.metadata)
assert len(set(doc.metadata)) > 4

@ -1,6 +1,9 @@
"""Integration test for Arxiv API Wrapper.""" """Integration test for Arxiv API Wrapper."""
from typing import List
import pytest import pytest
from langchain.schema import Document
from langchain.utilities import ArxivAPIWrapper from langchain.utilities import ArxivAPIWrapper
@ -9,22 +12,68 @@ def api_client() -> ArxivAPIWrapper:
return ArxivAPIWrapper() return ArxivAPIWrapper()
def test_call(api_client: ArxivAPIWrapper) -> None: def test_run_success(api_client: ArxivAPIWrapper) -> None:
"""Test that ArxivAPIWrapper returns correct answer""" """Test that returns the correct answer"""
output = api_client.run("1605.08386") output = api_client.run("1605.08386")
assert "Heat-bath random walks with Markov bases" in output assert "Heat-bath random walks with Markov bases" in output
def test_several_docs(api_client: ArxivAPIWrapper) -> None: def test_run_returns_several_docs(api_client: ArxivAPIWrapper) -> None:
"""Test that ArxivAPIWrapper returns several docs""" """Test that returns several docs"""
output = api_client.run("Caprice Stanley") output = api_client.run("Caprice Stanley")
assert "On Mixing Behavior of a Family of Random Walks" in output assert "On Mixing Behavior of a Family of Random Walks" in output
def test_no_result_call(api_client: ArxivAPIWrapper) -> None: def test_run_returns_no_result(api_client: ArxivAPIWrapper) -> None:
"""Test that call gives no result.""" """Test that gives no result."""
output = api_client.run("1605.08386WWW") output = api_client.run("1605.08386WWW")
assert "No good Arxiv Result was found" == output assert "No good Arxiv Result was found" == output
def assert_docs(docs: List[Document]) -> None:
for doc in docs:
assert doc.page_content
assert doc.metadata
assert set(doc.metadata) == {"Published", "Title", "Authors", "Summary"}
def test_load_success(api_client: ArxivAPIWrapper) -> None:
"""Test that returns one document"""
docs = api_client.load("1605.08386")
assert len(docs) == 1
assert_docs(docs)
def test_load_returns_no_result(api_client: ArxivAPIWrapper) -> None:
"""Test that returns no docs"""
docs = api_client.load("1605.08386WWW")
assert len(docs) == 0
def test_load_returns_limited_docs() -> None:
"""Test that returns several docs"""
expected_docs = 2
api_client = ArxivAPIWrapper(load_max_docs=expected_docs)
docs = api_client.load("ChatGPT")
assert len(docs) == expected_docs
assert_docs(docs)
def test_load_returns_full_set_of_metadata() -> None:
"""Test that returns several docs"""
api_client = ArxivAPIWrapper(load_max_docs=1, load_all_available_meta=True)
docs = api_client.load("ChatGPT")
assert len(docs) == 1
for doc in docs:
assert doc.page_content
assert doc.metadata
assert set(doc.metadata).issuperset(
{"Published", "Title", "Authors", "Summary"}
)
print(doc.metadata)
assert len(set(doc.metadata)) > 4

Loading…
Cancel
Save