forked from Archives/langchain
Harrison/apify (#2215)
Co-authored-by: Jiří Moravčík <jiri.moravcik@gmail.com>
This commit is contained in:
parent
e6a9ee64b3
commit
2eeaccf01c
BIN
docs/_static/ApifyActors.png
vendored
Normal file
BIN
docs/_static/ApifyActors.png
vendored
Normal file
Binary file not shown.
After Width: | Height: | Size: 559 KiB |
46
docs/ecosystem/apify.md
Normal file
46
docs/ecosystem/apify.md
Normal file
@ -0,0 +1,46 @@
|
|||||||
|
# Apify
|
||||||
|
|
||||||
|
This page covers how to use [Apify](https://apify.com) within LangChain.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Apify is a cloud platform for web scraping and data extraction,
|
||||||
|
which provides an [ecosystem](https://apify.com/store) of more than a thousand
|
||||||
|
ready-made apps called *Actors* for various scraping, crawling, and extraction use cases.
|
||||||
|
|
||||||
|
[![Apify Actors](../_static/ApifyActors.png)](https://apify.com/store)
|
||||||
|
|
||||||
|
This integration enables you run Actors on the Apify platform and load their results into LangChain to feed your vector
|
||||||
|
indexes with documents and data from the web, e.g. to generate answers from websites with documentation,
|
||||||
|
blogs, or knowledge bases.
|
||||||
|
|
||||||
|
|
||||||
|
## Installation and Setup
|
||||||
|
|
||||||
|
- Install the Apify API client for Python with `pip install apify-client`
|
||||||
|
- Get your [Apify API token](https://console.apify.com/account/integrations) and either set it as
|
||||||
|
an environment variable (`APIFY_API_TOKEN`) or pass it to the `ApifyWrapper` as `apify_api_token` in the constructor.
|
||||||
|
|
||||||
|
|
||||||
|
## Wrappers
|
||||||
|
|
||||||
|
### Utility
|
||||||
|
|
||||||
|
You can use the `ApifyWrapper` to run Actors on the Apify platform.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from langchain.utilities import ApifyWrapper
|
||||||
|
```
|
||||||
|
|
||||||
|
For a more detailed walkthrough of this wrapper, see [this notebook](../modules/agents/tools/examples/apify.ipynb).
|
||||||
|
|
||||||
|
|
||||||
|
### Loader
|
||||||
|
|
||||||
|
You can also use our `ApifyDatasetLoader` to get data from Apify dataset.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from langchain.document_loaders import ApifyDatasetLoader
|
||||||
|
```
|
||||||
|
|
||||||
|
For a more detailed walkthrough of this loader, see [this notebook](../modules/indexes/document_loaders/examples/apify_dataset.ipynb).
|
164
docs/modules/agents/tools/examples/apify.ipynb
Normal file
164
docs/modules/agents/tools/examples/apify.ipynb
Normal file
@ -0,0 +1,164 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"attachments": {},
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Apify\n",
|
||||||
|
"\n",
|
||||||
|
"This notebook shows how to use the [Apify integration](../../../../ecosystem/apify.md) for LangChain.\n",
|
||||||
|
"\n",
|
||||||
|
"[Apify](https://apify.com) is a cloud platform for web scraping and data extraction,\n",
|
||||||
|
"which provides an [ecosystem](https://apify.com/store) of more than a thousand\n",
|
||||||
|
"ready-made apps called *Actors* for various web scraping, crawling, and data extraction use cases.\n",
|
||||||
|
"For example, you can use it to extract Google Search results, Instagram and Facebook profiles, products from Amazon or Shopify, Google Maps reviews, etc. etc.\n",
|
||||||
|
"\n",
|
||||||
|
"In this example, we'll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\n",
|
||||||
|
"which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs,\n",
|
||||||
|
"and extract text content from the web pages. Then we feed the documents into a vector index and answer questions from it.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"attachments": {},
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"First, import `ApifyWrapper` into your source code:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from langchain.document_loaders.base import Document\n",
|
||||||
|
"from langchain.indexes import VectorstoreIndexCreator\n",
|
||||||
|
"from langchain.utilities import ApifyWrapper"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"attachments": {},
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Initialize it using your [Apify API token](https://console.apify.com/account/integrations) and for the purpose of this example, also with your OpenAI API key:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import os\n",
|
||||||
|
"os.environ[\"OPENAI_API_KEY\"] = \"Your OpenAI API key\"\n",
|
||||||
|
"os.environ[\"APIFY_API_TOKEN\"] = \"Your Apify API token\"\n",
|
||||||
|
"\n",
|
||||||
|
"apify = ApifyWrapper()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"attachments": {},
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Then run the Actor, wait for it to finish, and fetch its results from the Apify dataset into a LangChain document loader.\n",
|
||||||
|
"\n",
|
||||||
|
"Note that if you already have some results in an Apify dataset, you can load them directly using `ApifyDatasetLoader`, as shown in [this notebook](../../../indexes/document_loaders/examples/apify_dataset.ipynb). In that notebook, you'll also find the explanation of the `dataset_mapping_function`, which is used to map fields from the Apify dataset records to LangChain `Document` fields."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 3,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"loader = apify.call_actor(\n",
|
||||||
|
" actor_id=\"apify/website-content-crawler\",\n",
|
||||||
|
" run_input={\"startUrls\": [{\"url\": \"https://python.langchain.com/en/latest/\"}]},\n",
|
||||||
|
" dataset_mapping_function=lambda item: Document(\n",
|
||||||
|
" page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n",
|
||||||
|
" ),\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"attachments": {},
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Initialize the vector index from the crawled documents:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"index = VectorstoreIndexCreator().from_loaders([loader])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"attachments": {},
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"And finally, query the vector index:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"query = \"What is LangChain?\"\n",
|
||||||
|
"result = index.query_with_sources(query)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 6,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
" LangChain is a standard interface through which you can interact with a variety of large language models (LLMs). It provides modules that can be used to build language model applications, and it also provides chains and agents with memory capabilities.\n",
|
||||||
|
"\n",
|
||||||
|
"https://python.langchain.com/en/latest/modules/models/llms.html, https://python.langchain.com/en/latest/getting_started/getting_started.html\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(result[\"answer\"])\n",
|
||||||
|
"print(result[\"sources\"])"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.9.16"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 2
|
||||||
|
}
|
@ -0,0 +1,175 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"attachments": {},
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Apify Dataset\n",
|
||||||
|
"\n",
|
||||||
|
"This notebook shows how to load Apify datasets to LangChain.\n",
|
||||||
|
"\n",
|
||||||
|
"[Apify Dataset](https://docs.apify.com/platform/storage/dataset) is a scaleable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of [Apify Actors](https://apify.com/store)—serverless cloud programs for varius web scraping, crawling, and data extraction use cases.\n",
|
||||||
|
"\n",
|
||||||
|
"## Prerequisites\n",
|
||||||
|
"\n",
|
||||||
|
"You need to have an existing dataset on the Apify platform. If you don't have one, please first check out [this notebook](../../../agents/tools/examples/apify.ipynb) on how to use Apify to extract content from documentation, knowledge bases, help centers, or blogs."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"attachments": {},
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"First, import `ApifyDatasetLoader` into your source code:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from langchain.document_loaders import ApifyDatasetLoader\n",
|
||||||
|
"from langchain.document_loaders.base import Document"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"attachments": {},
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Then provide a function that maps Apify dataset record fields to LangChain `Document` format.\n",
|
||||||
|
"\n",
|
||||||
|
"For example, if your dataset items are structured like this:\n",
|
||||||
|
"\n",
|
||||||
|
"```json\n",
|
||||||
|
"{\n",
|
||||||
|
" \"url\": \"https://apify.com\",\n",
|
||||||
|
" \"text\": \"Apify is the best web scraping and automation platform.\"\n",
|
||||||
|
"}\n",
|
||||||
|
"```\n",
|
||||||
|
"\n",
|
||||||
|
"The mapping function in the code below will convert them to LangChain `Document` format, so that you can use them further with any LLM model (e.g. for question answering)."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 3,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"loader = ApifyDatasetLoader(\n",
|
||||||
|
" dataset_id=\"your-dataset-id\",\n",
|
||||||
|
" dataset_mapping_function=lambda dataset_item: Document(\n",
|
||||||
|
" page_content=dataset_item[\"text\"], metadata={\"source\": dataset_item[\"url\"]}\n",
|
||||||
|
" ),\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"data = loader.load()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"attachments": {},
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## An example with question answering\n",
|
||||||
|
"\n",
|
||||||
|
"In this example, we use data from a dataset to answer a question."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from langchain.docstore.document import Document\n",
|
||||||
|
"from langchain.document_loaders import ApifyDatasetLoader\n",
|
||||||
|
"from langchain.indexes import VectorstoreIndexCreator"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"loader = ApifyDatasetLoader(\n",
|
||||||
|
" dataset_id=\"your-dataset-id\",\n",
|
||||||
|
" dataset_mapping_function=lambda item: Document(\n",
|
||||||
|
" page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n",
|
||||||
|
" ),\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"index = VectorstoreIndexCreator().from_loaders([loader])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 7,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"query = \"What is Apify?\"\n",
|
||||||
|
"result = index.query_with_sources(query)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 8,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
" Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.\n",
|
||||||
|
"\n",
|
||||||
|
"https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(result[\"answer\"])\n",
|
||||||
|
"print(result[\"sources\"])"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.9.16"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 2
|
||||||
|
}
|
@ -19,6 +19,7 @@ from langchain.tools.python.tool import PythonREPLTool
|
|||||||
from langchain.tools.requests.tool import RequestsGetTool
|
from langchain.tools.requests.tool import RequestsGetTool
|
||||||
from langchain.tools.wikipedia.tool import WikipediaQueryRun
|
from langchain.tools.wikipedia.tool import WikipediaQueryRun
|
||||||
from langchain.tools.wolfram_alpha.tool import WolframAlphaQueryRun
|
from langchain.tools.wolfram_alpha.tool import WolframAlphaQueryRun
|
||||||
|
from langchain.utilities.apify import ApifyWrapper
|
||||||
from langchain.utilities.bash import BashProcess
|
from langchain.utilities.bash import BashProcess
|
||||||
from langchain.utilities.bing_search import BingSearchAPIWrapper
|
from langchain.utilities.bing_search import BingSearchAPIWrapper
|
||||||
from langchain.utilities.google_search import GoogleSearchAPIWrapper
|
from langchain.utilities.google_search import GoogleSearchAPIWrapper
|
||||||
|
@ -1,6 +1,7 @@
|
|||||||
"""All different types of document loaders."""
|
"""All different types of document loaders."""
|
||||||
|
|
||||||
from langchain.document_loaders.airbyte_json import AirbyteJSONLoader
|
from langchain.document_loaders.airbyte_json import AirbyteJSONLoader
|
||||||
|
from langchain.document_loaders.apify_dataset import ApifyDatasetLoader
|
||||||
from langchain.document_loaders.azlyrics import AZLyricsLoader
|
from langchain.document_loaders.azlyrics import AZLyricsLoader
|
||||||
from langchain.document_loaders.azure_blob_storage_container import (
|
from langchain.document_loaders.azure_blob_storage_container import (
|
||||||
AzureBlobStorageContainerLoader,
|
AzureBlobStorageContainerLoader,
|
||||||
@ -119,6 +120,7 @@ __all__ = [
|
|||||||
"GoogleApiClient",
|
"GoogleApiClient",
|
||||||
"CSVLoader",
|
"CSVLoader",
|
||||||
"BlackboardLoader",
|
"BlackboardLoader",
|
||||||
|
"ApifyDatasetLoader",
|
||||||
"WhatsAppChatLoader",
|
"WhatsAppChatLoader",
|
||||||
"DataFrameLoader",
|
"DataFrameLoader",
|
||||||
"AzureBlobStorageFileLoader",
|
"AzureBlobStorageFileLoader",
|
||||||
|
54
langchain/document_loaders/apify_dataset.py
Normal file
54
langchain/document_loaders/apify_dataset.py
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
"""Logic for loading documents from Apify datasets."""
|
||||||
|
from typing import Any, Callable, Dict, List
|
||||||
|
|
||||||
|
from pydantic import BaseModel, root_validator
|
||||||
|
|
||||||
|
from langchain.docstore.document import Document
|
||||||
|
from langchain.document_loaders.base import BaseLoader
|
||||||
|
|
||||||
|
|
||||||
|
class ApifyDatasetLoader(BaseLoader, BaseModel):
|
||||||
|
"""Logic for loading documents from Apify datasets."""
|
||||||
|
|
||||||
|
apify_client: Any
|
||||||
|
dataset_id: str
|
||||||
|
"""The ID of the dataset on the Apify platform."""
|
||||||
|
dataset_mapping_function: Callable[[Dict], Document]
|
||||||
|
"""A custom function that takes a single dictionary (an Apify dataset item)
|
||||||
|
and converts it to an instance of the Document class."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self, dataset_id: str, dataset_mapping_function: Callable[[Dict], Document]
|
||||||
|
):
|
||||||
|
"""Initialize the loader with an Apify dataset ID and a mapping function.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
dataset_id (str): The ID of the dataset on the Apify platform.
|
||||||
|
dataset_mapping_function (Callable): A function that takes a single
|
||||||
|
dictionary (an Apify dataset item) and converts it to an instance
|
||||||
|
of the Document class.
|
||||||
|
"""
|
||||||
|
super().__init__(
|
||||||
|
dataset_id=dataset_id, dataset_mapping_function=dataset_mapping_function
|
||||||
|
)
|
||||||
|
|
||||||
|
@root_validator()
|
||||||
|
def validate_environment(cls, values: Dict) -> Dict:
|
||||||
|
"""Validate environment."""
|
||||||
|
|
||||||
|
try:
|
||||||
|
from apify_client import ApifyClient
|
||||||
|
|
||||||
|
values["apify_client"] = ApifyClient()
|
||||||
|
except ImportError:
|
||||||
|
raise ValueError(
|
||||||
|
"Could not import apify-client Python package. "
|
||||||
|
"Please install it with `pip install apify-client`."
|
||||||
|
)
|
||||||
|
|
||||||
|
return values
|
||||||
|
|
||||||
|
def load(self) -> List[Document]:
|
||||||
|
"""Load documents."""
|
||||||
|
dataset_items = self.apify_client.dataset(self.dataset_id).list_items().items
|
||||||
|
return list(map(self.dataset_mapping_function, dataset_items))
|
@ -1,6 +1,7 @@
|
|||||||
"""General utilities."""
|
"""General utilities."""
|
||||||
from langchain.python import PythonREPL
|
from langchain.python import PythonREPL
|
||||||
from langchain.requests import RequestsWrapper
|
from langchain.requests import RequestsWrapper
|
||||||
|
from langchain.utilities.apify import ApifyWrapper
|
||||||
from langchain.utilities.bash import BashProcess
|
from langchain.utilities.bash import BashProcess
|
||||||
from langchain.utilities.bing_search import BingSearchAPIWrapper
|
from langchain.utilities.bing_search import BingSearchAPIWrapper
|
||||||
from langchain.utilities.google_search import GoogleSearchAPIWrapper
|
from langchain.utilities.google_search import GoogleSearchAPIWrapper
|
||||||
@ -12,6 +13,7 @@ from langchain.utilities.wikipedia import WikipediaAPIWrapper
|
|||||||
from langchain.utilities.wolfram_alpha import WolframAlphaAPIWrapper
|
from langchain.utilities.wolfram_alpha import WolframAlphaAPIWrapper
|
||||||
|
|
||||||
__all__ = [
|
__all__ = [
|
||||||
|
"ApifyWrapper",
|
||||||
"BashProcess",
|
"BashProcess",
|
||||||
"RequestsWrapper",
|
"RequestsWrapper",
|
||||||
"PythonREPL",
|
"PythonREPL",
|
||||||
|
123
langchain/utilities/apify.py
Normal file
123
langchain/utilities/apify.py
Normal file
@ -0,0 +1,123 @@
|
|||||||
|
from typing import Any, Callable, Dict, Optional
|
||||||
|
|
||||||
|
from pydantic import BaseModel, root_validator
|
||||||
|
|
||||||
|
from langchain.document_loaders import ApifyDatasetLoader
|
||||||
|
from langchain.document_loaders.base import Document
|
||||||
|
from langchain.utils import get_from_dict_or_env
|
||||||
|
|
||||||
|
|
||||||
|
class ApifyWrapper(BaseModel):
|
||||||
|
"""Wrapper around Apify.
|
||||||
|
|
||||||
|
To use, you should have the ``apify-client`` python package installed,
|
||||||
|
and the environment variable ``APIFY_API_TOKEN`` set with your API key, or pass
|
||||||
|
`apify_api_token` as a named parameter to the constructor.
|
||||||
|
"""
|
||||||
|
|
||||||
|
apify_client: Any
|
||||||
|
apify_client_async: Any
|
||||||
|
|
||||||
|
@root_validator()
|
||||||
|
def validate_environment(cls, values: Dict) -> Dict:
|
||||||
|
"""Validate environment.
|
||||||
|
|
||||||
|
Validate that an Apify API token is set and the apify-client
|
||||||
|
Python package exists in the current environment.
|
||||||
|
"""
|
||||||
|
apify_api_token = get_from_dict_or_env(
|
||||||
|
values, "apify_api_token", "APIFY_API_TOKEN"
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
from apify_client import ApifyClient, ApifyClientAsync
|
||||||
|
|
||||||
|
values["apify_client"] = ApifyClient(apify_api_token)
|
||||||
|
values["apify_client_async"] = ApifyClientAsync(apify_api_token)
|
||||||
|
except ImportError:
|
||||||
|
raise ValueError(
|
||||||
|
"Could not import apify-client Python package. "
|
||||||
|
"Please install it with `pip install apify-client`."
|
||||||
|
)
|
||||||
|
|
||||||
|
return values
|
||||||
|
|
||||||
|
def call_actor(
|
||||||
|
self,
|
||||||
|
actor_id: str,
|
||||||
|
run_input: Dict,
|
||||||
|
dataset_mapping_function: Callable[[Dict], Document],
|
||||||
|
*,
|
||||||
|
build: Optional[str] = None,
|
||||||
|
memory_mbytes: Optional[int] = None,
|
||||||
|
timeout_secs: Optional[int] = None,
|
||||||
|
) -> ApifyDatasetLoader:
|
||||||
|
"""Run an Actor on the Apify platform and wait for results to be ready.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
actor_id (str): The ID or name of the Actor on the Apify platform.
|
||||||
|
run_input (Dict): The input object of the Actor that you're trying to run.
|
||||||
|
dataset_mapping_function (Callable): A function that takes a single
|
||||||
|
dictionary (an Apify dataset item) and converts it to an
|
||||||
|
instance of the Document class.
|
||||||
|
build (str, optional): Optionally specifies the actor build to run.
|
||||||
|
It can be either a build tag or build number.
|
||||||
|
memory_mbytes (int, optional): Optional memory limit for the run,
|
||||||
|
in megabytes.
|
||||||
|
timeout_secs (int, optional): Optional timeout for the run, in seconds.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ApifyDatasetLoader: A loader that will fetch the records from the
|
||||||
|
Actor run's default dataset.
|
||||||
|
"""
|
||||||
|
actor_call = self.apify_client.actor(actor_id).call(
|
||||||
|
run_input=run_input,
|
||||||
|
build=build,
|
||||||
|
memory_mbytes=memory_mbytes,
|
||||||
|
timeout_secs=timeout_secs,
|
||||||
|
)
|
||||||
|
|
||||||
|
return ApifyDatasetLoader(
|
||||||
|
dataset_id=actor_call["defaultDatasetId"],
|
||||||
|
dataset_mapping_function=dataset_mapping_function,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def acall_actor(
|
||||||
|
self,
|
||||||
|
actor_id: str,
|
||||||
|
run_input: Dict,
|
||||||
|
dataset_mapping_function: Callable[[Dict], Document],
|
||||||
|
*,
|
||||||
|
build: Optional[str] = None,
|
||||||
|
memory_mbytes: Optional[int] = None,
|
||||||
|
timeout_secs: Optional[int] = None,
|
||||||
|
) -> ApifyDatasetLoader:
|
||||||
|
"""Run an Actor on the Apify platform and wait for results to be ready.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
actor_id (str): The ID or name of the Actor on the Apify platform.
|
||||||
|
run_input (Dict): The input object of the Actor that you're trying to run.
|
||||||
|
dataset_mapping_function (Callable): A function that takes a single
|
||||||
|
dictionary (an Apify dataset item) and converts it to
|
||||||
|
an instance of the Document class.
|
||||||
|
build (str, optional): Optionally specifies the actor build to run.
|
||||||
|
It can be either a build tag or build number.
|
||||||
|
memory_mbytes (int, optional): Optional memory limit for the run,
|
||||||
|
in megabytes.
|
||||||
|
timeout_secs (int, optional): Optional timeout for the run, in seconds.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ApifyDatasetLoader: A loader that will fetch the records from the
|
||||||
|
Actor run's default dataset.
|
||||||
|
"""
|
||||||
|
actor_call = await self.apify_client_async.actor(actor_id).call(
|
||||||
|
run_input=run_input,
|
||||||
|
build=build,
|
||||||
|
memory_mbytes=memory_mbytes,
|
||||||
|
timeout_secs=timeout_secs,
|
||||||
|
)
|
||||||
|
|
||||||
|
return ApifyDatasetLoader(
|
||||||
|
dataset_id=actor_call["defaultDatasetId"],
|
||||||
|
dataset_mapping_function=dataset_mapping_function,
|
||||||
|
)
|
Loading…
Reference in New Issue
Block a user