diff --git a/docs/_static/ApifyActors.png b/docs/_static/ApifyActors.png new file mode 100644 index 00000000..5c2a7bc1 Binary files /dev/null and b/docs/_static/ApifyActors.png differ diff --git a/docs/ecosystem/apify.md b/docs/ecosystem/apify.md new file mode 100644 index 00000000..f1f14efb --- /dev/null +++ b/docs/ecosystem/apify.md @@ -0,0 +1,46 @@ +# Apify + +This page covers how to use [Apify](https://apify.com) within LangChain. + +## Overview + +Apify is a cloud platform for web scraping and data extraction, +which provides an [ecosystem](https://apify.com/store) of more than a thousand +ready-made apps called *Actors* for various scraping, crawling, and extraction use cases. + +[![Apify Actors](../_static/ApifyActors.png)](https://apify.com/store) + +This integration enables you run Actors on the Apify platform and load their results into LangChain to feed your vector +indexes with documents and data from the web, e.g. to generate answers from websites with documentation, +blogs, or knowledge bases. + + +## Installation and Setup + +- Install the Apify API client for Python with `pip install apify-client` +- Get your [Apify API token](https://console.apify.com/account/integrations) and either set it as + an environment variable (`APIFY_API_TOKEN`) or pass it to the `ApifyWrapper` as `apify_api_token` in the constructor. + + +## Wrappers + +### Utility + +You can use the `ApifyWrapper` to run Actors on the Apify platform. + +```python +from langchain.utilities import ApifyWrapper +``` + +For a more detailed walkthrough of this wrapper, see [this notebook](../modules/agents/tools/examples/apify.ipynb). + + +### Loader + +You can also use our `ApifyDatasetLoader` to get data from Apify dataset. + +```python +from langchain.document_loaders import ApifyDatasetLoader +``` + +For a more detailed walkthrough of this loader, see [this notebook](../modules/indexes/document_loaders/examples/apify_dataset.ipynb). diff --git a/docs/modules/agents/tools/examples/apify.ipynb b/docs/modules/agents/tools/examples/apify.ipynb new file mode 100644 index 00000000..26d6072b --- /dev/null +++ b/docs/modules/agents/tools/examples/apify.ipynb @@ -0,0 +1,164 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Apify\n", + "\n", + "This notebook shows how to use the [Apify integration](../../../../ecosystem/apify.md) for LangChain.\n", + "\n", + "[Apify](https://apify.com) is a cloud platform for web scraping and data extraction,\n", + "which provides an [ecosystem](https://apify.com/store) of more than a thousand\n", + "ready-made apps called *Actors* for various web scraping, crawling, and data extraction use cases.\n", + "For example, you can use it to extract Google Search results, Instagram and Facebook profiles, products from Amazon or Shopify, Google Maps reviews, etc. etc.\n", + "\n", + "In this example, we'll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\n", + "which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs,\n", + "and extract text content from the web pages. Then we feed the documents into a vector index and answer questions from it.\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, import `ApifyWrapper` into your source code:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders.base import Document\n", + "from langchain.indexes import VectorstoreIndexCreator\n", + "from langchain.utilities import ApifyWrapper" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Initialize it using your [Apify API token](https://console.apify.com/account/integrations) and for the purpose of this example, also with your OpenAI API key:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"OPENAI_API_KEY\"] = \"Your OpenAI API key\"\n", + "os.environ[\"APIFY_API_TOKEN\"] = \"Your Apify API token\"\n", + "\n", + "apify = ApifyWrapper()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then run the Actor, wait for it to finish, and fetch its results from the Apify dataset into a LangChain document loader.\n", + "\n", + "Note that if you already have some results in an Apify dataset, you can load them directly using `ApifyDatasetLoader`, as shown in [this notebook](../../../indexes/document_loaders/examples/apify_dataset.ipynb). In that notebook, you'll also find the explanation of the `dataset_mapping_function`, which is used to map fields from the Apify dataset records to LangChain `Document` fields." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "loader = apify.call_actor(\n", + " actor_id=\"apify/website-content-crawler\",\n", + " run_input={\"startUrls\": [{\"url\": \"https://python.langchain.com/en/latest/\"}]},\n", + " dataset_mapping_function=lambda item: Document(\n", + " page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n", + " ),\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Initialize the vector index from the crawled documents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "index = VectorstoreIndexCreator().from_loaders([loader])" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And finally, query the vector index:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "query = \"What is LangChain?\"\n", + "result = index.query_with_sources(query)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " LangChain is a standard interface through which you can interact with a variety of large language models (LLMs). It provides modules that can be used to build language model applications, and it also provides chains and agents with memory capabilities.\n", + "\n", + "https://python.langchain.com/en/latest/modules/models/llms.html, https://python.langchain.com/en/latest/getting_started/getting_started.html\n" + ] + } + ], + "source": [ + "print(result[\"answer\"])\n", + "print(result[\"sources\"])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/modules/indexes/document_loaders/examples/apify_dataset.ipynb b/docs/modules/indexes/document_loaders/examples/apify_dataset.ipynb new file mode 100644 index 00000000..e02e836d --- /dev/null +++ b/docs/modules/indexes/document_loaders/examples/apify_dataset.ipynb @@ -0,0 +1,175 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Apify Dataset\n", + "\n", + "This notebook shows how to load Apify datasets to LangChain.\n", + "\n", + "[Apify Dataset](https://docs.apify.com/platform/storage/dataset) is a scaleable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of [Apify Actors](https://apify.com/store)—serverless cloud programs for varius web scraping, crawling, and data extraction use cases.\n", + "\n", + "## Prerequisites\n", + "\n", + "You need to have an existing dataset on the Apify platform. If you don't have one, please first check out [this notebook](../../../agents/tools/examples/apify.ipynb) on how to use Apify to extract content from documentation, knowledge bases, help centers, or blogs." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, import `ApifyDatasetLoader` into your source code:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders import ApifyDatasetLoader\n", + "from langchain.document_loaders.base import Document" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then provide a function that maps Apify dataset record fields to LangChain `Document` format.\n", + "\n", + "For example, if your dataset items are structured like this:\n", + "\n", + "```json\n", + "{\n", + " \"url\": \"https://apify.com\",\n", + " \"text\": \"Apify is the best web scraping and automation platform.\"\n", + "}\n", + "```\n", + "\n", + "The mapping function in the code below will convert them to LangChain `Document` format, so that you can use them further with any LLM model (e.g. for question answering)." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "loader = ApifyDatasetLoader(\n", + " dataset_id=\"your-dataset-id\",\n", + " dataset_mapping_function=lambda dataset_item: Document(\n", + " page_content=dataset_item[\"text\"], metadata={\"source\": dataset_item[\"url\"]}\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = loader.load()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## An example with question answering\n", + "\n", + "In this example, we use data from a dataset to answer a question." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.docstore.document import Document\n", + "from langchain.document_loaders import ApifyDatasetLoader\n", + "from langchain.indexes import VectorstoreIndexCreator" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "loader = ApifyDatasetLoader(\n", + " dataset_id=\"your-dataset-id\",\n", + " dataset_mapping_function=lambda item: Document(\n", + " page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "index = VectorstoreIndexCreator().from_loaders([loader])" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "query = \"What is Apify?\"\n", + "result = index.query_with_sources(query)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.\n", + "\n", + "https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples\n" + ] + } + ], + "source": [ + "print(result[\"answer\"])\n", + "print(result[\"sources\"])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/langchain/agents/load_tools.py b/langchain/agents/load_tools.py index 32b1463d..269c5e48 100644 --- a/langchain/agents/load_tools.py +++ b/langchain/agents/load_tools.py @@ -19,6 +19,7 @@ from langchain.tools.python.tool import PythonREPLTool from langchain.tools.requests.tool import RequestsGetTool from langchain.tools.wikipedia.tool import WikipediaQueryRun from langchain.tools.wolfram_alpha.tool import WolframAlphaQueryRun +from langchain.utilities.apify import ApifyWrapper from langchain.utilities.bash import BashProcess from langchain.utilities.bing_search import BingSearchAPIWrapper from langchain.utilities.google_search import GoogleSearchAPIWrapper diff --git a/langchain/document_loaders/__init__.py b/langchain/document_loaders/__init__.py index 2b6a35d4..746a94be 100644 --- a/langchain/document_loaders/__init__.py +++ b/langchain/document_loaders/__init__.py @@ -1,6 +1,7 @@ """All different types of document loaders.""" from langchain.document_loaders.airbyte_json import AirbyteJSONLoader +from langchain.document_loaders.apify_dataset import ApifyDatasetLoader from langchain.document_loaders.azlyrics import AZLyricsLoader from langchain.document_loaders.azure_blob_storage_container import ( AzureBlobStorageContainerLoader, @@ -119,6 +120,7 @@ __all__ = [ "GoogleApiClient", "CSVLoader", "BlackboardLoader", + "ApifyDatasetLoader", "WhatsAppChatLoader", "DataFrameLoader", "AzureBlobStorageFileLoader", diff --git a/langchain/document_loaders/apify_dataset.py b/langchain/document_loaders/apify_dataset.py new file mode 100644 index 00000000..aae71aa7 --- /dev/null +++ b/langchain/document_loaders/apify_dataset.py @@ -0,0 +1,54 @@ +"""Logic for loading documents from Apify datasets.""" +from typing import Any, Callable, Dict, List + +from pydantic import BaseModel, root_validator + +from langchain.docstore.document import Document +from langchain.document_loaders.base import BaseLoader + + +class ApifyDatasetLoader(BaseLoader, BaseModel): + """Logic for loading documents from Apify datasets.""" + + apify_client: Any + dataset_id: str + """The ID of the dataset on the Apify platform.""" + dataset_mapping_function: Callable[[Dict], Document] + """A custom function that takes a single dictionary (an Apify dataset item) + and converts it to an instance of the Document class.""" + + def __init__( + self, dataset_id: str, dataset_mapping_function: Callable[[Dict], Document] + ): + """Initialize the loader with an Apify dataset ID and a mapping function. + + Args: + dataset_id (str): The ID of the dataset on the Apify platform. + dataset_mapping_function (Callable): A function that takes a single + dictionary (an Apify dataset item) and converts it to an instance + of the Document class. + """ + super().__init__( + dataset_id=dataset_id, dataset_mapping_function=dataset_mapping_function + ) + + @root_validator() + def validate_environment(cls, values: Dict) -> Dict: + """Validate environment.""" + + try: + from apify_client import ApifyClient + + values["apify_client"] = ApifyClient() + except ImportError: + raise ValueError( + "Could not import apify-client Python package. " + "Please install it with `pip install apify-client`." + ) + + return values + + def load(self) -> List[Document]: + """Load documents.""" + dataset_items = self.apify_client.dataset(self.dataset_id).list_items().items + return list(map(self.dataset_mapping_function, dataset_items)) diff --git a/langchain/utilities/__init__.py b/langchain/utilities/__init__.py index b8103348..c9822364 100644 --- a/langchain/utilities/__init__.py +++ b/langchain/utilities/__init__.py @@ -1,6 +1,7 @@ """General utilities.""" from langchain.python import PythonREPL from langchain.requests import RequestsWrapper +from langchain.utilities.apify import ApifyWrapper from langchain.utilities.bash import BashProcess from langchain.utilities.bing_search import BingSearchAPIWrapper from langchain.utilities.google_search import GoogleSearchAPIWrapper @@ -12,6 +13,7 @@ from langchain.utilities.wikipedia import WikipediaAPIWrapper from langchain.utilities.wolfram_alpha import WolframAlphaAPIWrapper __all__ = [ + "ApifyWrapper", "BashProcess", "RequestsWrapper", "PythonREPL", diff --git a/langchain/utilities/apify.py b/langchain/utilities/apify.py new file mode 100644 index 00000000..bf1527f1 --- /dev/null +++ b/langchain/utilities/apify.py @@ -0,0 +1,123 @@ +from typing import Any, Callable, Dict, Optional + +from pydantic import BaseModel, root_validator + +from langchain.document_loaders import ApifyDatasetLoader +from langchain.document_loaders.base import Document +from langchain.utils import get_from_dict_or_env + + +class ApifyWrapper(BaseModel): + """Wrapper around Apify. + + To use, you should have the ``apify-client`` python package installed, + and the environment variable ``APIFY_API_TOKEN`` set with your API key, or pass + `apify_api_token` as a named parameter to the constructor. + """ + + apify_client: Any + apify_client_async: Any + + @root_validator() + def validate_environment(cls, values: Dict) -> Dict: + """Validate environment. + + Validate that an Apify API token is set and the apify-client + Python package exists in the current environment. + """ + apify_api_token = get_from_dict_or_env( + values, "apify_api_token", "APIFY_API_TOKEN" + ) + + try: + from apify_client import ApifyClient, ApifyClientAsync + + values["apify_client"] = ApifyClient(apify_api_token) + values["apify_client_async"] = ApifyClientAsync(apify_api_token) + except ImportError: + raise ValueError( + "Could not import apify-client Python package. " + "Please install it with `pip install apify-client`." + ) + + return values + + def call_actor( + self, + actor_id: str, + run_input: Dict, + dataset_mapping_function: Callable[[Dict], Document], + *, + build: Optional[str] = None, + memory_mbytes: Optional[int] = None, + timeout_secs: Optional[int] = None, + ) -> ApifyDatasetLoader: + """Run an Actor on the Apify platform and wait for results to be ready. + + Args: + actor_id (str): The ID or name of the Actor on the Apify platform. + run_input (Dict): The input object of the Actor that you're trying to run. + dataset_mapping_function (Callable): A function that takes a single + dictionary (an Apify dataset item) and converts it to an + instance of the Document class. + build (str, optional): Optionally specifies the actor build to run. + It can be either a build tag or build number. + memory_mbytes (int, optional): Optional memory limit for the run, + in megabytes. + timeout_secs (int, optional): Optional timeout for the run, in seconds. + + Returns: + ApifyDatasetLoader: A loader that will fetch the records from the + Actor run's default dataset. + """ + actor_call = self.apify_client.actor(actor_id).call( + run_input=run_input, + build=build, + memory_mbytes=memory_mbytes, + timeout_secs=timeout_secs, + ) + + return ApifyDatasetLoader( + dataset_id=actor_call["defaultDatasetId"], + dataset_mapping_function=dataset_mapping_function, + ) + + async def acall_actor( + self, + actor_id: str, + run_input: Dict, + dataset_mapping_function: Callable[[Dict], Document], + *, + build: Optional[str] = None, + memory_mbytes: Optional[int] = None, + timeout_secs: Optional[int] = None, + ) -> ApifyDatasetLoader: + """Run an Actor on the Apify platform and wait for results to be ready. + + Args: + actor_id (str): The ID or name of the Actor on the Apify platform. + run_input (Dict): The input object of the Actor that you're trying to run. + dataset_mapping_function (Callable): A function that takes a single + dictionary (an Apify dataset item) and converts it to + an instance of the Document class. + build (str, optional): Optionally specifies the actor build to run. + It can be either a build tag or build number. + memory_mbytes (int, optional): Optional memory limit for the run, + in megabytes. + timeout_secs (int, optional): Optional timeout for the run, in seconds. + + Returns: + ApifyDatasetLoader: A loader that will fetch the records from the + Actor run's default dataset. + """ + actor_call = await self.apify_client_async.actor(actor_id).call( + run_input=run_input, + build=build, + memory_mbytes=memory_mbytes, + timeout_secs=timeout_secs, + ) + + return ApifyDatasetLoader( + dataset_id=actor_call["defaultDatasetId"], + dataset_mapping_function=dataset_mapping_function, + )