Harrison/apify (#2215)

Co-authored-by: Jiří Moravčík <jiri.moravcik@gmail.com>
2023-03-30 20:58:14 -07:00 · 2023-03-30 20:58:14 -07:00 · 2eeaccf01c
commit 2eeaccf01c
parent e6a9ee64b3
9 changed files with 567 additions and 0 deletions
--- a/docs/_static/ApifyActors.png
+++ b/docs/_static/ApifyActors.png
--- a/docs/ecosystem/apify.md
+++ b/docs/ecosystem/apify.md
@ -0,0 +1,46 @@
 # Apify
 This page covers how to use [Apify](https://apify.com) within LangChain.
 ## Overview
 Apify is a cloud platform for web scraping and data extraction,
 which provides an [ecosystem](https://apify.com/store) of more than a thousand
 ready-made apps called *Actors* for various scraping, crawling, and extraction use cases.
 [![Apify Actors](../_static/ApifyActors.png)](https://apify.com/store)
 This integration enables you run Actors on the Apify platform and load their results into LangChain to feed your vector
 indexes with documents and data from the web, e.g. to generate answers from websites with documentation,
 blogs, or knowledge bases.
 ## Installation and Setup
 - Install the Apify API client for Python with `pip install apify-client`
 - Get your [Apify API token](https://console.apify.com/account/integrations) and either set it as
  an environment variable (`APIFY_API_TOKEN`) or pass it to the `ApifyWrapper` as `apify_api_token` in the constructor.
 ## Wrappers
 ### Utility
 You can use the `ApifyWrapper` to run Actors on the Apify platform.
 ```python
 from langchain.utilities import ApifyWrapper
 ```
 For a more detailed walkthrough of this wrapper, see [this notebook](../modules/agents/tools/examples/apify.ipynb).
 ### Loader
 You can also use our `ApifyDatasetLoader` to get data from Apify dataset.
 ```python
 from langchain.document_loaders import ApifyDatasetLoader
 ```
 For a more detailed walkthrough of this loader, see [this notebook](../modules/indexes/document_loaders/examples/apify_dataset.ipynb).
--- a/docs/modules/agents/tools/examples/apify.ipynb
+++ b/docs/modules/agents/tools/examples/apify.ipynb
@ -0,0 +1,164 @@
 {
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Apify\n",
    "\n",
    "This notebook shows how to use the [Apify integration](../../../../ecosystem/apify.md) for LangChain.\n",
    "\n",
    "[Apify](https://apify.com) is a cloud platform for web scraping and data extraction,\n",
    "which provides an [ecosystem](https://apify.com/store) of more than a thousand\n",
    "ready-made apps called *Actors* for various web scraping, crawling, and data extraction use cases.\n",
    "For example, you can use it to extract Google Search results, Instagram and Facebook profiles, products from Amazon or Shopify, Google Maps reviews, etc. etc.\n",
    "\n",
    "In this example, we'll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\n",
    "which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs,\n",
    "and extract text content from the web pages. Then we feed the documents into a vector index and answer questions from it.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, import `ApifyWrapper` into your source code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders.base import Document\n",
    "from langchain.indexes import VectorstoreIndexCreator\n",
    "from langchain.utilities import ApifyWrapper"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize it using your [Apify API token](https://console.apify.com/account/integrations) and for the purpose of this example, also with your OpenAI API key:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"Your OpenAI API key\"\n",
    "os.environ[\"APIFY_API_TOKEN\"] = \"Your Apify API token\"\n",
    "\n",
    "apify = ApifyWrapper()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then run the Actor, wait for it to finish, and fetch its results from the Apify dataset into a LangChain document loader.\n",
    "\n",
    "Note that if you already have some results in an Apify dataset, you can load them directly using `ApifyDatasetLoader`, as shown in [this notebook](../../../indexes/document_loaders/examples/apify_dataset.ipynb). In that notebook, you'll also find the explanation of the `dataset_mapping_function`, which is used to map fields from the Apify dataset records to LangChain `Document` fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = apify.call_actor(\n",
    "    actor_id=\"apify/website-content-crawler\",\n",
    "    run_input={\"startUrls\": [{\"url\": \"https://python.langchain.com/en/latest/\"}]},\n",
    "    dataset_mapping_function=lambda item: Document(\n",
    "        page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the vector index from the crawled documents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index = VectorstoreIndexCreator().from_loaders([loader])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And finally, query the vector index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"What is LangChain?\"\n",
    "result = index.query_with_sources(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " LangChain is a standard interface through which you can interact with a variety of large language models (LLMs). It provides modules that can be used to build language model applications, and it also provides chains and agents with memory capabilities.\n",
      "\n",
      "https://python.langchain.com/en/latest/modules/models/llms.html, https://python.langchain.com/en/latest/getting_started/getting_started.html\n"
     ]
    }
   ],
   "source": [
    "print(result[\"answer\"])\n",
    "print(result[\"sources\"])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/docs/modules/indexes/document_loaders/examples/apify_dataset.ipynb
+++ b/docs/modules/indexes/document_loaders/examples/apify_dataset.ipynb
@ -0,0 +1,175 @@
 {
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Apify Dataset\n",
    "\n",
    "This notebook shows how to load Apify datasets to LangChain.\n",
    "\n",
    "[Apify Dataset](https://docs.apify.com/platform/storage/dataset) is a scaleable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of [Apify Actors](https://apify.com/store)—serverless cloud programs for varius web scraping, crawling, and data extraction use cases.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "You need to have an existing dataset on the Apify platform. If you don't have one, please first check out [this notebook](../../../agents/tools/examples/apify.ipynb) on how to use Apify to extract content from documentation, knowledge bases, help centers, or blogs."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, import `ApifyDatasetLoader` into your source code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import ApifyDatasetLoader\n",
    "from langchain.document_loaders.base import Document"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then provide a function that maps Apify dataset record fields to LangChain `Document` format.\n",
    "\n",
    "For example, if your dataset items are structured like this:\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"url\": \"https://apify.com\",\n",
    "    \"text\": \"Apify is the best web scraping and automation platform.\"\n",
    "}\n",
    "```\n",
    "\n",
    "The mapping function in the code below will convert them to LangChain `Document` format, so that you can use them further with any LLM model (e.g. for question answering)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = ApifyDatasetLoader(\n",
    "    dataset_id=\"your-dataset-id\",\n",
    "    dataset_mapping_function=lambda dataset_item: Document(\n",
    "        page_content=dataset_item[\"text\"], metadata={\"source\": dataset_item[\"url\"]}\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = loader.load()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## An example with question answering\n",
    "\n",
    "In this example, we use data from a dataset to answer a question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.docstore.document import Document\n",
    "from langchain.document_loaders import ApifyDatasetLoader\n",
    "from langchain.indexes import VectorstoreIndexCreator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = ApifyDatasetLoader(\n",
    "    dataset_id=\"your-dataset-id\",\n",
    "    dataset_mapping_function=lambda item: Document(\n",
    "        page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index = VectorstoreIndexCreator().from_loaders([loader])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"What is Apify?\"\n",
    "result = index.query_with_sources(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.\n",
      "\n",
      "https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples\n"
     ]
    }
   ],
   "source": [
    "print(result[\"answer\"])\n",
    "print(result[\"sources\"])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/langchain/agents/load_tools.py
+++ b/langchain/agents/load_tools.py
@ -19,6 +19,7 @@ from langchain.tools.python.tool import PythonREPLTool
 from langchain.tools.requests.tool import RequestsGetTool
 from langchain.tools.wikipedia.tool import WikipediaQueryRun
 from langchain.tools.wolfram_alpha.tool import WolframAlphaQueryRun
 from langchain.utilities.apify import ApifyWrapper
 from langchain.utilities.bash import BashProcess
 from langchain.utilities.bing_search import BingSearchAPIWrapper
 from langchain.utilities.google_search import GoogleSearchAPIWrapper
--- a/langchain/document_loaders/init.py
+++ b/langchain/document_loaders/init.py
@ -1,6 +1,7 @@
 """All different types of document loaders."""
 from langchain.document_loaders.airbyte_json import AirbyteJSONLoader
 from langchain.document_loaders.apify_dataset import ApifyDatasetLoader
 from langchain.document_loaders.azlyrics import AZLyricsLoader
 from langchain.document_loaders.azure_blob_storage_container import (
    AzureBlobStorageContainerLoader,
@ -119,6 +120,7 @@ __all__ = [
    "GoogleApiClient",
    "CSVLoader",
    "BlackboardLoader",
    "ApifyDatasetLoader",
    "WhatsAppChatLoader",
    "DataFrameLoader",
    "AzureBlobStorageFileLoader",
--- a/langchain/document_loaders/apify_dataset.py
+++ b/langchain/document_loaders/apify_dataset.py
@ -0,0 +1,54 @@
 """Logic for loading documents from Apify datasets."""
 from typing import Any, Callable, Dict, List
 from pydantic import BaseModel, root_validator
 from langchain.docstore.document import Document
 from langchain.document_loaders.base import BaseLoader
 class ApifyDatasetLoader(BaseLoader, BaseModel):
    """Logic for loading documents from Apify datasets."""
    apify_client: Any
    dataset_id: str
    """The ID of the dataset on the Apify platform."""
    dataset_mapping_function: Callable[[Dict], Document]
    """A custom function that takes a single dictionary (an Apify dataset item)
     and converts it to an instance of the Document class."""
    def __init__(
        self, dataset_id: str, dataset_mapping_function: Callable[[Dict], Document]
    ):
        """Initialize the loader with an Apify dataset ID and a mapping function.
        Args:
            dataset_id (str): The ID of the dataset on the Apify platform.
            dataset_mapping_function (Callable): A function that takes a single
                dictionary (an Apify dataset item) and converts it to an instance
                of the Document class.
        """
        super().__init__(
            dataset_id=dataset_id, dataset_mapping_function=dataset_mapping_function
        )
    @root_validator()
    def validate_environment(cls, values: Dict) -> Dict:
        """Validate environment."""
        try:
            from apify_client import ApifyClient
            values["apify_client"] = ApifyClient()
        except ImportError:
            raise ValueError(
                "Could not import apify-client Python package. "
                "Please install it with `pip install apify-client`."
            )
        return values
    def load(self) -> List[Document]:
        """Load documents."""
        dataset_items = self.apify_client.dataset(self.dataset_id).list_items().items
        return list(map(self.dataset_mapping_function, dataset_items))
--- a/langchain/utilities/init.py
+++ b/langchain/utilities/init.py
@ -1,6 +1,7 @@
 """General utilities."""
 from langchain.python import PythonREPL
 from langchain.requests import RequestsWrapper
 from langchain.utilities.apify import ApifyWrapper
 from langchain.utilities.bash import BashProcess
 from langchain.utilities.bing_search import BingSearchAPIWrapper
 from langchain.utilities.google_search import GoogleSearchAPIWrapper
@ -12,6 +13,7 @@ from langchain.utilities.wikipedia import WikipediaAPIWrapper
 from langchain.utilities.wolfram_alpha import WolframAlphaAPIWrapper
 __all__ = [
    "ApifyWrapper",
    "BashProcess",
    "RequestsWrapper",
    "PythonREPL",
--- a/langchain/utilities/apify.py
+++ b/langchain/utilities/apify.py
@ -0,0 +1,123 @@
 from typing import Any, Callable, Dict, Optional
 from pydantic import BaseModel, root_validator
 from langchain.document_loaders import ApifyDatasetLoader
 from langchain.document_loaders.base import Document
 from langchain.utils import get_from_dict_or_env
 class ApifyWrapper(BaseModel):
    """Wrapper around Apify.
    To use, you should have the ``apify-client`` python package installed,
    and the environment variable ``APIFY_API_TOKEN`` set with your API key, or pass
    `apify_api_token` as a named parameter to the constructor.
    """
    apify_client: Any
    apify_client_async: Any
    @root_validator()
    def validate_environment(cls, values: Dict) -> Dict:
        """Validate environment.
        Validate that an Apify API token is set and the apify-client
        Python package exists in the current environment.
        """
        apify_api_token = get_from_dict_or_env(
            values, "apify_api_token", "APIFY_API_TOKEN"
        )
        try:
            from apify_client import ApifyClient, ApifyClientAsync
            values["apify_client"] = ApifyClient(apify_api_token)
            values["apify_client_async"] = ApifyClientAsync(apify_api_token)
        except ImportError:
            raise ValueError(
                "Could not import apify-client Python package. "
                "Please install it with `pip install apify-client`."
            )
        return values
    def call_actor(
        self,
        actor_id: str,
        run_input: Dict,
        dataset_mapping_function: Callable[[Dict], Document],
        *,
        build: Optional[str] = None,
        memory_mbytes: Optional[int] = None,
        timeout_secs: Optional[int] = None,
    ) -> ApifyDatasetLoader:
        """Run an Actor on the Apify platform and wait for results to be ready.
        Args:
            actor_id (str): The ID or name of the Actor on the Apify platform.
            run_input (Dict): The input object of the Actor that you're trying to run.
            dataset_mapping_function (Callable): A function that takes a single
                dictionary (an Apify dataset item) and converts it to an
                instance of the Document class.
            build (str, optional): Optionally specifies the actor build to run.
                It can be either a build tag or build number.
            memory_mbytes (int, optional): Optional memory limit for the run,
                in megabytes.
            timeout_secs (int, optional): Optional timeout for the run, in seconds.
        Returns:
            ApifyDatasetLoader: A loader that will fetch the records from the
                Actor run's default dataset.
        """
        actor_call = self.apify_client.actor(actor_id).call(
            run_input=run_input,
            build=build,
            memory_mbytes=memory_mbytes,
            timeout_secs=timeout_secs,
        )
        return ApifyDatasetLoader(
            dataset_id=actor_call["defaultDatasetId"],
            dataset_mapping_function=dataset_mapping_function,
        )
    async def acall_actor(
        self,
        actor_id: str,
        run_input: Dict,
        dataset_mapping_function: Callable[[Dict], Document],
        *,
        build: Optional[str] = None,
        memory_mbytes: Optional[int] = None,
        timeout_secs: Optional[int] = None,
    ) -> ApifyDatasetLoader:
        """Run an Actor on the Apify platform and wait for results to be ready.
        Args:
            actor_id (str): The ID or name of the Actor on the Apify platform.
            run_input (Dict): The input object of the Actor that you're trying to run.
            dataset_mapping_function (Callable): A function that takes a single
                dictionary (an Apify dataset item) and converts it to
                an instance of the Document class.
            build (str, optional): Optionally specifies the actor build to run.
                It can be either a build tag or build number.
            memory_mbytes (int, optional): Optional memory limit for the run,
                in megabytes.
            timeout_secs (int, optional): Optional timeout for the run, in seconds.
        Returns:
            ApifyDatasetLoader: A loader that will fetch the records from the
                Actor run's default dataset.
        """
        actor_call = await self.apify_client_async.actor(actor_id).call(
            run_input=run_input,
            build=build,
            memory_mbytes=memory_mbytes,
            timeout_secs=timeout_secs,
        )
        return ApifyDatasetLoader(
            dataset_id=actor_call["defaultDatasetId"],
            dataset_mapping_function=dataset_mapping_function,
        )