langchain/docs/extras/modules/agents/tools/integrations/apify.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Apify\n",
    "\n",
    "This notebook shows how to use the [Apify integration](../../../../ecosystem/apify.md) for LangChain.\n",
    "\n",
    "[Apify](https://apify.com) is a cloud platform for web scraping and data extraction,\n",
    "which provides an [ecosystem](https://apify.com/store) of more than a thousand\n",
    "ready-made apps called *Actors* for various web scraping, crawling, and data extraction use cases.\n",
    "For example, you can use it to extract Google Search results, Instagram and Facebook profiles, products from Amazon or Shopify, Google Maps reviews, etc. etc.\n",
    "\n",
    "In this example, we'll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\n",
    "which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs,\n",
    "and extract text content from the web pages. Then we feed the documents into a vector index and answer questions from it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install apify-client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, import `ApifyWrapper` into your source code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders.base import Document\n",
    "from langchain.indexes import VectorstoreIndexCreator\n",
    "from langchain.utilities import ApifyWrapper"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize it using your [Apify API token](https://console.apify.com/account/integrations) and for the purpose of this example, also with your OpenAI API key:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"Your OpenAI API key\"\n",
    "os.environ[\"APIFY_API_TOKEN\"] = \"Your Apify API token\"\n",
    "\n",
    "apify = ApifyWrapper()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then run the Actor, wait for it to finish, and fetch its results from the Apify dataset into a LangChain document loader.\n",
    "\n",
    "Note that if you already have some results in an Apify dataset, you can load them directly using `ApifyDatasetLoader`, as shown in [this notebook](../../../data_connection/document_loaders/examples/apify_dataset.html). In that notebook, you'll also find the explanation of the `dataset_mapping_function`, which is used to map fields from the Apify dataset records to LangChain `Document` fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = apify.call_actor(\n",
    "    actor_id=\"apify/website-content-crawler\",\n",
    "    run_input={\"startUrls\": [{\"url\": \"https://python.langchain.com/en/latest/\"}]},\n",
    "    dataset_mapping_function=lambda item: Document(\n",
    "        page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the vector index from the crawled documents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index = VectorstoreIndexCreator().from_loaders([loader])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And finally, query the vector index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"What is LangChain?\"\n",
    "result = index.query_with_sources(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " LangChain is a standard interface through which you can interact with a variety of large language models (LLMs). It provides modules that can be used to build language model applications, and it also provides chains and agents with memory capabilities.\n",
      "\n",
      "https://python.langchain.com/en/latest/modules/models/llms.html, https://python.langchain.com/en/latest/getting_started/getting_started.html\n"
     ]
    }
   ],
   "source": [
    "print(result[\"answer\"])\n",
    "print(result[\"sources\"])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
Harrison/apify (#2215) Co-authored-by: Jiří Moravčík <jiri.moravcik@gmail.com> 2023-03-31 03:58:14 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Apify\n",`
			`"\n",`
			`"This notebook shows how to use the [Apify integration](../../../../ecosystem/apify.md) for LangChain.\n",`
			`"\n",`
			`"[Apify](https://apify.com) is a cloud platform for web scraping and data extraction,\n",`
			`"which provides an [ecosystem](https://apify.com/store) of more than a thousand\n",`
			`"ready-made apps called Actors for various web scraping, crawling, and data extraction use cases.\n",`
			`"For example, you can use it to extract Google Search results, Instagram and Facebook profiles, products from Amazon or Shopify, Google Maps reviews, etc. etc.\n",`
			`"\n",`
			`"In this example, we'll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor,\n",`
			`"which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs,\n",`
			`"and extract text content from the web pages. Then we feed the documents into a vector index and answer questions from it.\n"`
			`]`
			`},`
			`{`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"#!pip install apify-client"`
			`]`
			`},`
			`{`
Harrison/apify (#2215) Co-authored-by: Jiří Moravčík <jiri.moravcik@gmail.com> 2023-03-31 03:58:14 +00:00			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"First, import `ApifyWrapper` into your source code:"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.document_loaders.base import Document\n",`
			`"from langchain.indexes import VectorstoreIndexCreator\n",`
			`"from langchain.utilities import ApifyWrapper"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Initialize it using your [Apify API token](https://console.apify.com/account/integrations) and for the purpose of this example, also with your OpenAI API key:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"import os\n",`
Doc refactor (#6300) Co-authored-by: jacoblee93 <jacoblee93@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> 2023-06-16 18:52:56 +00:00			`"\n",`
Harrison/apify (#2215) Co-authored-by: Jiří Moravčík <jiri.moravcik@gmail.com> 2023-03-31 03:58:14 +00:00			`"os.environ[\"OPENAI_API_KEY\"] = \"Your OpenAI API key\"\n",`
			`"os.environ[\"APIFY_API_TOKEN\"] = \"Your Apify API token\"\n",`
			`"\n",`
			`"apify = ApifyWrapper()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Then run the Actor, wait for it to finish, and fetch its results from the Apify dataset into a LangChain document loader.\n",`
			`"\n",`
Doc refactor (#6300) Co-authored-by: jacoblee93 <jacoblee93@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> 2023-06-16 18:52:56 +00:00			"Note that if you already have some results in an Apify dataset, you can load them directly using `ApifyDatasetLoader`, as shown in [this notebook](../../../data_connection/document_loaders/examples/apify_dataset.html). In that notebook, you'll also find the explanation of the `dataset_mapping_function`, which is used to map fields from the Apify dataset records to LangChain `Document` fields."
Harrison/apify (#2215) Co-authored-by: Jiří Moravčík <jiri.moravcik@gmail.com> 2023-03-31 03:58:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"loader = apify.call_actor(\n",`
			`" actor_id=\"apify/website-content-crawler\",\n",`
			`" run_input={\"startUrls\": [{\"url\": \"https://python.langchain.com/en/latest/\"}]},\n",`
			`" dataset_mapping_function=lambda item: Document(\n",`
			`" page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n",`
			`" ),\n",`
			`")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Initialize the vector index from the crawled documents:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"index = VectorstoreIndexCreator().from_loaders([loader])"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"And finally, query the vector index:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"query = \"What is LangChain?\"\n",`
			`"result = index.query_with_sources(query)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`" LangChain is a standard interface through which you can interact with a variety of large language models (LLMs). It provides modules that can be used to build language model applications, and it also provides chains and agents with memory capabilities.\n",`
			`"\n",`
			`"https://python.langchain.com/en/latest/modules/models/llms.html, https://python.langchain.com/en/latest/getting_started/getting_started.html\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"print(result[\"answer\"])\n",`
			`"print(result[\"sources\"])"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"version": "3.10.6"`
Harrison/apify (#2215) Co-authored-by: Jiří Moravčík <jiri.moravcik@gmail.com> 2023-03-31 03:58:14 +00:00			`}`
			`},`
			`"nbformat": 4,`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"nbformat_minor": 4`
Harrison/apify (#2215) Co-authored-by: Jiří Moravčík <jiri.moravcik@gmail.com> 2023-03-31 03:58:14 +00:00			`}`