langchain/docs/extras/integrations/document_loaders/sitemap.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sitemap\n",
    "\n",
    "Extends from the `WebBaseLoader`, `SitemapLoader` loads a sitemap from a given URL, and then scrape and load all pages in the sitemap, returning each page as a Document.\n",
    "\n",
    "The scraping is done concurrently.  There are reasonable limits to concurrent requests, defaulting to 2 per second.  If you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load. Note, while this will speed up the scraping process, but it may cause the server to block you.  Be careful!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: nest_asyncio in /Users/tasp/Code/projects/langchain/.venv/lib/python3.10/site-packages (1.5.6)\n",
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install nest_asyncio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fixes a bug with asyncio and jupyter\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders.sitemap import SitemapLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "sitemap_loader = SitemapLoader(web_path=\"https://langchain.readthedocs.io/sitemap.xml\")\n",
    "\n",
    "docs = sitemap_loader.load()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can change the `requests_per_second` parameter to increase the max concurrent requests. and use `requests_kwargs` to pass kwargs when send requests."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sitemap_loader.requests_per_second = 2\n",
    "# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue\n",
    "sitemap_loader.requests_kwargs = {\"verify\": False}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='\\n\\n\\n\\n\\n\\nWelcome to LangChain — 🦜🔗 LangChain 0.0.123\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nSkip to main content\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nCtrl+K\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n🦜🔗 LangChain 0.0.123\\n\\n\\n\\nGetting Started\\n\\nQuickstart Guide\\n\\nModules\\n\\nPrompt Templates\\nGetting Started\\nKey Concepts\\nHow-To Guides\\nCreate a custom prompt template\\nCreate a custom example selector\\nProvide few shot examples to a prompt\\nPrompt Serialization\\nExample Selectors\\nOutput Parsers\\n\\n\\nReference\\nPromptTemplates\\nExample Selector\\n\\n\\n\\n\\nLLMs\\nGetting Started\\nKey Concepts\\nHow-To Guides\\nGeneric Functionality\\nCustom LLM\\nFake LLM\\nLLM Caching\\nLLM Serialization\\nToken Usage Tracking\\n\\n\\nIntegrations\\nAI21\\nAleph Alpha\\nAnthropic\\nAzure OpenAI LLM Example\\nBanana\\nCerebriumAI LLM Example\\nCohere\\nDeepInfra LLM Example\\nForefrontAI LLM Example\\nGooseAI LLM Example\\nHugging Face Hub\\nManifest\\nModal\\nOpenAI\\nPetals LLM Example\\nPromptLayer OpenAI\\nSageMakerEndpoint\\nSelf-Hosted Models via Runhouse\\nStochasticAI\\nWriter\\n\\n\\nAsync API for LLM\\nStreaming with LLMs\\n\\n\\nReference\\n\\n\\nDocument Loaders\\nKey Concepts\\nHow To Guides\\nCoNLL-U\\nAirbyte JSON\\nAZLyrics\\nBlackboard\\nCollege Confidential\\nCopy Paste\\nCSV Loader\\nDirectory Loader\\nEmail\\nEverNote\\nFacebook Chat\\nFigma\\nGCS Directory\\nGCS File Storage\\nGitBook\\nGoogle Drive\\nGutenberg\\nHacker News\\nHTML\\niFixit\\nImages\\nIMSDb\\nMarkdown\\nNotebook\\nNotion\\nObsidian\\nPDF\\nPowerPoint\\nReadTheDocs Documentation\\nRoam\\ns3 Directory\\ns3 File\\nSubtitle Files\\nTelegram\\nUnstructured File Loader\\nURL\\nWeb Base\\nWord Documents\\nYouTube\\n\\n\\n\\n\\nUtils\\nKey Concepts\\nGeneric Utilities\\nBash\\nBing Search\\nGoogle Search\\nGoogle Serper API\\nIFTTT WebHooks\\nPython REPL\\nRequests\\nSearxNG Search API\\nSerpAPI\\nWolfram Alpha\\nZapier Natural Language Actions API\\n\\n\\nReference\\nPython REPL\\nSerpAPI\\nSearxNG Search\\nDocstore\\nText Splitter\\nEmbeddings\\nVectorStores\\n\\n\\n\\n\\nIndexes\\nGetting Started\\nKey Concepts\\nHow To Guides\\nEmbeddings\\nHypothetical Document Embeddings\\nText Splitter\\nVectorStores\\nAtlasDB\\nChroma\\nDeep Lake\\nElasticSearch\\nFAISS\\nMilvus\\nOpenSearch\\nPGVector\\nPinecone\\nQdrant\\nRedis\\nWeaviate\\nChatGPT Plugin Retriever\\nVectorStore Retriever\\nAnalyze Document\\nChat Index\\nGraph QA\\nQuestion Answering with Sources\\nQuestion Answering\\nSummarization\\nRetrieval Question/Answering\\nRetrieval Question Answering with Sources\\nVector DB Text Generation\\n\\n\\n\\n\\nChains\\nGetting Started\\nHow-To Guides\\nGeneric Chains\\nLoading from LangChainHub\\nLLM Chain\\nSequential Chains\\nSerialization\\nTransformation Chain\\n\\n\\nUtility Chains\\nAPI Chains\\nSelf-Critique Chain with Constitutional AI\\nBashChain\\nLLMCheckerChain\\nLLM Math\\nLLMRequestsChain\\nLLMSummarizationCheckerChain\\nModeration\\nPAL\\nSQLite example\\n\\n\\nAsync API for Chain\\n\\n\\nKey Concepts\\nReference\\n\\n\\nAgents\\nGetting Started\\nKey Concepts\\nHow-To Guides\\nAgents and Vectorstores\\nAsync API for Agent\\nConversation Agent (for Chat Models)\\nChatGPT Plugins\\nCustom Agent\\nDefining Custom Tools\\nHuman as a tool\\nIntermediate Steps\\nLoading from LangChainHub\\nMax Iterations\\nMulti Input Tools\\nSearch Tools\\nSerialization\\nAdding SharedMemory to an Agent and its Tools\\nCSV Agent\\nJSON Agent\\nOpenAPI Agent\\nPandas Dataframe Agent\\nPython Agent\\nSQL Database Agent\\nVectorstore Agent\\nMRKL\\nMRKL Chat\\nReAct\\nSelf Ask With Search\\n\\n\\nReference\\n\\n\\nMemory\\nGetting Started\\nKey Concepts\\nHow-To Guides\\nConversationBufferMemory\\nConversationBufferWindowMemory\\nEntity Memory\\nConversation Knowledge Graph Memory\\nConversationSummaryMemory\\nConversationSummaryBufferMemory\\nConversationTokenBufferMemory\\nAdding Memory To an L
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering sitemap URLs\n",
    "\n",
    "Sitemaps can be massive files, with thousands of URLs.  Often you don't need every single one of them.  You can filter the URLs by passing a list of strings or regex patterns to the `url_filter` parameter.  Only URLs that match one of the patterns will be loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = SitemapLoader(\n",
    "    \"https://langchain.readthedocs.io/sitemap.xml\",\n",
    "    filter_urls=[\"https://python.langchain.com/en/latest/\"],\n",
    ")\n",
    "documents = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='\\n\\n\\n\\n\\n\\nWelcome to LangChain — 🦜🔗 LangChain 0.0.123\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nSkip to main content\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nCtrl+K\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n🦜🔗 LangChain 0.0.123\\n\\n\\n\\nGetting Started\\n\\nQuickstart Guide\\n\\nModules\\n\\nModels\\nLLMs\\nGetting Started\\nGeneric Functionality\\nHow to use the async API for LLMs\\nHow to write a custom LLM wrapper\\nHow (and why) to use the fake LLM\\nHow to cache LLM calls\\nHow to serialize LLM classes\\nHow to stream LLM responses\\nHow to track token usage\\n\\n\\nIntegrations\\nAI21\\nAleph Alpha\\nAnthropic\\nAzure OpenAI LLM Example\\nBanana\\nCerebriumAI LLM Example\\nCohere\\nDeepInfra LLM Example\\nForefrontAI LLM Example\\nGooseAI LLM Example\\nHugging Face Hub\\nManifest\\nModal\\nOpenAI\\nPetals LLM Example\\nPromptLayer OpenAI\\nSageMakerEndpoint\\nSelf-Hosted Models via Runhouse\\nStochasticAI\\nWriter\\n\\n\\nReference\\n\\n\\nChat Models\\nGetting Started\\nHow-To Guides\\nHow to use few shot examples\\nHow to stream responses\\n\\n\\nIntegrations\\nAzure\\nOpenAI\\nPromptLayer ChatOpenAI\\n\\n\\n\\n\\nText Embedding Models\\nAzureOpenAI\\nCohere\\nFake Embeddings\\nHugging Face Hub\\nInstructEmbeddings\\nOpenAI\\nSageMaker Endpoint Embeddings\\nSelf Hosted Embeddings\\nTensorflowHub\\n\\n\\n\\n\\nPrompts\\nPrompt Templates\\nGetting Started\\nHow-To Guides\\nHow to create a custom prompt template\\nHow to create a prompt template that uses few shot examples\\nHow to work with partial Prompt Templates\\nHow to serialize prompts\\n\\n\\nReference\\nPromptTemplates\\nExample Selector\\n\\n\\n\\n\\nChat Prompt Template\\nExample Selectors\\nHow to create a custom example selector\\nLengthBased ExampleSelector\\nMaximal Marginal Relevance ExampleSelector\\nNGram Overlap ExampleSelector\\nSimilarity ExampleSelector\\n\\n\\nOutput Parsers\\nOutput Parsers\\nCommaSeparatedListOutputParser\\nOutputFixingParser\\nPydanticOutputParser\\nRetryOutputParser\\nStructured Output Parser\\n\\n\\n\\n\\nIndexes\\nGetting Started\\nDocument Loaders\\nCoNLL-U\\nAirbyte JSON\\nAZLyrics\\nBlackboard\\nCollege Confidential\\nCopy Paste\\nCSV Loader\\nDirectory Loader\\nEmail\\nEverNote\\nFacebook Chat\\nFigma\\nGCS Directory\\nGCS File Storage\\nGitBook\\nGoogle Drive\\nGutenberg\\nHacker News\\nHTML\\niFixit\\nImages\\nIMSDb\\nMarkdown\\nNotebook\\nNotion\\nObsidian\\nPDF\\nPowerPoint\\nReadTheDocs Documentation\\nRoam\\ns3 Directory\\ns3 File\\nSubtitle Files\\nTelegram\\nUnstructured File Loader\\nURL\\nWeb Base\\nWord Documents\\nYouTube\\n\\n\\nText Splitters\\nGetting Started\\nCharacter Text Splitter\\nHuggingFace Length Function\\nLatex Text Splitter\\nMarkdown Text Splitter\\nNLTK Text Splitter\\nPython Code Text Splitter\\nRecursiveCharacterTextSplitter\\nSpacy Text Splitter\\ntiktoken (OpenAI) Length Function\\nTiktokenText Splitter\\n\\n\\nVectorstores\\nGetting Started\\nAtlasDB\\nChroma\\nDeep Lake\\nElasticSearch\\nFAISS\\nMilvus\\nOpenSearch\\nPGVector\\nPinecone\\nQdrant\\nRedis\\nWeaviate\\n\\n\\nRetrievers\\nChatGPT Plugin Retriever\\nVectorStore Retriever\\n\\n\\n\\n\\nMemory\\nGetting Started\\nHow-To Guides\\nConversationBufferMemory\\nConversationBufferWindowMemory\\nEntity Memory\\nConversation Knowledge Graph Memory\\nConversationSummaryMemory\\nConversationSummaryBufferMemory\\nConversationTokenBufferMemory\\nHow to add Memory to an LLMChain\\nHow to add memory to a Multi-Input Chain\\nHow to add Memory to an Agent\\nHow to customize conversational memory\\nHow to create a custom Memory class\\nHow to use multiple memroy classes in the same chain\\n\\n\\n\\n\\nChains\\nGetting Started\\nHow-To Guides\\nAsync API for Chain\\nLoading from LangChainHub\\nLLM Chain\\nSequential Chains\\nSerialization\\nTransformation Chain\\nAnalyze Document\\nChat Index\\nGraph QA\\nHypothetical Document Embeddings\\nQuestion Answering with Sources\\nQuestion Answering\\nSummariza
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add custom scraping rules\n",
    "\n",
    "The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.\n",
    "\n",
    " The following example shows how to develop and use a custom function to avoid navigation and header elements."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import the `beautifulsoup4` library and define the custom function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install beautifulsoup4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "\n",
    "def remove_nav_and_header_elements(content: BeautifulSoup) -> str:\n",
    "    # Find all 'nav' and 'header' elements in the BeautifulSoup object\n",
    "    nav_elements = content.find_all(\"nav\")\n",
    "    header_elements = content.find_all(\"header\")\n",
    "\n",
    "    # Remove each 'nav' and 'header' element from the BeautifulSoup object\n",
    "    for element in nav_elements + header_elements:\n",
    "        element.decompose()\n",
    "\n",
    "    return str(content.get_text())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add your custom function to the `SitemapLoader` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = SitemapLoader(\n",
    "    \"https://langchain.readthedocs.io/sitemap.xml\",\n",
    "    filter_urls=[\"https://python.langchain.com/en/latest/\"],\n",
    "    parsing_function=remove_nav_and_header_elements,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Local Sitemap\n",
    "\n",
    "The sitemap loader can also be used to load local files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Fetching pages: 100%|####################################################################################################################################| 3/3 [00:00<00:00,  3.91it/s]\n"
     ]
    }
   ],
   "source": [
    "sitemap_loader = SitemapLoader(web_path=\"example_data/sitemap.xml\", is_local=True)\n",
    "\n",
    "docs = sitemap_loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}