"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/web_scraping.ipynb)\n",
"\n",
"## Use case\n",
"\n",
"[Web research](https://blog.langchain.dev/automating-web-research/) is one of the killer LLM applications:\n",
"\n",
"* Users have [highlighted it](https://twitter.com/GregKamradt/status/1679913813297225729?s=20) as one of his top desired AI tools. \n",
"* OSS repos like [gpt-researcher](https://github.com/assafelovic/gpt-researcher) are growing in popularity. \n",
" \n",
"![Image description](/img/web_scraping.png)\n",
" \n",
"## Overview\n",
"\n",
"Gathering content from the web has a few components:\n",
"\n",
"* `Search`: Query to url (e.g., using `GoogleSearchAPIWrapper`).\n",
"* `Loading`: Url to HTML (e.g., using `AsyncHtmlLoader`, `AsyncChromiumLoader`, etc).\n",
"* `Transforming`: HTML to formatted text (e.g., using `HTML2Text` or `Beautiful Soup`).\n",
"'English EditionEnglish中文 (Chinese)日本語 (Japanese) More Other Products from WSJBuy Side from WSJWSJ ShopWSJ Wine Other Products from WSJ Search Quotes and Companies Search Quotes and Companies 0.15% 0.03% 0.12% -0.42% 4.102% -0.69% -0.25% -0.15% -1.82% 0.24% 0.19% -1.10% About Evan His Family Reflects His Reporting How You Can Help Write a Message Life in Detention Latest News Get Email Updates Four Americans Released From Iranian Prison The Americans will remain under house arrest until they are '"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Result\n",
"docs_transformed[0].page_content[0:500]"
]
},
{
"cell_type": "markdown",
"id": "7d26d185",
"metadata": {},
"source": [
"These `Documents` now are staged for downstream usage in various LLM apps, as discussed below.\n",
"\n",
"## Loader\n",
"\n",
"### AsyncHtmlLoader\n",
"\n",
"The [AsyncHtmlLoader](docs/integrations/document_loaders/async_html) uses the `aiohttp` library to make asynchronous HTTP requests, suitable for simpler and lightweight scraping.\n",
"\n",
"### AsyncChromiumLoader\n",
"\n",
"The [AsyncChromiumLoader](docs/integrations/document_loaders/async_chromium) uses Playwright to launch a Chromium instance, which can handle JavaScript rendering and more complex web interactions.\n",
"\n",
"Chromium is one of the browsers supported by Playwright, a library used to control browser automation. \n",
"[HTML2Text](docs/integrations/document_transformers/html2text) provides a straightforward conversion of HTML content into plain text (with markdown-like formatting) without any specific tag manipulation. \n",
"\n",
"It's best suited for scenarios where the goal is to extract human-readable text without needing to manipulate specific HTML elements.\n",
"\n",
"### Beautiful Soup\n",
" \n",
"Beautiful Soup offers more fine-grained control over HTML content, enabling specific tag extraction, removal, and content cleaning. \n",
"\n",
"It's suited for cases where you want to extract specific information and clean up the HTML content according to your needs."
"Web scraping is challenging for many reasons. \n",
"\n",
"One of them is the changing nature of modern websites' layouts and content, which requires modifying scraping scripts to accommodate the changes.\n",
"\n",
"Using Function (e.g., OpenAI) with an extraction chain, we avoid having to change your code constantly when websites change. \n",
"\n",
"We're using `gpt-3.5-turbo-0613` to guarantee access to OpenAI Functions feature (although this might be available to everyone by time of writing). \n",
"\n",
"We're also keeping `temperature` at `0` to keep randomness of the LLM down."
"We can compare the headlines scraped to the page:\n",
"\n",
"![Image description](/img/wsj_page.png)\n",
"\n",
"Looking at the [LangSmith trace](https://smith.langchain.com/public/c3070198-5b13-419b-87bf-3821cdf34fa6/r), we can see what is going on under the hood:\n",
"\n",
"* It's following what is explained in the [extraction](docs/use_cases/extraction).\n",
"* We call the `information_extraction` function on the input text.\n",
"* It will attempt to populate the provided schema from the url content."
]
},
{
"cell_type": "markdown",
"id": "a5a6f11e",
"metadata": {},
"source": [
"## Research automation\n",
"\n",
"Related to scraping, we may want to answer specific questions using searched content.\n",
"We can automate the process of [web research](https://blog.langchain.dev/automating-web-research/) using a retriever, such as the `WebResearchRetriever` ([docs](https://python.langchain.com/docs/modules/data_connection/retrievers/web_research)).\n",
"INFO:langchain.retrievers.web_research:Generating questions for Google Search ...\n",
"INFO:langchain.retrievers.web_research:Questions for Google Search (raw): {'question': 'How do LLM Powered Autonomous Agents work?', 'text': LineList(lines=['1. What is the functioning principle of LLM Powered Autonomous Agents?\\n', '2. How do LLM Powered Autonomous Agents operate?\\n'])}\n",
"INFO:langchain.retrievers.web_research:Questions for Google Search: ['1. What is the functioning principle of LLM Powered Autonomous Agents?\\n', '2. How do LLM Powered Autonomous Agents operate?\\n']\n",
"INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
"INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
"INFO:langchain.retrievers.web_research:Search results: [{'title': 'LLM Powered Autonomous Agents | Hacker News', 'link': 'https://news.ycombinator.com/item?id=36488871', 'snippet': 'Jun 26, 2023 ... Exactly. A temperature of 0 means you always pick the highest probability token (i.e. the \"max\" function), while a temperature of 1 means you\\xa0...'}]\n",
"INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
"INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\" , \"What are the subgoals for achieving XYZ?\" , (2) by\\xa0...'}]\n",
"INFO:langchain.retrievers.web_research:New URLs to load: []\n",
"INFO:langchain.retrievers.web_research:Grabbing most relevant splits from urls...\n"
]
},
{
"data": {
"text/plain": [
"{'question': 'How do LLM Powered Autonomous Agents work?',\n",
" 'answer': \"LLM-powered autonomous agents work by using LLM as the agent's brain, complemented by several key components such as planning, memory, and tool use. In terms of planning, the agent breaks down large tasks into smaller subgoals and can reflect and refine its actions based on past experiences. Memory is divided into short-term memory, which is used for in-context learning, and long-term memory, which allows the agent to retain and recall information over extended periods. Tool use involves the agent calling external APIs for additional information. These agents have been used in various applications, including scientific discovery and generative agents simulation.\",\n",
"To answer questions over a specific website, you can use Apify's [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor, which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs,\n",
"and extract text content from the web pages.\n",
"\n",
"In the example below, we will deeply crawl the Python documentation of LangChain's Chat LLM models and answer a question over it.\n",