{ "cells": [ { "cell_type": "raw", "id": "e254cf03-49fc-4051-a4df-3a8e4e7d2688", "metadata": {}, "source": [ "---\n", "sidebar_position: 1\n", "title: Web scraping\n", "---" ] }, { "cell_type": "markdown", "id": "6605e7f7", "metadata": {}, "source": [ "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/web_scraping.ipynb)\n", "\n", "## Use case\n", "\n", "[Web research](https://blog.langchain.dev/automating-web-research/) is one of the killer LLM applications:\n", "\n", "* Users have [highlighted it](https://twitter.com/GregKamradt/status/1679913813297225729?s=20) as one of his top desired AI tools. \n", "* OSS repos like [gpt-researcher](https://github.com/assafelovic/gpt-researcher) are growing in popularity. \n", " \n", "![Image description](/img/web_scraping.png)\n", " \n", "## Overview\n", "\n", "Gathering content from the web has a few components:\n", "\n", "* `Search`: Query to url (e.g., using `GoogleSearchAPIWrapper`).\n", "* `Loading`: Url to HTML (e.g., using `AsyncHtmlLoader`, `AsyncChromiumLoader`, etc).\n", "* `Transforming`: HTML to formatted text (e.g., using `HTML2Text` or `Beautiful Soup`).\n", "\n", "## Quickstart" ] }, { "cell_type": "code", "execution_count": null, "id": "1803c182", "metadata": {}, "outputs": [], "source": [ "pip install -q openai langchain playwright beautifulsoup4\n", "playwright install\n", "\n", "# Set env var OPENAI_API_KEY or load from a .env file:\n", "# import dotenv\n", "# dotenv.load_dotenv()" ] }, { "cell_type": "markdown", "id": "50741083", "metadata": {}, "source": [ "Scraping HTML content using a headless instance of Chromium.\n", "\n", "* The async nature of the scraping process is handled using Python's asyncio library.\n", "* The actual interaction with the web pages is handled by Playwright." ] }, { "cell_type": "code", "execution_count": 2, "id": "cd457cb1", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import AsyncChromiumLoader\n", "from langchain.document_transformers import BeautifulSoupTransformer\n", "\n", "# Load HTML\n", "loader = AsyncChromiumLoader([\"https://www.wsj.com\"])\n", "html = loader.load()" ] }, { "cell_type": "markdown", "id": "2a879806", "metadata": {}, "source": [ "Scrape text content tags such as `

,

  • ,
    , and ` tags from the HTML content:\n", "\n", "* `

    `: The paragraph tag. It defines a paragraph in HTML and is used to group together related sentences and/or phrases.\n", " \n", "* `

  • `: The list item tag. It is used within ordered (`
      `) and unordered (`