Added new use case docs for Web Scraping, Chromium loader, BS4 transformer (#8732)

- Description: Added a new use case category called "Web Scraping", and
a tutorial to scrape websites using OpenAI Functions Extraction chain to
the docs.
  - Tag maintainer:@baskaryan @hwchase17 ,
- Twitter handle: https://www.linkedin.com/in/haiphunghiem/ (I'm on
LinkedIn mostly)

---------

Co-authored-by: Lance Martin <lance@langchain.dev>
harrison/clean-up-imports
Hai The Dude 1 year ago committed by GitHub
parent 6cb763507c
commit e4418d1b7e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,9 @@
---
sidebar_position: 3
---
# Web Scraping
Web scraping has historically been a challenging endeavor due to the ever-changing nature of website structures, making it tedious for developers to maintain their scraping scripts. Traditional methods often rely on specific HTML tags and patterns which, when altered, can disrupt data extraction processes.
Enter the LLM-based method for parsing HTML: By leveraging the capabilities of LLMs, and especially OpenAI Functions in LangChain's extraction chain, developers can instruct the model to extract only the desired data in a specified format. This method not only streamlines the extraction process but also significantly reduces the time spent on manual debugging and script modifications. Its adaptability means that even if websites undergo significant design changes, the extraction remains consistent and robust. This level of resilience translates to reduced maintenance efforts, cost savings, and ensures a higher quality of extracted data. Compared to its predecessors, LLM-based approach wins out the web scraping domain by transforming a historically cumbersome task into a more automated and efficient process.

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 172 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 716 KiB

@ -0,0 +1,101 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "ad553e51",
"metadata": {},
"source": [
"# Async Chromium\n",
"\n",
"Chromium is one of the browsers supported by Playwright, a library used to control browser automation. \n",
"\n",
"By running `p.chromium.launch(headless=True)`, we are launching a headless instance of Chromium. \n",
"\n",
"Headless mode means that the browser is running without a graphical user interface.\n",
"\n",
"`AsyncChromiumLoader` load the page, and then we use `Html2TextTransformer` to trasnform to text."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1c3a4c19",
"metadata": {},
"outputs": [],
"source": [
"! pip install -q playwright beautifulsoup4\n",
"! playwright install"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "dd2cdea7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'<!DOCTYPE html><html lang=\"en\"><head><script src=\"https://s0.2mdn.net/instream/video/client.js\" asyn'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.document_loaders import AsyncChromiumLoader\n",
"urls = [\"https://www.wsj.com\"]\n",
"loader = AsyncChromiumLoader(urls)\n",
"docs = loader.load()\n",
"docs[0].page_content[0:100]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "013caa7e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Skip to Main ContentSkip to SearchSkip to... Select * Top News * What's News *\\nFeatured Stories * Retirement * Life & Arts * Hip-Hop * Sports * Video *\\nEconomy * Real Estate * Sports * CMO * CIO * CFO * Risk & Compliance *\\nLogistics Report * Sustainable Business * Heard on the Street * Barrons *\\nMarketWatch * Mansion Global * Penta * Opinion * Journal Reports * Sponsored\\nOffers Explore Our Brands * WSJ * * * * * Barron's * * * * * MarketWatch * * *\\n* * IBD # The Wall Street Journal SubscribeSig\""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.document_transformers import Html2TextTransformer\n",
"html2text = Html2TextTransformer()\n",
"docs_transformed = html2text.transform_documents(docs)\n",
"docs_transformed[0].page_content[0:500]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,95 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "2ed9a4c2",
"metadata": {},
"source": [
"# Beautiful Soup\n",
"\n",
"Beautiful Soup offers fine-grained control over HTML content, enabling specific tag extraction, removal, and content cleaning. \n",
"\n",
"It's suited for cases where you want to extract specific information and clean up the HTML content according to your needs.\n",
"\n",
"For example, we can scrape text content within `<p>, <li>, <div>, and <a>` tags from the HTML content:\n",
"\n",
"* `<p>`: The paragraph tag. It defines a paragraph in HTML and is used to group together related sentences and/or phrases.\n",
" \n",
"* `<li>`: The list item tag. It is used within ordered (`<ol>`) and unordered (`<ul>`) lists to define individual items within the list.\n",
" \n",
"* `<div>`: The division tag. It is a block-level element used to group other inline or block-level elements.\n",
" \n",
"* `<a>`: The anchor tag. It is used to define hyperlinks."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "dd710e5b",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import AsyncChromiumLoader\n",
"from langchain.document_transformers import BeautifulSoupTransformer\n",
"\n",
"# Load HTML\n",
"loader = AsyncChromiumLoader([\"https://www.wsj.com\"])\n",
"html = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "052b64dd",
"metadata": {},
"outputs": [],
"source": [
"# Transform\n",
"bs_transformer = BeautifulSoupTransformer()\n",
"docs_transformed = bs_transformer.transform_documents(html,tags_to_extract=[\"p\", \"li\", \"div\", \"a\"])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b53a5307",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Conservative legal activists are challenging Amazon, Comcast and others using many of the same tools that helped kill affirmative-action programs in colleges.1,2099 min read U.S. stock indexes fell and government-bond prices climbed, after Moodys lowered credit ratings for 10 smaller U.S. banks and said it was reviewing ratings for six larger ones. The Dow industrials dropped more than 150 points.3 min read Penn Entertainments Barstool Sportsbook app will be rebranded as ESPN Bet this fall as '"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs_transformed[0].page_content[0:500]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -0,0 +1,599 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "6605e7f7",
"metadata": {},
"source": [
"# Web scraping\n",
"\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/web_scraping.ipynb)\n",
"\n",
"## Use case\n",
"\n",
"[Web research](https://blog.langchain.dev/automating-web-research/) is one of the killer LLM applications:\n",
"\n",
"* Users have [highlighted it](https://twitter.com/GregKamradt/status/1679913813297225729?s=20) as one of his top desired AI tools. \n",
"* OSS repos like [gpt-researcher](https://github.com/assafelovic/gpt-researcher) are growing in popularity. \n",
" \n",
"![Image description](/img/web_scraping.png)\n",
" \n",
"## Overview\n",
"\n",
"Gathering content from the web has a few components:\n",
"\n",
"* `Search`: Query to url (e.g., using `GoogleSearchAPIWrapper`).\n",
"* `Loading`: Url to HTML (e.g., using `AsyncHtmlLoader`, `AsyncChromiumLoader`, etc).\n",
"* `Transforming`: HTML to formatted text (e.g., using `HTML2Text` or `Beautiful Soup`).\n",
"\n",
"## Quickstart"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1803c182",
"metadata": {},
"outputs": [],
"source": [
"pip install -q openai langchain playwright beautifulsoup4\n",
"playwright install\n",
"\n",
"# Set env var OPENAI_API_KEY or load from a .env file:\n",
"# import dotenv\n",
"# dotenv.load_env()"
]
},
{
"cell_type": "markdown",
"id": "50741083",
"metadata": {},
"source": [
"Scraping HTML content using a headless instance of Chromium.\n",
"\n",
"* The async nature of the scraping process is handled using Python's asyncio library.\n",
"* The actual interaction with the web pages is handled by Playwright."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cd457cb1",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import AsyncChromiumLoader\n",
"from langchain.document_transformers import BeautifulSoupTransformer\n",
"\n",
"# Load HTML\n",
"loader = AsyncChromiumLoader([\"https://www.wsj.com\"])\n",
"html = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "2a879806",
"metadata": {},
"source": [
"Scrape text content tags such as `<p>, <li>, <div>, and <a>` tags from the HTML content:\n",
"\n",
"* `<p>`: The paragraph tag. It defines a paragraph in HTML and is used to group together related sentences and/or phrases.\n",
" \n",
"* `<li>`: The list item tag. It is used within ordered (`<ol>`) and unordered (`<ul>`) lists to define individual items within the list.\n",
" \n",
"* `<div>`: The division tag. It is a block-level element used to group other inline or block-level elements.\n",
" \n",
"* `<a>`: The anchor tag. It is used to define hyperlinks.\n",
"\n",
"* `<span>`: an inline container used to mark up a part of a text, or a part of a document. \n",
"\n",
"For many news websites (e.g., WSJ, CNN), headlines and summaries are all in `<span>` tags."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "141f206b",
"metadata": {},
"outputs": [],
"source": [
"# Transform\n",
"bs_transformer = BeautifulSoupTransformer()\n",
"docs_transformed = bs_transformer.transform_documents(html,tags_to_extract=[\"span\"])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "73ddb234",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'English EditionEnglish中文 (Chinese)日本語 (Japanese) More Other Products from WSJBuy Side from WSJWSJ ShopWSJ Wine Other Products from WSJ Search Quotes and Companies Search Quotes and Companies 0.15% 0.03% 0.12% -0.42% 4.102% -0.69% -0.25% -0.15% -1.82% 0.24% 0.19% -1.10% About Evan His Family Reflects His Reporting How You Can Help Write a Message Life in Detention Latest News Get Email Updates Four Americans Released From Iranian Prison The Americans will remain under house arrest until they are '"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Result\n",
"docs_transformed[0].page_content[0:500]"
]
},
{
"cell_type": "markdown",
"id": "7d26d185",
"metadata": {},
"source": [
"These `Documents` now are staged for downstream usage in various LLM apps, as discussed below.\n",
"\n",
"## Loader\n",
"\n",
"### AsyncHtmlLoader\n",
"\n",
"The [AsyncHtmlLoader](docs/integrations/document_loaders/async_html) uses the `aiohttp` library to make asynchronous HTTP requests, suitable for simpler and lightweight scraping.\n",
"\n",
"### AsyncChromiumLoader\n",
"\n",
"The [AsyncChromiumLoader](docs/integrations/document_loaders/async_chromium) uses Playwright to launch a Chromium instance, which can handle JavaScript rendering and more complex web interactions.\n",
"\n",
"Chromium is one of the browsers supported by Playwright, a library used to control browser automation. \n",
"\n",
"Headless mode means that the browser is running without a graphical user interface, which is commonly used for web scrapin."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8941e855",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import AsyncHtmlLoader\n",
"urls = [\"https://www.espn.com\",\"https://lilianweng.github.io/posts/2023-06-23-agent/\"]\n",
"loader = AsyncHtmlLoader(urls)\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "e47f4bf0",
"metadata": {},
"source": [
"## Transformer\n",
"\n",
"### HTML2Text\n",
"\n",
"[HTML2Text](docs/integrations/document_transformers/html2text) provides a straightforward conversion of HTML content into plain text (with markdown-like formatting) without any specific tag manipulation. \n",
"\n",
"It's best suited for scenarios where the goal is to extract human-readable text without needing to manipulate specific HTML elements.\n",
"\n",
"### Beautiful Soup\n",
" \n",
"Beautiful Soup offers more fine-grained control over HTML content, enabling specific tag extraction, removal, and content cleaning. \n",
"\n",
"It's suited for cases where you want to extract specific information and clean up the HTML content according to your needs."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "99a7e2a8",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Fetching pages: 100%|#############################################################################################################| 2/2 [00:00<00:00, 7.01it/s]\n"
]
}
],
"source": [
"from langchain.document_loaders import AsyncHtmlLoader\n",
"urls = [\"https://www.espn.com\", \"https://lilianweng.github.io/posts/2023-06-23-agent/\"]\n",
"loader = AsyncHtmlLoader(urls)\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a2cd3e8d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Skip to main content Skip to navigation\\n\\n<\\n\\n>\\n\\nMenu\\n\\n## ESPN\\n\\n * Search\\n\\n * * scores\\n\\n * NFL\\n * MLB\\n * NBA\\n * NHL\\n * Soccer\\n * NCAAF\\n * …\\n\\n * Women's World Cup\\n * LLWS\\n * NCAAM\\n * NCAAW\\n * Sports Betting\\n * Boxing\\n * CFL\\n * NCAA\\n * Cricket\\n * F1\\n * Golf\\n * Horse\\n * MMA\\n * NASCAR\\n * NBA G League\\n * Olympic Sports\\n * PLL\\n * Racing\\n * RN BB\\n * RN FB\\n * Rugby\\n * Tennis\\n * WNBA\\n * WWE\\n * X Games\\n * XFL\\n\\n * More\""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.document_transformers import Html2TextTransformer\n",
"html2text = Html2TextTransformer()\n",
"docs_transformed = html2text.transform_documents(docs)\n",
"docs_transformed[0].page_content[0:500]"
]
},
{
"cell_type": "markdown",
"id": "8aef9861",
"metadata": {},
"source": [
"## Scraping with extraction\n",
"\n",
"### LLM with function calling\n",
"\n",
"Web scraping is challenging for many reasons. \n",
"\n",
"One of them is the changing nature of modern websites' layouts and content, which requires modifying scraping scripts to accommodate the changes.\n",
"\n",
"Using Function (e.g., OpenAI) with an extraction chain, we avoid having to change your code constantly when websites change. \n",
"\n",
"We're using `gpt-3.5-turbo-0613` to guarantee access to OpenAI Functions feature (although this might be available to everyone by time of writing). \n",
"\n",
"We're also keeping `temperature` at `0` to keep randomness of the LLM down."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "52d49f6f",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\")"
]
},
{
"cell_type": "markdown",
"id": "fc5757ce",
"metadata": {},
"source": [
"### Define a schema\n",
"\n",
"Next, you define a schema to specify what kind of data you want to extract. \n",
"\n",
"Here, the key names matter as they tell the LLM what kind of information they want. \n",
"\n",
"So, be as detailed as possible. \n",
"\n",
"In this example, we want to scrape only news article's name and summary from The Wall Street Journal website."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "95506f8e",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import create_extraction_chain\n",
"\n",
"schema = {\n",
" \"properties\": {\n",
" \"news_article_title\": {\"type\": \"string\"},\n",
" \"news_article_summary\": {\"type\": \"string\"},\n",
" },\n",
" \"required\": [\"news_article_title\", \"news_article_summary\"],\n",
"}\n",
"\n",
"def extract(content: str, schema: dict):\n",
" return create_extraction_chain(schema=schema, llm=llm).run(content)"
]
},
{
"cell_type": "markdown",
"id": "97f7de42",
"metadata": {},
"source": [
"### Run the web scraper w/ BeautifulSoup\n",
"\n",
"As shown above, we'll using `BeautifulSoupTransformer`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "977560ba",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Extracting content with LLM\n",
"[{'news_article_summary': 'The Americans will remain under house arrest until '\n",
" 'they are allowed to return to the U.S. in coming '\n",
" 'weeks, following a monthslong diplomatic push by '\n",
" 'the Biden administration.',\n",
" 'news_article_title': 'Four Americans Released From Iranian Prison'},\n",
" {'news_article_summary': 'Price pressures continued cooling last month, with '\n",
" 'the CPI rising a mild 0.2% from June, likely '\n",
" 'deterring the Federal Reserve from raising interest '\n",
" 'rates at its September meeting.',\n",
" 'news_article_title': 'Cooler July Inflation Opens Door to Fed Pause on '\n",
" 'Rates'},\n",
" {'news_article_summary': 'The company has decided to eliminate 27 of its 30 '\n",
" 'clothing labels, such as Lark & Ro and Goodthreads, '\n",
" 'as it works to fend off antitrust scrutiny and cut '\n",
" 'costs.',\n",
" 'news_article_title': 'Amazon Cuts Dozens of House Brands'},\n",
" {'news_article_summary': 'President Bidens order comes on top of a slowing '\n",
" 'Chinese economy, Covid lockdowns and rising '\n",
" 'tensions between the two powers.',\n",
" 'news_article_title': 'U.S. Investment Ban on China Poised to Deepen Divide'},\n",
" {'news_article_summary': 'The proposed trial date in the '\n",
" 'election-interference case comes on the same day as '\n",
" 'the former presidents not guilty plea on '\n",
" 'additional Mar-a-Lago charges.',\n",
" 'news_article_title': 'Trump Should Be Tried in January, Prosecutors Tell '\n",
" 'Judge'},\n",
" {'news_article_summary': 'The CEO who started in June says the platform has '\n",
" '“an entirely different road map” for the future.',\n",
" 'news_article_title': 'Yaccarino Says X Is Watching Threads but Has Its Own '\n",
" 'Vision'},\n",
" {'news_article_summary': 'Students foot the bill for flagship state '\n",
" 'universities that pour money into new buildings and '\n",
" 'programs with little pushback.',\n",
" 'news_article_title': 'Colleges Spend Like Theres No Tomorrow. These '\n",
" 'Places Are Just Devouring Money.'},\n",
" {'news_article_summary': 'Wildfires fanned by hurricane winds have torn '\n",
" 'through parts of the Hawaiian island, devastating '\n",
" 'the popular tourist town of Lahaina.',\n",
" 'news_article_title': 'Maui Wildfires Leave at Least 36 Dead'},\n",
" {'news_article_summary': 'After its large armored push stalled, Kyiv has '\n",
" 'fallen back on the kind of tactics that brought it '\n",
" 'success earlier in the war.',\n",
" 'news_article_title': 'Ukraine Uses Small-Unit Tactics to Retake Captured '\n",
" 'Territory'},\n",
" {'news_article_summary': 'President Guillermo Lasso says the Aug. 20 election '\n",
" 'will proceed, as the Andean country grapples with '\n",
" 'rising drug gang violence.',\n",
" 'news_article_title': 'Ecuador Declares State of Emergency After '\n",
" 'Presidential Hopeful Killed'},\n",
" {'news_article_summary': 'This years hurricane season, which typically runs '\n",
" 'from June to the end of November, has been '\n",
" 'difficult to predict, climate scientists said.',\n",
" 'news_article_title': 'Atlantic Hurricane Season Prediction Increased to '\n",
" 'Above Normal, NOAA Says'},\n",
" {'news_article_summary': 'The NFL is raising the price of its NFL+ streaming '\n",
" 'packages as it adds the NFL Network and RedZone.',\n",
" 'news_article_title': 'NFL to Raise Price of NFL+ Streaming Packages as It '\n",
" 'Adds NFL Network, RedZone'},\n",
" {'news_article_summary': 'Russia is planning a moon mission as part of the '\n",
" 'new space race.',\n",
" 'news_article_title': 'Russias Moon Mission and the New Space Race'},\n",
" {'news_article_summary': 'Tapestrys $8.5 billion acquisition of Capri would '\n",
" 'create a conglomerate with more than $12 billion in '\n",
" 'annual sales, but it would still lack the '\n",
" 'high-wattage labels and diversity that have fueled '\n",
" 'LVMHs success.',\n",
" 'news_article_title': \"Why the Coach and Kors Marriage Doesn't Scare LVMH\"},\n",
" {'news_article_summary': 'The Supreme Court has blocked Purdue Pharmas $6 '\n",
" 'billion Sackler opioid settlement.',\n",
" 'news_article_title': 'Supreme Court Blocks Purdue Pharmas $6 Billion '\n",
" 'Sackler Opioid Settlement'},\n",
" {'news_article_summary': 'The Social Security COLA is expected to rise in '\n",
" '2024, but not by a lot.',\n",
" 'news_article_title': 'Social Security COLA Expected to Rise in 2024, but '\n",
" 'Not by a Lot'}]\n"
]
}
],
"source": [
"import pprint\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"def scrape_with_playwright(urls, schema):\n",
" \n",
" loader = AsyncChromiumLoader(urls)\n",
" docs = loader.load()\n",
" bs_transformer = BeautifulSoupTransformer()\n",
" docs_transformed = bs_transformer.transform_documents(docs,tags_to_extract=[\"span\"])\n",
" print(\"Extracting content with LLM\")\n",
" \n",
" # Grab the first 1000 tokens of the site\n",
" splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, \n",
" chunk_overlap=0)\n",
" splits = splitter.split_documents(docs_transformed)\n",
" \n",
" # Process the first split \n",
" extracted_content = extract(\n",
" schema=schema, content=splits[0].page_content\n",
" )\n",
" pprint.pprint(extracted_content)\n",
" return extracted_content\n",
"\n",
"urls = [\"https://www.wsj.com\"]\n",
"extracted_content = scrape_with_playwright(urls, schema=schema)"
]
},
{
"cell_type": "markdown",
"id": "b08a8cef",
"metadata": {},
"source": [
"We can compare the headlines scraped to the page:\n",
"\n",
"![Image description](/img/wsj_page.png)\n",
"\n",
"Looking at the [LangSmith trace](https://smith.langchain.com/public/c3070198-5b13-419b-87bf-3821cdf34fa6/r), we can see what is going on under the hood:\n",
"\n",
"* It's following what is explained in the [extraction](docs/use_cases/extraction).\n",
"* We call the `information_extraction` function on the input text.\n",
"* It will attempt to populate the provided schema from the url content."
]
},
{
"cell_type": "markdown",
"id": "a5a6f11e",
"metadata": {},
"source": [
"## Research automation\n",
"\n",
"Related to scraping, we may want to answer specific questions using searched content.\n",
"\n",
"We can automate the process of [web research](https://blog.langchain.dev/automating-web-research/) using a retriver, such as the `WebResearchRetriever` ([docs](https://python.langchain.com/docs/modules/data_connection/retrievers/web_research)).\n",
"\n",
"![Image description](/img/web_research.png)\n",
"\n",
"Copy requirments [from here](https://github.com/langchain-ai/web-explorer/blob/main/requirements.txt):\n",
"\n",
"`pip install -r requirements.txt`\n",
" \n",
"Set `GOOGLE_CSE_ID` and `GOOGLE_API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "414f0d41",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import Chroma\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.chat_models.openai import ChatOpenAI\n",
"from langchain.utilities import GoogleSearchAPIWrapper\n",
"from langchain.retrievers.web_research import WebResearchRetriever"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "5d1ce098",
"metadata": {},
"outputs": [],
"source": [
"# Vectorstore\n",
"vectorstore = Chroma(embedding_function=OpenAIEmbeddings(),persist_directory=\"./chroma_db_oai\")\n",
"\n",
"# LLM\n",
"llm = ChatOpenAI(temperature=0)\n",
"\n",
"# Search \n",
"search = GoogleSearchAPIWrapper()"
]
},
{
"cell_type": "markdown",
"id": "6d808b9d",
"metadata": {},
"source": [
"Initialize retriever with the above tools to:\n",
" \n",
"* Use an LLM to generate multiple relevant search queries (one LLM call)\n",
"* Execute a search for each query\n",
"* Choose the top K links per query (multiple search calls in parallel)\n",
"* Load the information from all chosen links (scrape pages in parallel)\n",
"* Index those documents into a vectorstore\n",
"* Find the most relevant documents for each original generated search query"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "e3e3a589",
"metadata": {},
"outputs": [],
"source": [
"# Initialize\n",
"web_research_retriever = WebResearchRetriever.from_llm(\n",
" vectorstore=vectorstore,\n",
" llm=llm, \n",
" search=search)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "20655b74",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:langchain.retrievers.web_research:Generating questions for Google Search ...\n",
"INFO:langchain.retrievers.web_research:Questions for Google Search (raw): {'question': 'How do LLM Powered Autonomous Agents work?', 'text': LineList(lines=['1. What is the functioning principle of LLM Powered Autonomous Agents?\\n', '2. How do LLM Powered Autonomous Agents operate?\\n'])}\n",
"INFO:langchain.retrievers.web_research:Questions for Google Search: ['1. What is the functioning principle of LLM Powered Autonomous Agents?\\n', '2. How do LLM Powered Autonomous Agents operate?\\n']\n",
"INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
"INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
"INFO:langchain.retrievers.web_research:Search results: [{'title': 'LLM Powered Autonomous Agents | Hacker News', 'link': 'https://news.ycombinator.com/item?id=36488871', 'snippet': 'Jun 26, 2023 ... Exactly. A temperature of 0 means you always pick the highest probability token (i.e. the \"max\" function), while a temperature of 1 means you\\xa0...'}]\n",
"INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
"INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\" , \"What are the subgoals for achieving XYZ?\" , (2) by\\xa0...'}]\n",
"INFO:langchain.retrievers.web_research:New URLs to load: []\n",
"INFO:langchain.retrievers.web_research:Grabbing most relevant splits from urls...\n"
]
},
{
"data": {
"text/plain": [
"{'question': 'How do LLM Powered Autonomous Agents work?',\n",
" 'answer': \"LLM-powered autonomous agents work by using LLM as the agent's brain, complemented by several key components such as planning, memory, and tool use. In terms of planning, the agent breaks down large tasks into smaller subgoals and can reflect and refine its actions based on past experiences. Memory is divided into short-term memory, which is used for in-context learning, and long-term memory, which allows the agent to retain and recall information over extended periods. Tool use involves the agent calling external APIs for additional information. These agents have been used in various applications, including scientific discovery and generative agents simulation.\",\n",
" 'sources': ''}"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Run\n",
"import logging\n",
"logging.basicConfig()\n",
"logging.getLogger(\"langchain.retrievers.web_research\").setLevel(logging.INFO)\n",
"from langchain.chains import RetrievalQAWithSourcesChain\n",
"user_input = \"How do LLM Powered Autonomous Agents work?\"\n",
"qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm,retriever=web_research_retriever)\n",
"result = qa_chain({\"question\": user_input})\n",
"result"
]
},
{
"cell_type": "markdown",
"id": "ff62e5f5",
"metadata": {},
"source": [
"### Going deeper \n",
"\n",
"* Here's a [app](https://github.com/langchain-ai/web-explorer/tree/main) that wraps this retriver with a lighweight UI."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -52,6 +52,7 @@ from langchain.document_loaders.blockchain import BlockchainDocumentLoader
from langchain.document_loaders.brave_search import BraveSearchLoader
from langchain.document_loaders.browserless import BrowserlessLoader
from langchain.document_loaders.chatgpt import ChatGPTLoader
from langchain.document_loaders.chromium import AsyncChromiumLoader
from langchain.document_loaders.college_confidential import CollegeConfidentialLoader
from langchain.document_loaders.concurrent import ConcurrentLoader
from langchain.document_loaders.confluence import ConfluenceLoader
@ -196,6 +197,9 @@ PagedPDFSplitter = PyPDFLoader
TelegramChatLoader = TelegramChatFileLoader
__all__ = [
"AcreomLoader",
"AsyncHtmlLoader",
"AsyncChromiumLoader",
"AZLyricsLoader",
"AcreomLoader",
"AirbyteCDKLoader",

@ -0,0 +1,90 @@
import asyncio
import logging
from typing import Iterator, List
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
logger = logging.getLogger(__name__)
class AsyncChromiumLoader(BaseLoader):
"""Scrape HTML content from provided URLs using a
headless instance of the Chromium browser."""
def __init__(
self,
urls: List[str],
):
"""
Initialize the loader with a list of URL paths.
Args:
urls (List[str]): A list of URLs to scrape content from.
Raises:
ImportError: If the required 'playwright' package is not installed.
"""
self.urls = urls
try:
import playwright # noqa: F401
except ImportError:
raise ImportError(
"playwright is required for AsyncChromiumLoader. "
"Please install it with `pip install playwright`."
)
async def ascrape_playwright(self, url: str) -> str:
"""
Asynchronously scrape the content of a given URL using Playwright's async API.
Args:
url (str): The URL to scrape.
Returns:
str: The scraped HTML content or an error message if an exception occurs.
"""
from playwright.async_api import async_playwright
logger.info("Starting scraping...")
results = ""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
try:
page = await browser.new_page()
await page.goto(url)
results = await page.content() # Simply get the HTML content
logger.info("Content scraped")
except Exception as e:
results = f"Error: {e}"
await browser.close()
return results
def lazy_load(self) -> Iterator[Document]:
"""
Lazily load text content from the provided URLs.
This method yields Documents one at a time as they're scraped,
instead of waiting to scrape all URLs before returning.
Yields:
Document: The scraped content encapsulated within a Document object.
"""
for url in self.urls:
html_content = asyncio.run(self.ascrape_playwright(url))
metadata = {"source": url}
yield Document(page_content=html_content, metadata=metadata)
def load(self) -> List[Document]:
"""
Load and return all Documents from the provided URLs.
Returns:
List[Document]: A list of Document objects
containing the scraped content from each URL.
"""
return list(self.lazy_load())

@ -15,6 +15,9 @@
Document
""" # noqa: E501
from langchain.document_transformers.beautiful_soup_transformer import (
BeautifulSoupTransformer,
)
from langchain.document_transformers.doctran_text_extract import (
DoctranPropertyExtractor,
)
@ -31,6 +34,7 @@ from langchain.document_transformers.nuclia_text_transform import NucliaTextTran
from langchain.document_transformers.openai_functions import OpenAIMetadataTagger
__all__ = [
"BeautifulSoupTransformer",
"DoctranQATransformer",
"DoctranTextTranslator",
"DoctranPropertyExtractor",

@ -0,0 +1,143 @@
from typing import Any, List, Sequence
from langchain.schema import BaseDocumentTransformer, Document
class BeautifulSoupTransformer(BaseDocumentTransformer):
"""Transform HTML content by extracting specific tags and removing unwanted ones.
Example:
.. code-block:: python
from langchain.document_transformers import BeautifulSoupTransformer
bs4_transformer = BeautifulSoupTransformer()
docs_transformed = bs4_transformer.transform_documents(docs)
"""
def __init__(self) -> None:
"""
Initialize the transformer.
This checks if the BeautifulSoup4 package is installed.
If not, it raises an ImportError.
"""
try:
import bs4 # noqa:F401
except ImportError:
raise ImportError(
"BeautifulSoup4 is required for BeautifulSoupTransformer. "
"Please install it with `pip install beautifulsoup4`."
)
def transform_documents(
self,
documents: Sequence[Document],
unwanted_tags: List[str] = ["script", "style"],
tags_to_extract: List[str] = ["p", "li", "div", "a"],
remove_lines: bool = True,
**kwargs: Any,
) -> Sequence[Document]:
"""
Transform a list of Document objects by cleaning their HTML content.
Args:
documents: A sequence of Document objects containing HTML content.
unwanted_tags: A list of tags to be removed from the HTML.
tags_to_extract: A list of tags whose content will be extracted.
remove_lines: If set to True, unnecessary lines will be
removed from the HTML content.
Returns:
A sequence of Document objects with transformed content.
"""
for doc in documents:
cleaned_content = doc.page_content
cleaned_content = self.remove_unwanted_tags(cleaned_content, unwanted_tags)
cleaned_content = self.extract_tags(cleaned_content, tags_to_extract)
if remove_lines:
cleaned_content = self.remove_unnecessary_lines(cleaned_content)
doc.page_content = cleaned_content
return documents
@staticmethod
def remove_unwanted_tags(html_content: str, unwanted_tags: List[str]) -> str:
"""
Remove unwanted tags from a given HTML content.
Args:
html_content: The original HTML content string.
unwanted_tags: A list of tags to be removed from the HTML.
Returns:
A cleaned HTML string with unwanted tags removed.
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
for tag in unwanted_tags:
for element in soup.find_all(tag):
element.decompose()
return str(soup)
@staticmethod
def extract_tags(html_content: str, tags: List[str]) -> str:
"""
Extract specific tags from a given HTML content.
Args:
html_content: The original HTML content string.
tags: A list of tags to be extracted from the HTML.
Returns:
A string combining the content of the extracted tags.
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
text_parts = []
for tag in tags:
elements = soup.find_all(tag)
for element in elements:
if tag == "a":
href = element.get("href")
if href:
text_parts.append(f"{element.get_text()} ({href})")
else:
text_parts.append(element.get_text())
else:
text_parts.append(element.get_text())
return " ".join(text_parts)
@staticmethod
def remove_unnecessary_lines(content: str) -> str:
"""
Clean up the content by removing unnecessary lines.
Args:
content: A string, which may contain unnecessary lines or spaces.
Returns:
A cleaned string with unnecessary lines removed.
"""
lines = content.split("\n")
stripped_lines = [line.strip() for line in lines]
non_empty_lines = [line for line in stripped_lines if line]
seen = set()
deduped_lines = []
for line in non_empty_lines:
if line not in seen:
seen.add(line)
deduped_lines.append(line)
cleaned_content = " ".join(deduped_lines)
return cleaned_content
async def atransform_documents(
self,
documents: Sequence[Document],
**kwargs: Any,
) -> Sequence[Document]:
raise NotImplementedError
Loading…
Cancel
Save