diff --git a/docs/docs_skeleton/docs/use_cases/web_scraping/index.mdx b/docs/docs_skeleton/docs/use_cases/web_scraping/index.mdx new file mode 100644 index 0000000000..c62c8e6821 --- /dev/null +++ b/docs/docs_skeleton/docs/use_cases/web_scraping/index.mdx @@ -0,0 +1,9 @@ +--- +sidebar_position: 3 +--- + +# Web Scraping + +Web scraping has historically been a challenging endeavor due to the ever-changing nature of website structures, making it tedious for developers to maintain their scraping scripts. Traditional methods often rely on specific HTML tags and patterns which, when altered, can disrupt data extraction processes. + +Enter the LLM-based method for parsing HTML: By leveraging the capabilities of LLMs, and especially OpenAI Functions in LangChain's extraction chain, developers can instruct the model to extract only the desired data in a specified format. This method not only streamlines the extraction process but also significantly reduces the time spent on manual debugging and script modifications. Its adaptability means that even if websites undergo significant design changes, the extraction remains consistent and robust. This level of resilience translates to reduced maintenance efforts, cost savings, and ensures a higher quality of extracted data. Compared to its predecessors, LLM-based approach wins out the web scraping domain by transforming a historically cumbersome task into a more automated and efficient process. diff --git a/docs/docs_skeleton/static/img/web_research.png b/docs/docs_skeleton/static/img/web_research.png new file mode 100644 index 0000000000..3192f7570e Binary files /dev/null and b/docs/docs_skeleton/static/img/web_research.png differ diff --git a/docs/docs_skeleton/static/img/web_scraping.png b/docs/docs_skeleton/static/img/web_scraping.png new file mode 100644 index 0000000000..738fbdd449 Binary files /dev/null and b/docs/docs_skeleton/static/img/web_scraping.png differ diff --git a/docs/docs_skeleton/static/img/wsj_page.png b/docs/docs_skeleton/static/img/wsj_page.png new file mode 100644 index 0000000000..65644746ae Binary files /dev/null and b/docs/docs_skeleton/static/img/wsj_page.png differ diff --git a/docs/extras/integrations/document_loaders/async_chromium.ipynb b/docs/extras/integrations/document_loaders/async_chromium.ipynb new file mode 100644 index 0000000000..ddb220913c --- /dev/null +++ b/docs/extras/integrations/document_loaders/async_chromium.ipynb @@ -0,0 +1,101 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ad553e51", + "metadata": {}, + "source": [ + "# Async Chromium\n", + "\n", + "Chromium is one of the browsers supported by Playwright, a library used to control browser automation. \n", + "\n", + "By running `p.chromium.launch(headless=True)`, we are launching a headless instance of Chromium. \n", + "\n", + "Headless mode means that the browser is running without a graphical user interface.\n", + "\n", + "`AsyncChromiumLoader` load the page, and then we use `Html2TextTransformer` to trasnform to text." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c3a4c19", + "metadata": {}, + "outputs": [], + "source": [ + "! pip install -q playwright beautifulsoup4\n", + "! playwright install" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "dd2cdea7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'