langchain/docs/extras/integrations/document_loaders/browserless.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Browserless\n",
    "\n",
    "Browserless is a service that allows you to run headless Chrome instances in the cloud. It's a great way to run browser-based automation at scale without having to worry about managing your own infrastructure.\n",
    "\n",
    "To use Browserless as a document loader, initialize a `BrowserlessLoader` instance as shown in this notebook. Note that by default, `BrowserlessLoader` returns the `innerText` of the page's `body` element. To disable this and get the raw HTML, set `text_content` to `False`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import BrowserlessLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "BROWSERLESS_API_TOKEN = \"YOUR_BROWSERLESS_API_TOKEN\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Jump to content\n",
      "Main menu\n",
      "Search\n",
      "Create account\n",
      "Log in\n",
      "Personal tools\n",
      "Toggle the table of contents\n",
      "Document classification\n",
      "17 languages\n",
      "Article\n",
      "Talk\n",
      "Read\n",
      "Edit\n",
      "View history\n",
      "Tools\n",
      "From Wikipedia, the free encyclopedia\n",
      "\n",
      "Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done \"manually\" (or \"intellectually\") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.\n",
      "\n",
      "The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.\n",
      "\n",
      "Do\n"
     ]
    }
   ],
   "source": [
    "loader = BrowserlessLoader(\n",
    "    api_token=BROWSERLESS_API_TOKEN,\n",
    "    urls=[\n",
    "        \"https://en.wikipedia.org/wiki/Document_classification\",\n",
    "    ],\n",
    "    text_content=True,\n",
    ")\n",
    "\n",
    "documents = loader.load()\n",
    "\n",
    "print(documents[0].page_content[:1000])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
add browserless loader (#7562) # Browserless Added support for Browserless' `/content` endpoint as a document loader. ### About Browserless Browserless is a cloud service that provides access to headless Chrome browsers via a REST API. It allows developers to automate Chromium in a serverless fashion without having to configure and maintain their own Chrome infrastructure. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Lance Martin <lance@langchain.dev> 2023-07-13 20:18:28 +00:00			`{`
			`"cells": [`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Add text_content kwarg to BrowserlessLoader (#7856) Added keyword argument to toggle between getting the text content of a site versus its HTML when using the `BrowserlessLoader` 2023-07-18 00:02:19 +00:00			`"# Browserless\n",`
			`"\n",`
			`"Browserless is a service that allows you to run headless Chrome instances in the cloud. It's a great way to run browser-based automation at scale without having to worry about managing your own infrastructure.\n",`
			`"\n",`
			"To use Browserless as a document loader, initialize a `BrowserlessLoader` instance as shown in this notebook. Note that by default, `BrowserlessLoader` returns the `innerText` of the page's `body` element. To disable this and get the raw HTML, set `text_content` to `False`."
add browserless loader (#7562) # Browserless Added support for Browserless' `/content` endpoint as a document loader. ### About Browserless Browserless is a cloud service that provides access to headless Chrome browsers via a REST API. It allows developers to automate Chromium in a serverless fashion without having to configure and maintain their own Chrome infrastructure. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Lance Martin <lance@langchain.dev> 2023-07-13 20:18:28 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
Add text_content kwarg to BrowserlessLoader (#7856) Added keyword argument to toggle between getting the text content of a site versus its HTML when using the `BrowserlessLoader` 2023-07-18 00:02:19 +00:00			`"execution_count": 11,`
add browserless loader (#7562) # Browserless Added support for Browserless' `/content` endpoint as a document loader. ### About Browserless Browserless is a cloud service that provides access to headless Chrome browsers via a REST API. It allows developers to automate Chromium in a serverless fashion without having to configure and maintain their own Chrome infrastructure. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Lance Martin <lance@langchain.dev> 2023-07-13 20:18:28 +00:00			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.document_loaders import BrowserlessLoader"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Add text_content kwarg to BrowserlessLoader (#7856) Added keyword argument to toggle between getting the text content of a site versus its HTML when using the `BrowserlessLoader` 2023-07-18 00:02:19 +00:00			`"execution_count": 12,`
add browserless loader (#7562) # Browserless Added support for Browserless' `/content` endpoint as a document loader. ### About Browserless Browserless is a cloud service that provides access to headless Chrome browsers via a REST API. It allows developers to automate Chromium in a serverless fashion without having to configure and maintain their own Chrome infrastructure. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Lance Martin <lance@langchain.dev> 2023-07-13 20:18:28 +00:00			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
Add text_content kwarg to BrowserlessLoader (#7856) Added keyword argument to toggle between getting the text content of a site versus its HTML when using the `BrowserlessLoader` 2023-07-18 00:02:19 +00:00			`"BROWSERLESS_API_TOKEN = \"YOUR_BROWSERLESS_API_TOKEN\""`
add browserless loader (#7562) # Browserless Added support for Browserless' `/content` endpoint as a document loader. ### About Browserless Browserless is a cloud service that provides access to headless Chrome browsers via a REST API. It allows developers to automate Chromium in a serverless fashion without having to configure and maintain their own Chrome infrastructure. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Lance Martin <lance@langchain.dev> 2023-07-13 20:18:28 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
Add text_content kwarg to BrowserlessLoader (#7856) Added keyword argument to toggle between getting the text content of a site versus its HTML when using the `BrowserlessLoader` 2023-07-18 00:02:19 +00:00			`"execution_count": 14,`
add browserless loader (#7562) # Browserless Added support for Browserless' `/content` endpoint as a document loader. ### About Browserless Browserless is a cloud service that provides access to headless Chrome browsers via a REST API. It allows developers to automate Chromium in a serverless fashion without having to configure and maintain their own Chrome infrastructure. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Lance Martin <lance@langchain.dev> 2023-07-13 20:18:28 +00:00			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
Add text_content kwarg to BrowserlessLoader (#7856) Added keyword argument to toggle between getting the text content of a site versus its HTML when using the `BrowserlessLoader` 2023-07-18 00:02:19 +00:00			`"Jump to content\n",`
			`"Main menu\n",`
			`"Search\n",`
			`"Create account\n",`
			`"Log in\n",`
			`"Personal tools\n",`
			`"Toggle the table of contents\n",`
			`"Document classification\n",`
			`"17 languages\n",`
			`"Article\n",`
			`"Talk\n",`
			`"Read\n",`
			`"Edit\n",`
			`"View history\n",`
			`"Tools\n",`
			`"From Wikipedia, the free encyclopedia\n",`
			`"\n",`
			"Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done \"manually\" (or \"intellectually\") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.\n",
			`"\n",`
			`"The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.\n",`
			`"\n",`
			`"Do\n"`
add browserless loader (#7562) # Browserless Added support for Browserless' `/content` endpoint as a document loader. ### About Browserless Browserless is a cloud service that provides access to headless Chrome browsers via a REST API. It allows developers to automate Chromium in a serverless fashion without having to configure and maintain their own Chrome infrastructure. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Lance Martin <lance@langchain.dev> 2023-07-13 20:18:28 +00:00			`]`
			`}`
			`],`
			`"source": [`
			`"loader = BrowserlessLoader(\n",`
			`" api_token=BROWSERLESS_API_TOKEN,\n",`
			`" urls=[\n",`
			`" \"https://en.wikipedia.org/wiki/Document_classification\",\n",`
			`" ],\n",`
Add text_content kwarg to BrowserlessLoader (#7856) Added keyword argument to toggle between getting the text content of a site versus its HTML when using the `BrowserlessLoader` 2023-07-18 00:02:19 +00:00			`" text_content=True,\n",`
add browserless loader (#7562) # Browserless Added support for Browserless' `/content` endpoint as a document loader. ### About Browserless Browserless is a cloud service that provides access to headless Chrome browsers via a REST API. It allows developers to automate Chromium in a serverless fashion without having to configure and maintain their own Chrome infrastructure. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Lance Martin <lance@langchain.dev> 2023-07-13 20:18:28 +00:00			`")\n",`
			`"\n",`
			`"documents = loader.load()\n",`
			`"\n",`
			`"print(documents[0].page_content[:1000])"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "venv",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
Add text_content kwarg to BrowserlessLoader (#7856) Added keyword argument to toggle between getting the text content of a site versus its HTML when using the `BrowserlessLoader` 2023-07-18 00:02:19 +00:00			`"version": "3.10.9"`
add browserless loader (#7562) # Browserless Added support for Browserless' `/content` endpoint as a document loader. ### About Browserless Browserless is a cloud service that provides access to headless Chrome browsers via a REST API. It allows developers to automate Chromium in a serverless fashion without having to configure and maintain their own Chrome infrastructure. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Lance Martin <lance@langchain.dev> 2023-07-13 20:18:28 +00:00			`},`
			`"orig_nbformat": 4`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 2`
			`}`