langchain/docs/modules/indexes/document_loaders/examples/html.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2dfc4698",
   "metadata": {},
   "source": [
    "# HTML\n",
    "\n",
    "This covers how to load `HTML` documents into a document format that we can use downstream."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "24b434b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import UnstructuredHTMLLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "00f46fda",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = UnstructuredHTMLLoader(\"example_data/fake-content.html\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b68a26b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "34de48fa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='My First Heading\\n\\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00337aae",
   "metadata": {},
   "source": [
    "## Loading HTML with BeautifulSoup4\n",
    "\n",
    "We can also use `BeautifulSoup4` to load HTML documents using the `BSHTMLLoader`.  This will extract the text from the HTML into `page_content`, and the page title as `title` into `metadata`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "79b1bce4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.document_loaders import BSHTMLLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "4be99e6c",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='\\n\\nTest Title\\n\\n\\nMy First Heading\\nMy first paragraph.\\n\\n\\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "loader = BSHTMLLoader(\"example_data/fake-content.html\")\n",
    "data = loader.load()\n",
    "data"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
add unstructured examples (#913) 2023-02-07 02:13:46 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"id": "2dfc4698",`
			`"metadata": {},`
			`"source": [`
			`"# HTML\n",`
			`"\n",`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			"This covers how to load `HTML` documents into a document format that we can use downstream."
add unstructured examples (#913) 2023-02-07 02:13:46 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"id": "24b434b5",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.document_loaders import UnstructuredHTMLLoader"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"id": "00f46fda",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"loader = UnstructuredHTMLLoader(\"example_data/fake-content.html\")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"id": "b68a26b3",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"data = loader.load()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"id": "34de48fa",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"text/plain": [`
			`"[Document(page_content='My First Heading\\n\\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]"`
			`]`
add unstructured examples (#913) 2023-02-07 02:13:46 +00:00			`},`
			`"execution_count": 4,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"data"`
			`]`
			`},`
fix import error of bs4 (#1952) Ran into a broken build if bs4 wasn't installed in the project. Minor tweak to follow the other doc loaders optional package-loading conventions. Also updated html docs to include reference to this new html loader. side note: Should there be 2 different html-to-text document loaders? This new one only handles local files, while the existing unstructured html loader handles HTML from local and remote. So it seems like the improvement was adding the title to the metadata, which is useful but could also be added to `html.py` 2023-03-24 04:56:13 +00:00			`{`
			`"cell_type": "markdown",`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"id": "00337aae",`
			`"metadata": {},`
fix import error of bs4 (#1952) Ran into a broken build if bs4 wasn't installed in the project. Minor tweak to follow the other doc loaders optional package-loading conventions. Also updated html docs to include reference to this new html loader. side note: Should there be 2 different html-to-text document loaders? This new one only handles local files, while the existing unstructured html loader handles HTML from local and remote. So it seems like the improvement was adding the title to the metadata, which is useful but could also be added to `html.py` 2023-03-24 04:56:13 +00:00			`"source": [`
			`"## Loading HTML with BeautifulSoup4\n",`
			`"\n",`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			"We can also use `BeautifulSoup4` to load HTML documents using the `BSHTMLLoader`. This will extract the text from the HTML into `page_content`, and the page title as `title` into `metadata`."
			`]`
fix import error of bs4 (#1952) Ran into a broken build if bs4 wasn't installed in the project. Minor tweak to follow the other doc loaders optional package-loading conventions. Also updated html docs to include reference to this new html loader. side note: Should there be 2 different html-to-text document loaders? This new one only handles local files, while the existing unstructured html loader handles HTML from local and remote. So it seems like the improvement was adding the title to the metadata, which is useful but could also be added to `html.py` 2023-03-24 04:56:13 +00:00			`},`
add unstructured examples (#913) 2023-02-07 02:13:46 +00:00			`{`
			`"cell_type": "code",`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"execution_count": 1,`
add unstructured examples (#913) 2023-02-07 02:13:46 +00:00			`"id": "79b1bce4",`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"metadata": {`
			`"tags": []`
			`},`
add unstructured examples (#913) 2023-02-07 02:13:46 +00:00			`"outputs": [],`
fix import error of bs4 (#1952) Ran into a broken build if bs4 wasn't installed in the project. Minor tweak to follow the other doc loaders optional package-loading conventions. Also updated html docs to include reference to this new html loader. side note: Should there be 2 different html-to-text document loaders? This new one only handles local files, while the existing unstructured html loader handles HTML from local and remote. So it seems like the improvement was adding the title to the metadata, which is useful but could also be added to `html.py` 2023-03-24 04:56:13 +00:00			`"source": [`
			`"from langchain.document_loaders import BSHTMLLoader"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"execution_count": 2,`
			`"id": "4be99e6c",`
			`"metadata": {`
			`"collapsed": false,`
			`"jupyter": {`
			`"outputs_hidden": false`
			`},`
			`"tags": []`
			`},`
fix import error of bs4 (#1952) Ran into a broken build if bs4 wasn't installed in the project. Minor tweak to follow the other doc loaders optional package-loading conventions. Also updated html docs to include reference to this new html loader. side note: Should there be 2 different html-to-text document loaders? This new one only handles local files, while the existing unstructured html loader handles HTML from local and remote. So it seems like the improvement was adding the title to the metadata, which is useful but could also be added to `html.py` 2023-03-24 04:56:13 +00:00			`"outputs": [`
			`{`
			`"data": {`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"text/plain": [`
			`"[Document(page_content='\\n\\nTest Title\\n\\n\\nMy First Heading\\nMy first paragraph.\\n\\n\\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]"`
			`]`
fix import error of bs4 (#1952) Ran into a broken build if bs4 wasn't installed in the project. Minor tweak to follow the other doc loaders optional package-loading conventions. Also updated html docs to include reference to this new html loader. side note: Should there be 2 different html-to-text document loaders? This new one only handles local files, while the existing unstructured html loader handles HTML from local and remote. So it seems like the improvement was adding the title to the metadata, which is useful but could also be added to `html.py` 2023-03-24 04:56:13 +00:00			`},`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"execution_count": 2,`
fix import error of bs4 (#1952) Ran into a broken build if bs4 wasn't installed in the project. Minor tweak to follow the other doc loaders optional package-loading conventions. Also updated html docs to include reference to this new html loader. side note: Should there be 2 different html-to-text document loaders? This new one only handles local files, while the existing unstructured html loader handles HTML from local and remote. So it seems like the improvement was adding the title to the metadata, which is useful but could also be added to `html.py` 2023-03-24 04:56:13 +00:00			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"loader = BSHTMLLoader(\"example_data/fake-content.html\")\n",`
			`"data = loader.load()\n",`
			`"data"`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`]`
add unstructured examples (#913) 2023-02-07 02:13:46 +00:00			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
Vwp/docs improved document loaders (#4006) Huge thanks to @leo-gan for improving the document loaders notebooks --------- Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> 2023-05-02 22:24:53 +00:00			`"version": "3.10.6"`
add unstructured examples (#913) 2023-02-07 02:13:46 +00:00			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`