add HTMLHeaderTextSplitter (#11039)

Description: Similar in concept to the `MarkdownHeaderTextSplitter`, the
`HTMLHeaderTextSplitter` is a "structure-aware" chunker that splits text
at the element level and adds metadata for each header "relevant" to any
given chunk. It can return chunks element by element or combine elements
with the same metadata, with the objectives of (a) keeping related text
grouped (more or less) semantically and (b) preserving context-rich
information encoded in document structures. It can be used with other
text splitters as part of a chunking pipeline.

Dependency: lxml python package

Maintainer: @hwchase17

Twitter handle: @MartinZirulnik

---------

Co-authored-by: PresidioVantage <github@presidiovantage.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
pull/11380/head
mziru 1 year ago committed by GitHub
parent 289de601c8
commit 9e3c1d4463
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,241 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c95fcd15cd52c944",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"# HTMLHeaderTextSplitter\n",
"## Description and motivation\n",
"Similar in concept to the <a href=\"https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata\">`MarkdownHeaderTextSplitter`</a>, the `HTMLHeaderTextSplitter` is a \"structure-aware\" chunker that splits text at the element level and adds metadata for each header \"relevant\" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.\n",
"\n",
"## Usage examples\n",
"#### 1) With an HTML string:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "initial_id",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T18:57:49.208965400Z",
"start_time": "2023-10-02T18:57:48.899756Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Foo'),\n",
" Document(page_content='Some intro text about Foo. \\nBar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),\n",
" Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),\n",
" Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),\n",
" Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),\n",
" Document(page_content='Baz', metadata={'Header 1': 'Foo'}),\n",
" Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),\n",
" Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.text_splitter import HTMLHeaderTextSplitter\n",
"\n",
"html_string =\"\"\"\n",
"<!DOCTYPE html>\n",
"<html>\n",
"<body>\n",
" <div>\n",
" <h1>Foo</h1>\n",
" <p>Some intro text about Foo.</p>\n",
" <div>\n",
" <h2>Bar main section</h2>\n",
" <p>Some intro text about Bar.</p>\n",
" <h3>Bar subsection 1</h3>\n",
" <p>Some text about the first subtopic of Bar.</p>\n",
" <h3>Bar subsection 2</h3>\n",
" <p>Some text about the second subtopic of Bar.</p>\n",
" </div>\n",
" <div>\n",
" <h2>Baz</h2>\n",
" <p>Some text about Baz</p>\n",
" </div>\n",
" <br>\n",
" <p>Some concluding text about Foo</p>\n",
" </div>\n",
"</body>\n",
"</html>\n",
"\"\"\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
" (\"h3\", \"Header 3\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text(html_string)\n",
"html_header_splits"
]
},
{
"cell_type": "markdown",
"id": "e29b4aade2a0070c",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"#### 2) Pipelined to another splitter, with html loaded from a web URL:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6ada8ea093ea0475",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T18:57:51.016141300Z",
"start_time": "2023-10-02T18:57:50.647495400Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berrys paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
" Document(page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
" Document(page_content='This account of Gödels discovery was told to Hao Wang very much after the fact; but in Gödels contemporary correspondence with Bernays and Zermelo, essentially the same description of his path to the theorems is given. (See Gödel 2003a and Gödel 2003b respectively.) From those accounts we see that the undefinability of truth in arithmetic, a result credited to Tarski, was likely obtained in some form by Gödel by 1931. But he neither publicized nor published the result; the biases logicians', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
" Document(page_content='result; the biases logicians had expressed at the time concerning the notion of truth, biases which came vehemently to the fore when Tarski announced his results on the undefinability of truth in formal systems 1935, may have served as a deterrent to Gödels publication of that theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
" Document(page_content='We now describe the proof of the two theorems, formulating Gödels results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödels notation.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödels Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'})]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"url = \"https://plato.stanford.edu/entries/goedel/\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
" (\"h3\", \"Header 3\"),\n",
" (\"h4\", \"Header 4\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
"\n",
"#for local file use html_splitter.split_text_from_file(<path_to_file>)\n",
"html_header_splits = html_splitter.split_text_from_url(url)\n",
"\n",
"chunk_size = 500\n",
"chunk_overlap = 30\n",
"text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=chunk_size, chunk_overlap=chunk_overlap\n",
")\n",
"\n",
"# Split\n",
"splits = text_splitter.split_documents(html_header_splits)\n",
"splits[80:85]"
]
},
{
"cell_type": "markdown",
"id": "ac0930371d79554a",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"## Limitations\n",
"\n",
"There can be quite a bit of structural variation from one HTML document to another, and while `HTMLHeaderTextSplitter` will attempt to attach all \"relevant\" headers to any given chunk, it can sometimes miss certain headers. For example, the algorithm assumes an informational hierarchy in which headers are always at nodes \"above\" associated text, i.e. prior siblings, ancestors, and combinations thereof. In the following news article (as of the writing of this document), the document is structured such that the text of the top-level headline, while tagged \"h1\", is in a *distinct* subtree from the text elements that we'd expect it to be *\"above\"*&mdash;so we can observe that the \"h1\" element and its associated text do not show up in the chunk metadata (but, where applicable, we do see \"h2\" and its associated text): \n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5a5ec1482171b119",
"metadata": {
"ExecuteTime": {
"end_time": "2023-10-02T19:03:25.943524300Z",
"start_time": "2023-10-02T19:03:25.691641Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No two El Niño winters are the same, but many have temperature and precipitation trends in common. \n",
"Average conditions during an El Niño winter across the continental US. \n",
"One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA. \n",
"Because the jet stream is essentially a river of air that storms flow through, the\n"
]
}
],
"source": [
"url = \"https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html\"\n",
"\n",
"headers_to_split_on = [\n",
" (\"h1\", \"Header 1\"),\n",
" (\"h2\", \"Header 2\"),\n",
"]\n",
"\n",
"html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
"html_header_splits = html_splitter.split_text_from_url(url)\n",
"print(html_header_splits[1].page_content[:500])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "poetry-venv",
"language": "python",
"name": "poetry-venv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -40,9 +40,14 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"id": "ceb3c1fb",
"metadata": {},
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-25T19:12:27.243781300Z",
"start_time": "2023-09-25T19:12:24.943559400Z"
}
},
"outputs": [],
"source": [
"from langchain.text_splitter import MarkdownHeaderTextSplitter"
@ -50,19 +55,20 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 2,
"id": "2ae3649b",
"metadata": {},
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-25T19:12:31.917013600Z",
"start_time": "2023-09-25T19:12:31.905694500Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Hi this is Jim \\nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),\n",
" Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),\n",
" Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]"
]
"text/plain": "[Document(page_content='Hi this is Jim \\nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),\n Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),\n Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]"
},
"execution_count": 5,
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
@ -83,17 +89,20 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 3,
"id": "aac1738c",
"metadata": {},
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-25T19:12:35.672077100Z",
"start_time": "2023-09-25T19:12:35.666731400Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"langchain.schema.Document"
]
"text/plain": "langchain.schema.document.Document"
},
"execution_count": 4,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
@ -112,21 +121,20 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 4,
"id": "480e0e3a",
"metadata": {},
"metadata": {
"ExecuteTime": {
"end_time": "2023-09-25T19:12:41.337249Z",
"start_time": "2023-09-25T19:12:41.326099200Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),\n",
" Document(page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),\n",
" Document(page_content='As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n#### Standardization', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),\n",
" Document(page_content='#### Standardization \\nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),\n",
" Document(page_content='Implementations of Markdown are available for over a dozen programming languages.', metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'})]"
]
"text/plain": "[Document(page_content='Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),\n Document(page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),\n Document(page_content='As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n#### Standardization', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),\n Document(page_content='#### Standardization \\nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),\n Document(page_content='Implementations of Markdown are available for over a dozen programming languages.', metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'})]"
},
"execution_count": 8,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
@ -156,6 +164,16 @@
"splits = text_splitter.split_documents(md_header_splits)\n",
"splits"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
},
"id": "4017f148d414a45c"
}
],
"metadata": {

@ -0,0 +1,199 @@
<?xml version="1.0" encoding="UTF-8" ?>
<!-- HTML PRE CHUNK:
This performs a best-effort preliminary "chunking" of text in an HTML file,
matching each chunk with a "headers" metadata value based on header tags in proximity.
recursively visits every element (template mode=list).
for every element with tagname of interest (only):
1. serializes a div (and metadata marking the element's xpath).
2. calculates all text-content for the given element, including descendant elements which are *not* themselves tags of interest.
3. if any such text-content was found, serializes a "headers" (span.headers) along with this text (span.chunk).
to calculate the "headers" of an element:
1. recursively gets the *nearest* prior-siblings for headings of *each* level
2. recursively repeats that step#1 for each ancestor (regardless of tag)
n.b. this recursion is only performed (beginning with) elements which are
both (1) tags-of-interest and (2) have their own text-content.
-->
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/1999/xhtml">
<xsl:param name="tags">div|p|blockquote|ol|ul</xsl:param>
<xsl:template match="/">
<html>
<head>
<style>
div {
border: solid;
margin-top: .5em;
padding-left: .5em;
}
h1, h2, h3, h4, h5, h6 {
margin: 0;
}
.xpath {
color: blue;
}
.chunk {
margin: .5em 1em;
}
</style>
</head>
<body>
<!-- create "filtered tree" with only tags of interest -->
<xsl:apply-templates select="*" />
</body>
</html>
</xsl:template>
<xsl:template match="*">
<xsl:choose>
<!-- tags of interest get serialized into the filtered tree (and recurse down child elements) -->
<xsl:when test="contains(
concat('|', $tags, '|'),
concat('|', local-name(), '|'))">
<xsl:variable name="xpath">
<xsl:apply-templates mode="xpath" select="." />
</xsl:variable>
<xsl:variable name="txt">
<!-- recurse down child text-nodes and elements -->
<xsl:apply-templates mode="text" />
</xsl:variable>
<xsl:variable name="txt-norm" select="normalize-space($txt)" />
<div title="{$xpath}">
<small class="xpath">
<xsl:value-of select="$xpath" />
</small>
<xsl:if test="$txt-norm">
<xsl:variable name="headers">
<xsl:apply-templates mode="headingsWithAncestors" select="." />
</xsl:variable>
<xsl:if test="normalize-space($headers)">
<span class="headers">
<xsl:copy-of select="$headers" />
</span>
</xsl:if>
<p class="chunk">
<xsl:value-of select="$txt-norm" />
</p>
</xsl:if>
<xsl:apply-templates select="*" />
</div>
</xsl:when>
<!-- all other tags get "skipped" and recurse down child elements -->
<xsl:otherwise>
<xsl:apply-templates select="*" />
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!-- text mode:
prints text nodes;
for elements, recurses down child nodes (text and elements) *except* certain exceptions:
tags of interest (handled in their own list-mode match),
non-content text (e.g. script|style)
-->
<!-- ignore non-content text -->
<xsl:template mode="text" match="
script|style" />
<!-- for all other elements *except tags of interest*, recurse on child-nodes (text and elements) -->
<xsl:template mode="text" match="*">
<xsl:choose>
<!-- ignore tags of interest -->
<xsl:when test="contains(
concat('|', $tags, '|'),
concat('|', local-name(), '|'))" />
<xsl:otherwise>
<xsl:apply-templates mode="text" />
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!-- xpath mode:
return an xpath which matches this element uniquely
-->
<xsl:template mode="xpath" match="*">
<!-- recurse up parents -->
<xsl:apply-templates mode="xpath" select="parent::*" />
<xsl:value-of select="name()" />
<xsl:text>[</xsl:text>
<xsl:value-of select="1+count(preceding-sibling::*)" />
<xsl:text>]/</xsl:text>
</xsl:template>
<!-- headingsWithAncestors mode:
recurses up parents (ALL ancestors)
-->
<xsl:template mode="headingsWithAncestors" match="*">
<!-- recurse -->
<xsl:apply-templates mode="headingsWithAncestors" select="parent::*" />
<xsl:apply-templates mode="headingsWithPriorSiblings" select=".">
<xsl:with-param name="maxHead" select="6" />
</xsl:apply-templates>
</xsl:template>
<!-- headingsWithPriorSiblings mode:
recurses up preceding-siblings
-->
<xsl:template mode="headingsWithPriorSiblings" match="*">
<xsl:param name="maxHead" />
<xsl:variable name="headLevel" select="number(substring(local-name(), 2))" />
<xsl:choose>
<xsl:when test="'h' = substring(local-name(), 1, 1) and $maxHead >= $headLevel">
<!-- recurse up to prior sibling; max level one less than current -->
<xsl:apply-templates mode="headingsWithPriorSiblings" select="preceding-sibling::*[1]">
<xsl:with-param name="maxHead" select="$headLevel - 1" />
</xsl:apply-templates>
<xsl:apply-templates mode="heading" select="." />
</xsl:when>
<!-- special case for 'header' tag, serialize child-headers -->
<xsl:when test="self::header">
<xsl:apply-templates mode="heading" select="h1|h2|h3|h4|h5|h6" />
<!--
we choose not to recurse further up prior-siblings in this case,
but n.b. the 'headingsWithAncestors' template above will still continue recursion.
-->
</xsl:when>
<xsl:otherwise>
<!-- recurse up to prior sibling; no other work on this element -->
<xsl:apply-templates mode="headingsWithPriorSiblings" select="preceding-sibling::*[1]">
<xsl:with-param name="maxHead" select="$maxHead" />
</xsl:apply-templates>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template mode="heading" match="h1|h2|h3|h4|h5|h6">
<xsl:copy>
<xsl:value-of select="normalize-space(.)" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

@ -8,7 +8,7 @@
BaseDocumentTransformer --> TextSplitter --> <name>TextSplitter # Example: CharacterTextSplitter
RecursiveCharacterTextSplitter --> <name>TextSplitter
Note: **MarkdownHeaderTextSplitter** does not derive from TextSplitter.
Note: **MarkdownHeaderTextSplitter** and **HTMLHeaderTextSplitter do not derive from TextSplitter.
**Main helpers:**
@ -23,10 +23,12 @@ from __future__ import annotations
import copy
import logging
import pathlib
import re
from abc import ABC, abstractmethod
from dataclasses import dataclass
from enum import Enum
from io import BytesIO, StringIO
from typing import (
AbstractSet,
Any,
@ -46,6 +48,8 @@ from typing import (
cast,
)
import requests
from langchain.docstore.document import Document
from langchain.schema import BaseDocumentTransformer
@ -463,6 +467,159 @@ class MarkdownHeaderTextSplitter:
]
class ElementType(TypedDict):
"""Element type as typed dict."""
url: str
xpath: str
content: str
metadata: Dict[str, str]
class HTMLHeaderTextSplitter:
"""
Splitting HTML files based on specified headers.
Requires lxml package.
"""
def __init__(
self,
headers_to_split_on: List[Tuple[str, str]],
return_each_element: bool = False,
):
"""Create a new HTMLHeaderTextSplitter.
Args:
headers_to_split_on: list of tuples of headers we want to track mapped to
(arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4,
h5, h6 e.g. [("h1", "Header 1"), ("h2", "Header 2)].
return_each_element: Return each element w/ associated headers.
"""
# Output element-by-element or aggregated into chunks w/ common headers
self.return_each_element = return_each_element
self.headers_to_split_on = sorted(headers_to_split_on)
def aggregate_elements_to_chunks(
self, elements: List[ElementType]
) -> List[Document]:
"""Combine elements with common metadata into chunks
Args:
elements: HTML element content with associated identifying info and metadata
"""
aggregated_chunks: List[ElementType] = []
for element in elements:
if (
aggregated_chunks
and aggregated_chunks[-1]["metadata"] == element["metadata"]
):
# If the last element in the aggregated list
# has the same metadata as the current element,
# append the current content to the last element's content
aggregated_chunks[-1]["content"] += " \n" + element["content"]
else:
# Otherwise, append the current element to the aggregated list
aggregated_chunks.append(element)
return [
Document(page_content=chunk["content"], metadata=chunk["metadata"])
for chunk in aggregated_chunks
]
def split_text_from_url(self, url: str) -> List[Document]:
"""Split HTML from web URL
Args:
url: web URL
"""
r = requests.get(url)
return self.split_text_from_file(BytesIO(r.content))
def split_text(self, text: str) -> List[Document]:
"""Split HTML text string
Args:
text: HTML text
"""
return self.split_text_from_file(StringIO(text))
def split_text_from_file(self, file: Any) -> List[Document]:
"""Split HTML file
Args:
file: HTML file
"""
try:
from lxml import etree
except ImportError as e:
raise ImportError(
"Unable to import lxml, please install with `pip install lxml`."
) from e
# use lxml library to parse html document and return xml ElementTree
parser = etree.HTMLParser()
tree = etree.parse(file, parser)
# document transformation for "structure-aware" chunking is handled with xsl.
# see comments in html_chunks_with_headers.xslt for more detailed information.
xslt_path = (
pathlib.Path(__file__).parent
/ "document_transformers/xsl/html_chunks_with_headers.xslt"
)
xslt_tree = etree.parse(xslt_path)
transform = etree.XSLT(xslt_tree)
result = transform(tree)
result_dom = etree.fromstring(str(result))
# create filter and mapping for header metadata
header_filter = [header[0] for header in self.headers_to_split_on]
header_mapping = dict(self.headers_to_split_on)
# map xhtml namespace prefix
ns_map = {"h": "http://www.w3.org/1999/xhtml"}
# build list of elements from DOM
elements = []
for element in result_dom.findall("*//*", ns_map):
if element.findall("*[@class='headers']") or element.findall(
"*[@class='chunk']"
):
elements.append(
ElementType(
url=file,
xpath="".join(
[
node.text
for node in element.findall("*[@class='xpath']", ns_map)
]
),
content="".join(
[
node.text
for node in element.findall("*[@class='chunk']", ns_map)
]
),
metadata={
# Add text of specified headers to metadata using header
# mapping.
header_mapping[node.tag]: node.text
for node in filter(
lambda x: x.tag in header_filter,
element.findall("*[@class='headers']/*", ns_map),
)
},
)
)
if not self.return_each_element:
return self.aggregate_elements_to_chunks(elements)
else:
return [
Document(page_content=chunk["content"], metadata=chunk["metadata"])
for chunk in elements
]
# should be in newer Python versions (3.10+)
# @dataclass(frozen=True, kw_only=True, slots=True)
@dataclass(frozen=True)

Loading…
Cancel
Save