diff --git a/docs/modules/indexes/text_splitters/examples/html.ipynb b/docs/modules/indexes/text_splitters/examples/html.ipynb new file mode 100644 index 00000000..53905136 --- /dev/null +++ b/docs/modules/indexes/text_splitters/examples/html.ipynb @@ -0,0 +1,172 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "80f6cd99", + "metadata": {}, + "source": [ + "# HTML\n", + "\n", + ">[HTML](https://en.wikipedia.org/wiki/HMTL) s the standard markup language for documents designed to be displayed in a web browser.\n", + "\n", + "`HtmlTextSplitter` splits text along Markdown headings, code blocks, or horizontal rules. It's implemented as a simple subclass of `RecursiveCharacterSplitter` with HTML-specific separators. See the source code to see the HTML syntax expected by default.\n", + "\n", + "1. How the text is split: by list of `HTML` specific separators\n", + "2. How the chunk size is measured: by number of characters" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "96d64839", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from langchain.text_splitter import HtmlTextSplitter" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "cfb0da17", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "html_text = \"\"\"\n", + "\n", + "\n", + "
\n", + "⚡ Building applications with LLMs through composability ⚡
\n", + "⚡ Building applications with LLMs through composability ⚡
\\n