"Within each markdown group we can then apply any text splitter we want. "
"md_header_splits"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "480e0e3a",
"metadata": {},
"outputs": [],
"source": [
"markdown_document = \"# Intro \\n\\n ## History \\n\\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\n\\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \\n\\n ## Rise and divergence \\n\\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\n\\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n\\n #### Standardization \\n\\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \\n\\n ## Implementations \\n\\n Implementations of Markdown are available for over a dozen programming languages.\"\n",
" _metadatas = [header_group['metadata'] for _ in _splits]\n",
" all_splits += _splits\n",
" all_metadatas += _metadatas"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "3f5d775e",
"execution_count": 4,
"id": "aac1738c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Markdown[9'"
"langchain.schema.Document"
]
},
"execution_count": 7,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_splits[0]"
"type(md_header_splits[0])"
]
},
{
"cell_type": "markdown",
"id": "9bd8977a",
"metadata": {},
"source": [
"Within each markdown group we can then apply any text splitter we want. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "33ab0d5c",
"id": "480e0e3a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Header 1': 'Intro', 'Header 2': 'History'}"
"[Document(page_content='Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),\n",
" Document(page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),\n",
" Document(page_content='As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n#### Standardization', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),\n",
" Document(page_content='#### Standardization \\nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),\n",
" Document(page_content='Implementations of Markdown are available for over a dozen programming languages.', metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'})]"
]
},
"execution_count": 8,
@ -170,7 +142,26 @@
}
],
"source": [
"all_metadatas[0]"
"markdown_document = \"# Intro \\n\\n ## History \\n\\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\n\\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \\n\\n ## Rise and divergence \\n\\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\n\\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n\\n #### Standardization \\n\\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \\n\\n ## Implementations \\n\\n Implementations of Markdown are available for over a dozen programming languages.\"\n",
"Text splitting for vector storage often uses sentences or other delimiters [to keep related text together](https://www.pinecone.io/learn/chunking-strategies/). \n",
"\n",
"But many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting. \n",
"But many documents (such as `Markdown` files) have structure (headers) that can be explicitly used in splitting. \n",
"\n",
"We added a new text splitter for Markdown files that lets a user split based specified headers. \n",
"The `MarkdownHeaderTextSplitter` lets a user split `Markdown` files files based on specified headers. \n",
"\n",
"This results in chunks that retain the header(s) that it came from (e.g., Introduction) in the chunk metadata.\n",
"This results in chunks that retain the header(s) that it came from in the metadata.\n",
"\n",
"This works nicely w/ `SelfQueryRetriever`.\n",
"\n",
@ -30,19 +30,10 @@
},
{
"cell_type": "code",
"execution_count": 1,
"id": "cda52c2c",
"execution_count": null,
"id": "2e587f65",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/31treehaus/miniconda3/envs/langchain-new/lib/python3.9/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.6.4) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.\n",
"{'content': 'We previously introduced [auto-evaluator](https://blog.langchain.dev/auto-evaluator-opportunities/), an open-source tool for grading LLM question-answer chains. Here, we extend auto-evaluator with a [lightweight Streamlit app](https://github.com/langchain-ai/auto-evaluator/tree/main/streamlit) that can connect to any existing Pinecone index. We add the ability to test metadata filtering using `SelfQueryRetriever` as well as some other approaches that we’ve found to be useful, as discussed below. \\n[ret_trim.mov](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/ret_trim.mov)',\n",
" 'metadata': {'Section': 'Evaluation'}}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"# Let's create groups based on the section headers in our page\n",
" _metadatas = [header_group['metadata'] for _ in _splits]\n",
" all_splits += _splits\n",
" all_metadatas += _metadatas"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "7424f78b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'In these cases, semantic search will look for the concept `episode 53` in the chunks, but instead we simply want to filter the chunks for `episode 53` and then perform semantic search to extract those that best summarize the episode. Metadata filtering does this, so long as we 1) we have a metadata filter for episode number and 2) we can extract the value from the query (e.g., `54` or `252`) that we want to extract. The LangChain `SelfQueryRetriever` does the latter (see'"