Update notebook for MD header splitter and create new cookbook (#6399)

Move MD header text splitter example to its own cookbook.
master
Lance Martin 11 months ago committed by GitHub
parent 22af93d851
commit ae6196507d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -172,315 +172,6 @@
"source": [
"all_metadatas[0]"
]
},
{
"cell_type": "markdown",
"id": "dcf70760",
"metadata": {},
"source": [
"### Use case\n",
"\n",
"Let's appy `MarkdownHeaderTextSplitter` to a Notion page [here](https://rlancemartin.notion.site/Auto-Evaluation-of-Metadata-Filtering-18502448c85240828f33716740f9574b?pvs=4) as a test.\n",
"\n",
"The page is downloaded as markdown and stored locally as shown [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "73313d6c",
"metadata": {},
"outputs": [],
"source": [
"# Load Notion database as a markdownfile file\n",
"from langchain.document_loaders import NotionDirectoryLoader\n",
"loader = NotionDirectoryLoader(\"../Notion_DB_Metadata\")\n",
"docs = loader.load()\n",
"md_file=docs[0].page_content"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "6fa341d7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'content': 'We previously introduced [auto-evaluator](https://blog.langchain.dev/auto-evaluator-opportunities/), an open-source tool for grading LLM question-answer chains. Here, we extend auto-evaluator with a [lightweight Streamlit app](https://github.com/langchain-ai/auto-evaluator/tree/main/streamlit) that can connect to any existing Pinecone index. We add the ability to test metadata filtering using `SelfQueryRetriever` as well as some other approaches that weve found to be useful, as discussed below. \\n[ret_trim.mov](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/ret_trim.mov)',\n",
" 'metadata': {'Section': 'Evaluation'}}"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's create groups based on the section headers\n",
"headers_to_split_on = [\n",
" (\"###\", \"Section\"),\n",
"]\n",
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
"md_header_splits = markdown_splitter.split_text(md_file)\n",
"md_header_splits[3]"
]
},
{
"cell_type": "markdown",
"id": "42d8bb9b",
"metadata": {},
"source": [
"Now, we split the text in each group and keep the group as metadata."
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "a9831de2",
"metadata": {},
"outputs": [],
"source": [
"# Define our text splitter\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"chunk_size = 500\n",
"chunk_overlap = 50\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)\n",
" \n",
"# Create splits within each header group\n",
"all_splits=[]\n",
"all_metadatas=[]\n",
"for header_group in md_header_splits:\n",
" _splits = text_splitter.split_text(header_group['content'])\n",
" _metadatas = [header_group['metadata'] for _ in _splits]\n",
" all_splits += _splits\n",
" all_metadatas += _metadatas"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "b5691ee5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'In these cases, semantic search will look for the concept `episode 53` in the chunks, but instead we simply want to filter the chunks for `episode 53` and then perform semantic search to extract those that best summarize the episode. Metadata filtering does this, so long as we 1) we have a metadata filter for episode number and 2) we can extract the value from the query (e.g., `54` or `252`) that we want to extract. The LangChain `SelfQueryRetriever` does the latter (see'"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_splits[6]"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "e1dfb405",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Section': 'Motivation'}"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_metadatas[6]"
]
},
{
"cell_type": "markdown",
"id": "79868606",
"metadata": {},
"source": [
"This sets us up well do perform metadata filtering based on the document structure.\n",
"\n",
"Let's bring this all togther by building a vectorstore first."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "143d7347",
"metadata": {},
"outputs": [],
"source": [
"! pip install chromadb"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "cbcb917a",
"metadata": {},
"outputs": [],
"source": [
"# Build vectorstore\n",
"from langchain.vectorstores import Chroma\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"embeddings = OpenAIEmbeddings()\n",
"vectorstore = Chroma.from_texts(texts=all_splits,metadatas=all_metadatas,embedding=OpenAIEmbeddings())"
]
},
{
"cell_type": "markdown",
"id": "3f6031fc",
"metadata": {},
"source": [
"Let's create a `SelfQueryRetriever` that can filter based upon metadata we defined."
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "5b1b6a75",
"metadata": {},
"outputs": [],
"source": [
"# Create retriever \n",
"from langchain.llms import OpenAI\n",
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
"from langchain.chains.query_constructor.base import AttributeInfo\n",
"\n",
"# Define our metadata\n",
"metadata_field_info = [\n",
" AttributeInfo(\n",
" name=\"Section\",\n",
" description=\"Headers of the markdown document that organize the ideas\",\n",
" type=\"string or list[string]\",\n",
" ),\n",
"]\n",
"document_content_description = \"Headers of the markdown document\"\n",
"\n",
"# Define self query retriver\n",
"llm = OpenAI(temperature=0)\n",
"sq_retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)"
]
},
{
"cell_type": "markdown",
"id": "9d0dbed8",
"metadata": {},
"source": [
"Now we can fetch chunks specifically from any section of the doc!"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "6c37fe1b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled.png)', metadata={'Section': 'Introduction'}),\n",
" Document(page_content='Q+A systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. There many ways to approach this. For example, we recently [discussed](https://blog.langchain.dev/auto-evaluation-of-anthropic-100k-context-window/) the Retriever-Less option (at bottom in the below diagram), highlighting the Anthropic 100k context window model. Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using', metadata={'Section': 'Introduction'}),\n",
" Document(page_content='on a user-defined criteria in a VectorDB using metadata tags prior to semantic search.', metadata={'Section': 'Introduction'})]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Test\n",
"question=\"Summarize the Introduction section of the document\"\n",
"sq_retriever.get_relevant_documents(question)"
]
},
{
"cell_type": "markdown",
"id": "bb0efebd",
"metadata": {},
"source": [
"Now, we can create chat or Q+A apps that are aware of the explict document structure. \n",
"\n",
"Of course, semantic search without specific metadata filtering would probably work reasonably well for this simple document.\n",
"\n",
"But, the ability to retain document structure for metadata filtering can be helpful for more complicated or longer documents."
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "3b40e24e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None\n"
]
},
{
"data": {
"text/plain": [
"'The document discusses different approaches to retrieve relevant text chunks and synthesize them into an answer in Q+A systems. One of the approaches is metadata filtering, which pre-filters chunks based on user-defined criteria in a VectorDB using metadata tags prior to semantic search. The Retriever-Less option, which uses the Anthropic 100k context window model, is also mentioned as an alternative approach.'"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.chains import RetrievalQA\n",
"from langchain.chat_models import ChatOpenAI\n",
"llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)\n",
"qa_chain = RetrievalQA.from_chain_type(llm,retriever=sq_retriever)\n",
"qa_chain.run(question)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "dfeeb327",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='Testing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Testing') limit=None\n"
]
},
{
"data": {
"text/plain": [
"'The Testing section of the document describes how the performance of the SelfQueryRetriever was evaluated using various test cases. The tests were designed to evaluate the ability of the SelfQueryRetriever to correctly infer metadata filters from the query using metadata_field_info. The results of the tests showed that the SelfQueryRetriever performed well in some cases, but failed in others. The document also provides a link to the code for the auto-evaluator and instructions on how to use it. Additionally, the document mentions the use of the Kor library for structured data extraction to explicitly specify transformations that the auto-evaluator can use.'"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"question=\"Summarize the Testing section of the document\"\n",
"qa_chain.run(question)"
]
}
],
"metadata": {

@ -0,0 +1,366 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "88d7cc8c",
"metadata": {},
"source": [
"# Context aware text splitting and QA / Chat\n",
"\n",
"Text splitting for vector storage often uses sentences or other delimiters [to keep related text together](https://www.pinecone.io/learn/chunking-strategies/). \n",
"\n",
"But many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting. \n",
"\n",
"We added a new text splitter for Markdown files that lets a user split based specified headers. \n",
"\n",
"This results in chunks that retain the header(s) that it came from (e.g., Introduction) in the chunk metadata.\n",
"\n",
"This works nicely w/ `SelfQueryRetriever`.\n",
"\n",
"First, tell the retriever about our splits.\n",
"\n",
"Then, query based on the doc structure (e.g., \"summarize the doc introduction\"). \n",
"\n",
"Chunks only from that section of the Document will be filtered and used in chat / Q+A.\n",
"\n",
"Let's test this out on an [example Notion page](https://rlancemartin.notion.site/Auto-Evaluation-of-Metadata-Filtering-18502448c85240828f33716740f9574b?pvs=4)!\n",
"\n",
"First, I download the page to Markdown as explained [here](https://python.langchain.com/docs/ecosystem/integrations/notion)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "cda52c2c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/31treehaus/miniconda3/envs/langchain-new/lib/python3.9/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.6.4) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.\n",
" warnings.warn(\n"
]
}
],
"source": [
"# Load Notion page as a markdownfile file\n",
"from langchain.document_loaders import NotionDirectoryLoader\n",
"path='.../Notion_Folder_With_Markdown_File'\n",
"loader = NotionDirectoryLoader(path)\n",
"docs = loader.load()\n",
"md_file=docs[0].page_content"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "730b84f2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'content': 'We previously introduced [auto-evaluator](https://blog.langchain.dev/auto-evaluator-opportunities/), an open-source tool for grading LLM question-answer chains. Here, we extend auto-evaluator with a [lightweight Streamlit app](https://github.com/langchain-ai/auto-evaluator/tree/main/streamlit) that can connect to any existing Pinecone index. We add the ability to test metadata filtering using `SelfQueryRetriever` as well as some other approaches that weve found to be useful, as discussed below. \\n[ret_trim.mov](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/ret_trim.mov)',\n",
" 'metadata': {'Section': 'Evaluation'}}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's create groups based on the section headers in our page\n",
"from langchain.text_splitter import MarkdownHeaderTextSplitter\n",
"headers_to_split_on = [\n",
" (\"###\", \"Section\"),\n",
"]\n",
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
"md_header_splits = markdown_splitter.split_text(md_file)\n",
"md_header_splits[3]"
]
},
{
"cell_type": "markdown",
"id": "4f73a609",
"metadata": {},
"source": [
"Now, we split the text in each header group and keep the group as metadata."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "7fbff95f",
"metadata": {},
"outputs": [],
"source": [
"# Define our text splitter\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"chunk_size = 500\n",
"chunk_overlap = 0\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)\n",
" \n",
"# Create splits within each header group and combine them\n",
"all_splits=[]\n",
"all_metadatas=[]\n",
"for header_group in md_header_splits:\n",
" _splits = text_splitter.split_text(header_group['content'])\n",
" _metadatas = [header_group['metadata'] for _ in _splits]\n",
" all_splits += _splits\n",
" all_metadatas += _metadatas"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "7424f78b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'In these cases, semantic search will look for the concept `episode 53` in the chunks, but instead we simply want to filter the chunks for `episode 53` and then perform semantic search to extract those that best summarize the episode. Metadata filtering does this, so long as we 1) we have a metadata filter for episode number and 2) we can extract the value from the query (e.g., `54` or `252`) that we want to extract. The LangChain `SelfQueryRetriever` does the latter (see'"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_splits[6]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "08f5db3a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Section': 'Motivation'}"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_metadatas[6]"
]
},
{
"cell_type": "markdown",
"id": "5bd72546",
"metadata": {},
"source": [
"This sets us up well do perform metadata filtering based on the document structure.\n",
"\n",
"Let's bring this all togther by building a vectorstore first."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b050b4de",
"metadata": {},
"outputs": [],
"source": [
"! pip install chromadb "
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "01d59c39",
"metadata": {},
"outputs": [],
"source": [
"# Build vectorstore and keep the metadata\n",
"from langchain.vectorstores import Chroma\n",
"vectorstore = Chroma.from_texts(texts=all_splits,metadatas=all_metadatas,embedding=OpenAIEmbeddings())"
]
},
{
"cell_type": "markdown",
"id": "310346dd",
"metadata": {},
"source": [
"Let's create a `SelfQueryRetriever` that can filter based upon metadata we defined."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "7fd4d283",
"metadata": {},
"outputs": [],
"source": [
"# Create retriever \n",
"from langchain.llms import OpenAI\n",
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
"from langchain.chains.query_constructor.base import AttributeInfo\n",
"\n",
"# Define our metadata\n",
"metadata_field_info = [\n",
" AttributeInfo(\n",
" name=\"Section\",\n",
" description=\"Part of the document that the text comes from\",\n",
" type=\"string or list[string]\",\n",
" ),\n",
"]\n",
"document_content_description = \"Major sections of the document\"\n",
"\n",
"# Define self query retriver\n",
"llm = OpenAI(temperature=0)\n",
"retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)"
]
},
{
"cell_type": "markdown",
"id": "218b9820",
"metadata": {},
"source": [
"We can see that we can query *only for texts* in the `Introduction` of the document!"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "f8064987",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled.png)', metadata={'Section': 'Introduction'}),\n",
" Document(page_content='Q+A systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. There many ways to approach this. For example, we recently [discussed](https://blog.langchain.dev/auto-evaluation-of-anthropic-100k-context-window/) the Retriever-Less option (at bottom in the below diagram), highlighting the Anthropic 100k context window model. Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using', metadata={'Section': 'Introduction'}),\n",
" Document(page_content='metadata tags prior to semantic search.', metadata={'Section': 'Introduction'})]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Test\n",
"retriever.get_relevant_documents(\"Summarize the Introduction section of the document\")"
]
},
{
"cell_type": "markdown",
"id": "f35999b3",
"metadata": {},
"source": [
"We can also look at other parts of the document."
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "47929be4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='Testing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Testing') limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled%202.png)', metadata={'Section': 'Testing'}),\n",
" Document(page_content='`SelfQueryRetriever` works well in [many cases](https://twitter.com/hwchase17/status/1656791488569954304/photo/1). For example, given [this test case](https://twitter.com/hwchase17/status/1656791488569954304?s=20): \\n![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled%201.png) \\nThe query can be nicely broken up into semantic query and metadata filter: \\n```python\\nsemantic query: \"prompt injection\"', metadata={'Section': 'Testing'}),\n",
" Document(page_content='Below, we can see detailed results from the app: \\n- Kor extraction is above to perform the transformation between query and metadata format ✅\\n- Self-querying attempts to filter using the episode ID (`252`) in the query and fails 🚫\\n- Baseline returns docs from 3 different episodes (one from `252`), confusing the answer 🚫', metadata={'Section': 'Testing'}),\n",
" Document(page_content='will use in retrieval [here](https://github.com/langchain-ai/auto-evaluator/blob/main/streamlit/kor_retriever_lex.py).', metadata={'Section': 'Testing'})]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.get_relevant_documents(\"Summarize the Testing section of the document\")"
]
},
{
"cell_type": "markdown",
"id": "1af7720f",
"metadata": {},
"source": [
"Now, we can create chat or Q+A apps that are aware of the explict document structure. \n",
"\n",
"The ability to retain document structure for metadata filtering can be helpful for complicated or longer documents."
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "565822a1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='Testing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Testing') limit=None\n"
]
},
{
"data": {
"text/plain": [
"'The Testing section of the document describes the evaluation of the `SelfQueryRetriever` component in comparison to a baseline model. The evaluation was performed on a test case where the query was broken down into a semantic query and a metadata filter. The results showed that the `SelfQueryRetriever` component was able to perform the transformation between query and metadata format, but failed to filter using the episode ID in the query. The baseline model returned documents from three different episodes, which confused the answer. The `SelfQueryRetriever` component was deemed to work well in many cases and will be used in retrieval.'"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.chains import RetrievalQA\n",
"from langchain.chat_models import ChatOpenAI\n",
"llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)\n",
"qa_chain = RetrievalQA.from_chain_type(llm,retriever=retriever)\n",
"qa_chain.run(\"Summarize the Testing section of the document\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -87,3 +87,4 @@ Additional related resources include:
For examples to this done in an end-to-end manner, please see the following resources:
- [Semantic search over a group chat with Sources Notebook](./semantic-search-over-chat.html): A notebook that semantically searches over a group chat conversation.
- [Document context aware text splitting and QA](./document-context-aware-QA.html): A notebook that shows context aware splitting on markdown files and SelfQueryRetriever for QA using the resulting metadata.

Loading…
Cancel
Save