mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
2667ddc686
**Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
341 lines
12 KiB
Plaintext
341 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "88d7cc8c",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Context aware text splitting and QA / Chat\n",
|
|
"\n",
|
|
"Text splitting for vector storage often uses sentences or other delimiters [to keep related text together](https://www.pinecone.io/learn/chunking-strategies/). \n",
|
|
"\n",
|
|
"But many documents (such as `Markdown` files) have structure (headers) that can be explicitly used in splitting. \n",
|
|
"\n",
|
|
"The `MarkdownHeaderTextSplitter` lets a user split `Markdown` files files based on specified headers. \n",
|
|
"\n",
|
|
"This results in chunks that retain the header(s) that it came from in the metadata.\n",
|
|
"\n",
|
|
"This works nicely w/ `SelfQueryRetriever`.\n",
|
|
"\n",
|
|
"First, tell the retriever about our splits.\n",
|
|
"\n",
|
|
"Then, query based on the doc structure (e.g., \"summarize the doc introduction\"). \n",
|
|
"\n",
|
|
"Chunks only from that section of the Document will be filtered and used in chat / Q+A.\n",
|
|
"\n",
|
|
"Let's test this out on an [example Notion page](https://rlancemartin.notion.site/Auto-Evaluation-of-Metadata-Filtering-18502448c85240828f33716740f9574b?pvs=4)!\n",
|
|
"\n",
|
|
"First, I download the page to Markdown as explained [here](https://python.langchain.com/docs/ecosystem/integrations/notion)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"id": "2e587f65",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Load Notion page as a markdownfile file\n",
|
|
"from langchain.document_loaders import NotionDirectoryLoader\n",
|
|
"\n",
|
|
"path = \"../Notion_DB/\"\n",
|
|
"loader = NotionDirectoryLoader(path)\n",
|
|
"docs = loader.load()\n",
|
|
"md_file = docs[0].page_content"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"id": "1cd3fd7e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Let's create groups based on the section headers in our page\n",
|
|
"from langchain.text_splitter import MarkdownHeaderTextSplitter\n",
|
|
"\n",
|
|
"headers_to_split_on = [\n",
|
|
" (\"###\", \"Section\"),\n",
|
|
"]\n",
|
|
"markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
|
|
"md_header_splits = markdown_splitter.split_text(md_file)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4f73a609",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now, perform text splitting on the header grouped documents. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 26,
|
|
"id": "7fbff95f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Define our text splitter\n",
|
|
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
|
|
"\n",
|
|
"chunk_size = 500\n",
|
|
"chunk_overlap = 0\n",
|
|
"text_splitter = RecursiveCharacterTextSplitter(\n",
|
|
" chunk_size=chunk_size, chunk_overlap=chunk_overlap\n",
|
|
")\n",
|
|
"all_splits = text_splitter.split_documents(md_header_splits)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5bd72546",
|
|
"metadata": {},
|
|
"source": [
|
|
"This sets us up well do perform metadata filtering based on the document structure.\n",
|
|
"\n",
|
|
"Let's bring this all togther by building a vectorstore first."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b050b4de",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"! pip install chromadb"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 27,
|
|
"id": "01d59c39",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Build vectorstore and keep the metadata\n",
|
|
"from langchain.embeddings import OpenAIEmbeddings\n",
|
|
"from langchain.vectorstores import Chroma\n",
|
|
"\n",
|
|
"vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "310346dd",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's create a `SelfQueryRetriever` that can filter based upon metadata we defined."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 28,
|
|
"id": "7fd4d283",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create retriever\n",
|
|
"from langchain.llms import OpenAI\n",
|
|
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
|
|
"from langchain.chains.query_constructor.base import AttributeInfo\n",
|
|
"\n",
|
|
"# Define our metadata\n",
|
|
"metadata_field_info = [\n",
|
|
" AttributeInfo(\n",
|
|
" name=\"Section\",\n",
|
|
" description=\"Part of the document that the text comes from\",\n",
|
|
" type=\"string or list[string]\",\n",
|
|
" ),\n",
|
|
"]\n",
|
|
"document_content_description = \"Major sections of the document\"\n",
|
|
"\n",
|
|
"# Define self query retriver\n",
|
|
"llm = OpenAI(temperature=0)\n",
|
|
"retriever = SelfQueryRetriever.from_llm(\n",
|
|
" llm, vectorstore, document_content_description, metadata_field_info, verbose=True\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "218b9820",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can see that we can query *only for texts* in the `Introduction` of the document!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 29,
|
|
"id": "d688db6e",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled.png)', metadata={'Section': 'Introduction'}),\n",
|
|
" Document(page_content='Q+A systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. There many ways to approach this. For example, we recently [discussed](https://blog.langchain.dev/auto-evaluation-of-anthropic-100k-context-window/) the Retriever-Less option (at bottom in the below diagram), highlighting the Anthropic 100k context window model. Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using', metadata={'Section': 'Introduction'}),\n",
|
|
" Document(page_content='metadata tags prior to semantic search.', metadata={'Section': 'Introduction'})]"
|
|
]
|
|
},
|
|
"execution_count": 29,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Test\n",
|
|
"retriever.get_relevant_documents(\"Summarize the Introduction section of the document\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 29,
|
|
"id": "f8064987",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled.png)', metadata={'Section': 'Introduction'}),\n",
|
|
" Document(page_content='Q+A systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. There many ways to approach this. For example, we recently [discussed](https://blog.langchain.dev/auto-evaluation-of-anthropic-100k-context-window/) the Retriever-Less option (at bottom in the below diagram), highlighting the Anthropic 100k context window model. Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using', metadata={'Section': 'Introduction'}),\n",
|
|
" Document(page_content='metadata tags prior to semantic search.', metadata={'Section': 'Introduction'})]"
|
|
]
|
|
},
|
|
"execution_count": 29,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Test\n",
|
|
"retriever.get_relevant_documents(\"Summarize the Introduction section of the document\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f35999b3",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can also look at other parts of the document."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 30,
|
|
"id": "47929be4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"query='Testing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Testing') limit=None\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled%202.png)', metadata={'Section': 'Testing'}),\n",
|
|
" Document(page_content='`SelfQueryRetriever` works well in [many cases](https://twitter.com/hwchase17/status/1656791488569954304/photo/1). For example, given [this test case](https://twitter.com/hwchase17/status/1656791488569954304?s=20): \\n![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled%201.png) \\nThe query can be nicely broken up into semantic query and metadata filter: \\n```python\\nsemantic query: \"prompt injection\"', metadata={'Section': 'Testing'}),\n",
|
|
" Document(page_content='Below, we can see detailed results from the app: \\n- Kor extraction is above to perform the transformation between query and metadata format ✅\\n- Self-querying attempts to filter using the episode ID (`252`) in the query and fails 🚫\\n- Baseline returns docs from 3 different episodes (one from `252`), confusing the answer 🚫', metadata={'Section': 'Testing'}),\n",
|
|
" Document(page_content='will use in retrieval [here](https://github.com/langchain-ai/auto-evaluator/blob/main/streamlit/kor_retriever_lex.py).', metadata={'Section': 'Testing'})]"
|
|
]
|
|
},
|
|
"execution_count": 30,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"retriever.get_relevant_documents(\"Summarize the Testing section of the document\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1af7720f",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now, we can create chat or Q+A apps that are aware of the explict document structure. \n",
|
|
"\n",
|
|
"The ability to retain document structure for metadata filtering can be helpful for complicated or longer documents."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 31,
|
|
"id": "565822a1",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"query='Testing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Testing') limit=None\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'The Testing section of the document describes the evaluation of the `SelfQueryRetriever` component in comparison to a baseline model. The evaluation was performed on a test case where the query was broken down into a semantic query and a metadata filter. The results showed that the `SelfQueryRetriever` component was able to perform the transformation between query and metadata format, but failed to filter using the episode ID in the query. The baseline model returned documents from three different episodes, which confused the answer. The `SelfQueryRetriever` component was deemed to work well in many cases and will be used in retrieval.'"
|
|
]
|
|
},
|
|
"execution_count": 31,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain.chains import RetrievalQA\n",
|
|
"from langchain.chat_models import ChatOpenAI\n",
|
|
"\n",
|
|
"llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)\n",
|
|
"qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)\n",
|
|
"qa_chain.run(\"Summarize the Testing section of the document\")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.16"
|
|
},
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|