Flesh out semi-structured cookbook (#11904)

pull/11909/head
Lance Martin 12 months ago committed by GitHub
parent e8c1850369
commit eca8a5e5b8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -14,12 +14,19 @@
"\n",
"Many documents contain a mixture of content types, including text and tables. \n",
"\n",
"Semi-structured data can be challenging for conventional RAG for at least two reasons: \n",
"\n",
"* Text splitting may break up tables, corrupting the data in retrieval\n",
"* Embedding tables may pose challenges for semantic similarity search \n",
"\n",
"This cookbook shows how to perform RAG on documents with semi-structured data: \n",
"\n",
"* We will use [Unstructured](https://unstructured.io/) to parse both text and tables from documents (PDFs).\n",
"* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with summaries for retrieval.\n",
"* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with table summaries better suited for retrieval.\n",
"* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.\n",
"\n",
"The overall flow is here:\n",
"\n",
"![MVR.png](attachment:7b5c5a30-393c-4b27-8fa1-688306ef2aef.png)\n",
"\n",
"## Packages"
@ -32,7 +39,29 @@
"metadata": {},
"outputs": [],
"source": [
"! pip install langchain unstructured[all-docs] pydantic lxml"
"! pip install langchain unstructured[all-docs] pydantic lxml langchainhub"
]
},
{
"cell_type": "markdown",
"id": "44349a83-e1dc-4eed-ba75-587f309d8c88",
"metadata": {},
"source": [
"The PDF partitioning used by Unstructured will use: \n",
"\n",
"* `tesseract` for Optical Character Recognition (OCR)\n",
"* `poppler` for PDF rendering and processing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7880871-4949-4ea2-aed8-540a09188a41",
"metadata": {},
"outputs": [],
"source": [
"! brew install tesseract \n",
"! brew install poppler"
]
},
{
@ -44,8 +73,16 @@
"\n",
"### Partition PDF tables and text\n",
"\n",
"* `LLaMA2` Paper: https://arxiv.org/pdf/2307.09288.pdf\n",
"* Use `Unstructured` to partition elements"
"Apply to the [`LLaMA2`](https://arxiv.org/pdf/2307.09288.pdf) paper. \n",
"\n",
"We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf), which segments a PDF document by using a layout model. \n",
"\n",
"This layout model makes it possible to extract elements, such as tables, from pdfs. \n",
"\n",
"We also can use `Unstructured` chunking, which:\n",
"\n",
"* Tries to identify document sections (e.g., Introduction, etc)\n",
"* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes"
]
},
{
@ -72,7 +109,7 @@
"\n",
"# Get elements\n",
"raw_pdf_elements = partition_pdf(filename=path+\"LLaMA2.pdf\",\n",
" # Using pdf format to find embedded image blocks\n",
" # Unstructured first finds embedded image blocks\n",
" extract_images_in_pdf=False,\n",
" # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles\n",
" # Titles are any sub-section of the document \n",
@ -82,13 +119,22 @@
" # Chunking params to aggregate text blocks\n",
" # Attempt to create a new chunk 3800 chars\n",
" # Attempt to keep chunks > 2000 chars \n",
" # Hard max on chunks\n",
" max_characters=4000, \n",
" new_after_n_chars=3800, \n",
" combine_text_under_n_chars=2000,\n",
" image_output_dir_path=path)"
]
},
{
"cell_type": "markdown",
"id": "b09cd727-aeab-49af-8a51-0dc377321e7c",
"metadata": {},
"source": [
"We can examine the elements extracted by `partition_pdf`.\n",
"\n",
"`CompositeElement` are aggregated chunks."
]
},
{
"cell_type": "code",
"execution_count": 13,
@ -168,7 +214,13 @@
"source": [
"## Multi-vector retriever\n",
"\n",
"Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary).\n",
"Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text. \n",
"\n",
"With the summary, we will also store the raw table elements.\n",
"\n",
"The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).\n",
"\n",
"The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer. \n",
"\n",
"### Summaries"
]
@ -185,6 +237,21 @@
"from langchain.schema.output_parser import StrOutputParser"
]
},
{
"cell_type": "markdown",
"id": "37b65677-aeb4-44fd-b06d-4539341ede97",
"metadata": {},
"source": [
"We create a simple summarize chain for each element.\n",
"\n",
"You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).\n",
"\n",
"```\n",
"from langchain import hub\n",
"obj = hub.pull(\"rlm/multi-vector-retriever-summarization\")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 17,
@ -233,7 +300,10 @@
"source": [
"### Add to vectorstore\n",
"\n",
"Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries."
"Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries: \n",
"\n",
"* `InMemoryStore` stores the raw text, tables\n",
"* `vectorstore` stores the embedded summaries"
]
},
{

Loading…
Cancel
Save