"Many documents contain a mixture of content types, including text and tables. \n",
"\n",
"Semi-structured data can be challenging for conventional RAG for at least two reasons: \n",
"\n",
"* Text splitting may break up tables, corrupting the data in retrieval\n",
"* Embedding tables may pose challenges for semantic similarity search \n",
"\n",
"This cookbook shows how to perform RAG on documents with semi-structured data: \n",
"\n",
"* We will use [Unstructured](https://unstructured.io/) to parse both text and tables from documents (PDFs).\n",
"* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with summaries for retrieval.\n",
"* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with table summaries better suited for retrieval.\n",
"* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.\n",
"Apply to the [`LLaMA2`](https://arxiv.org/pdf/2307.09288.pdf) paper. \n",
"\n",
"We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf), which segments a PDF document by using a layout model. \n",
"\n",
"This layout model makes it possible to extract elements, such as tables, from pdfs. \n",
"\n",
"We also can use `Unstructured` chunking, which:\n",
"\n",
"* Tries to identify document sections (e.g., Introduction, etc)\n",
"* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes"
"Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text. \n",
"\n",
"With the summary, we will also store the raw table elements.\n",
"\n",
"The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).\n",
"\n",
"The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer. \n",