docs[minor]: Add "Build a PDF ingestion and Question/Answering system" tutorial (#22570)

More direct entrypoint for a common use-case. Meant to give people a more hands-on intro to document loaders/loading data from different data sources as well. Some duplicate content for RAG and extraction (to show what you can do with the loaded documents), but defers to the appropriate sections rather than going too in-depth. @baskaryan @hwchase17
4 months ago · 234394f631
parent 5fc5ed463c
commit 234394f631
3 changed files with 340 additions and 0 deletions
--- a/docs/docs/example_data/nke-10k-2023.pdf
+++ b/docs/docs/example_data/nke-10k-2023.pdf
--- a/docs/docs/tutorials/index.mdx
+++ b/docs/docs/tutorials/index.mdx
@ -19,6 +19,7 @@ New to LangChain or to LLM app development in general? Read this material to qui
 - [Build a Query Analysis System](/docs/tutorials/query_analysis)
 - [Build a local RAG application](/docs/tutorials/local_rag)
 - [Build a Question Answering application over a Graph Database](/docs/tutorials/graph)
+- [Build a PDF ingestion and Question/Answering system](/docs/tutorials/pdf_qa/)

 ### Specialized tasks
 - [Build an Extraction Chain](/docs/tutorials/extraction)
--- a/docs/docs/tutorials/pdf_qa.ipynb
+++ b/docs/docs/tutorials/pdf_qa.ipynb
@ -0,0 +1,339 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "---\n",
+    "keywords: [pdf, document loader]\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Build a PDF ingestion and Question/Answering system\n",
+    "\n",
+    "PDF files often hold crucial unstructured data unavailable from other sources. They can be quite lengthy, and unlike plain text files, cannot generally be fed directly into the prompt of a language model.\n",
+    "\n",
+    "In this tutorial, you'll create a system that can answer questions about PDF files. More specifically, you'll use a [Document Loader](/docs/concepts/#document-loaders) to load text in a format usable by an LLM, then build a retrieval-augmented generation (RAG) pipeline to answer questions, including citations from the source material.\n",
+    "\n",
+    "This tutorial will gloss over some concepts more deeply covered in our [RAG](/docs/tutorials/rag/) tutorial, so you may want to go through those first if you haven't already.\n",
+    "\n",
+    "Let's dive in!\n",
+    "\n",
+    "## Loading documents\n",
+    "\n",
+    "First, you'll need to choose a PDF to load. We'll use a document from [Nike's annual public SEC report](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf). It's over 100 pages long, and contains some crucial data mixed with longer explanatory text. However, you can feel free to use a PDF of your choosing.\n",
+    "\n",
+    "Once you've chosen your PDF, the next step is to load it into a format that an LLM can more easily handle, since LLMs generally require text inputs. LangChain has a few different [built-in document loaders](/docs/how_to/document_loader_pdf/) for this purpose which you can experiment with. Below, we'll use one powered by the [`pypdf`](https://pypi.org/project/pypdf/) package that reads from a filepath:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -qU pypdf langchain_community"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "107\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_community.document_loaders import PyPDFLoader\n",
+    "\n",
+    "file_path = \"../example_data/nke-10k-2023.pdf\"\n",
+    "loader = PyPDFLoader(file_path)\n",
+    "\n",
+    "docs = loader.load()\n",
+    "\n",
+    "print(len(docs))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Table of Contents\n",
+      "UNITED STATES\n",
+      "SECURITIES AND EXCHANGE COMMISSION\n",
+      "Washington, D.C. 20549\n",
+      "FORM 10-K\n",
+      "\n",
+      "{'source': '../example_data/nke-10k-2023.pdf', 'page': 0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(docs[0].page_content[0:100])\n",
+    "print(docs[0].metadata)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So what just happened?\n",
+    "\n",
+    "- The loader reads the PDF at the specified path into memory.\n",
+    "- It then extracts text data using the `pypdf` package.\n",
+    "- Finally, it creates a LangChain [Document](/docs/concepts/#documents) for each page of the PDF with the page's content and some metadata about where in the document the text came from.\n",
+    "\n",
+    "LangChain has [many other document loaders](/docs/integrations/document_loaders/) for other data sources, or you can create a [custom document loader](/docs/how_to/document_loader_custom/).\n",
+    "\n",
+    "## Question answering with RAG\n",
+    "\n",
+    "Next, you'll prepare the loaded documents for later retrieval. Using a [text splitter](/docs/concepts/#text-splitters), you'll split your loaded documents into smaller documents that can more easily fit into an LLM's context window, then load them into a [vector store](/docs/concepts/#vector-stores). You can then create a [retriever](/docs/concepts/#retrievers) from the vector store for use in our RAG chain:\n",
+    "\n",
+    "```{=mdx}\n",
+    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
+    "\n",
+    "<ChatModelTabs openaiParams={`model=\"gpt-4o\"`} />\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | output: false\n",
+    "# | echo: false\n",
+    "\n",
+    "%pip install langchain_anthropic\n",
+    "\n",
+    "import getpass\n",
+    "import os\n",
+    "\n",
+    "from langchain_anthropic import ChatAnthropic\n",
+    "\n",
+    "os.environ[\"ANTHROPIC_API_KEY\"] = getpass.getpass(\"Anthropic API Key:\")\n",
+    "\n",
+    "llm = ChatAnthropic(model=\"claude-3-sonnet-20240229\", temperature=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install langchain_chroma langchain_openai"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | output: false\n",
+    "# | echo: false\n",
+    "\n",
+    "import getpass\n",
+    "import os\n",
+    "\n",
+    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_chroma import Chroma\n",
+    "from langchain_openai import OpenAIEmbeddings\n",
+    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
+    "\n",
+    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
+    "splits = text_splitter.split_documents(docs)\n",
+    "vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())\n",
+    "\n",
+    "retriever = vectorstore.as_retriever()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, you'll use some built-in helpers to construct the final `rag_chain`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'input': \"What was Nike's revenue in 2023?\",\n",
+       " 'context': [Document(page_content='Table of Contents\\nFISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\\nThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:\\nFISCAL 2023 COMPARED TO FISCAL 2022\\n•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.\\nThe increase was due to higher revenues in North America, Europe, Middle East & Africa (\"EMEA\"), APLA and Greater China, which contributed approximately 7, 6,\\n2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\\n•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This\\nincrease was primarily due to higher revenues in Men\\'s, the Jordan Brand, Women\\'s and Kids\\' which grew 17%, 35%,11% and 10%, respectively, on a wholesale\\nequivalent basis.', metadata={'page': 35, 'source': '../example_data/nke-10k-2023.pdf'}),\n",
+       "  Document(page_content='Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and other areas to create an end-to-end technology foundation, which we\\nbelieve will further accelerate our digital transformation. We believe this unified approach will accelerate growth and unlock more efficiency for our business, while driving\\nspeed and responsiveness as we serve consumers globally.\\nFINANCIAL HIGHLIGHTS\\n•In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively\\n•NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for\\nfiscal 2023\\n•Gross margin for the fiscal year decreased 250 basis points to 43.5% primarily driven by higher product costs, higher markdowns and unfavorable changes in foreign\\ncurrency exchange rates, partially offset by strategic pricing actions', metadata={'page': 30, 'source': '../example_data/nke-10k-2023.pdf'}),\n",
+       "  Document(page_content=\"Table of Contents\\nNORTH AMERICA\\n(Dollars in millions) FISCAL 2023FISCAL 2022 % CHANGE% CHANGE\\nEXCLUDING\\nCURRENCY\\nCHANGESFISCAL 2021 % CHANGE% CHANGE\\nEXCLUDING\\nCURRENCY\\nCHANGES\\nRevenues by:\\nFootwear $ 14,897 $ 12,228 22 % 22 %$ 11,644 5 % 5 %\\nApparel 5,947 5,492 8 % 9 % 5,028 9 % 9 %\\nEquipment 764 633 21 % 21 % 507 25 % 25 %\\nTOTAL REVENUES $ 21,608 $ 18,353 18 % 18 %$ 17,179 7 % 7 %\\nRevenues by:    \\nSales to Wholesale Customers $ 11,273 $ 9,621 17 % 18 %$ 10,186 -6 % -6 %\\nSales through NIKE Direct 10,335 8,732 18 % 18 % 6,993 25 % 25 %\\nTOTAL REVENUES $ 21,608 $ 18,353 18 % 18 %$ 17,179 7 % 7 %\\nEARNINGS BEFORE INTEREST AND TAXES $ 5,454 $ 5,114 7 % $ 5,089 0 %\\nFISCAL 2023 COMPARED TO FISCAL 2022\\n•North America revenues increased 18% on a currency-neutral basis, primarily due to higher revenues in Men's and the Jordan Brand. NIKE Direct revenues\\nincreased 18%, driven by strong digital sales growth of 23%, comparable store sales growth of 9% and the addition of new stores.\", metadata={'page': 39, 'source': '../example_data/nke-10k-2023.pdf'}),\n",
+       "  Document(page_content=\"Table of Contents\\nEUROPE, MIDDLE EAST & AFRICA\\n(Dollars in millions) FISCAL 2023FISCAL 2022 % CHANGE% CHANGE\\nEXCLUDING\\nCURRENCY\\nCHANGESFISCAL 2021 % CHANGE% CHANGE\\nEXCLUDING\\nCURRENCY\\nCHANGES\\nRevenues by:\\nFootwear $ 8,260 $ 7,388 12 % 25 %$ 6,970 6 % 9 %\\nApparel 4,566 4,527 1 % 14 % 3,996 13 % 16 %\\nEquipment 592 564 5 % 18 % 490 15 % 17 %\\nTOTAL REVENUES $ 13,418 $ 12,479 8 % 21 %$ 11,456 9 % 12 %\\nRevenues by:    \\nSales to Wholesale Customers $ 8,522 $ 8,377 2 % 15 %$ 7,812 7 % 10 %\\nSales through NIKE Direct 4,896 4,102 19 % 33 % 3,644 13 % 15 %\\nTOTAL REVENUES $ 13,418 $ 12,479 8 % 21 %$ 11,456 9 % 12 %\\nEARNINGS BEFORE INTEREST AND TAXES $ 3,531 $ 3,293 7 % $ 2,435 35 % \\nFISCAL 2023 COMPARED TO FISCAL 2022\\n•EMEA revenues increased 21% on a currency-neutral basis, due to higher revenues in Men's, the Jordan Brand, Women's and Kids'. NIKE Direct revenues\\nincreased 33%, driven primarily by strong digital sales growth of 43% and comparable store sales growth of 22%.\", metadata={'page': 40, 'source': '../example_data/nke-10k-2023.pdf'})],\n",
+       " 'answer': 'According to the financial highlights, Nike, Inc. achieved record revenues of $51.2 billion in fiscal 2023, which increased 10% on a reported basis and 16% on a currency-neutral basis compared to fiscal 2022.'}"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.chains import create_retrieval_chain\n",
+    "from langchain.chains.combine_documents import create_stuff_documents_chain\n",
+    "from langchain_core.prompts import ChatPromptTemplate\n",
+    "\n",
+    "system_prompt = (\n",
+    "    \"You are an assistant for question-answering tasks. \"\n",
+    "    \"Use the following pieces of retrieved context to answer \"\n",
+    "    \"the question. If you don't know the answer, say that you \"\n",
+    "    \"don't know. Use three sentences maximum and keep the \"\n",
+    "    \"answer concise.\"\n",
+    "    \"\\n\\n\"\n",
+    "    \"{context}\"\n",
+    ")\n",
+    "\n",
+    "prompt = ChatPromptTemplate.from_messages(\n",
+    "    [\n",
+    "        (\"system\", system_prompt),\n",
+    "        (\"human\", \"{input}\"),\n",
+    "    ]\n",
+    ")\n",
+    "\n",
+    "\n",
+    "question_answer_chain = create_stuff_documents_chain(llm, prompt)\n",
+    "rag_chain = create_retrieval_chain(retriever, question_answer_chain)\n",
+    "\n",
+    "results = rag_chain.invoke({\"input\": \"What was Nike's revenue in 2023?\"})\n",
+    "\n",
+    "results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can see that you get both a final answer in the `answer` key of the results dict, and the `context` the LLM used to generate an answer.\n",
+    "\n",
+    "Examining the values under the `context` further, you can see that they are documents that each contain a chunk of the ingested page content. Usefully, these documents also preserve the original metadata from way back when you first loaded them:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Table of Contents\n",
+      "FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\n",
+      "The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:\n",
+      "FISCAL 2023 COMPARED TO FISCAL 2022\n",
+      "•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.\n",
+      "The increase was due to higher revenues in North America, Europe, Middle East & Africa (\"EMEA\"), APLA and Greater China, which contributed approximately 7, 6,\n",
+      "2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n",
+      "•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This\n",
+      "increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale\n",
+      "equivalent basis.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(results[\"context\"][0].page_content)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'page': 35, 'source': '../example_data/nke-10k-2023.pdf'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(results[\"context\"][0].metadata)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This particular chunk came from page 35 in the original PDF. You can use this data to show which page in the PDF the answer came from, allowing users to quickly verify that answers are based on the source material.\n",
+    "\n",
+    ":::info\n",
+    "For a deeper dive into RAG, see [this more focused tutorial](/docs/tutorials/rag/) or [our how-to guides](/docs/how_to/#qa-with-rag).\n",
+    ":::\n",
+    "\n",
+    "## Next steps\n",
+    "\n",
+    "You've now seen how to load documents from a PDF file with a Document Loader and some techniques you can use to prepare that loaded data for RAG.\n",
+    "\n",
+    "For more on document loaders, you can check out:\n",
+    "\n",
+    "- [The entry in the conceptual guide](/docs/concepts/#document-loaders)\n",
+    "- [Related how-to guides](/docs/how_to/#document-loaders)\n",
+    "- [Available integrations](/docs/integrations/document_loaders/)\n",
+    "- [How to create a custom document loader](/docs/how_to/document_loader_custom/)\n",
+    "\n",
+    "For more on RAG, see:\n",
+    "\n",
+    "- [Build a Retrieval Augmented Generation (RAG) App](/docs/tutorials/rag/)\n",
+    "- [Related how-to guides](/docs/how_to/#qa-with-rag)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}