Add an example tutorial for using PDFMinerPDFasHTMLLoader (#2960)

Last week I added the `PDFMinerPDFasHTMLLoader`. I am adding some example code in the notebook to serve as a tutorial for how that loader can be used to create snippets of a pdf that are structured within sections. All the other loaders only provide the `Document` objects segmented by pages but that's pretty loose given the amount of other metadata that can be extracted. With the new loader, one can leverage font-size of the text to decide when a new sections starts and can segment the text more semantically as shown in the tutorial notebook. The cell shows that we are able to find the content of entire section under **Related Work** for the example pdf which is spread across 2 pages and hence is stored as two separate documents by other loaders
1 year ago · aead062a70
parent 51894ddd98
commit aead062a70
1 changed files with 108 additions and 5 deletions
--- a/docs/modules/indexes/document_loaders/examples/pdf.ipynb
+++ b/docs/modules/indexes/document_loaders/examples/pdf.ipynb
@ -376,7 +376,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 2,
   "id": "a5525fb0",
   "metadata": {},
   "outputs": [],
@ -386,12 +386,115 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
   "id": "dac7ff68",
   "metadata": {},
   "outputs": [],
   "source": [
-    "data = loader.load()"
+    "data = loader.load()[0]   # entire pdf is loaded as a single Document"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "0ba9f645",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from bs4 import BeautifulSoup\n",
+    "soup = BeautifulSoup(data.page_content,'html.parser')\n",
+    "content = soup.find_all('div')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "35304e21",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "cur_fs = None\n",
+    "cur_text = ''\n",
+    "snippets = []   # first collect all snippets that have the same font size\n",
+    "for c in content:\n",
+    "    sp = c.find('span')\n",
+    "    if not sp:\n",
+    "        continue\n",
+    "    st = sp.get('style')\n",
+    "    if not st:\n",
+    "        continue\n",
+    "    fs = re.findall('font-size:(\\d+)px',st)\n",
+    "    if not fs:\n",
+    "        continue\n",
+    "    fs = int(fs[0])\n",
+    "    if not cur_fs:\n",
+    "        cur_fs = fs\n",
+    "    if fs == cur_fs:\n",
+    "        cur_text += c.text\n",
+    "    else:\n",
+    "        snippets.append((cur_text,cur_fs))\n",
+    "        cur_fs = fs\n",
+    "        cur_text = c.text\n",
+    "snippets.append((cur_text,cur_fs))\n",
+    "# Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as\n",
+    "# headers/footers in a PDF appear on multiple pages so if we find duplicatess safe to assume that it is redundant info)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "af8adf2f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.docstore.document import Document\n",
+    "cur_idx = -1\n",
+    "semantic_snippets = []\n",
+    "# Assumption: headings have higher font size than their respective content\n",
+    "for s in snippets:\n",
+    "    # if current snippet's font size > previous section's heading => it is a new heading\n",
+    "    if not semantic_snippets or s[1] > semantic_snippets[cur_idx].metadata['heading_font']:\n",
+    "        metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}\n",
+    "        metadata.update(data.metadata)\n",
+    "        semantic_snippets.append(Document(page_content='',metadata=metadata))\n",
+    "        cur_idx += 1\n",
+    "        continue\n",
+    "    \n",
+    "    # if current snippet's font size <= previous section's content => content belongs to the same section (one can also create\n",
+    "    # a tree like structure for sub sections if needed but that may require some more thinking and may be data specific)\n",
+    "    if not semantic_snippets[cur_idx].metadata['content_font'] or s[1] <= semantic_snippets[cur_idx].metadata['content_font']:\n",
+    "        semantic_snippets[cur_idx].page_content += s[0]\n",
+    "        semantic_snippets[cur_idx].metadata['content_font'] = max(s[1], semantic_snippets[cur_idx].metadata['content_font'])\n",
+    "        continue\n",
+    "    \n",
+    "    # if current snippet's font size > previous section's content but less tha previous section's heading than also make a new \n",
+    "    # section (e.g. title of a pdf will have the highest font size but we don't want it to subsume all sections)\n",
+    "    metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}\n",
+    "    metadata.update(data.metadata)\n",
+    "    semantic_snippets.append(Document(page_content='',metadata=metadata))\n",
+    "    cur_idx += 1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "db7f6674",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Document(page_content='Recently, various DL models and datasets have been developed for layout analysis\\ntasks. The dhSegment [22] utilizes fully convolutional networks [20] for segmen-\\ntation tasks on historical documents. Object detection-based methods like Faster\\nR-CNN [28] and Mask R-CNN [12] are used for identifying document elements [38]\\nand detecting tables [30, 26]. Most recently, Graph Neural Networks [29] have also\\nbeen used in table detection [27]. However, these models are usually implemented\\nindividually and there is no uniﬁed framework to load and use such models.\\nThere has been a surge of interest in creating open-source tools for document\\nimage processing: a search of document image analysis in Github leads to 5M\\nrelevant code pieces 6; yet most of them rely on traditional rule-based methods\\nor provide limited functionalities. The closest prior research to our work is the\\nOCR-D project7, which also tries to build a complete toolkit for DIA. However,\\nsimilar to the platform developed by Neudecker et al. [21], it is designed for\\nanalyzing historical documents, and provides no supports for recent DL models.\\nThe DocumentLayoutAnalysis project8 focuses on processing born-digital PDF\\ndocuments via analyzing the stored PDF data. Repositories like DeepLayout9\\nand Detectron2-PubLayNet10 are individual deep learning models trained on\\nlayout analysis datasets without support for the full DIA pipeline. The Document\\nAnalysis and Exploitation (DAE) platform [15] and the DeepDIVA project [2]\\naim to improve the reproducibility of DIA methods (or DL models), yet they\\nare not actively maintained. OCR engines like Tesseract [14], easyOCR11 and\\npaddleOCR12 usually do not come with comprehensive functionalities for other\\nDIA tasks like layout analysis.\\nRecent years have also seen numerous eﬀorts to create libraries for promoting\\nreproducibility and reusability in the ﬁeld of DL. Libraries like Dectectron2 [35],\\n6 The number shown is obtained by specifying the search type as ‘code’.\\n7 https://ocr-d.de/en/about\\n8 https://github.com/BobLd/DocumentLayoutAnalysis\\n9 https://github.com/leonlulu/DeepLayout\\n10 https://github.com/hpanwar08/detectron2\\n11 https://github.com/JaidedAI/EasyOCR\\n12 https://github.com/PaddlePaddle/PaddleOCR\\n4\\nZ. Shen et al.\\nFig. 1: The overall architecture of LayoutParser. For an input document image,\\nthe core LayoutParser library provides a set of oﬀ-the-shelf tools for layout\\ndetection, OCR, visualization, and storage, backed by a carefully designed layout\\ndata structure. LayoutParser also supports high level customization via eﬃcient\\nlayout annotation and model training functions. These improve model accuracy\\non the target samples. The community platform enables the easy sharing of DIA\\nmodels and whole digitization pipelines to promote reusability and reproducibility.\\nA collection of detailed documentation, tutorials and exemplar projects make\\nLayoutParser easy to learn and use.\\nAllenNLP [8] and transformers [34] have provided the community with complete\\nDL-based support for developing and deploying models for general computer\\nvision and natural language processing problems. LayoutParser, on the other\\nhand, specializes speciﬁcally in DIA tasks. LayoutParser is also equipped with a\\ncommunity platform inspired by established model hubs such as Torch Hub [23]\\nand TensorFlow Hub [1]. It enables the sharing of pretrained models as well as\\nfull document processing pipelines that are unique to DIA tasks.\\nThere have been a variety of document data collections to facilitate the\\ndevelopment of DL models. Some examples include PRImA [3](magazine layouts),\\nPubLayNet [38](academic paper layouts), Table Bank [18](tables in academic\\npapers), Newspaper Navigator Dataset [16, 17](newspaper ﬁgure layouts) and\\nHJDataset [31](historical Japanese document layouts). A spectrum of models\\ntrained on these datasets are currently available in the LayoutParser model zoo\\nto support diﬀerent use cases.\\n', metadata={'heading': '2 Related Work\\n', 'content_font': 9, 'heading_font': 11, 'source': 'example_data/layout-parser-paper.pdf'})"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "semantic_snippets[4]"
   ]
  },
  {
@ -474,9 +577,9 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "langchain_dev",
   "language": "python",
-   "name": "python3"
+   "name": "langchain_dev"
  },
  "language_info": {
   "codemirror_mode": {