langchain-notebooks/pdf-loader.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "jukit_cell_id": "ut22SE2PmJ"
      },
      "source": [
        "## Loading PDF"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "jukit_cell_id": "EQL3ZDG6Dt"
      },
      "source": [
        "from langchain.document_loaders import PagedPDFSplitter\n",
        "\n",
        "loader = PagedPDFSplitter(\"./documents/layout-parser-paper.pdf\")\n",
        "pages = loader.load_and_split()"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "jukit_cell_id": "6LWg1c7vN6"
      },
      "source": [
        "Documents can be retrived with page numbers"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "jukit_cell_id": "0kFnbEI7yL"
      },
      "source": [
        "from langchain.vectorstores import FAISS\n",
        "from langchain.embeddings.openai import OpenAIEmbeddings"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "code",
      "metadata": {
        "jukit_cell_id": "KkXwCS4JHN"
      },
      "source": [
        "faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings() )\n",
        "\n",
        "# Find docs (ie pages) most similar to query\n",
        "# k: number of docs similar to query\n",
        "docs = faiss_index.similarity_search(\"How will the community be engaged ?\", k=2)"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "code",
      "metadata": {
        "jukit_cell_id": "RDajVoEdqh"
      },
      "source": [
        "# get page numbers + content,  similar to query \n",
        "for doc in docs:\n",
        "    print(\"\\n----\\n\")\n",
        "    print(\"page: \" + str(doc.metadata[\"page\"] + 1))\n",
        "    print(\"content:\")\n",
        "    print(str(doc.page_content))"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "\n----\n\npage: 10\ncontent:\n10 Z. Shen et al.\nFig. 4: Illustration of (a) the original historical Japanese document with layout\ndetection results and (b) a recreated version of the document image that achieves\nmuch better character recognition recall. The reorganization algorithm rearranges\nthe tokens based on the their detected bounding boxes given a maximum allowed\nheight.\n4LayoutParser Community Platform\nAnother focus of LayoutParser is promoting the reusability of layout detection\nmodels and full digitization pipelines. Similar to many existing deep learning\nlibraries, LayoutParser comes with a community model hub for distributing\nlayout models. End-users can upload their self-trained models to the model hub,\nand these models can be loaded into a similar interface as the currently available\nLayoutParser pre-trained models. For example, the model trained on the News\nNavigator dataset [17] has been incorporated in the model hub.\nBeyond DL models, LayoutParser also promotes the sharing of entire doc-\nument digitization pipelines. For example, sometimes the pipeline requires the\ncombination of multiple DL models to achieve better accuracy. Currently, pipelines\nare mainly described in academic papers and implementations are often not pub-\nlicly available. To this end, the LayoutParser community platform also enables\nthe sharing of layout pipelines to promote the discussion and reuse of techniques.\nFor each shared pipeline, it has a dedicated project page, with links to the source\ncode, documentation, and an outline of the approaches. A discussion panel is\nprovided for exchanging ideas. Combined with the core LayoutParser library,\nusers can easily build reusable components based on the shared pipelines and\napply them to solve their unique problems.\n5 Use Cases\nThe core objective of LayoutParser is to make it easier to create both large-scale\nand light-weight document digitization pipelines. Large-scale document processing\n\n----\n\npage: 4\ncontent:\n4 Z. Shen et al.\nEfficient Data AnnotationC u s t o m i z e d  M o d e l  T r a i n i n gModel Cust omizationDI A Model HubDI A Pipeline SharingCommunity PlatformLa y out Detection ModelsDocument Images \nT h e  C o r e  L a y o u t P a r s e r  L i b r a r yOCR ModuleSt or age & VisualizationLa y out Data Structur e\nFig. 1: The overall architecture of LayoutParser . For an input document image,\nthe core LayoutParser library provides a set of o\u000b-the-shelf tools for layout\ndetection, OCR, visualization, and storage, backed by a carefully designed layout\ndata structure. LayoutParser also supports high level customization via e\u000ecient\nlayout annotation and model training functions. These improve model accuracy\non the target samples. The community platform enables the easy sharing of DIA\nmodels and whole digitization pipelines to promote reusability and reproducibility.\nA collection of detailed documentation, tutorials and exemplar projects make\nLayoutParser easy to learn and use.\nAllenNLP [ 8] and transformers [ 34] have provided the community with complete\nDL-based support for developing and deploying models for general computer\nvision and natural language processing problems. LayoutParser , on the other\nhand, specializes speci\fcally in DIA tasks. LayoutParser is also equipped with a\ncommunity platform inspired by established model hubs such as Torch Hub [23]\nandTensorFlow Hub [1]. It enables the sharing of pretrained models as well as\nfull document processing pipelines that are unique to DIA tasks.\nThere have been a variety of document data collections to facilitate the\ndevelopment of DL models. Some examples include PRImA [ 3](magazine layouts),\nPubLayNet [ 38](academic paper layouts), Table Bank [ 18](tables in academic\npapers), Newspaper Navigator Dataset [ 16,17](newspaper \fgure layouts) and\nHJDataset [31](historical Japanese document layouts). A spectrum of models\ntrained on these datasets are currently available in the LayoutParser model zoo\nto support di\u000berent use cases.\n3 The Core LayoutParser Library\nAt the core of LayoutParser is an o\u000b-the-shelf toolkit that streamlines DL-\nbased document image analysis. Five components support a simple interface\nwith comprehensive functionalities: 1) The layout detection models enable using\npre-trained or self-trained DL models for layout detection with just four lines\nof code. 2) The detected layout information is stored in carefully engineered\n"
        }
      ],
      "execution_count": 1
    },
    {
      "cell_type": "code",
      "metadata": {
        "jukit_cell_id": "cqoPocvVBS"
      },
      "source": [],
      "outputs": [],
      "execution_count": null
    }
  ],
  "metadata": {
    "anaconda-cloud": {},
    "kernelspec": {
      "display_name": "python",
      "language": "python",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}