forked from Archives/langchain
c64f98e2bb
Co-authored-by: Andrew White <white.d.andrew@gmail.com> Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net> Co-authored-by: Peng Qu <82029664+pengqu123@users.noreply.github.com>
201 lines
7.7 KiB
Plaintext
201 lines
7.7 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f70e6118",
|
|
"metadata": {},
|
|
"source": [
|
|
"# PDF\n",
|
|
"\n",
|
|
"This covers how to load pdfs into a document format that we can use downstream."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "743f9413",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Using PyPDF\n",
|
|
"\n",
|
|
"Allows for tracking of page numbers as well."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "c428b0c5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.document_loaders import PagedPDFSplitter\n",
|
|
"\n",
|
|
"loader = PagedPDFSplitter(\"example_data/layout-parser-paper.pdf\")\n",
|
|
"pages = loader.load_and_split()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ebd895e4",
|
|
"metadata": {},
|
|
"source": [
|
|
"An advantage of this approach is that documents can be retrieved with page numbers."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "87fa7b3a",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"9: 10 Z. Shen et al.\n",
|
|
"Fig. 4: Illustration of (a) the original historical Japanese document with layout\n",
|
|
"detection results and (b) a recreated version of the document image that achieves\n",
|
|
"much better character recognition recall. The reorganization algorithm rearranges\n",
|
|
"the tokens based on the their detected bounding boxes given a maximum allowed\n",
|
|
"height.\n",
|
|
"4LayoutParser Community Platform\n",
|
|
"Another focus of LayoutParser is promoting the reusability of layout detection\n",
|
|
"models and full digitization pipelines. Similar to many existing deep learning\n",
|
|
"libraries, LayoutParser comes with a community model hub for distributing\n",
|
|
"layout models. End-users can upload their self-trained models to the model hub,\n",
|
|
"and these models can be loaded into a similar interface as the currently available\n",
|
|
"LayoutParser pre-trained models. For example, the model trained on the News\n",
|
|
"Navigator dataset [17] has been incorporated in the model hub.\n",
|
|
"Beyond DL models, LayoutParser also promotes the sharing of entire doc-\n",
|
|
"ument digitization pipelines. For example, sometimes the pipeline requires the\n",
|
|
"combination of multiple DL models to achieve better accuracy. Currently, pipelines\n",
|
|
"are mainly described in academic papers and implementations are often not pub-\n",
|
|
"licly available. To this end, the LayoutParser community platform also enables\n",
|
|
"the sharing of layout pipelines to promote the discussion and reuse of techniques.\n",
|
|
"For each shared pipeline, it has a dedicated project page, with links to the source\n",
|
|
"code, documentation, and an outline of the approaches. A discussion panel is\n",
|
|
"provided for exchanging ideas. Combined with the core LayoutParser library,\n",
|
|
"users can easily build reusable components based on the shared pipelines and\n",
|
|
"apply them to solve their unique problems.\n",
|
|
"5 Use Cases\n",
|
|
"The core objective of LayoutParser is to make it easier to create both large-scale\n",
|
|
"and light-weight document digitization pipelines. Large-scale document processing\n",
|
|
"3: 4 Z. Shen et al.\n",
|
|
"Efficient Data AnnotationC u s t o m i z e d M o d e l T r a i n i n gModel Cust omizationDI A Model HubDI A Pipeline SharingCommunity PlatformLa y out Detection ModelsDocument Images \n",
|
|
"T h e C o r e L a y o u t P a r s e r L i b r a r yOCR ModuleSt or age & VisualizationLa y out Data Structur e\n",
|
|
"Fig. 1: The overall architecture of LayoutParser . For an input document image,\n",
|
|
"the core LayoutParser library provides a set of o\u000b",
|
|
"-the-shelf tools for layout\n",
|
|
"detection, OCR, visualization, and storage, backed by a carefully designed layout\n",
|
|
"data structure. LayoutParser also supports high level customization via e\u000ecient\n",
|
|
"layout annotation and model training functions. These improve model accuracy\n",
|
|
"on the target samples. The community platform enables the easy sharing of DIA\n",
|
|
"models and whole digitization pipelines to promote reusability and reproducibility.\n",
|
|
"A collection of detailed documentation, tutorials and exemplar projects make\n",
|
|
"LayoutParser easy to learn and use.\n",
|
|
"AllenNLP [ 8] and transformers [ 34] have provided the community with complete\n",
|
|
"DL-based support for developing and deploying models for general computer\n",
|
|
"vision and natural language processing problems. LayoutParser , on the other\n",
|
|
"hand, specializes speci\f",
|
|
"cally in DIA tasks. LayoutParser is also equipped with a\n",
|
|
"community platform inspired by established model hubs such as Torch Hub [23]\n",
|
|
"andTensorFlow Hub [1]. It enables the sharing of pretrained models as well as\n",
|
|
"full document processing pipelines that are unique to DIA tasks.\n",
|
|
"There have been a variety of document data collections to facilitate the\n",
|
|
"development of DL models. Some examples include PRImA [ 3](magazine layouts),\n",
|
|
"PubLayNet [ 38](academic paper layouts), Table Bank [ 18](tables in academic\n",
|
|
"papers), Newspaper Navigator Dataset [ 16,17](newspaper \f",
|
|
"gure layouts) and\n",
|
|
"HJDataset [31](historical Japanese document layouts). A spectrum of models\n",
|
|
"trained on these datasets are currently available in the LayoutParser model zoo\n",
|
|
"to support di\u000b",
|
|
"erent use cases.\n",
|
|
"3 The Core LayoutParser Library\n",
|
|
"At the core of LayoutParser is an o\u000b",
|
|
"-the-shelf toolkit that streamlines DL-\n",
|
|
"based document image analysis. Five components support a simple interface\n",
|
|
"with comprehensive functionalities: 1) The layout detection models enable using\n",
|
|
"pre-trained or self-trained DL models for layout detection with just four lines\n",
|
|
"of code. 2) The detected layout information is stored in carefully engineered\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain.vectorstores import FAISS\n",
|
|
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
|
"\n",
|
|
"faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())\n",
|
|
"docs = faiss_index.similarity_search(\"How will the community be engaged?\", k=2)\n",
|
|
"for doc in docs:\n",
|
|
" print(str(doc.metadata[\"page\"]) + \":\", doc.page_content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "09d64998",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Using Unstructured"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "0cc0cd42",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.document_loaders import UnstructuredPDFLoader"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "082d557c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = UnstructuredPDFLoader(\"example_data/layout-parser-paper.pdf\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "5c41106f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "54fb6b62",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.1"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|