Grobid parser for Scientific Articles from PDF (#6729)

### Scientific Article PDF Parsing via Grobid

`Description:`
This change adds the GrobidParser class, which uses the Grobid library
to parse scientific articles into a universal XML format containing the
article title, references, sections, section text etc. The GrobidParser
uses a local Grobid server to return PDFs document as XML and parses the
XML to optionally produce documents of individual sentences or of whole
paragraphs. Metadata includes the text, paragraph number, pdf relative
bboxes, pages (text may overlap over two pages), section title
(Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the
title of the paper and finally the file path.
      
Grobid parsing is useful beyond standard pdf parsing as it accurately
outputs sections and paragraphs within them. This allows for
post-fitering of results for specific sections i.e. limiting results to
the methodology section or results. While sections are split via
headings, ideally they could be classified specifically into
introduction, methodology, results, discussion, conclusion. I'm
currently experimenting with chatgpt-3.5 for this function, which could
later be implemented as a textsplitter.

`Dependencies:`
For use, the grobid repo must be cloned and Java must be installed, for
colab this is:

```
!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install
```

Once installed the server is ran on localhost:8070 via
```
get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')
```

@rlancemartin, @eyurtsev

Twitter Handle: @Corranmac

Grobid Demo Notebook is
[here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing).

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
pull/6944/head
corranmac 1 year ago committed by GitHub
parent 6157bdf9d9
commit 20c6ade2fc
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,44 @@
# Grobid
This page covers how to use the Grobid to parse articles for LangChain.
It is seperated into two parts: installation and running the server
## Installation and Setup
#Ensure You have Java installed
!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
#Clone and install the Grobid Repo
import os
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install
#Run the server,
get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')
You can now use the GrobidParser to produce documents
```python
from langchain.document_loaders.parsers import GrobidParser
from langchain.document_loaders.generic import GenericLoader
#Produce chunks from article paragraphs
loader = GenericLoader.from_filesystem(
"/Users/31treehaus/Desktop/Papers/",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=False)
)
docs = loader.load()
#Produce chunks from article sentences
loader = GenericLoader.from_filesystem(
"/Users/31treehaus/Desktop/Papers/",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()
```
Chunk metadata will include bboxes although these are a bit funky to parse, see https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/

@ -0,0 +1,180 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "bdccb278",
"metadata": {},
"source": [
"# Grobid\n",
"\n",
"GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents.\n",
"\n",
"It is particularly good for sturctured PDFs, like academic papers.\n",
"\n",
"This loader uses GROBIB to parse PDFs into `Documents` that retain metadata associated with the section of text.\n",
"\n",
"---\n",
"\n",
"For users on `Mac` - \n",
"\n",
"(Note: additional instructions can be found [here](https://python.langchain.com/docs/ecosystem/integrations/grobid.mdx).)\n",
"\n",
"Install Java (Apple Silicon):\n",
"```\n",
"$ arch -arm64 brew install openjdk@11\n",
"$ brew --prefix openjdk@11\n",
"/opt/homebrew/opt/openjdk@ 11\n",
"```\n",
"\n",
"In `~/.zshrc`:\n",
"```\n",
"export JAVA_HOME=/opt/homebrew/opt/openjdk@11\n",
"export PATH=$JAVA_HOME/bin:$PATH\n",
"```\n",
"\n",
"Then, in Terminal:\n",
"```\n",
"$ source ~/.zshrc\n",
"```\n",
"\n",
"Confirm install:\n",
"```\n",
"$ which java\n",
"/opt/homebrew/opt/openjdk@11/bin/java\n",
"$ java -version \n",
"openjdk version \"11.0.19\" 2023-04-18\n",
"OpenJDK Runtime Environment Homebrew (build 11.0.19+0)\n",
"OpenJDK 64-Bit Server VM Homebrew (build 11.0.19+0, mixed mode)\n",
"```\n",
"\n",
"Then, get [Grobid](https://grobid.readthedocs.io/en/latest/Install-Grobid/#getting-grobid):\n",
"```\n",
"$ curl -LO https://github.com/kermitt2/grobid/archive/0.7.3.zip\n",
"$ unzip 0.7.3.zip\n",
"```\n",
" \n",
"Build\n",
"```\n",
"$ ./gradlew clean install\n",
"```\n",
"\n",
"Then, run the server:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2d8992fc",
"metadata": {},
"outputs": [],
"source": [
"! get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')"
]
},
{
"cell_type": "markdown",
"id": "4b41bfb1",
"metadata": {},
"source": [
"Now, we can use the data loader."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "640e9a4b",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders.parsers import GrobidParser\n",
"from langchain.document_loaders.generic import GenericLoader"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ecdc1fb9",
"metadata": {},
"outputs": [],
"source": [
"loader = GenericLoader.from_filesystem(\n",
" \"../Papers/\",\n",
" glob=\"*\",\n",
" suffixes=[\".pdf\"],\n",
" parser= GrobidParser(segment_sentences=False)\n",
")\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "efe9e356",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g.\"Books -2TB\" or \"Social media conversations\").There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[3].page_content"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5be03d17",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'text': 'Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g.\"Books -2TB\" or \"Social media conversations\").There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.',\n",
" 'para': '2',\n",
" 'bboxes': \"[[{'page': '1', 'x': '317.05', 'y': '509.17', 'h': '207.73', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '522.72', 'h': '220.08', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '536.27', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '549.82', 'h': '218.65', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '563.37', 'h': '136.98', 'w': '9.46'}], [{'page': '1', 'x': '446.49', 'y': '563.37', 'h': '78.11', 'w': '9.46'}, {'page': '1', 'x': '304.69', 'y': '576.92', 'h': '138.32', 'w': '9.46'}], [{'page': '1', 'x': '447.75', 'y': '576.92', 'h': '76.66', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '590.47', 'h': '219.63', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '604.02', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '617.56', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '631.11', 'h': '220.18', 'w': '9.46'}]]\",\n",
" 'pages': \"('1', '1')\",\n",
" 'section_title': 'Introduction',\n",
" 'section_number': '1',\n",
" 'paper_title': 'LLaMA: Open and Efficient Foundation Language Models',\n",
" 'file_path': '/Users/31treehaus/Desktop/Papers/2302.13971.pdf'}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[3].metadata"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -1,4 +1,5 @@
from langchain.document_loaders.parsers.audio import OpenAIWhisperParser
from langchain.document_loaders.parsers.grobid import GrobidParser
from langchain.document_loaders.parsers.html import BS4HTMLParser
from langchain.document_loaders.parsers.language import LanguageParser
from langchain.document_loaders.parsers.pdf import (
@ -11,6 +12,7 @@ from langchain.document_loaders.parsers.pdf import (
__all__ = [
"BS4HTMLParser",
"GrobidParser",
"LanguageParser",
"OpenAIWhisperParser",
"PDFMinerParser",

@ -0,0 +1,142 @@
from typing import Dict, Iterator, List, Union
import requests
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseBlobParser
from langchain.document_loaders.blob_loaders import Blob
class ServerUnavailableException(Exception):
pass
class GrobidParser(BaseBlobParser):
"""Loader that uses Grobid to load article PDF files."""
def __init__(
self,
segment_sentences: bool,
grobid_server: str = "http://localhost:8070/api/processFulltextDocument",
) -> None:
self.segment_sentences = segment_sentences
self.grobid_server = grobid_server
try:
requests.get(grobid_server)
except requests.exceptions.RequestException:
print(
"GROBID server does not appear up and running, \
please ensure Grobid is installed and the server is running"
)
raise ServerUnavailableException
def process_xml(
self, file_path: str, xml_data: str, segment_sentences: bool
) -> Iterator[Document]:
"""Process the XML file from Grobin."""
try:
from bs4 import BeautifulSoup
except ImportError:
raise ImportError(
"`bs4` package not found, please install it with " "`pip install bs4`"
)
soup = BeautifulSoup(xml_data, "xml")
sections = soup.find_all("div")
title = soup.find_all("title")[0].text
chunks = []
for section in sections:
sect = section.find("head")
if sect is not None:
for i, paragraph in enumerate(section.find_all("p")):
chunk_bboxes = []
paragraph_text = []
for i, sentence in enumerate(paragraph.find_all("s")):
paragraph_text.append(sentence.text)
sbboxes = []
for bbox in sentence.get("coords").split(";"):
box = bbox.split(",")
sbboxes.append(
{
"page": box[0],
"x": box[1],
"y": box[2],
"h": box[3],
"w": box[4],
}
)
chunk_bboxes.append(sbboxes)
if segment_sentences is True:
fpage, lpage = sbboxes[0]["page"], sbboxes[-1]["page"]
sentence_dict = {
"text": sentence.text,
"para": str(i),
"bboxes": [sbboxes],
"section_title": sect.text,
"section_number": sect.get("n"),
"pages": (fpage, lpage),
}
chunks.append(sentence_dict)
if segment_sentences is not True:
fpage, lpage = (
chunk_bboxes[0][0]["page"],
chunk_bboxes[-1][-1]["page"],
)
paragraph_dict = {
"text": "".join(paragraph_text),
"para": str(i),
"bboxes": chunk_bboxes,
"section_title": sect.text,
"section_number": sect.get("n"),
"pages": (fpage, lpage),
}
chunks.append(paragraph_dict)
yield from [
Document(
page_content=chunk["text"],
metadata=dict(
{
"text": str(chunk["text"]),
"para": str(chunk["para"]),
"bboxes": str(chunk["bboxes"]),
"pages": str(chunk["pages"]),
"section_title": str(chunk["section_title"]),
"section_number": str(chunk["section_number"]),
"paper_title": str(title),
"file_path": str(file_path),
}
),
)
for chunk in chunks
]
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
file_path = blob.source
if file_path is None:
raise ValueError("blob.source cannot be None.")
pdf = open(file_path, "rb")
files = {"input": (file_path, pdf, "application/pdf", {"Expires": "0"})}
try:
data: Dict[str, Union[str, List[str]]] = {}
for param in ["generateIDs", "consolidateHeader", "segmentSentences"]:
data[param] = "1"
data["teiCoordinates"] = ["head", "s"]
files = files or {}
r = requests.request(
"POST",
self.grobid_server,
headers=None,
params=None,
files=files,
data=data,
timeout=60,
)
xml_data = r.text
except requests.exceptions.ReadTimeout:
xml_data = None
if xml_data is None:
return iter([])
else:
return self.process_xml(file_path, xml_data, self.segment_sentences)

@ -5,6 +5,7 @@ def test_parsers_public_api_correct() -> None:
"""Test public API of parsers for breaking changes."""
assert set(__all__) == {
"BS4HTMLParser",
"GrobidParser",
"LanguageParser",
"OpenAIWhisperParser",
"PyPDFParser",

Loading…
Cancel
Save