mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
45 lines
1.4 KiB
Plaintext
45 lines
1.4 KiB
Plaintext
# Grobid
|
|
|
|
This page covers how to use the Grobid to parse articles for LangChain.
|
|
It is separated into two parts: installation and running the server
|
|
|
|
## Installation and Setup
|
|
#Ensure You have Java installed
|
|
!apt-get install -y openjdk-11-jdk -q
|
|
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
|
|
|
|
#Clone and install the Grobid Repo
|
|
import os
|
|
!git clone https://github.com/kermitt2/grobid.git
|
|
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
|
|
os.chdir('grobid')
|
|
!./gradlew clean install
|
|
|
|
#Run the server,
|
|
get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')
|
|
|
|
You can now use the GrobidParser to produce documents
|
|
```python
|
|
from langchain.document_loaders.parsers import GrobidParser
|
|
from langchain.document_loaders.generic import GenericLoader
|
|
|
|
#Produce chunks from article paragraphs
|
|
loader = GenericLoader.from_filesystem(
|
|
"/Users/31treehaus/Desktop/Papers/",
|
|
glob="*",
|
|
suffixes=[".pdf"],
|
|
parser= GrobidParser(segment_sentences=False)
|
|
)
|
|
docs = loader.load()
|
|
|
|
#Produce chunks from article sentences
|
|
loader = GenericLoader.from_filesystem(
|
|
"/Users/31treehaus/Desktop/Papers/",
|
|
glob="*",
|
|
suffixes=[".pdf"],
|
|
parser= GrobidParser(segment_sentences=True)
|
|
)
|
|
docs = loader.load()
|
|
```
|
|
Chunk metadata will include bboxes although these are a bit funky to parse, see https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/
|