# Grobid This page covers how to use the Grobid to parse articles for LangChain. It is seperated into two parts: installation and running the server ## Installation and Setup #Ensure You have Java installed !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java #Clone and install the Grobid Repo import os !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install #Run the server, get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') You can now use the GrobidParser to produce documents ```python from langchain.document_loaders.parsers import GrobidParser from langchain.document_loaders.generic import GenericLoader #Produce chunks from article paragraphs loader = GenericLoader.from_filesystem( "/Users/31treehaus/Desktop/Papers/", glob="*", suffixes=[".pdf"], parser= GrobidParser(segment_sentences=False) ) docs = loader.load() #Produce chunks from article sentences loader = GenericLoader.from_filesystem( "/Users/31treehaus/Desktop/Papers/", glob="*", suffixes=[".pdf"], parser= GrobidParser(segment_sentences=True) ) docs = loader.load() ``` Chunk metadata will include bboxes although these are a bit funky to parse, see https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/