mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
dfb93dd2b5
- Description: Improvement in the Grobid loader documentation, typos and suggesting to use the docker image instead of installing Grobid in local (the documentation was also limited to Mac, while docker allow running in any platform) - Tag maintainer: @rlancemartin, @eyurtsev - Twitter handle: @whitenoise
46 lines
1.7 KiB
Plaintext
46 lines
1.7 KiB
Plaintext
# Grobid
|
|
|
|
GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents.
|
|
|
|
It is designed and expected to be used to parse academic papers, where it works particularly well.
|
|
|
|
*Note*: if the articles supplied to Grobid are large documents (e.g. dissertations) exceeding a certain number
|
|
of elements, they might not be processed.
|
|
|
|
This page covers how to use the Grobid to parse articles for LangChain.
|
|
|
|
## Installation
|
|
The grobid installation is described in details in https://grobid.readthedocs.io/en/latest/Install-Grobid/.
|
|
However, it is probably easier and less troublesome to run grobid through a docker container,
|
|
as documented [here](https://grobid.readthedocs.io/en/latest/Grobid-docker/).
|
|
|
|
## Use Grobid with LangChain
|
|
|
|
Once grobid is installed and up and running (you can check by accessing it http://localhost:8070),
|
|
you're ready to go.
|
|
|
|
You can now use the GrobidParser to produce documents
|
|
```python
|
|
from langchain.document_loaders.parsers import GrobidParser
|
|
from langchain.document_loaders.generic import GenericLoader
|
|
|
|
#Produce chunks from article paragraphs
|
|
loader = GenericLoader.from_filesystem(
|
|
"/Users/31treehaus/Desktop/Papers/",
|
|
glob="*",
|
|
suffixes=[".pdf"],
|
|
parser= GrobidParser(segment_sentences=False)
|
|
)
|
|
docs = loader.load()
|
|
|
|
#Produce chunks from article sentences
|
|
loader = GenericLoader.from_filesystem(
|
|
"/Users/31treehaus/Desktop/Papers/",
|
|
glob="*",
|
|
suffixes=[".pdf"],
|
|
parser= GrobidParser(segment_sentences=True)
|
|
)
|
|
docs = loader.load()
|
|
```
|
|
Chunk metadata will include Bounding Boxes. Although these are a bit funky to parse,
|
|
they are explained in https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/ |