You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/tests/unit_tests/document_loaders
corranmac 20c6ade2fc
Grobid parser for Scientific Articles from PDF (#6729)
### Scientific Article PDF Parsing via Grobid

`Description:`
This change adds the GrobidParser class, which uses the Grobid library
to parse scientific articles into a universal XML format containing the
article title, references, sections, section text etc. The GrobidParser
uses a local Grobid server to return PDFs document as XML and parses the
XML to optionally produce documents of individual sentences or of whole
paragraphs. Metadata includes the text, paragraph number, pdf relative
bboxes, pages (text may overlap over two pages), section title
(Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the
title of the paper and finally the file path.
      
Grobid parsing is useful beyond standard pdf parsing as it accurately
outputs sections and paragraphs within them. This allows for
post-fitering of results for specific sections i.e. limiting results to
the methodology section or results. While sections are split via
headings, ideally they could be classified specifically into
introduction, methodology, results, discussion, conclusion. I'm
currently experimenting with chatgpt-3.5 for this function, which could
later be implemented as a textsplitter.

`Dependencies:`
For use, the grobid repo must be cloned and Java must be installed, for
colab this is:

```
!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install
```

Once installed the server is ran on localhost:8070 via
```
get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')
```

@rlancemartin, @eyurtsev

Twitter Handle: @Corranmac

Grobid Demo Notebook is
[here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing).

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
1 year ago
..
blob_loaders YoutubeAudioLoader and updates to OpenAIWhisperParser (#5772) 1 year ago
loaders fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
parsers Grobid parser for Scientific Articles from PDF (#6729) 1 year ago
sample_documents Bibtex integration for document loader and retriever (#5137) 1 year ago
test_docs Allow readthedoc loader to pass custom html tag (#5175) 1 year ago
__init__.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_base.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_bibtex.py Bibtex integration for document loader and retriever (#5137) 1 year ago
test_bshtml.py Add html parsers (#4874) 1 year ago
test_confluence.py Add Confluence Loader unit tests (#3333) 1 year ago
test_csv_loader.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_detect_encoding.py feat #4479: TextLoader auto detect encoding and improved exceptions (#4927) 1 year ago
test_directory.py Add path validation to DirectoryLoader (#5327) 1 year ago
test_evernote_loader.py feature/4493 Improve Evernote Document Loader (#4577) 1 year ago
test_generic_loader.py Add a generic document loader (#4875) 1 year ago
test_github.py DocumentLoader for GitHub (#5408) 1 year ago
test_json_loader.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_mhtml.py Added a MHTML document loader (#6311) 1 year ago
test_psychic.py Update to the latest Psychic python library version (#6804) 1 year ago
test_readthedoc.py Allow readthedoc loader to pass custom html tag (#5175) 1 year ago
test_telegram.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_trello.py New Trello document loader (#4767) 1 year ago
test_web_base.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_youtube.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago