langchain/tests/unit_tests
corranmac 20c6ade2fc
Grobid parser for Scientific Articles from PDF (#6729)
### Scientific Article PDF Parsing via Grobid

`Description:`
This change adds the GrobidParser class, which uses the Grobid library
to parse scientific articles into a universal XML format containing the
article title, references, sections, section text etc. The GrobidParser
uses a local Grobid server to return PDFs document as XML and parses the
XML to optionally produce documents of individual sentences or of whole
paragraphs. Metadata includes the text, paragraph number, pdf relative
bboxes, pages (text may overlap over two pages), section title
(Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the
title of the paper and finally the file path.
      
Grobid parsing is useful beyond standard pdf parsing as it accurately
outputs sections and paragraphs within them. This allows for
post-fitering of results for specific sections i.e. limiting results to
the methodology section or results. While sections are split via
headings, ideally they could be classified specifically into
introduction, methodology, results, discussion, conclusion. I'm
currently experimenting with chatgpt-3.5 for this function, which could
later be implemented as a textsplitter.

`Dependencies:`
For use, the grobid repo must be cloned and Java must be installed, for
colab this is:

```
!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install
```

Once installed the server is ran on localhost:8070 via
```
get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')
```

@rlancemartin, @eyurtsev

Twitter Handle: @Corranmac

Grobid Demo Notebook is
[here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing).

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-29 14:29:29 -07:00
..
agents Fix breaking tags (#6765) 2023-06-26 09:28:11 -07:00
callbacks split up batch llm calls into separate runs (#5804) 2023-06-24 21:03:31 -07:00
chains split up batch llm calls into separate runs (#5804) 2023-06-24 21:03:31 -07:00
chat_models add FunctionMessage support to _convert_dict_to_message() in OpenAI chat model (#6382) 2023-06-20 08:25:55 -07:00
client Update to RunOnDataset helper functions to accept evaluator callbacks (#6629) 2023-06-26 23:58:13 -07:00
data
docstore
document_loaders Grobid parser for Scientific Articles from PDF (#6729) 2023-06-29 14:29:29 -07:00
evaluation Permit Constitutional Principles (#6807) 2023-06-27 00:23:54 -07:00
examples Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
llms split up batch llm calls into separate runs (#5804) 2023-06-24 21:03:31 -07:00
load Include placeholder value for all secrets, not just kwargs (#6421) 2023-06-19 15:41:45 +01:00
memory Implemented appending arbitrary messages (#5293) 2023-05-29 07:18:59 -07:00
output_parsers Update String Evaluator (#6615) 2023-06-26 14:16:14 -07:00
prompts Fix for #6431 - chatprompt template with partial variables giing validation error (#6456) 2023-06-19 22:08:15 -07:00
retrievers Harrison/myscale self query (#6376) 2023-06-18 16:53:10 -07:00
tools add async to zapier nla tools (#6791) 2023-06-27 16:53:35 -07:00
utilities Fix graphql tool (#4984) 2023-05-19 15:27:50 -07:00
vectorstores Add maximal relevance search to SKLearnVectorStore (#5430) 2023-05-30 16:13:33 -07:00
__init__.py
conftest.py Add pytest --only-extended and --only-core options (#4494) 2023-05-12 11:35:22 -04:00
test_bash.py Add Mastodon toots loader (#5036) 2023-05-22 16:43:07 -07:00
test_cache.py Add caching to BaseChatModel (issue #1644) (#5089) 2023-06-24 11:45:09 -07:00
test_dependencies.py Fix class promotion (#6187) 2023-06-18 16:55:18 -07:00
test_document_transformers.py
test_formatting.py
test_math_utils.py add get_top_k_cosine_similarity method to get max top k score and index (#5059) 2023-05-22 11:55:48 -07:00
test_pytest_config.py Block sockets for unit-tests (#4803) 2023-05-16 14:41:24 -04:00
test_python.py option for csv agent to not include df in prompt (#4610) 2023-05-12 21:55:22 -07:00
test_schema.py
test_sql_database_schema.py
test_sql_database.py Fix SQLAlchemy truncating text when it is too big (#5206) 2023-06-01 21:33:31 -04:00
test_text_splitter.py MD header text splitter returns Documents (#6571) 2023-06-22 09:25:38 -07:00