### Scientific Article PDF Parsing via Grobid
`Description:`
This change adds the GrobidParser class, which uses the Grobid library
to parse scientific articles into a universal XML format containing the
article title, references, sections, section text etc. The GrobidParser
uses a local Grobid server to return PDFs document as XML and parses the
XML to optionally produce documents of individual sentences or of whole
paragraphs. Metadata includes the text, paragraph number, pdf relative
bboxes, pages (text may overlap over two pages), section title
(Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the
title of the paper and finally the file path.
Grobid parsing is useful beyond standard pdf parsing as it accurately
outputs sections and paragraphs within them. This allows for
post-fitering of results for specific sections i.e. limiting results to
the methodology section or results. While sections are split via
headings, ideally they could be classified specifically into
introduction, methodology, results, discussion, conclusion. I'm
currently experimenting with chatgpt-3.5 for this function, which could
later be implemented as a textsplitter.
`Dependencies:`
For use, the grobid repo must be cloned and Java must be installed, for
colab this is:
```
!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install
```
Once installed the server is ran on localhost:8070 via
```
get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')
```
@rlancemartin, @eyurtsev
Twitter Handle: @Corranmac
Grobid Demo Notebook is
[here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing).
---------
Co-authored-by: rlm <pexpresss31@gmail.com>
#### Summary
A new approach to loading source code is implemented:
Each top-level function and class in the code is loaded into separate
documents. Then, an additional document is created with the top-level
code, but without the already loaded functions and classes.
This could improve the accuracy of QA chains over source code.
For instance, having this script:
```
class MyClass:
def __init__(self, name):
self.name = name
def greet(self):
print(f"Hello, {self.name}!")
def main():
name = input("Enter your name: ")
obj = MyClass(name)
obj.greet()
if __name__ == '__main__':
main()
```
The loader will create three documents with this content:
First document:
```
class MyClass:
def __init__(self, name):
self.name = name
def greet(self):
print(f"Hello, {self.name}!")
```
Second document:
```
def main():
name = input("Enter your name: ")
obj = MyClass(name)
obj.greet()
```
Third document:
```
# Code for: class MyClass:
# Code for: def main():
if __name__ == '__main__':
main()
```
A threshold parameter is added to control whether small scripts are
split in this way or not.
At this moment, only Python and JavaScript are supported. The
appropriate parser is determined by examining the file extension.
#### Tests
This PR adds:
- Unit tests
- Integration tests
#### Dependencies
Only one dependency was added as optional (needed for the JavaScript
parser).
#### Documentation
A notebook is added showing how the loader can be used.
#### Who can review?
@eyurtsev @hwchase17
---------
Co-authored-by: rlm <pexpresss31@gmail.com>
# OpenAIWhisperParser
This PR creates a new parser, `OpenAIWhisperParser`, that uses the
[OpenAI Whisper
model](https://platform.openai.com/docs/guides/speech-to-text/quickstart)
to perform transcription of audio files to text (`Documents`). Please
see the notebook for usage.
# Add bs4 html parser
* Some minor refactors
* Extract the bs4 html parsing code from the bs html loader
* Move some tests from integration tests to unit tests
# Fix Telegram API loader + add tests.
I was testing this integration and it was broken with next error:
```python
message_threads = loader._get_message_threads(df)
KeyError: False
```
Also, this particular loader didn't have any tests / related group in
poetry, so I added those as well.
@hwchase17 / @eyurtsev please take a look on this fix PR.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>