langchain

mirror of https://github.com/hwchase17/langchain synced 2024-10-29 17:07:25 +00:00

History

corranmac 20c6ade2fc Grobid parser for Scientific Articles from PDF (#6729 ) ### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @Corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>		2023-06-29 14:29:29 -07:00
..
example_data	feat: Add `UnstructuredOrgModeLoader` (#6842 )	2023-06-27 16:34:17 -07:00
acreom.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
airbyte_json.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
airtable.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
alibaba_cloud_maxcompute.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
apify_dataset.ipynb	docs/fix links (#6498 )	2023-06-20 14:06:50 -07:00
arxiv.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
aws_s3_directory.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
aws_s3_file.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
azlyrics.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
azure_blob_storage_container.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
azure_blob_storage_file.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
bibtex.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
bilibili.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
blackboard.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
blockchain.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
chatgpt_loader.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
college_confidential.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
confluence.ipynb	fix titles in documentation	2023-06-17 11:09:11 -07:00
conll-u.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
copypaste.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
csv.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
diffbot.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
discord.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
docugami.ipynb	docs/fix links (#6498 )	2023-06-20 14:06:50 -07:00
duckdb.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
email.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
embaas.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
epub.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
evernote.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
excel.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
facebook_chat.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
fauna.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
figma.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
git.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
gitbook.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
github.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
google_bigquery.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
google_cloud_storage_directory.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
google_cloud_storage_file.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
google_drive.ipynb	Harrison/gdrive enhancements (#6375 )	2023-06-18 11:07:23 -07:00
grobid.ipynb	Grobid parser for Scientific Articles from PDF (#6729 )	2023-06-29 14:29:29 -07:00
gutenberg.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
hacker_news.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
hugging_face_dataset.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
ifixit.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
image_captions.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
image.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
imsdb.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
iugu.ipynb	Minor Grammar Fixes in Docs and Comments (#6536 )	2023-06-21 09:53:31 -07:00
joplin.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
jupyter_notebook.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
larksuite.ipynb	feat (documents): add LarkSuite document loader (#6420 )	2023-06-27 23:08:05 -07:00
mastodon.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
mediawikidump.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
merge_doc_loader.ipynb	Create merge loader that combines documents from a set of loaders (#6659 )	2023-06-23 13:02:48 -07:00
mhtml.ipynb	Added a MHTML document loader (#6311 )	2023-06-25 13:12:08 -07:00
microsoft_onedrive.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
microsoft_powerpoint.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
microsoft_word.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
modern_treasury.ipynb	Minor Grammar Fixes in Docs and Comments (#6536 )	2023-06-21 09:53:31 -07:00
notion.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
notiondb.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
obsidian.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
odt.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
open_city_data.ipynb	Loader for OpenCityData and minor cleanups to Pandas, Airtable loaders (#6301 )	2023-06-22 22:20:42 -07:00
org_mode.ipynb	feat: Add `UnstructuredOrgModeLoader` (#6842 )	2023-06-27 16:34:17 -07:00
pandas_dataframe.ipynb	Loader for OpenCityData and minor cleanups to Pandas, Airtable loaders (#6301 )	2023-06-22 22:20:42 -07:00
psychic.ipynb	docs/fix links (#6498 )	2023-06-20 14:06:50 -07:00
pyspark_dataframe.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
readthedocs_documentation.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
recursive_url_loader.ipynb	`RecusiveUrlLoader` to `RecursiveUrlLoader` (#6787 )	2023-06-26 23:12:14 -07:00
reddit.ipynb	docs/fix links (#6498 )	2023-06-20 14:06:50 -07:00
roam.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
rst.ipynb	feat: Add `UnstructuredRSTLoader` (#6594 )	2023-06-25 12:41:57 -07:00
sitemap.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
slack.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
snowflake.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
source_code.ipynb	feat (documents): add a source code loader based on AST manipulation (#6486 )	2023-06-27 15:58:47 -07:00
spreedly.ipynb	Minor Grammar Fixes in Docs and Comments (#6536 )	2023-06-21 09:53:31 -07:00
stripe.ipynb	Minor Grammar Fixes in Docs and Comments (#6536 )	2023-06-21 09:53:31 -07:00
subtitle.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
telegram.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
tencent_cos_directory.ipynb	feat(document_loaders): add tencent cos directory and file loader (#6401 )	2023-06-27 23:07:20 -07:00
tencent_cos_file.ipynb	feat(document_loaders): add tencent cos directory and file loader (#6401 )	2023-06-27 23:07:20 -07:00
tomarkdown.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
toml.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
trello.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
twitter.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
unstructured_file.ipynb	Docs/unstructured api key (#6781 )	2023-06-27 16:54:15 -07:00
url.ipynb	Add markdown to specify important arguments (#6246 )	2023-06-18 17:47:00 -07:00
weather.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
web_base.ipynb	Web Loader: Add proxy support (#6792 )	2023-06-27 22:27:49 -07:00
whatsapp_chat.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
wikipedia.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
xml.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
youtube_audio.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00
youtube_transcript.ipynb	Doc refactor (#6300 )	2023-06-16 11:52:56 -07:00