langchain/docs/extras/integrations/document_loaders/grobid.ipynb

131 lines
5.0 KiB
Plaintext
Raw Normal View History

Grobid parser for Scientific Articles from PDF (#6729) ### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @Corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-29 21:29:29 +00:00
{
"cells": [
{
"cell_type": "markdown",
"id": "bdccb278",
"metadata": {},
"source": [
"# Grobid\n",
"\n",
"GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents.\n",
"\n",
"It is designed and expected to be used to parse academic papers, where it works particularly well. Note: if the articles supplied to Grobid are large documents (e.g. dissertations) exceeding a certain number of elements, they might not be processed. \n",
Grobid parser for Scientific Articles from PDF (#6729) ### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @Corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-29 21:29:29 +00:00
"\n",
"This loader uses Grobid to parse PDFs into `Documents` that retain metadata associated with the section of text.\n",
Grobid parser for Scientific Articles from PDF (#6729) ### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @Corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-29 21:29:29 +00:00
"\n",
"---\n",
"The best approach is to install Grobid via docker, see https://grobid.readthedocs.io/en/latest/Grobid-docker/. \n",
Grobid parser for Scientific Articles from PDF (#6729) ### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @Corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-29 21:29:29 +00:00
"\n",
"(Note: additional instructions can be found [here](https://python.langchain.com/docs/extras/integrations/providers/grobid.mdx).)\n",
Grobid parser for Scientific Articles from PDF (#6729) ### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @Corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-29 21:29:29 +00:00
"\n",
"Once grobid is up-and-running you can interact as described below. \n"
Grobid parser for Scientific Articles from PDF (#6729) ### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @Corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-29 21:29:29 +00:00
]
},
{
"cell_type": "markdown",
"id": "4b41bfb1",
"metadata": {},
"source": [
"Now, we can use the data loader."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "640e9a4b",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders.parsers import GrobidParser\n",
"from langchain.document_loaders.generic import GenericLoader"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ecdc1fb9",
"metadata": {},
"outputs": [],
"source": [
"loader = GenericLoader.from_filesystem(\n",
" \"../Papers/\",\n",
" glob=\"*\",\n",
" suffixes=[\".pdf\"],\n",
Fix `make docs_build` and related scripts (#7276) **Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-07-12 02:05:14 +00:00
" parser=GrobidParser(segment_sentences=False),\n",
Grobid parser for Scientific Articles from PDF (#6729) ### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @Corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-29 21:29:29 +00:00
")\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "efe9e356",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g.\"Books -2TB\" or \"Social media conversations\").There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[3].page_content"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5be03d17",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'text': 'Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g.\"Books -2TB\" or \"Social media conversations\").There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.',\n",
" 'para': '2',\n",
" 'bboxes': \"[[{'page': '1', 'x': '317.05', 'y': '509.17', 'h': '207.73', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '522.72', 'h': '220.08', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '536.27', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '549.82', 'h': '218.65', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '563.37', 'h': '136.98', 'w': '9.46'}], [{'page': '1', 'x': '446.49', 'y': '563.37', 'h': '78.11', 'w': '9.46'}, {'page': '1', 'x': '304.69', 'y': '576.92', 'h': '138.32', 'w': '9.46'}], [{'page': '1', 'x': '447.75', 'y': '576.92', 'h': '76.66', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '590.47', 'h': '219.63', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '604.02', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '617.56', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '631.11', 'h': '220.18', 'w': '9.46'}]]\",\n",
" 'pages': \"('1', '1')\",\n",
" 'section_title': 'Introduction',\n",
" 'section_number': '1',\n",
" 'paper_title': 'LLaMA: Open and Efficient Foundation Language Models',\n",
" 'file_path': '/Users/31treehaus/Desktop/Papers/2302.13971.pdf'}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[3].metadata"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}