langchain/docs/extras/integrations/document_loaders/grobid.ipynb
Luca Foppiano dfb93dd2b5
Improved grobid documentation (#9025)
- Description: Improvement in the Grobid loader documentation, typos and
suggesting to use the docker image instead of installing Grobid in local
(the documentation was also limited to Mac, while docker allow running
in any platform)
  - Tag maintainer: @rlancemartin, @eyurtsev
  - Twitter handle: @whitenoise
2023-08-10 10:47:22 -04:00

131 lines
5.0 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "bdccb278",
"metadata": {},
"source": [
"# Grobid\n",
"\n",
"GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents.\n",
"\n",
"It is designed and expected to be used to parse academic papers, where it works particularly well. Note: if the articles supplied to Grobid are large documents (e.g. dissertations) exceeding a certain number of elements, they might not be processed. \n",
"\n",
"This loader uses Grobid to parse PDFs into `Documents` that retain metadata associated with the section of text.\n",
"\n",
"---\n",
"The best approach is to install Grobid via docker, see https://grobid.readthedocs.io/en/latest/Grobid-docker/. \n",
"\n",
"(Note: additional instructions can be found [here](https://python.langchain.com/docs/extras/integrations/providers/grobid.mdx).)\n",
"\n",
"Once grobid is up-and-running you can interact as described below. \n"
]
},
{
"cell_type": "markdown",
"id": "4b41bfb1",
"metadata": {},
"source": [
"Now, we can use the data loader."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "640e9a4b",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders.parsers import GrobidParser\n",
"from langchain.document_loaders.generic import GenericLoader"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ecdc1fb9",
"metadata": {},
"outputs": [],
"source": [
"loader = GenericLoader.from_filesystem(\n",
" \"../Papers/\",\n",
" glob=\"*\",\n",
" suffixes=[\".pdf\"],\n",
" parser=GrobidParser(segment_sentences=False),\n",
")\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "efe9e356",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g.\"Books -2TB\" or \"Social media conversations\").There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[3].page_content"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5be03d17",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'text': 'Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g.\"Books -2TB\" or \"Social media conversations\").There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.',\n",
" 'para': '2',\n",
" 'bboxes': \"[[{'page': '1', 'x': '317.05', 'y': '509.17', 'h': '207.73', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '522.72', 'h': '220.08', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '536.27', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '549.82', 'h': '218.65', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '563.37', 'h': '136.98', 'w': '9.46'}], [{'page': '1', 'x': '446.49', 'y': '563.37', 'h': '78.11', 'w': '9.46'}, {'page': '1', 'x': '304.69', 'y': '576.92', 'h': '138.32', 'w': '9.46'}], [{'page': '1', 'x': '447.75', 'y': '576.92', 'h': '76.66', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '590.47', 'h': '219.63', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '604.02', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '617.56', 'h': '218.27', 'w': '9.46'}, {'page': '1', 'x': '306.14', 'y': '631.11', 'h': '220.18', 'w': '9.46'}]]\",\n",
" 'pages': \"('1', '1')\",\n",
" 'section_title': 'Introduction',\n",
" 'section_number': '1',\n",
" 'paper_title': 'LLaMA: Open and Efficient Foundation Language Models',\n",
" 'file_path': '/Users/31treehaus/Desktop/Papers/2302.13971.pdf'}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[3].metadata"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}