langchain/docs/modules/indexes/document_loaders/examples/bibtex.ipynb
Eugene Yurtsev 5cfa72a130
Bibtex integration for document loader and retriever (#5137)
# Bibtex integration

Wrap bibtexparser to retrieve a list of docs from a bibtex file.
* Get the metadata from the bibtex entries
* `page_content` get from the local pdf referenced in the `file` field
of the bibtex entry using `pymupdf`
* If no valid pdf file, `page_content` set to the `abstract` field of
the bibtex entry
* Support Zotero flavour using regex to get the file path
* Added usage example in
`docs/modules/indexes/document_loaders/examples/bibtex.ipynb`
---------

Co-authored-by: Sébastien M. Popoff <sebastien.popoff@espci.fr>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
2023-05-25 00:21:31 -07:00

191 lines
6.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "bda1f3f5",
"metadata": {},
"source": [
"# BibTeX\n",
"\n",
"> BibTeX is a file format and reference management system commonly used in conjunction with LaTeX typesetting. It serves as a way to organize and store bibliographic information for academic and research documents.\n",
"\n",
"BibTeX files have a .bib extension and consist of plain text entries representing references to various publications, such as books, articles, conference papers, theses, and more. Each BibTeX entry follows a specific structure and contains fields for different bibliographic details like author names, publication title, journal or book title, year of publication, page numbers, and more.\n",
"\n",
"Bibtex files can also store the path to documents, such as `.pdf` files that can be retrieved."
]
},
{
"cell_type": "markdown",
"id": "1b7a1eef-7bf7-4e7d-8bfc-c4e27c9488cb",
"metadata": {},
"source": [
"## Installation\n",
"First, you need to install `bibtexparser` and `PyMuPDF`."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "b674aaea-ed3a-4541-8414-260a8f67f623",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install bibtexparser pymupdf"
]
},
{
"cell_type": "markdown",
"id": "95f05e1c-195e-4e2b-ae8e-8d6637f15be6",
"metadata": {},
"source": [
"## Examples"
]
},
{
"cell_type": "markdown",
"id": "e29b954c-1407-4797-ae21-6ba8937156be",
"metadata": {},
"source": [
"`BibtexLoader` has these arguments:\n",
"- `file_path`: the path the the `.bib` bibtex file\n",
"- optional `max_docs`: default=None, i.e. not limit. Use it to limit number of retrieved documents.\n",
"- optional `max_content_chars`: default=4000. Use it to limit the number of characters in a single document.\n",
"- optional `load_extra_meta`: default=False. By default only the most important fields from the bibtex entries: `Published` (publication year), `Title`, `Authors`, `Summary`, `Journal`, `Keywords`, and `URL`. If True, it will also try to load return `entry_id`, `note`, `doi`, and `links` fields. \n",
"- optional `file_pattern`: default=`r'[^:]+\\.pdf'`. Regex pattern to find files in the `file` entry. Default pattern supports `Zotero` flavour bibtex style and bare file path."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "9bfd5e46",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import BibtexLoader"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "01971b53",
"metadata": {},
"outputs": [],
"source": [
"# Create a dummy bibtex file and download a pdf.\n",
"import urllib.request\n",
"\n",
"urllib.request.urlretrieve(\"https://www.fourmilab.ch/etexts/einstein/specrel/specrel.pdf\", \"einstein1905.pdf\")\n",
"\n",
"bibtex_text = \"\"\"\n",
" @article{einstein1915,\n",
" title={Die Feldgleichungen der Gravitation},\n",
" abstract={Die Grundgleichungen der Gravitation, die ich hier entwickeln werde, wurden von mir in einer Abhandlung: ,,Die formale Grundlage der allgemeinen Relativit{\\\"a}tstheorie`` in den Sitzungsberichten der Preu{\\ss}ischen Akademie der Wissenschaften 1915 ver{\\\"o}ffentlicht.},\n",
" author={Einstein, Albert},\n",
" journal={Sitzungsberichte der K{\\\"o}niglich Preu{\\ss}ischen Akademie der Wissenschaften},\n",
" volume={1915},\n",
" number={1},\n",
" pages={844--847},\n",
" year={1915},\n",
" doi={10.1002/andp.19163540702},\n",
" link={https://onlinelibrary.wiley.com/doi/abs/10.1002/andp.19163540702},\n",
" file={einstein1905.pdf}\n",
" }\n",
" \"\"\"\n",
"# save bibtex_text to biblio.bib file\n",
"with open(\"./biblio.bib\", \"w\") as file:\n",
" file.write(bibtex_text)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "2631f46b",
"metadata": {},
"outputs": [],
"source": [
"docs = BibtexLoader(\"./biblio.bib\").load()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "33ef1fb2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'id': 'einstein1915',\n",
" 'published_year': '1915',\n",
" 'title': 'Die Feldgleichungen der Gravitation',\n",
" 'publication': 'Sitzungsberichte der K{\"o}niglich Preu{\\\\ss}ischen Akademie der Wissenschaften',\n",
" 'authors': 'Einstein, Albert',\n",
" 'abstract': 'Die Grundgleichungen der Gravitation, die ich hier entwickeln werde, wurden von mir in einer Abhandlung: ,,Die formale Grundlage der allgemeinen Relativit{\"a}tstheorie`` in den Sitzungsberichten der Preu{\\\\ss}ischen Akademie der Wissenschaften 1915 ver{\"o}ffentlicht.',\n",
" 'url': 'https://doi.org/10.1002/andp.19163540702'}"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0].metadata"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "46969806-45a9-4c4d-a61b-cfb9658fc9de",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ON THE ELECTRODYNAMICS OF MOVING\n",
"BODIES\n",
"By A. EINSTEIN\n",
"June 30, 1905\n",
"It is known that Maxwells electrodynamics—as usually understood at the\n",
"present time—when applied to moving bodies, leads to asymmetries which do\n",
"not appear to be inherent in the phenomena. Take, for example, the recipro-\n",
"cal electrodynamic action of a magnet and a conductor. The observable phe-\n",
"nomenon here depends only on the r\n"
]
}
],
"source": [
"print(docs[0].page_content[:400]) # all pages of the pdf content"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}