community[minor]: add support for llmsherpa (#19741)

Thank you for contributing to LangChain! - [x] **PR title**: "community: added support for llmsherpa library" - [x] **Add tests and docs**: 1. Integration test: 'docs/docs/integrations/document_loaders/test_llmsherpa.py'. 2. an example notebook: `docs/docs/integrations/document_loaders/llmsherpa.ipynb`. - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17. --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
3 months ago · ba54f1577f
parent a99bd098ac
commit ba54f1577f
5 changed files with 609 additions and 0 deletions
--- a/docs/docs/integrations/document_loaders/llmsherpa.ipynb
+++ b/docs/docs/integrations/document_loaders/llmsherpa.ipynb
@ -0,0 +1,419 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7f5437a835409a57",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "# LLM Sherpa\n",
+    "\n",
+    "This notebook covers how to use `LLM Sherpa` to load files of many types. `LLM Sherpa` supports different file formats including DOCX, PPTX, HTML, TXT, and XML.\n",
+    "\n",
+    "`LLMSherpaFileLoader` use LayoutPDFReader, which is part of the LLMSherpa library. This tool is designed to parse PDFs while preserving their layout information, which is often lost when using most PDF to text parsers.\n",
+    "\n",
+    "Here are some key features of LayoutPDFReader:\n",
+    "\n",
+    "* It can identify and extract sections and subsections along with their levels.\n",
+    "* It combines lines to form paragraphs.\n",
+    "* It can identify links between sections and paragraphs.\n",
+    "* It can extract tables along with the section the tables are found in.\n",
+    "* It can identify and extract lists and nested lists.\n",
+    "* It can join content spread across pages.\n",
+    "* It can remove repeating headers and footers.\n",
+    "* It can remove watermarks.\n",
+    "\n",
+    "check [llmsherpa](https://llmsherpa.readthedocs.io/en/latest/) documentation.\n",
+    "\n",
+    "`INFO: this library fail with some pdf files so use it with caution.`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "initial_id",
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# Install package\n",
+    "# !pip install --upgrade --quiet llmsherpa"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "baa8d2672ac6dd4b",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "## LLMSherpaFileLoader\n",
+    "\n",
+    "Under the hood LLMSherpaFileLoader defined some strategist to load file content: [\"sections\", \"chunks\", \"html\", \"text\"], setup [nlm-ingestor](https://github.com/nlmatics/nlm-ingestor) to get `llmsherpa_api_url` or use the default."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6fb0104dde44091b",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "### sections strategy: return the file parsed into sections"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "14150b3110143a43",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T23:06:03.648268Z",
+     "start_time": "2024-03-28T23:05:51.734372Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader\n",
+    "\n",
+    "loader = LLMSherpaFileLoader(\n",
+    "    file_path=\"https://arxiv.org/pdf/2402.14207.pdf\",\n",
+    "    new_indent_parser=True,\n",
+    "    apply_ocr=True,\n",
+    "    strategy=\"sections\",\n",
+    "    llmsherpa_api_url=\"http://localhost:5010/api/parseDocument?renderFormat=all\",\n",
+    ")\n",
+    "docs = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "e639aa0010ed3579",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T23:06:11.568739Z",
+     "start_time": "2024-03-28T23:06:11.557702Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "Document(page_content='Abstract\\nWe study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.\\nThis underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing.\\nWe propose STORM, a writing system for the Synthesis of Topic Outlines through\\nReferences\\nFull-length Article\\nTopic\\nOutline\\n2022 Winter Olympics\\nOpening Ceremony\\nResearch via Question Asking\\nRetrieval and Multi-perspective Question Asking.\\nSTORM models the pre-writing stage by\\nLLM\\n(1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline.\\nFor evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage.\\nWe further gather feedback from experienced Wikipedia editors.\\nCompared to articles generated by an outlinedriven retrieval-augmented baseline, more of STORM’s articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%).\\nThe expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.\\n1. Can you provide any information about the transportation arrangements for the opening ceremony?\\nLLM\\n2. Can you provide any information about the budget for the 2022 Winter Olympics opening ceremony?…\\nLLM- Role1\\nLLM- Role2\\nLLM- Role1', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'section_number': 1, 'section_title': 'Abstract'})"
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "docs[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "818977c1a0505814",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T23:06:28.900386Z",
+     "start_time": "2024-03-28T23:06:28.891805Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "79"
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(docs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e424ce828ea64c01",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "### chunks strategy: return the file parsed into chunks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "4c0ff1a52b9dd4e3",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T23:06:44.507836Z",
+     "start_time": "2024-03-28T23:06:32.507326Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader\n",
+    "\n",
+    "loader = LLMSherpaFileLoader(\n",
+    "    file_path=\"https://arxiv.org/pdf/2402.14207.pdf\",\n",
+    "    new_indent_parser=True,\n",
+    "    apply_ocr=True,\n",
+    "    strategy=\"chunks\",\n",
+    "    llmsherpa_api_url=\"http://localhost:5010/api/parseDocument?renderFormat=all\",\n",
+    ")\n",
+    "docs = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "33dc25e83f6e0430",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T23:06:49.951741Z",
+     "start_time": "2024-03-28T23:06:49.938331Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "Document(page_content='Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'chunk_number': 1, 'chunk_type': 'para'})"
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "docs[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "2310e24f3d081cb4",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T23:06:56.933007Z",
+     "start_time": "2024-03-28T23:06:56.922196Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "306"
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(docs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6bb9b715b0d2b4b0",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "### html strategy: return the file as one html document"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "f3fbe9f3c4d8a6ee",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T22:59:15.869599Z",
+     "start_time": "2024-03-28T22:58:54.306814Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader\n",
+    "\n",
+    "loader = LLMSherpaFileLoader(\n",
+    "    file_path=\"https://arxiv.org/pdf/2402.14207.pdf\",\n",
+    "    new_indent_parser=True,\n",
+    "    apply_ocr=True,\n",
+    "    strategy=\"html\",\n",
+    "    llmsherpa_api_url=\"http://localhost:5010/api/parseDocument?renderFormat=all\",\n",
+    ")\n",
+    "docs = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "b8fcbfcd58126e09",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T22:59:33.386455Z",
+     "start_time": "2024-03-28T22:59:33.381274Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "'<html><h1>Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models</h1><table><th><td colSpan=1>Yijia Shao</td><td colSpan=1>Yucheng Jiang</td><td colSpan=1>Theodore A. Kanell</td><td colSpan=1>Peter Xu</td></th><tr><td colSpan=1></td><td colSpan=1>Omar Khattab</td><td colSpan=1>Monica S. Lam</td><td colSpan=1></td></tr></table><p>Stanford University {shaoyj, yuchengj, '"
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "docs[0].page_content[:400]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "8cbe691320144cf6",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T22:59:49.667979Z",
+     "start_time": "2024-03-28T22:59:49.661572Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "1"
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(docs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "634af5a1c58a7766",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "### text strategy: return the file as one text document"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "ee47c6e36c952534",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T23:04:56.549898Z",
+     "start_time": "2024-03-28T23:04:38.148264Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader\n",
+    "\n",
+    "loader = LLMSherpaFileLoader(\n",
+    "    file_path=\"https://arxiv.org/pdf/2402.14207.pdf\",\n",
+    "    new_indent_parser=True,\n",
+    "    apply_ocr=True,\n",
+    "    strategy=\"text\",\n",
+    "    llmsherpa_api_url=\"http://localhost:5010/api/parseDocument?renderFormat=all\",\n",
+    ")\n",
+    "docs = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "998649675f14c50e",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T23:05:28.558467Z",
+     "start_time": "2024-03-28T23:05:28.543132Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "'Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\\n | Yijia Shao | Yucheng Jiang | Theodore A. Kanell | Peter Xu\\n | --- | --- | --- | ---\\n |  | Omar Khattab | Monica S. Lam | \\n\\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu\\nAbstract\\nWe study how to apply large language models to write grounded and organized long'"
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "docs[0].page_content[:400]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "7fec7a95023ea8e9",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T23:05:39.207693Z",
+     "start_time": "2024-03-28T23:05:39.199663Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "1"
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(docs)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/libs/community/langchain_community/document_loaders/init.py
+++ b/libs/community/langchain_community/document_loaders/init.py
@ -102,6 +102,7 @@ _module_lookup = {
    "JoplinLoader": "langchain_community.document_loaders.joplin",
    "LakeFSLoader": "langchain_community.document_loaders.lakefs",
    "LarkSuiteDocLoader": "langchain_community.document_loaders.larksuite",
+    "LLMSherpaFileLoader": "langchain_community.document_loaders.llmsherpa",
    "MHTMLLoader": "langchain_community.document_loaders.mhtml",
    "MWDumpLoader": "langchain_community.document_loaders.mediawikidump",
    "MastodonTootsLoader": "langchain_community.document_loaders.mastodon",
--- a/libs/community/langchain_community/document_loaders/llmsherpa.py
+++ b/libs/community/langchain_community/document_loaders/llmsherpa.py
@ -0,0 +1,142 @@
+from pathlib import Path
+from typing import Iterator, Union
+from urllib.parse import urlparse
+
+from langchain_core.documents import Document
+
+from langchain_community.document_loaders.pdf import BaseLoader
+
+DEFAULT_API = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
+
+
+class LLMSherpaFileLoader(BaseLoader):
+    """Load Documents using `LLMSherpa`.
+
+    LLMSherpaFileLoader use LayoutPDFReader, which is part of the LLMSherpa library.
+    This tool is designed to parse PDFs while preserving their layout information,
+    which is often lost when using most PDF to text parsers.
+
+    Examples
+    --------
+    from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
+
+    loader = LLMSherpaFileLoader(
+        "example.pdf",
+        strategy="chunks",
+        llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
+    )
+    docs = loader.load()
+    """
+
+    def __init__(
+        self,
+        file_path: Union[str, Path],
+        new_indent_parser: bool = True,
+        apply_ocr: bool = True,
+        strategy: str = "chunks",
+        llmsherpa_api_url: str = DEFAULT_API,
+    ):
+        """Initialize with a file path."""
+        try:
+            import llmsherpa  # noqa:F401
+        except ImportError:
+            raise ImportError(
+                "llmsherpa package not found, please install it with "
+                "`pip install llmsherpa`"
+            )
+        _valid_strategies = ["sections", "chunks", "html", "text"]
+        if strategy not in _valid_strategies:
+            raise ValueError(
+                f"Got {strategy} for `strategy`, "
+                f"but should be one of `{_valid_strategies}`"
+            )
+        # validate llmsherpa url
+        if not self._is_valid_url(llmsherpa_api_url):
+            raise ValueError(f"Invalid URL: {llmsherpa_api_url}")
+        self.url = self._validate_llmsherpa_url(
+            url=llmsherpa_api_url,
+            new_indent_parser=new_indent_parser,
+            apply_ocr=apply_ocr,
+        )
+
+        self.strategy = strategy
+        self.file_path = str(file_path)
+
+    @staticmethod
+    def _is_valid_url(url: str) -> bool:
+        """Check if the url is valid."""
+        parsed = urlparse(url)
+        return bool(parsed.netloc) and bool(parsed.scheme)
+
+    @staticmethod
+    def _validate_llmsherpa_url(
+        url: str, new_indent_parser: bool = True, apply_ocr: bool = True
+    ) -> str:
+        """Check if the llmsherpa url is valid."""
+        parsed = urlparse(url)
+        valid_url = url
+        if ("/api/parseDocument" not in parsed.path) and (
+            "/api/document/developer/parseDocument" not in parsed.path
+        ):
+            raise ValueError(f"Invalid LLMSherpa URL: {url}")
+
+        if "renderFormat=all" not in parsed.query:
+            valid_url = valid_url + "?renderFormat=all"
+        if new_indent_parser and "useNewIndentParser=true" not in parsed.query:
+            valid_url = valid_url + "&useNewIndentParser=true"
+        if apply_ocr and "applyOcr=yes" not in parsed.query:
+            valid_url = valid_url + "&applyOcr=yes"
+
+        return valid_url
+
+    def lazy_load(
+        self,
+    ) -> Iterator[Document]:
+        """Load file."""
+        from llmsherpa.readers import LayoutPDFReader
+
+        docs_reader = LayoutPDFReader(self.url)
+        doc = docs_reader.read_pdf(self.file_path)
+
+        if self.strategy == "sections":
+            yield from [
+                Document(
+                    page_content=section.to_text(include_children=True, recurse=True),
+                    metadata={
+                        "source": self.file_path,
+                        "section_number": section_num,
+                        "section_title": section.title,
+                    },
+                )
+                for section_num, section in enumerate(doc.sections())
+            ]
+        if self.strategy == "chunks":
+            yield from [
+                Document(
+                    page_content=chunk.to_context_text(),
+                    metadata={
+                        "source": self.file_path,
+                        "chunk_number": chunk_num,
+                        "chunk_type": chunk.tag,
+                    },
+                )
+                for chunk_num, chunk in enumerate(doc.chunks())
+            ]
+        if self.strategy == "html":
+            yield from [
+                Document(
+                    page_content=doc.to_html(),
+                    metadata={
+                        "source": self.file_path,
+                    },
+                )
+            ]
+        if self.strategy == "text":
+            yield from [
+                Document(
+                    page_content=doc.to_text(),
+                    metadata={
+                        "source": self.file_path,
+                    },
+                )
+            ]
--- a/libs/community/tests/integration_tests/document_loaders/test_llmsherpa.py
+++ b/libs/community/tests/integration_tests/document_loaders/test_llmsherpa.py
@ -0,0 +1,46 @@
+from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
+
+file_path = "https://arxiv.org/pdf/2402.14207.pdf"
+
+
+def test_llmsherpa_file_loader_initialization() -> None:
+    loader = LLMSherpaFileLoader(
+        file_path=file_path,
+    )
+    docs = loader.load()
+    assert isinstance(loader, LLMSherpaFileLoader)
+    assert hasattr(docs, "__iter__")
+    assert loader.strategy == "chunks"
+    assert (
+        loader.url
+        == "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes"
+    )
+    assert len(docs) > 1
+
+
+def test_apply_ocr() -> None:
+    loader = LLMSherpaFileLoader(
+        file_path=file_path,
+        apply_ocr=True,
+        new_indent_parser=False,
+    )
+    docs = loader.load()
+    assert (
+        loader.url
+        == "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all&applyOcr=yes"
+    )
+    assert len(docs) > 1
+
+
+def test_new_indent_parser() -> None:
+    loader = LLMSherpaFileLoader(
+        file_path=file_path,
+        apply_ocr=False,
+        new_indent_parser=True,
+    )
+    docs = loader.load()
+    assert (
+        loader.url
+        == "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all&useNewIndentParser=true"
+    )
+    assert len(docs) > 1
--- a/libs/community/tests/unit_tests/document_loaders/test_imports.py
+++ b/libs/community/tests/unit_tests/document_loaders/test_imports.py
@ -86,6 +86,7 @@ EXPECTED_ALL = [
    "IuguLoader",
    "JSONLoader",
    "JoplinLoader",
+    "LLMSherpaFileLoader",
    "LarkSuiteDocLoader",
    "LakeFSLoader",
    "MHTMLLoader",