langchain/tests/unit_tests/document_loaders/parsers/test_public_api.py

from langchain.document_loaders.parsers import __all__


def test_parsers_public_api_correct() -> None:
    """Test public API of parsers for breaking changes."""
    assert set(__all__) == {
        "BS4HTMLParser",
        "GrobidParser",
        "LanguageParser",
        "OpenAIWhisperParser",
        "PyPDFParser",
        "PDFMinerParser",
        "PyMuPDFParser",
        "PyPDFium2Parser",
        "PDFPlumberParser",
    }
Add workflow for testing with all deps (#4410) # Add action to test with all dependencies installed PR adds a custom action for setting up poetry that allows specifying a cache key: https://github.com/actions/setup-python/issues/505#issuecomment-1273013236 This makes it possible to run 2 types of unit tests: (1) unit tests with only core dependencies (2) unit tests with extended dependencies (e.g., those that rely on an optional pdf parsing library) As part of this PR, we're moving some pdf parsing tests into the unit-tests section and making sure that these unit tests get executed when running with extended dependencies. 2023-05-10 13:35:07 +00:00			`from langchain.document_loaders.parsers import __all__`


			`def test_parsers_public_api_correct() -> None:`
			`"""Test public API of parsers for breaking changes."""`
			`assert set(__all__) == {`
Add html parsers (#4874) # Add bs4 html parser * Some minor refactors * Extract the bs4 html parsing code from the bs html loader * Move some tests from integration tests to unit tests 2023-05-18 02:39:11 +00:00			`"BS4HTMLParser",`
Grobid parser for Scientific Articles from PDF (#6729) ### Scientific Article PDF Parsing via Grobid `Description:` This change adds the GrobidParser class, which uses the Grobid library to parse scientific articles into a universal XML format containing the article title, references, sections, section text etc. The GrobidParser uses a local Grobid server to return PDFs document as XML and parses the XML to optionally produce documents of individual sentences or of whole paragraphs. Metadata includes the text, paragraph number, pdf relative bboxes, pages (text may overlap over two pages), section title (Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the title of the paper and finally the file path. Grobid parsing is useful beyond standard pdf parsing as it accurately outputs sections and paragraphs within them. This allows for post-fitering of results for specific sections i.e. limiting results to the methodology section or results. While sections are split via headings, ideally they could be classified specifically into introduction, methodology, results, discussion, conclusion. I'm currently experimenting with chatgpt-3.5 for this function, which could later be implemented as a textsplitter. `Dependencies:` For use, the grobid repo must be cloned and Java must be installed, for colab this is: ``` !apt-get install -y openjdk-11-jdk -q !update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java !git clone https://github.com/kermitt2/grobid.git os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64" os.chdir('grobid') !./gradlew clean install ``` Once installed the server is ran on localhost:8070 via ``` get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &') ``` @rlancemartin, @eyurtsev Twitter Handle: @Corranmac Grobid Demo Notebook is [here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing). --------- Co-authored-by: rlm <pexpresss31@gmail.com> 2023-06-29 21:29:29 +00:00			`"GrobidParser",`
feat (documents): add a source code loader based on AST manipulation (#6486) #### Summary A new approach to loading source code is implemented: Each top-level function and class in the code is loaded into separate documents. Then, an additional document is created with the top-level code, but without the already loaded functions and classes. This could improve the accuracy of QA chains over source code. For instance, having this script: ``` class MyClass: def __init__(self, name): self.name = name def greet(self): print(f"Hello, {self.name}!") def main(): name = input("Enter your name: ") obj = MyClass(name) obj.greet() if __name__ == '__main__': main() ``` The loader will create three documents with this content: First document: ``` class MyClass: def __init__(self, name): self.name = name def greet(self): print(f"Hello, {self.name}!") ``` Second document: ``` def main(): name = input("Enter your name: ") obj = MyClass(name) obj.greet() ``` Third document: ``` # Code for: class MyClass: # Code for: def main(): if __name__ == '__main__': main() ``` A threshold parameter is added to control whether small scripts are split in this way or not. At this moment, only Python and JavaScript are supported. The appropriate parser is determined by examining the file extension. #### Tests This PR adds: - Unit tests - Integration tests #### Dependencies Only one dependency was added as optional (needed for the JavaScript parser). #### Documentation A notebook is added showing how the loader can be used. #### Who can review? @eyurtsev @hwchase17 --------- Co-authored-by: rlm <pexpresss31@gmail.com> 2023-06-27 22:58:47 +00:00			`"LanguageParser",`
Create OpenAIWhisperParser for generating Documents from audio files (#5580) # OpenAIWhisperParser This PR creates a new parser, `OpenAIWhisperParser`, that uses the [OpenAI Whisper model](https://platform.openai.com/docs/guides/speech-to-text/quickstart) to perform transcription of audio files to text (`Documents`). Please see the notebook for usage. 2023-06-05 22:51:13 +00:00			`"OpenAIWhisperParser",`
Add workflow for testing with all deps (#4410) # Add action to test with all dependencies installed PR adds a custom action for setting up poetry that allows specifying a cache key: https://github.com/actions/setup-python/issues/505#issuecomment-1273013236 This makes it possible to run 2 types of unit tests: (1) unit tests with only core dependencies (2) unit tests with extended dependencies (e.g., those that rely on an optional pdf parsing library) As part of this PR, we're moving some pdf parsing tests into the unit-tests section and making sure that these unit tests get executed when running with extended dependencies. 2023-05-10 13:35:07 +00:00			`"PyPDFParser",`
			`"PDFMinerParser",`
			`"PyMuPDFParser",`
			`"PyPDFium2Parser",`
Feature: pdfplumber PDF loader with BaseBlobParser (#4552) # Feature: pdfplumber PDF loader with BaseBlobParser * Adds pdfplumber as a PDF loader * Adds pdfplumber as a blob parser. 2023-05-15 13:47:02 +00:00			`"PDFPlumberParser",`
Add workflow for testing with all deps (#4410) # Add action to test with all dependencies installed PR adds a custom action for setting up poetry that allows specifying a cache key: https://github.com/actions/setup-python/issues/505#issuecomment-1273013236 This makes it possible to run 2 types of unit tests: (1) unit tests with only core dependencies (2) unit tests with extended dependencies (e.g., those that rely on an optional pdf parsing library) As part of this PR, we're moving some pdf parsing tests into the unit-tests section and making sure that these unit tests get executed when running with extended dependencies. 2023-05-10 13:35:07 +00:00			`}`