feat: Add `UnstructuredOrgModeLoader` (#6842)

### Summary

Adds `UnstructuredOrgModeLoader` for processing
[Org-mode](https://en.wikipedia.org/wiki/Org-mode) documents.

### Testing

```python
from langchain.document_loaders import UnstructuredOrgModeLoader

loader = UnstructuredOrgModeLoader(
    file_path="example_data/README.org", mode="elements"
)
docs = loader.load()
print(docs[0])
```

### Reviewers

- @rlancemartin
- @eyurtsev
- @hwchase17
pull/6853/head
Matt Robinson 1 year ago committed by GitHub
parent e53995836a
commit b24472eae3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,27 @@
* Example Docs
The sample docs directory contains the following files:
- ~example-10k.html~ - A 10-K SEC filing in HTML format
- ~layout-parser-paper.pdf~ - A PDF copy of the layout parser paper
- ~factbook.xml~ / ~factbook.xsl~ - Example XML/XLS files that you
can use to test stylesheets
These documents can be used to test out the parsers in the library. In
addition, here are instructions for pulling in some sample docs that are
too big to store in the repo.
** XBRL 10-K
You can get an example 10-K in inline XBRL format using the following
~curl~. Note, you need to have the user agent set in the header or the
SEC site will reject your request.
#+BEGIN_SRC bash
curl -O \
-A '${organization} ${email}'
https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt
#+END_SRC
You can parse this document using the HTML parser.

@ -0,0 +1,88 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Org-mode\n",
"\n",
">A [Org Mode document](https://en.wikipedia.org/wiki/Org-mode) is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `UnstructuredOrgModeLoader`\n",
"\n",
"You can load data from Org-mode files with `UnstructuredOrgModeLoader` using the following workflow."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import UnstructuredOrgModeLoader"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"loader = UnstructuredOrgModeLoader(\n",
" file_path=\"example_data/README.org\", mode=\"elements\"\n",
")\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='Example Docs' metadata={'source': 'example_data/README.org', 'filename': 'README.org', 'file_directory': 'example_data', 'filetype': 'text/org', 'page_number': 1, 'category': 'Title'}\n"
]
}
],
"source": [
"print(docs[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

@ -78,6 +78,7 @@ from langchain.document_loaders.odt import UnstructuredODTLoader
from langchain.document_loaders.onedrive import OneDriveLoader
from langchain.document_loaders.onedrive_file import OneDriveFileLoader
from langchain.document_loaders.open_city_data import OpenCityDataLoader
from langchain.document_loaders.org_mode import UnstructuredOrgModeLoader
from langchain.document_loaders.pdf import (
MathpixPDFLoader,
OnlinePDFLoader,
@ -262,6 +263,7 @@ __all__ = [
"UnstructuredImageLoader",
"UnstructuredMarkdownLoader",
"UnstructuredODTLoader",
"UnstructuredOrgModeLoader",
"UnstructuredPDFLoader",
"UnstructuredPowerPointLoader",
"UnstructuredRSTLoader",

@ -0,0 +1,22 @@
"""Loader that loads Org-Mode files."""
from typing import Any, List
from langchain.document_loaders.unstructured import (
UnstructuredFileLoader,
validate_unstructured_version,
)
class UnstructuredOrgModeLoader(UnstructuredFileLoader):
"""Loader that uses unstructured to load Org-Mode files."""
def __init__(
self, file_path: str, mode: str = "single", **unstructured_kwargs: Any
):
validate_unstructured_version(min_unstructured_version="0.7.9")
super().__init__(file_path=file_path, mode=mode, **unstructured_kwargs)
def _get_elements(self) -> List:
from unstructured.partition.org import partition_org
return partition_org(filename=self.file_path, **self.unstructured_kwargs)

@ -0,0 +1,15 @@
import os
from pathlib import Path
from langchain.document_loaders import UnstructuredOrgModeLoader
EXAMPLE_DIRECTORY = file_path = Path(__file__).parent.parent / "examples"
def test_unstructured_org_mode_loader() -> None:
"""Test unstructured loader."""
file_path = os.path.join(EXAMPLE_DIRECTORY, "README.org")
loader = UnstructuredOrgModeLoader(str(file_path))
docs = loader.load()
assert len(docs) == 1

@ -0,0 +1,27 @@
* Example Docs
The sample docs directory contains the following files:
- ~example-10k.html~ - A 10-K SEC filing in HTML format
- ~layout-parser-paper.pdf~ - A PDF copy of the layout parser paper
- ~factbook.xml~ / ~factbook.xsl~ - Example XML/XLS files that you
can use to test stylesheets
These documents can be used to test out the parsers in the library. In
addition, here are instructions for pulling in some sample docs that are
too big to store in the repo.
** XBRL 10-K
You can get an example 10-K in inline XBRL format using the following
~curl~. Note, you need to have the user agent set in the header or the
SEC site will reject your request.
#+BEGIN_SRC bash
curl -O \
-A '${organization} ${email}'
https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt
#+END_SRC
You can parse this document using the HTML parser.
Loading…
Cancel
Save