mirror of
https://github.com/hwchase17/langchain
synced 2024-10-29 17:07:25 +00:00
b24472eae3
### Summary Adds `UnstructuredOrgModeLoader` for processing [Org-mode](https://en.wikipedia.org/wiki/Org-mode) documents. ### Testing ```python from langchain.document_loaders import UnstructuredOrgModeLoader loader = UnstructuredOrgModeLoader( file_path="example_data/README.org", mode="elements" ) docs = loader.load() print(docs[0]) ``` ### Reviewers - @rlancemartin - @eyurtsev - @hwchase17
28 lines
889 B
Org Mode
28 lines
889 B
Org Mode
* Example Docs
|
|
|
|
The sample docs directory contains the following files:
|
|
|
|
- ~example-10k.html~ - A 10-K SEC filing in HTML format
|
|
- ~layout-parser-paper.pdf~ - A PDF copy of the layout parser paper
|
|
- ~factbook.xml~ / ~factbook.xsl~ - Example XML/XLS files that you
|
|
can use to test stylesheets
|
|
|
|
These documents can be used to test out the parsers in the library. In
|
|
addition, here are instructions for pulling in some sample docs that are
|
|
too big to store in the repo.
|
|
|
|
** XBRL 10-K
|
|
|
|
You can get an example 10-K in inline XBRL format using the following
|
|
~curl~. Note, you need to have the user agent set in the header or the
|
|
SEC site will reject your request.
|
|
|
|
#+BEGIN_SRC bash
|
|
|
|
curl -O \
|
|
-A '${organization} ${email}'
|
|
https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt
|
|
#+END_SRC
|
|
|
|
You can parse this document using the HTML parser.
|