You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/tests/integration_tests/examples
Pau Ramon Revilla 87802c86d9
Added a MHTML document loader (#6311)
MHTML is a very interesting format since it's used both for emails but
also for archived webpages. Some scraping projects want to store pages
in disk to process them later, mhtml is perfect for that use case.

This is heavily inspired from the beautifulsoup html loader, but
extracting the html part from the mhtml file.

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
1 year ago
..
README.rst feat: Add `UnstructuredRSTLoader` (#6594) 1 year ago
default-encoding.py Add PythonLoader which auto-detects encoding of Python files (#3311) 1 year ago
example-utf8.html Add ability to pass kwargs to loader classes in `DirectoryLoader`, add ability to modify encoding and BeautifulSoup behaviour in `BSHTMLLoader` (#2275) 2 years ago
example.html Add HTML document_loader that includes page title metadata (#1720) 2 years ago
example.json JSON loader (#4067) 1 year ago
example.mht Added a MHTML document loader (#6311) 1 year ago
facebook_chat.json Refactor TelegramChatLoader and FacebookChatLoader classes and add tests (#3863) 1 year ago
factbook.xml feat: Add `UnstructuredXMLLoader` for `.xml` files (#5955) 1 year ago
fake.odt feat: add loader for open office odt files (#4405) 1 year ago
hello.msg Harrison/msg files (#2375) 2 years ago
hello.pdf Harrison/format agent instructions (#973) 2 years ago
layout-parser-paper.pdf Harrison/remote paths pdf (#1544) 2 years ago
non-utf8-encoding.py Add PythonLoader which auto-detects encoding of Python files (#3311) 1 year ago
sitemap.xml Harrison/sitemap local (#4704) 1 year ago
slack_export.zip Add Slack Directory Loader (#2841) 1 year ago
stanley-cups.csv feat: Add `UnstructuredCSVLoader` for CSV files (#5844) 1 year ago
stanley-cups.xlsx feat: add `UnstructuredExcelLoader` for `.xlsx` and `.xls` files (#5617) 1 year ago
whatsapp_chat.txt Fix WhatsAppChatLoader : Enable parsing additional formats (#6663) 1 year ago

README.rst

Example Docs
------------

The sample docs directory contains the following files:

-  ``example-10k.html`` - A 10-K SEC filing in HTML format
-  ``layout-parser-paper.pdf`` - A PDF copy of the layout parser paper
-  ``factbook.xml``/``factbook.xsl`` - Example XML/XLS files that you
   can use to test stylesheets

These documents can be used to test out the parsers in the library. In
addition, here are instructions for pulling in some sample docs that are
too big to store in the repo.

XBRL 10-K
^^^^^^^^^

You can get an example 10-K in inline XBRL format using the following
``curl``. Note, you need to have the user agent set in the header or the
SEC site will reject your request.

.. code:: bash

   curl -O \
     -A '${organization} ${email}'
     https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt

You can parse this document using the HTML parser.