docs: added missed `document_loaders` examples (#5150)

# DOCS added missed document_loader examples

Added missed examples: `JSON`, `Open Document Format (ODT)`,
`Wikipedia`, `tomarkdown`.
Updated them to a consistent format.

## Who can review?

@hwchase17 
@dev2049
searx_updates
Leonid Ganeline 12 months ago committed by GitHub
parent c111134a55
commit 33929489b9
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -40,10 +40,11 @@ For detailed instructions on how to get set up with Unstructured, see installati
./document_loaders/examples/file_directory.ipynb
./document_loaders/examples/html.ipynb
./document_loaders/examples/image.ipynb
./document_loaders/examples/jupyter_notebook.ipynb
./document_loaders/examples/json.ipynb
./document_loaders/examples/markdown.ipynb
./document_loaders/examples/microsoft_powerpoint.ipynb
./document_loaders/examples/microsoft_word.ipynb
./document_loaders/examples/odt.ipynb
./document_loaders/examples/pandas_dataframe.ipynb
./document_loaders/examples/pdf.ipynb
./document_loaders/examples/sitemap.ipynb
@ -81,6 +82,7 @@ We don't need any access permissions to these datasets and services.
./document_loaders/examples/ifixit.ipynb
./document_loaders/examples/imsdb.ipynb
./document_loaders/examples/mediawikidump.ipynb
./document_loaders/examples/wikipedia.ipynb
./document_loaders/examples/youtube_transcript.ipynb
@ -131,4 +133,5 @@ We need access tokens and sometime other parameters to get access to these datas
./document_loaders/examples/slack.ipynb
./document_loaders/examples/spreedly.ipynb
./document_loaders/examples/stripe.ipynb
./document_loaders/examples/tomarkdown.ipynb
./document_loaders/examples/twitter.ipynb

@ -4,28 +4,30 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# JSON Files\n",
"# JSON\n",
"\n",
"The `JSONLoader` uses a specified [jq schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files.\n",
"\n",
"This notebook shows how to use the `JSONLoader` to load [JSON](https://en.wikipedia.org/wiki/JSON) files into documents. A few examples of `jq` schema extracting different parts of a JSON file are also shown.\n",
">[JSON (JavaScript Object Notation)](https://en.wikipedia.org/wiki/JSON) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attributevalue pairs and arrays (or other serializable values).\n",
"\n",
">The `JSONLoader` uses a specified [jq schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files. It uses the `jq` python package.\n",
"Check this [manual](https://stedolan.github.io/jq/manual/#Basicfilters) for a detailed documentation of the `jq` syntax."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"execution_count": 1,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install jq"
"#!pip install jq"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
@ -359,7 +361,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -5,9 +5,13 @@
"id": "22a849cc",
"metadata": {},
"source": [
"## Unstructured ODT Loader\n",
"# Open Document Format (ODT)\n",
"\n",
"The `UnstructuredODTLoader` can be used to load Open Office ODT files."
">The [Open Document Format for Office Applications (ODF)](https://en.wikipedia.org/wiki/OpenDocument), also known as `OpenDocument`, is an open file format for word processing documents, spreadsheets, presentations and graphics and using ZIP-compressed XML files. It was developed with the aim of providing an open, XML-based file format specification for office applications.\n",
"\n",
">The standard is developed and maintained by a technical committee in the Organization for the Advancement of Structured Information Standards (`OASIS`) consortium. It was based on the Sun Microsystems specification for OpenOffice.org XML, the default format for `OpenOffice.org` and `LibreOffice`. It was originally developed for `StarOffice` \"to provide an open standard for office documents.\"\n",
"\n",
"The `UnstructuredODTLoader` is used to load `Open Office ODT` files."
]
},
{
@ -68,7 +72,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
"version": "3.10.6"
}
},
"nbformat": 4,

@ -7,7 +7,7 @@
"source": [
"# 2Markdown\n",
"\n",
"Uses [2markdown](https://2markdown.com/) to convert any webpage into a standard markdown file"
">[2markdown](https://2markdown.com/) service transforms website content into structured markdown files.\n"
]
},
{
@ -17,7 +17,7 @@
"metadata": {},
"outputs": [],
"source": [
"# You will need to get your own API key\n",
"# You will need to get your own API key. See https://2markdown.com/login\n",
"\n",
"api_key = \"\""
]
@ -56,9 +56,7 @@
"cell_type": "code",
"execution_count": 8,
"id": "706304e9",
"metadata": {
"scrolled": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
@ -220,7 +218,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.6"
}
},
"nbformat": 4,

Loading…
Cancel
Save