JSON loader (#4067)

This implements a loader of text passages in JSON format. The `jq`
syntax is used to define a schema for accessing the relevant contents
from the JSON file. This requires dependency on the `jq` package:
https://pypi.org/project/jq/.

---------

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
This commit is contained in:
Aivin V. Solatorio 2023-05-05 17:48:13 -04:00 committed by GitHub
parent bb6d97c18c
commit 6567b73e1a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 584 additions and 4 deletions

View File

@ -0,0 +1,367 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# JSON Files\n",
"\n",
"The `JSONLoader` uses a specified [jq schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files.\n",
"\n",
"This notebook shows how to use the `JSONLoader` to load [JSON](https://en.wikipedia.org/wiki/JSON) files into documents. A few examples of `jq` schema extracting different parts of a JSON file are also shown.\n",
"\n",
"Check this [manual](https://stedolan.github.io/jq/manual/#Basicfilters) for a detailed documentation of the `jq` syntax."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install jq"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"from langchain.document_loaders import JSONLoader"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from pathlib import Path\n",
"from pprint import pprint\n",
"\n",
"\n",
"file_path='./example_data/facebook_chat.json'\n",
"data = json.loads(Path(file_path).read_text())"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},\n",
" 'is_still_participant': True,\n",
" 'joinable_mode': {'link': '', 'mode': 1},\n",
" 'magic_words': [],\n",
" 'messages': [{'content': 'Bye!',\n",
" 'sender_name': 'User 2',\n",
" 'timestamp_ms': 1675597571851},\n",
" {'content': 'Oh no worries! Bye',\n",
" 'sender_name': 'User 1',\n",
" 'timestamp_ms': 1675597435669},\n",
" {'content': 'No Im sorry it was my mistake, the blue one is not '\n",
" 'for sale',\n",
" 'sender_name': 'User 2',\n",
" 'timestamp_ms': 1675596277579},\n",
" {'content': 'I thought you were selling the blue one!',\n",
" 'sender_name': 'User 1',\n",
" 'timestamp_ms': 1675595140251},\n",
" {'content': 'Im not interested in this bag. Im interested in the '\n",
" 'blue one!',\n",
" 'sender_name': 'User 1',\n",
" 'timestamp_ms': 1675595109305},\n",
" {'content': 'Here is $129',\n",
" 'sender_name': 'User 2',\n",
" 'timestamp_ms': 1675595068468},\n",
" {'photos': [{'creation_timestamp': 1675595059,\n",
" 'uri': 'url_of_some_picture.jpg'}],\n",
" 'sender_name': 'User 2',\n",
" 'timestamp_ms': 1675595060730},\n",
" {'content': 'Online is at least $100',\n",
" 'sender_name': 'User 2',\n",
" 'timestamp_ms': 1675595045152},\n",
" {'content': 'How much do you want?',\n",
" 'sender_name': 'User 1',\n",
" 'timestamp_ms': 1675594799696},\n",
" {'content': 'Goodmorning! $50 is too low.',\n",
" 'sender_name': 'User 2',\n",
" 'timestamp_ms': 1675577876645},\n",
" {'content': 'Hi! Im interested in your bag. Im offering $50. Let '\n",
" 'me know if you are interested. Thanks!',\n",
" 'sender_name': 'User 1',\n",
" 'timestamp_ms': 1675549022673}],\n",
" 'participants': [{'name': 'User 1'}, {'name': 'User 2'}],\n",
" 'thread_path': 'inbox/User 1 and User 2 chat',\n",
" 'title': 'User 1 and User 2 chat'}\n"
]
}
],
"source": [
"pprint(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using `JSONLoader`\n",
"\n",
"Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data. This can easily be done through the `JSONLoader` as shown below."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"loader = JSONLoader(\n",
" file_path='./example_data/facebook_chat.json',\n",
" jq_schema='.messages[].content')\n",
"\n",
"data = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),\n",
" Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),\n",
" Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3}),\n",
" Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4}),\n",
" Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5}),\n",
" Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6}),\n",
" Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7}),\n",
" Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8}),\n",
" Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9}),\n",
" Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10}),\n",
" Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11})]\n"
]
}
],
"source": [
"pprint(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extracting metadata\n",
"\n",
"Generally, we want to include metadata available in the JSON file into the documents that we create from the content.\n",
"\n",
"The following demonstrates how metadata can be extracted using the `JSONLoader`.\n",
"\n",
"There are some key changes to be noted. In the previous example where we didn't collect the metadata, we managed to directly specify in the schema where the value for the `page_content` can be extracted from.\n",
"\n",
"```\n",
".messages[].content\n",
"```\n",
"\n",
"In the current example, we have to tell the loader to iterate over the records in the `messages` field. The jq_schema then has to be:\n",
"\n",
"```\n",
".messages[]\n",
"```\n",
"\n",
"This allows us to pass the records (dict) into the `metadata_func` that has to be implemented. The `metadata_func` is responsible for identifying which pieces of information in the record should be included in the metadata stored in the final `Document` object.\n",
"\n",
"Additionally, we now have to explicitly specify in the loader, via the `content_key` argument, the key from the record where the value for the `page_content` needs to be extracted from."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Define the metadata extraction function.\n",
"def metadata_func(record: dict, metadata: dict) -> dict:\n",
"\n",
" metadata[\"sender_name\"] = record.get(\"sender_name\")\n",
" metadata[\"timestamp_ms\"] = record.get(\"timestamp_ms\")\n",
"\n",
" return metadata\n",
"\n",
"\n",
"loader = JSONLoader(\n",
" file_path='./example_data/facebook_chat.json',\n",
" jq_schema='.messages[]',\n",
" content_key=\"content\",\n",
" metadata_func=metadata_func\n",
")\n",
"\n",
"data = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),\n",
" Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),\n",
" Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3, 'sender_name': 'User 2', 'timestamp_ms': 1675596277579}),\n",
" Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4, 'sender_name': 'User 1', 'timestamp_ms': 1675595140251}),\n",
" Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5, 'sender_name': 'User 1', 'timestamp_ms': 1675595109305}),\n",
" Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6, 'sender_name': 'User 2', 'timestamp_ms': 1675595068468}),\n",
" Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7, 'sender_name': 'User 2', 'timestamp_ms': 1675595060730}),\n",
" Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8, 'sender_name': 'User 2', 'timestamp_ms': 1675595045152}),\n",
" Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9, 'sender_name': 'User 1', 'timestamp_ms': 1675594799696}),\n",
" Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10, 'sender_name': 'User 2', 'timestamp_ms': 1675577876645}),\n",
" Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11, 'sender_name': 'User 1', 'timestamp_ms': 1675549022673})]\n"
]
}
],
"source": [
"pprint(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, you will see that the documents contain the metadata associated with the content we extracted."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `metadata_func`\n",
"\n",
"As shown above, the `metadata_func` accepts the default metadata generated by the `JSONLoader`. This allows full control to the user with respect to how the metadata is formatted.\n",
"\n",
"For example, the default metadata contains the `source` and the `seq_num` keys. However, it is possible that the JSON data contain these keys as well. The user can then exploit the `metadata_func` to rename the default keys and use the ones from the JSON data.\n",
"\n",
"The example below shows how we can modify the `source` to only contain information of the file source relative to the `langchain` directory."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Define the metadata extraction function.\n",
"def metadata_func(record: dict, metadata: dict) -> dict:\n",
"\n",
" metadata[\"sender_name\"] = record.get(\"sender_name\")\n",
" metadata[\"timestamp_ms\"] = record.get(\"timestamp_ms\")\n",
" \n",
" if \"source\" in metadata:\n",
" source = metadata[\"source\"].split(\"/\")\n",
" source = source[source.index(\"langchain\"):]\n",
" metadata[\"source\"] = \"/\".join(source)\n",
"\n",
" return metadata\n",
"\n",
"\n",
"loader = JSONLoader(\n",
" file_path='./example_data/facebook_chat.json',\n",
" jq_schema='.messages[]',\n",
" content_key=\"content\",\n",
" metadata_func=metadata_func\n",
")\n",
"\n",
"data = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),\n",
" Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),\n",
" Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3, 'sender_name': 'User 2', 'timestamp_ms': 1675596277579}),\n",
" Document(page_content='I thought you were selling the blue one!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4, 'sender_name': 'User 1', 'timestamp_ms': 1675595140251}),\n",
" Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5, 'sender_name': 'User 1', 'timestamp_ms': 1675595109305}),\n",
" Document(page_content='Here is $129', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6, 'sender_name': 'User 2', 'timestamp_ms': 1675595068468}),\n",
" Document(page_content='', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7, 'sender_name': 'User 2', 'timestamp_ms': 1675595060730}),\n",
" Document(page_content='Online is at least $100', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8, 'sender_name': 'User 2', 'timestamp_ms': 1675595045152}),\n",
" Document(page_content='How much do you want?', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9, 'sender_name': 'User 1', 'timestamp_ms': 1675594799696}),\n",
" Document(page_content='Goodmorning! $50 is too low.', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10, 'sender_name': 'User 2', 'timestamp_ms': 1675577876645}),\n",
" Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11, 'sender_name': 'User 1', 'timestamp_ms': 1675549022673})]\n"
]
}
],
"source": [
"pprint(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Common JSON structures with jq schema\n",
"\n",
"The list below provides a reference to the possible `jq_schema` the user can use to extract content from the JSON data depending on the structure.\n",
"\n",
"```\n",
"JSON -> [{\"text\": ...}, {\"text\": ...}, {\"text\": ...}]\n",
"jq_schema -> \".[].text\"\n",
" \n",
"JSON -> {\"key\": [{\"text\": ...}, {\"text\": ...}, {\"text\": ...}]}\n",
"jq_schema -> \".key[].text\"\n",
"\n",
"JSON -> [\"...\", \"...\", \"...\"]\n",
"jq_schema -> \".[]\"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -45,6 +45,7 @@ from langchain.document_loaders.ifixit import IFixitLoader
from langchain.document_loaders.image import UnstructuredImageLoader
from langchain.document_loaders.image_captions import ImageCaptionLoader
from langchain.document_loaders.imsdb import IMSDbLoader
from langchain.document_loaders.json_loader import JSONLoader
from langchain.document_loaders.markdown import UnstructuredMarkdownLoader
from langchain.document_loaders.mediawikidump import MWDumpLoader
from langchain.document_loaders.modern_treasury import ModernTreasuryLoader
@ -144,6 +145,7 @@ __all__ = [
"IFixitLoader",
"IMSDbLoader",
"ImageCaptionLoader",
"JSONLoader",
"ModernTreasuryLoader",
"MWDumpLoader",
"NotebookLoader",

View File

@ -0,0 +1,104 @@
"""Loader that loads data from JSON."""
import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
class JSONLoader(BaseLoader):
"""Loads a JSON file and references a jq schema provided to load the text into
documents.
Example:
[{"text": ...}, {"text": ...}, {"text": ...}] -> schema = .[].text
{"key": [{"text": ...}, {"text": ...}, {"text": ...}]} -> schema = .key[].text
["", "", ""] -> schema = .[]
"""
def __init__(
self,
file_path: Union[str, Path],
jq_schema: str,
content_key: Optional[str] = None,
metadata_func: Optional[Callable[[Dict, Dict], Dict]] = None,
):
"""Initialize the JSONLoader.
Args:
file_path (Union[str, Path]): The path to the JSON file.
jq_schema (str): The jq schema to use to extract the data or text from
the JSON.
content_key (str): The key to use to extract the content from the JSON if
the jq_schema results to a list of objects (dict).
metadata_func (Callable[Dict, Dict]): A function that takes in the JSON
object extracted by the jq_schema and the default metadata and returns
a dict of the updated metadata.
"""
try:
import jq # noqa:F401
except ImportError:
raise ValueError(
"jq package not found, please install it with `pip install jq`"
)
self.file_path = Path(file_path).resolve()
self._jq_schema = jq.compile(jq_schema)
self._content_key = content_key
self._metadata_func = metadata_func
def load(self) -> List[Document]:
"""Load and return documents from the JSON file."""
data = self._jq_schema.input(json.loads(self.file_path.read_text()))
# Perform some validation
# This is not a perfect validation, but it should catch most cases
# and prevent the user from getting a cryptic error later on.
if self._content_key is not None:
sample = data.first()
if not isinstance(sample, dict):
raise ValueError(
f"Expected the jq schema to result in a list of objects (dict), \
so sample must be a dict but got `{type(sample)}`"
)
if sample.get(self._content_key) is None:
raise ValueError(
f"Expected the jq schema to result in a list of objects (dict) \
with the key `{self._content_key}`"
)
if self._metadata_func is not None:
sample_metadata = self._metadata_func(sample, {})
if not isinstance(sample_metadata, dict):
raise ValueError(
f"Expected the metadata_func to return a dict but got \
`{type(sample_metadata)}`"
)
docs = []
for i, sample in enumerate(data, 1):
metadata = dict(
source=str(self.file_path),
seq_num=i,
)
if self._content_key is not None:
text = sample.get(self._content_key)
if self._metadata_func is not None:
# We pass in the metadata dict to the metadata_func
# so that the user can customize the default metadata
# based on the content of the JSON object.
metadata = self._metadata_func(sample, metadata)
else:
text = sample
# In case the text is None, set it to an empty string
text = text or ""
docs.append(Document(page_content=text, metadata=metadata))
return docs

71
poetry.lock generated
View File

@ -1,4 +1,4 @@
# This file is automatically @generated by Poetry 1.4.0 and should not be changed by hand.
# This file is automatically @generated by Poetry 1.4.2 and should not be changed by hand.
[[package]]
name = "absl-py"
@ -3220,6 +3220,71 @@ files = [
{file = "joblib-1.2.0.tar.gz", hash = "sha256:e1cee4a79e4af22881164f218d4311f60074197fb707e082e803b61f6d137018"},
]
[[package]]
name = "jq"
version = "1.4.1"
description = "jq is a lightweight and flexible JSON processor."
category = "main"
optional = true
python-versions = ">=3.5"
files = [
{file = "jq-1.4.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:1708cad6ee0f173ce38c6ebfc81b98a545b35387ae6471c8d7f9f3a02ffb723e"},
{file = "jq-1.4.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c94e70e5f0798d87018cd4a58175f4eed2afa08727389a0f3f246bf7e7b98d1e"},
{file = "jq-1.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ddc2c6b55c5461c6f155c4b717927bdd29a83a6356250c4e6016297bcea80498"},
{file = "jq-1.4.1-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e2e71f5a921542efbea12386ca9d91ea1aeb6bd393681073e4a47a720613715f"},
{file = "jq-1.4.1-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:b2bf666002d23ee8cf9e619d2d1e46d86a089e028367665386b9d67d22b31ceb"},
{file = "jq-1.4.1-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:e33954fe47e61a533556d38e045ddd7b3fa8a8186a70981462a207ed22594d83"},
{file = "jq-1.4.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:07905774df7706588014ca49789548328e8f66738b004089b3f0c42f7f389405"},
{file = "jq-1.4.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:959b2e677e56dc31c8572c0852ad26d3b351a8a458ca72c96f8cedfcde49419f"},
{file = "jq-1.4.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e74ab69d39b171f1625fa666baa8f9a1ff49e7295047082bcb537fcc2d359dfe"},
{file = "jq-1.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:103412f7f35175eb9a1005e4e2067b363dfcdb413d02fa962ddf288b2b16cc54"},
{file = "jq-1.4.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:1f70d5e0c6445cc58f720de2ab44c156c69ce6d898c4d4ad04f07815868e31ed"},
{file = "jq-1.4.1-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:db980118c02321c56b6e0ddf817ad1cbbd8b6c90f4637bdebb695e84ee41a296"},
{file = "jq-1.4.1-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:9b295a51a9ea7e324aa7ad2ce2cca3d51d7492a525cd7a59773666a07b1cc0f7"},
{file = "jq-1.4.1-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:82b44474641dcdb07b43267d17f77914595768e9464b31de114e6c229a16ac6e"},
{file = "jq-1.4.1-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:582c40d7e212e310cf1ed0fddc4590853b64a5e09aed1f740613765c83cff072"},
{file = "jq-1.4.1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:75f4269f709f746bf3d52df2c4ebc316d4985e0db97b7c1a293f02202befcdcb"},
{file = "jq-1.4.1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1a060fd3172f8833828cb26151ea2f6c0f99f0191109ad580baee7befbdd6e65"},
{file = "jq-1.4.1-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:2bfd61be72ad1e35622a7525e55615954ccfbe6ccadabd7f964e879bb4a53ad6"},
{file = "jq-1.4.1-cp36-cp36m-musllinux_1_1_aarch64.whl", hash = "sha256:4364c45113407f1316a99bd7a8661aa9304eb3578c80b201917aa8568fa40ee1"},
{file = "jq-1.4.1-cp36-cp36m-musllinux_1_1_i686.whl", hash = "sha256:0a8c37073a335596c645f0260fd3ea7b6141c2fb0115a0b8082252b0169f70c8"},
{file = "jq-1.4.1-cp36-cp36m-musllinux_1_1_x86_64.whl", hash = "sha256:96e5160f77498389e388e7ba3cd1771abc386b52788c82dee897c95bc87efe6f"},
{file = "jq-1.4.1-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:fac91eb91bec60dee28e2325f863c43d12ffc904ee72248522c6d0157ae98a54"},
{file = "jq-1.4.1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:581e771e7c4aad728f9696ce6faee0f3d535cb0c845a49ac20188d8c7918e19d"},
{file = "jq-1.4.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:31b6526533cbc298ae0c0084d22452fbd3b4600ace488dc961ecf9a1dcb51a83"},
{file = "jq-1.4.1-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:1830a9fd394673758010e41e8d0e00be7126b0ea9f3ede017a555c0c805435bc"},
{file = "jq-1.4.1-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:6b11e71b4d00928898f494d8e2945b80aab0447a4f2e7fb4603ac32cccc4e28e"},
{file = "jq-1.4.1-cp37-cp37m-musllinux_1_1_i686.whl", hash = "sha256:3e4dd3ba62e284479528a5a00084c2923a08de7cb7fe154036a345190ed5bc24"},
{file = "jq-1.4.1-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:7dfa6ff7424339ed361d911a13635e7c2f888e18e42920a8603e8806d85fdfdc"},
{file = "jq-1.4.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:419f8d28e737b96476ac9ba66e000e4d93e54dd8003f1374269315086b98d822"},
{file = "jq-1.4.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:de27a580663825b493b061682b59704f29a748011f2e5bc4701b34f8f17ed405"},
{file = "jq-1.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ebfec7c54b3252ec59663a21885e97d49b1dd455d8db0223bb77073b9b248fc3"},
{file = "jq-1.4.1-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:56a21666412dd1a6b8306475d0ec6e1eba7965100b3dfd6ecf1eb537aabec513"},
{file = "jq-1.4.1-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:f97b1e2582d64b65069f2d8b5e08f94f1d0998233c98c0d6edcf0a610262cd3a"},
{file = "jq-1.4.1-cp38-cp38-musllinux_1_1_i686.whl", hash = "sha256:33b5fcbf32c24557dd638e59b919f2ecfa98e65cf4b96f63c327ed10ea24495d"},
{file = "jq-1.4.1-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:a16fb7e2e0942b4661a8d210e9ac3292b5f021abbcddbbcb6b783f9eb5d7a6cb"},
{file = "jq-1.4.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:4c4d6b9f30556d5f17552ac2ef8563872a2c0271cc7c8789c87546270135ae15"},
{file = "jq-1.4.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3f82346544116503cbdfd56ac5e90f837c2b96d69b64a3444df2770156dc8d64"},
{file = "jq-1.4.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1799792f34ca8441fb1c4b3cf05c644ef2a4b28ad07bae65b1c7cde8f26721b4"},
{file = "jq-1.4.1-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:2403bfcaedbe860ffaa3258b65ad3dcf72d2d97c59acf6f8fd5f663a1b0a183a"},
{file = "jq-1.4.1-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:c59ebcd4f0bb99d5d69085905c80d8ebf95df522750d95e33985121daa4e1de4"},
{file = "jq-1.4.1-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:aa7fadeca796eb385b93217fb65ac2c54150ac3fcea2722c0c76390f0d6b2681"},
{file = "jq-1.4.1-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:11fb7e41c4931127cfe5c53b1eb812d797ed7d47a8ab22f6cb294cf470d5038b"},
{file = "jq-1.4.1-pp37-pypy37_pp73-macosx_10_9_x86_64.whl", hash = "sha256:fc8f67f7b8140e51bd291686055d63f62b60fa3bea861265309f54fd74f5517d"},
{file = "jq-1.4.1-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:30ce02d9c01ffea7c92b4ec006b114c4047816f15016173dced3fc046760b854"},
{file = "jq-1.4.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bbbfdfbb0bc2d615edfa8213720423885c022a827ea3c8e8593bce98b6086c99"},
{file = "jq-1.4.1-pp37-pypy37_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:9053a8e9f3636d367e8bb0841a62d839f2116e6965096d95c38a8f9da57eed66"},
{file = "jq-1.4.1-pp38-pypy38_pp73-macosx_10_9_x86_64.whl", hash = "sha256:3ecdffb3abc9f1611465b761eebcdb3008ae57946a86a99e76bc6b09fe611f29"},
{file = "jq-1.4.1-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e5f0688f98dedb49a5c680b961a4f453fe84b34795aa3203eec77f306fa823d5"},
{file = "jq-1.4.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:342f901a9330d12d2c2baf17684b77ae198fade920d061bb844d1b3733097792"},
{file = "jq-1.4.1-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:761713740c19dd0e0da8b6eaea7f588df2af64d8e32d1157a3a05028b0fec2b3"},
{file = "jq-1.4.1-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:6343d929e48ba4d75febcd987752931dc7a70e1b2f6f17b74baf3d5179dfb6a5"},
{file = "jq-1.4.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4ec82f8925f7a88547cd302f2b479c81af17468dbd3473d688c3714a264f90c0"},
{file = "jq-1.4.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:95edc023b97d1a44fd1e8243119a3532bc0e7d121dfdf2722471ec36763b85aa"},
{file = "jq-1.4.1-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:cc4dd73782c039c66b25fc103b07fd46bac5d2f5a62dba29b45ae97ca88ba988"},
{file = "jq-1.4.1.tar.gz", hash = "sha256:52284ee3cb51670e6f537b0ec813654c064c1c0705bd910097ea0fe17313516d"},
]
[[package]]
name = "jsonlines"
version = "3.1.0"
@ -9617,7 +9682,7 @@ cffi = {version = ">=1.11", markers = "platform_python_implementation == \"PyPy\
cffi = ["cffi (>=1.11)"]
[extras]
all = ["O365", "aleph-alpha-client", "anthropic", "arxiv", "atlassian-python-api", "azure-cosmos", "azure-identity", "beautifulsoup4", "clickhouse-connect", "cohere", "deeplake", "duckduckgo-search", "elasticsearch", "faiss-cpu", "google-api-python-client", "google-search-results", "gptcache", "html2text", "huggingface_hub", "jina", "jinja2", "lancedb", "lark", "manifest-ml", "networkx", "nlpcloud", "nltk", "nomic", "openai", "opensearch-py", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "psycopg2-binary", "pyowm", "pypdf", "pytesseract", "pyvespa", "qdrant-client", "redis", "sentence-transformers", "spacy", "tensorflow-text", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
all = ["O365", "aleph-alpha-client", "anthropic", "arxiv", "atlassian-python-api", "azure-cosmos", "azure-identity", "beautifulsoup4", "clickhouse-connect", "cohere", "deeplake", "duckduckgo-search", "elasticsearch", "faiss-cpu", "google-api-python-client", "google-search-results", "gptcache", "html2text", "huggingface_hub", "jina", "jinja2", "jq", "lancedb", "lark", "manifest-ml", "networkx", "nlpcloud", "nltk", "nomic", "openai", "opensearch-py", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "psycopg2-binary", "pyowm", "pypdf", "pytesseract", "pyvespa", "qdrant-client", "redis", "sentence-transformers", "spacy", "tensorflow-text", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
azure = ["azure-core", "azure-cosmos", "azure-identity", "openai"]
cohere = ["cohere"]
embeddings = ["sentence-transformers"]
@ -9628,4 +9693,4 @@ qdrant = ["qdrant-client"]
[metadata]
lock-version = "2.0"
python-versions = ">=3.8.1,<4.0"
content-hash = "aad9c9a1fc1b6fbd67225e0762298a49b3837a42ecb564a19f6161c2c37d0fd4"
content-hash = "2352db14ae75227c4d1ab34d48c74da3a16ceaeb5c5fa5df1a1dfcc5ae8e69e6"

View File

@ -76,6 +76,7 @@ lancedb = {version = "^0.1", optional = true}
pexpect = {version = "^4.8.0", optional = true}
pyvespa = {version = "^0.33.0", optional = true}
O365 = {version = "^2.0.26", optional = true}
jq = {version = "^1.4.1", optional = true}
[tool.poetry.group.docs.dependencies]
autodoc_pydantic = "^1.8.0"
@ -156,7 +157,7 @@ openai = ["openai"]
cohere = ["cohere"]
embeddings = ["sentence-transformers"]
azure = ["azure-identity", "azure-cosmos", "openai", "azure-core"]
all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect", "azure-cosmos", "lancedb", "lark", "pexpect", "pyvespa", "O365"]
all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect", "azure-cosmos", "lancedb", "lark", "pexpect", "pyvespa", "O365", "jq"]
[tool.ruff]
select = [

View File

@ -0,0 +1,16 @@
from pathlib import Path
from langchain.document_loaders import JSONLoader
def test_json_loader() -> None:
"""Test unstructured loader."""
file_path = Path(__file__).parent.parent / "examples/example.json"
loader = JSONLoader(str(file_path), ".messages[].content")
docs = loader.load()
# Check that the correct number of documents are loaded.
assert len(docs) == 3
# Make sure that None content are converted to empty strings.
assert docs[-1].page_content == ""

View File

@ -0,0 +1,25 @@
{
"messages": [
{
"sender_name": "User 2",
"timestamp_ms": 1675597571851,
"content": "Bye!"
},
{
"sender_name": "User 1",
"timestamp_ms": 1675597435669,
"content": "Oh no worries! Bye"
},
{
"sender_name": "User 2",
"timestamp_ms": 1675595060730,
"photos": [
{
"uri": "url_of_some_picture.jpg",
"creation_timestamp": 1675595059
}
]
}
],
"title": "User 1 and User 2 chat"
}