JSON loader (#4067)

This implements a loader of text passages in JSON format. The `jq` syntax is used to define a schema for accessing the relevant contents from the JSON file. This requires dependency on the `jq` package: https://pypi.org/project/jq/. --------- Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
1 year ago · 6567b73e1a
parent bb6d97c18c
commit 6567b73e1a
7 changed files with 584 additions and 4 deletions
--- a/docs/modules/indexes/document_loaders/examples/json_loader.ipynb
+++ b/docs/modules/indexes/document_loaders/examples/json_loader.ipynb
@ -0,0 +1,367 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# JSON Files\n",
+    "\n",
+    "The `JSONLoader` uses a specified [jq schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files.\n",
+    "\n",
+    "This notebook shows how to use the `JSONLoader` to load [JSON](https://en.wikipedia.org/wiki/JSON) files into documents. A few examples of `jq` schema extracting different parts of a JSON file are also shown.\n",
+    "\n",
+    "Check this [manual](https://stedolan.github.io/jq/manual/#Basicfilters) for a detailed documentation of the `jq` syntax."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install jq"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "jupyter": {
+     "outputs_hidden": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import JSONLoader"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from pathlib import Path\n",
+    "from pprint import pprint\n",
+    "\n",
+    "\n",
+    "file_path='./example_data/facebook_chat.json'\n",
+    "data = json.loads(Path(file_path).read_text())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},\n",
+      " 'is_still_participant': True,\n",
+      " 'joinable_mode': {'link': '', 'mode': 1},\n",
+      " 'magic_words': [],\n",
+      " 'messages': [{'content': 'Bye!',\n",
+      "               'sender_name': 'User 2',\n",
+      "               'timestamp_ms': 1675597571851},\n",
+      "              {'content': 'Oh no worries! Bye',\n",
+      "               'sender_name': 'User 1',\n",
+      "               'timestamp_ms': 1675597435669},\n",
+      "              {'content': 'No Im sorry it was my mistake, the blue one is not '\n",
+      "                          'for sale',\n",
+      "               'sender_name': 'User 2',\n",
+      "               'timestamp_ms': 1675596277579},\n",
+      "              {'content': 'I thought you were selling the blue one!',\n",
+      "               'sender_name': 'User 1',\n",
+      "               'timestamp_ms': 1675595140251},\n",
+      "              {'content': 'Im not interested in this bag. Im interested in the '\n",
+      "                          'blue one!',\n",
+      "               'sender_name': 'User 1',\n",
+      "               'timestamp_ms': 1675595109305},\n",
+      "              {'content': 'Here is $129',\n",
+      "               'sender_name': 'User 2',\n",
+      "               'timestamp_ms': 1675595068468},\n",
+      "              {'photos': [{'creation_timestamp': 1675595059,\n",
+      "                           'uri': 'url_of_some_picture.jpg'}],\n",
+      "               'sender_name': 'User 2',\n",
+      "               'timestamp_ms': 1675595060730},\n",
+      "              {'content': 'Online is at least $100',\n",
+      "               'sender_name': 'User 2',\n",
+      "               'timestamp_ms': 1675595045152},\n",
+      "              {'content': 'How much do you want?',\n",
+      "               'sender_name': 'User 1',\n",
+      "               'timestamp_ms': 1675594799696},\n",
+      "              {'content': 'Goodmorning! $50 is too low.',\n",
+      "               'sender_name': 'User 2',\n",
+      "               'timestamp_ms': 1675577876645},\n",
+      "              {'content': 'Hi! Im interested in your bag. Im offering $50. Let '\n",
+      "                          'me know if you are interested. Thanks!',\n",
+      "               'sender_name': 'User 1',\n",
+      "               'timestamp_ms': 1675549022673}],\n",
+      " 'participants': [{'name': 'User 1'}, {'name': 'User 2'}],\n",
+      " 'thread_path': 'inbox/User 1 and User 2 chat',\n",
+      " 'title': 'User 1 and User 2 chat'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "pprint(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Using `JSONLoader`\n",
+    "\n",
+    "Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data. This can easily be done through the `JSONLoader` as shown below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = JSONLoader(\n",
+    "    file_path='./example_data/facebook_chat.json',\n",
+    "    jq_schema='.messages[].content')\n",
+    "\n",
+    "data = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),\n",
+      " Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),\n",
+      " Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3}),\n",
+      " Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4}),\n",
+      " Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5}),\n",
+      " Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6}),\n",
+      " Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7}),\n",
+      " Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8}),\n",
+      " Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9}),\n",
+      " Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10}),\n",
+      " Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11})]\n"
+     ]
+    }
+   ],
+   "source": [
+    "pprint(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extracting metadata\n",
+    "\n",
+    "Generally, we want to include metadata available in the JSON file into the documents that we create from the content.\n",
+    "\n",
+    "The following demonstrates how metadata can be extracted using the `JSONLoader`.\n",
+    "\n",
+    "There are some key changes to be noted. In the previous example where we didn't collect the metadata, we managed to directly specify in the schema where the value for the `page_content` can be extracted from.\n",
+    "\n",
+    "```\n",
+    ".messages[].content\n",
+    "```\n",
+    "\n",
+    "In the current example, we have to tell the loader to iterate over the records in the `messages` field. The jq_schema then has to be:\n",
+    "\n",
+    "```\n",
+    ".messages[]\n",
+    "```\n",
+    "\n",
+    "This allows us to pass the records (dict) into the `metadata_func` that has to be implemented. The `metadata_func` is responsible for identifying which pieces of information in the record should be included in the metadata stored in the final `Document` object.\n",
+    "\n",
+    "Additionally, we now have to explicitly specify in the loader, via the `content_key` argument, the key from the record where the value for the `page_content` needs to be extracted from."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define the metadata extraction function.\n",
+    "def metadata_func(record: dict, metadata: dict) -> dict:\n",
+    "\n",
+    "    metadata[\"sender_name\"] = record.get(\"sender_name\")\n",
+    "    metadata[\"timestamp_ms\"] = record.get(\"timestamp_ms\")\n",
+    "\n",
+    "    return metadata\n",
+    "\n",
+    "\n",
+    "loader = JSONLoader(\n",
+    "    file_path='./example_data/facebook_chat.json',\n",
+    "    jq_schema='.messages[]',\n",
+    "    content_key=\"content\",\n",
+    "    metadata_func=metadata_func\n",
+    ")\n",
+    "\n",
+    "data = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),\n",
+      " Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),\n",
+      " Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3, 'sender_name': 'User 2', 'timestamp_ms': 1675596277579}),\n",
+      " Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4, 'sender_name': 'User 1', 'timestamp_ms': 1675595140251}),\n",
+      " Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5, 'sender_name': 'User 1', 'timestamp_ms': 1675595109305}),\n",
+      " Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6, 'sender_name': 'User 2', 'timestamp_ms': 1675595068468}),\n",
+      " Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7, 'sender_name': 'User 2', 'timestamp_ms': 1675595060730}),\n",
+      " Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8, 'sender_name': 'User 2', 'timestamp_ms': 1675595045152}),\n",
+      " Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9, 'sender_name': 'User 1', 'timestamp_ms': 1675594799696}),\n",
+      " Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10, 'sender_name': 'User 2', 'timestamp_ms': 1675577876645}),\n",
+      " Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11, 'sender_name': 'User 1', 'timestamp_ms': 1675549022673})]\n"
+     ]
+    }
+   ],
+   "source": [
+    "pprint(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, you will see that the documents contain the metadata associated with the content we extracted."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The `metadata_func`\n",
+    "\n",
+    "As shown above, the `metadata_func` accepts the default metadata generated by the `JSONLoader`. This allows full control to the user with respect to how the metadata is formatted.\n",
+    "\n",
+    "For example, the default metadata contains the `source` and the `seq_num` keys. However, it is possible that the JSON data contain these keys as well. The user can then exploit the `metadata_func` to rename the default keys and use the ones from the JSON data.\n",
+    "\n",
+    "The example below shows how we can modify the `source` to only contain information of the file source relative to the `langchain` directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define the metadata extraction function.\n",
+    "def metadata_func(record: dict, metadata: dict) -> dict:\n",
+    "\n",
+    "    metadata[\"sender_name\"] = record.get(\"sender_name\")\n",
+    "    metadata[\"timestamp_ms\"] = record.get(\"timestamp_ms\")\n",
+    "    \n",
+    "    if \"source\" in metadata:\n",
+    "        source = metadata[\"source\"].split(\"/\")\n",
+    "        source = source[source.index(\"langchain\"):]\n",
+    "        metadata[\"source\"] = \"/\".join(source)\n",
+    "\n",
+    "    return metadata\n",
+    "\n",
+    "\n",
+    "loader = JSONLoader(\n",
+    "    file_path='./example_data/facebook_chat.json',\n",
+    "    jq_schema='.messages[]',\n",
+    "    content_key=\"content\",\n",
+    "    metadata_func=metadata_func\n",
+    ")\n",
+    "\n",
+    "data = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),\n",
+      " Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),\n",
+      " Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3, 'sender_name': 'User 2', 'timestamp_ms': 1675596277579}),\n",
+      " Document(page_content='I thought you were selling the blue one!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4, 'sender_name': 'User 1', 'timestamp_ms': 1675595140251}),\n",
+      " Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5, 'sender_name': 'User 1', 'timestamp_ms': 1675595109305}),\n",
+      " Document(page_content='Here is $129', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6, 'sender_name': 'User 2', 'timestamp_ms': 1675595068468}),\n",
+      " Document(page_content='', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7, 'sender_name': 'User 2', 'timestamp_ms': 1675595060730}),\n",
+      " Document(page_content='Online is at least $100', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8, 'sender_name': 'User 2', 'timestamp_ms': 1675595045152}),\n",
+      " Document(page_content='How much do you want?', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9, 'sender_name': 'User 1', 'timestamp_ms': 1675594799696}),\n",
+      " Document(page_content='Goodmorning! $50 is too low.', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10, 'sender_name': 'User 2', 'timestamp_ms': 1675577876645}),\n",
+      " Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11, 'sender_name': 'User 1', 'timestamp_ms': 1675549022673})]\n"
+     ]
+    }
+   ],
+   "source": [
+    "pprint(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Common JSON structures with jq schema\n",
+    "\n",
+    "The list below provides a reference to the possible `jq_schema` the user can use to extract content from the JSON data depending on the structure.\n",
+    "\n",
+    "```\n",
+    "JSON        -> [{\"text\": ...}, {\"text\": ...}, {\"text\": ...}]\n",
+    "jq_schema   -> \".[].text\"\n",
+    "        \n",
+    "JSON        -> {\"key\": [{\"text\": ...}, {\"text\": ...}, {\"text\": ...}]}\n",
+    "jq_schema   -> \".key[].text\"\n",
+    "\n",
+    "JSON        -> [\"...\", \"...\", \"...\"]\n",
+    "jq_schema   -> \".[]\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/langchain/document_loaders/init.py
+++ b/langchain/document_loaders/init.py
@ -45,6 +45,7 @@ from langchain.document_loaders.ifixit import IFixitLoader
 from langchain.document_loaders.image import UnstructuredImageLoader
 from langchain.document_loaders.image_captions import ImageCaptionLoader
 from langchain.document_loaders.imsdb import IMSDbLoader
+from langchain.document_loaders.json_loader import JSONLoader
 from langchain.document_loaders.markdown import UnstructuredMarkdownLoader
 from langchain.document_loaders.mediawikidump import MWDumpLoader
 from langchain.document_loaders.modern_treasury import ModernTreasuryLoader
@ -144,6 +145,7 @@ __all__ = [
    "IFixitLoader",
    "IMSDbLoader",
    "ImageCaptionLoader",
+    "JSONLoader",
    "ModernTreasuryLoader",
    "MWDumpLoader",
    "NotebookLoader",
--- a/langchain/document_loaders/json_loader.py
+++ b/langchain/document_loaders/json_loader.py
@ -0,0 +1,104 @@
+"""Loader that loads data from JSON."""
+import json
+from pathlib import Path
+from typing import Callable, Dict, List, Optional, Union
+
+from langchain.docstore.document import Document
+from langchain.document_loaders.base import BaseLoader
+
+
+class JSONLoader(BaseLoader):
+    """Loads a JSON file and references a jq schema provided to load the text into
+    documents.
+
+    Example:
+        [{"text": ...}, {"text": ...}, {"text": ...}] -> schema = .[].text
+        {"key": [{"text": ...}, {"text": ...}, {"text": ...}]} -> schema = .key[].text
+        ["", "", ""] -> schema = .[]
+    """
+
+    def __init__(
+        self,
+        file_path: Union[str, Path],
+        jq_schema: str,
+        content_key: Optional[str] = None,
+        metadata_func: Optional[Callable[[Dict, Dict], Dict]] = None,
+    ):
+        """Initialize the JSONLoader.
+
+        Args:
+            file_path (Union[str, Path]): The path to the JSON file.
+            jq_schema (str): The jq schema to use to extract the data or text from
+                the JSON.
+            content_key (str): The key to use to extract the content from the JSON if
+                the jq_schema results to a list of objects (dict).
+            metadata_func (Callable[Dict, Dict]): A function that takes in the JSON
+                object extracted by the jq_schema and the default metadata and returns
+                a dict of the updated metadata.
+        """
+        try:
+            import jq  # noqa:F401
+        except ImportError:
+            raise ValueError(
+                "jq package not found, please install it with `pip install jq`"
+            )
+
+        self.file_path = Path(file_path).resolve()
+        self._jq_schema = jq.compile(jq_schema)
+        self._content_key = content_key
+        self._metadata_func = metadata_func
+
+    def load(self) -> List[Document]:
+        """Load and return documents from the JSON file."""
+
+        data = self._jq_schema.input(json.loads(self.file_path.read_text()))
+
+        # Perform some validation
+        # This is not a perfect validation, but it should catch most cases
+        # and prevent the user from getting a cryptic error later on.
+        if self._content_key is not None:
+            sample = data.first()
+            if not isinstance(sample, dict):
+                raise ValueError(
+                    f"Expected the jq schema to result in a list of objects (dict), \
+                        so sample must be a dict but got `{type(sample)}`"
+                )
+
+            if sample.get(self._content_key) is None:
+                raise ValueError(
+                    f"Expected the jq schema to result in a list of objects (dict) \
+                        with the key `{self._content_key}`"
+                )
+
+            if self._metadata_func is not None:
+                sample_metadata = self._metadata_func(sample, {})
+                if not isinstance(sample_metadata, dict):
+                    raise ValueError(
+                        f"Expected the metadata_func to return a dict but got \
+                            `{type(sample_metadata)}`"
+                    )
+
+        docs = []
+
+        for i, sample in enumerate(data, 1):
+            metadata = dict(
+                source=str(self.file_path),
+                seq_num=i,
+            )
+
+            if self._content_key is not None:
+                text = sample.get(self._content_key)
+                if self._metadata_func is not None:
+                    # We pass in the metadata dict to the metadata_func
+                    # so that the user can customize the default metadata
+                    # based on the content of the JSON object.
+                    metadata = self._metadata_func(sample, metadata)
+            else:
+                text = sample
+
+            # In case the text is None, set it to an empty string
+            text = text or ""
+
+            docs.append(Document(page_content=text, metadata=metadata))
+
+        return docs
--- a/poetry.lock
+++ b/poetry.lock
@ -1,4 +1,4 @@
-# This file is automatically @generated by Poetry 1.4.0 and should not be changed by hand.
+# This file is automatically @generated by Poetry 1.4.2 and should not be changed by hand.

 [[package]]
 name = "absl-py"
@ -3220,6 +3220,71 @@ files = [
    {file = "joblib-1.2.0.tar.gz", hash = "sha256:e1cee4a79e4af22881164f218d4311f60074197fb707e082e803b61f6d137018"},
 ]

+[[package]]
+name = "jq"
+version = "1.4.1"
+description = "jq is a lightweight and flexible JSON processor."
+category = "main"
+optional = true
+python-versions = ">=3.5"
+files = [
+    {file = "jq-1.4.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:1708cad6ee0f173ce38c6ebfc81b98a545b35387ae6471c8d7f9f3a02ffb723e"},
+    {file = "jq-1.4.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c94e70e5f0798d87018cd4a58175f4eed2afa08727389a0f3f246bf7e7b98d1e"},
+    {file = "jq-1.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ddc2c6b55c5461c6f155c4b717927bdd29a83a6356250c4e6016297bcea80498"},
+    {file = "jq-1.4.1-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e2e71f5a921542efbea12386ca9d91ea1aeb6bd393681073e4a47a720613715f"},
+    {file = "jq-1.4.1-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:b2bf666002d23ee8cf9e619d2d1e46d86a089e028367665386b9d67d22b31ceb"},
+    {file = "jq-1.4.1-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:e33954fe47e61a533556d38e045ddd7b3fa8a8186a70981462a207ed22594d83"},
+    {file = "jq-1.4.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:07905774df7706588014ca49789548328e8f66738b004089b3f0c42f7f389405"},
+    {file = "jq-1.4.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:959b2e677e56dc31c8572c0852ad26d3b351a8a458ca72c96f8cedfcde49419f"},
+    {file = "jq-1.4.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e74ab69d39b171f1625fa666baa8f9a1ff49e7295047082bcb537fcc2d359dfe"},
+    {file = "jq-1.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:103412f7f35175eb9a1005e4e2067b363dfcdb413d02fa962ddf288b2b16cc54"},
+    {file = "jq-1.4.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:1f70d5e0c6445cc58f720de2ab44c156c69ce6d898c4d4ad04f07815868e31ed"},
+    {file = "jq-1.4.1-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:db980118c02321c56b6e0ddf817ad1cbbd8b6c90f4637bdebb695e84ee41a296"},
+    {file = "jq-1.4.1-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:9b295a51a9ea7e324aa7ad2ce2cca3d51d7492a525cd7a59773666a07b1cc0f7"},
+    {file = "jq-1.4.1-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:82b44474641dcdb07b43267d17f77914595768e9464b31de114e6c229a16ac6e"},
+    {file = "jq-1.4.1-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:582c40d7e212e310cf1ed0fddc4590853b64a5e09aed1f740613765c83cff072"},
+    {file = "jq-1.4.1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:75f4269f709f746bf3d52df2c4ebc316d4985e0db97b7c1a293f02202befcdcb"},
+    {file = "jq-1.4.1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1a060fd3172f8833828cb26151ea2f6c0f99f0191109ad580baee7befbdd6e65"},
+    {file = "jq-1.4.1-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:2bfd61be72ad1e35622a7525e55615954ccfbe6ccadabd7f964e879bb4a53ad6"},
+    {file = "jq-1.4.1-cp36-cp36m-musllinux_1_1_aarch64.whl", hash = "sha256:4364c45113407f1316a99bd7a8661aa9304eb3578c80b201917aa8568fa40ee1"},
+    {file = "jq-1.4.1-cp36-cp36m-musllinux_1_1_i686.whl", hash = "sha256:0a8c37073a335596c645f0260fd3ea7b6141c2fb0115a0b8082252b0169f70c8"},
+    {file = "jq-1.4.1-cp36-cp36m-musllinux_1_1_x86_64.whl", hash = "sha256:96e5160f77498389e388e7ba3cd1771abc386b52788c82dee897c95bc87efe6f"},
+    {file = "jq-1.4.1-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:fac91eb91bec60dee28e2325f863c43d12ffc904ee72248522c6d0157ae98a54"},
+    {file = "jq-1.4.1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:581e771e7c4aad728f9696ce6faee0f3d535cb0c845a49ac20188d8c7918e19d"},
+    {file = "jq-1.4.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:31b6526533cbc298ae0c0084d22452fbd3b4600ace488dc961ecf9a1dcb51a83"},
+    {file = "jq-1.4.1-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:1830a9fd394673758010e41e8d0e00be7126b0ea9f3ede017a555c0c805435bc"},
+    {file = "jq-1.4.1-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:6b11e71b4d00928898f494d8e2945b80aab0447a4f2e7fb4603ac32cccc4e28e"},
+    {file = "jq-1.4.1-cp37-cp37m-musllinux_1_1_i686.whl", hash = "sha256:3e4dd3ba62e284479528a5a00084c2923a08de7cb7fe154036a345190ed5bc24"},
+    {file = "jq-1.4.1-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:7dfa6ff7424339ed361d911a13635e7c2f888e18e42920a8603e8806d85fdfdc"},
+    {file = "jq-1.4.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:419f8d28e737b96476ac9ba66e000e4d93e54dd8003f1374269315086b98d822"},
+    {file = "jq-1.4.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:de27a580663825b493b061682b59704f29a748011f2e5bc4701b34f8f17ed405"},
+    {file = "jq-1.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ebfec7c54b3252ec59663a21885e97d49b1dd455d8db0223bb77073b9b248fc3"},
+    {file = "jq-1.4.1-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:56a21666412dd1a6b8306475d0ec6e1eba7965100b3dfd6ecf1eb537aabec513"},
+    {file = "jq-1.4.1-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:f97b1e2582d64b65069f2d8b5e08f94f1d0998233c98c0d6edcf0a610262cd3a"},
+    {file = "jq-1.4.1-cp38-cp38-musllinux_1_1_i686.whl", hash = "sha256:33b5fcbf32c24557dd638e59b919f2ecfa98e65cf4b96f63c327ed10ea24495d"},
+    {file = "jq-1.4.1-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:a16fb7e2e0942b4661a8d210e9ac3292b5f021abbcddbbcb6b783f9eb5d7a6cb"},
+    {file = "jq-1.4.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:4c4d6b9f30556d5f17552ac2ef8563872a2c0271cc7c8789c87546270135ae15"},
+    {file = "jq-1.4.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3f82346544116503cbdfd56ac5e90f837c2b96d69b64a3444df2770156dc8d64"},
+    {file = "jq-1.4.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1799792f34ca8441fb1c4b3cf05c644ef2a4b28ad07bae65b1c7cde8f26721b4"},
+    {file = "jq-1.4.1-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:2403bfcaedbe860ffaa3258b65ad3dcf72d2d97c59acf6f8fd5f663a1b0a183a"},
+    {file = "jq-1.4.1-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:c59ebcd4f0bb99d5d69085905c80d8ebf95df522750d95e33985121daa4e1de4"},
+    {file = "jq-1.4.1-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:aa7fadeca796eb385b93217fb65ac2c54150ac3fcea2722c0c76390f0d6b2681"},
+    {file = "jq-1.4.1-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:11fb7e41c4931127cfe5c53b1eb812d797ed7d47a8ab22f6cb294cf470d5038b"},
+    {file = "jq-1.4.1-pp37-pypy37_pp73-macosx_10_9_x86_64.whl", hash = "sha256:fc8f67f7b8140e51bd291686055d63f62b60fa3bea861265309f54fd74f5517d"},
+    {file = "jq-1.4.1-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:30ce02d9c01ffea7c92b4ec006b114c4047816f15016173dced3fc046760b854"},
+    {file = "jq-1.4.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bbbfdfbb0bc2d615edfa8213720423885c022a827ea3c8e8593bce98b6086c99"},
+    {file = "jq-1.4.1-pp37-pypy37_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:9053a8e9f3636d367e8bb0841a62d839f2116e6965096d95c38a8f9da57eed66"},
+    {file = "jq-1.4.1-pp38-pypy38_pp73-macosx_10_9_x86_64.whl", hash = "sha256:3ecdffb3abc9f1611465b761eebcdb3008ae57946a86a99e76bc6b09fe611f29"},
+    {file = "jq-1.4.1-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e5f0688f98dedb49a5c680b961a4f453fe84b34795aa3203eec77f306fa823d5"},
+    {file = "jq-1.4.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:342f901a9330d12d2c2baf17684b77ae198fade920d061bb844d1b3733097792"},
+    {file = "jq-1.4.1-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:761713740c19dd0e0da8b6eaea7f588df2af64d8e32d1157a3a05028b0fec2b3"},
+    {file = "jq-1.4.1-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:6343d929e48ba4d75febcd987752931dc7a70e1b2f6f17b74baf3d5179dfb6a5"},
+    {file = "jq-1.4.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4ec82f8925f7a88547cd302f2b479c81af17468dbd3473d688c3714a264f90c0"},
+    {file = "jq-1.4.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:95edc023b97d1a44fd1e8243119a3532bc0e7d121dfdf2722471ec36763b85aa"},
+    {file = "jq-1.4.1-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:cc4dd73782c039c66b25fc103b07fd46bac5d2f5a62dba29b45ae97ca88ba988"},
+    {file = "jq-1.4.1.tar.gz", hash = "sha256:52284ee3cb51670e6f537b0ec813654c064c1c0705bd910097ea0fe17313516d"},
+]
+
 [[package]]
 name = "jsonlines"
 version = "3.1.0"
@ -9617,7 +9682,7 @@ cffi = {version = ">=1.11", markers = "platform_python_implementation == \"PyPy\
 cffi = ["cffi (>=1.11)"]

 [extras]
-all = ["O365", "aleph-alpha-client", "anthropic", "arxiv", "atlassian-python-api", "azure-cosmos", "azure-identity", "beautifulsoup4", "clickhouse-connect", "cohere", "deeplake", "duckduckgo-search", "elasticsearch", "faiss-cpu", "google-api-python-client", "google-search-results", "gptcache", "html2text", "huggingface_hub", "jina", "jinja2", "lancedb", "lark", "manifest-ml", "networkx", "nlpcloud", "nltk", "nomic", "openai", "opensearch-py", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "psycopg2-binary", "pyowm", "pypdf", "pytesseract", "pyvespa", "qdrant-client", "redis", "sentence-transformers", "spacy", "tensorflow-text", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
+all = ["O365", "aleph-alpha-client", "anthropic", "arxiv", "atlassian-python-api", "azure-cosmos", "azure-identity", "beautifulsoup4", "clickhouse-connect", "cohere", "deeplake", "duckduckgo-search", "elasticsearch", "faiss-cpu", "google-api-python-client", "google-search-results", "gptcache", "html2text", "huggingface_hub", "jina", "jinja2", "jq", "lancedb", "lark", "manifest-ml", "networkx", "nlpcloud", "nltk", "nomic", "openai", "opensearch-py", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "psycopg2-binary", "pyowm", "pypdf", "pytesseract", "pyvespa", "qdrant-client", "redis", "sentence-transformers", "spacy", "tensorflow-text", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
 azure = ["azure-core", "azure-cosmos", "azure-identity", "openai"]
 cohere = ["cohere"]
 embeddings = ["sentence-transformers"]
@ -9628,4 +9693,4 @@ qdrant = ["qdrant-client"]
 [metadata]
 lock-version = "2.0"
 python-versions = ">=3.8.1,<4.0"
-content-hash = "aad9c9a1fc1b6fbd67225e0762298a49b3837a42ecb564a19f6161c2c37d0fd4"
+content-hash = "2352db14ae75227c4d1ab34d48c74da3a16ceaeb5c5fa5df1a1dfcc5ae8e69e6"
--- a/pyproject.toml
+++ b/pyproject.toml
@ -76,6 +76,7 @@ lancedb = {version = "^0.1", optional = true}
 pexpect = {version = "^4.8.0", optional = true}
 pyvespa = {version = "^0.33.0", optional = true}
 O365 = {version = "^2.0.26", optional = true}
+jq = {version = "^1.4.1", optional = true}

 [tool.poetry.group.docs.dependencies]
 autodoc_pydantic = "^1.8.0"
@ -156,7 +157,7 @@ openai = ["openai"]
 cohere = ["cohere"]
 embeddings = ["sentence-transformers"]
 azure = ["azure-identity", "azure-cosmos", "openai", "azure-core"]
-all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect", "azure-cosmos", "lancedb", "lark", "pexpect", "pyvespa", "O365"]
+all = ["anthropic", "cohere", "openai", "nlpcloud", "huggingface_hub", "jina", "manifest-ml", "elasticsearch", "opensearch-py", "google-search-results", "faiss-cpu", "sentence-transformers", "transformers", "spacy", "nltk", "wikipedia", "beautifulsoup4", "tiktoken", "torch", "jinja2", "pinecone-client", "pinecone-text", "weaviate-client", "redis", "google-api-python-client", "wolframalpha", "qdrant-client", "tensorflow-text", "pypdf", "networkx", "nomic", "aleph-alpha-client", "deeplake", "pgvector", "psycopg2-binary", "boto3", "pyowm", "pytesseract", "html2text", "atlassian-python-api", "gptcache", "duckduckgo-search", "arxiv", "azure-identity", "clickhouse-connect", "azure-cosmos", "lancedb", "lark", "pexpect", "pyvespa", "O365", "jq"]

 [tool.ruff]
 select = [
--- a/tests/integration_tests/document_loaders/test_json_loader.py
+++ b/tests/integration_tests/document_loaders/test_json_loader.py
@ -0,0 +1,16 @@
+from pathlib import Path
+
+from langchain.document_loaders import JSONLoader
+
+
+def test_json_loader() -> None:
+    """Test unstructured loader."""
+    file_path = Path(__file__).parent.parent / "examples/example.json"
+    loader = JSONLoader(str(file_path), ".messages[].content")
+    docs = loader.load()
+
+    # Check that the correct number of documents are loaded.
+    assert len(docs) == 3
+
+    # Make sure that None content are converted to empty strings.
+    assert docs[-1].page_content == ""
--- a/tests/integration_tests/examples/example.json
+++ b/tests/integration_tests/examples/example.json
@ -0,0 +1,25 @@
+{
+    "messages": [
+        {
+            "sender_name": "User 2",
+            "timestamp_ms": 1675597571851,
+            "content": "Bye!"
+        },
+        {
+            "sender_name": "User 1",
+            "timestamp_ms": 1675597435669,
+            "content": "Oh no worries! Bye"
+        },
+        {
+            "sender_name": "User 2",
+            "timestamp_ms": 1675595060730,
+            "photos": [
+                {
+                    "uri": "url_of_some_picture.jpg",
+                    "creation_timestamp": 1675595059
+                }
+            ]
+        }
+    ],
+    "title": "User 1 and User 2 chat"
+}