{ "cells": [ { "cell_type": "markdown", "id": "79f24a6b", "metadata": {}, "source": [ "# File Directory\n", "\n", "This covers how to use the `DirectoryLoader` to load all documents in a directory. Under the hood, by default this uses the [UnstructuredLoader](./unstructured_file.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "id": "019d8520", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import DirectoryLoader" ] }, { "cell_type": "markdown", "id": "0c76cdc5", "metadata": {}, "source": [ "We can use the `glob` parameter to control which files to load. Note that here it doesn't load the `.rst` file or the `.ipynb` files." ] }, { "cell_type": "code", "execution_count": 2, "id": "891fe56f", "metadata": {}, "outputs": [], "source": [ "loader = DirectoryLoader('../', glob=\"**/*.md\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "addfe9cf", "metadata": {}, "outputs": [], "source": [ "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 4, "id": "b042086d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(docs)" ] }, { "cell_type": "markdown", "id": "e633d62f", "metadata": {}, "source": [ "## Show a progress bar" ] }, { "cell_type": "markdown", "id": "43911860", "metadata": {}, "source": [ "By default a progress bar will not be shown. To show a progress bar, install the `tqdm` library (e.g. `pip install tqdm`), and set the `show_progress` parameter to `True`." ] }, { "cell_type": "code", "execution_count": 10, "id": "bb93daac", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: tqdm in /Users/jon/.pyenv/versions/3.9.16/envs/microbiome-app/lib/python3.9/site-packages (4.65.0)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0it [00:00, ?it/s]\n" ] } ], "source": [ "%pip install tqdm\n", "loader = DirectoryLoader('../', glob=\"**/*.md\", show_progress=True)\n", "docs = loader.load()" ] }, { "cell_type": "markdown", "id": "c16ed46a", "metadata": {}, "source": [ "## Use multithreading" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5752e23e", "metadata": {}, "source": [ "By default the loading happens in one thread. In order to utilize several threads set the `use_multithreading` flag to true." ] }, { "cell_type": "code", "execution_count": null, "id": "f8d84f52", "metadata": {}, "outputs": [], "source": [ "loader = DirectoryLoader('../', glob=\"**/*.md\", use_multithreading=True)\n", "docs = loader.load()" ] }, { "cell_type": "markdown", "id": "c5652850", "metadata": {}, "source": [ "## Change loader class\n", "By default this uses the `UnstructuredLoader` class. However, you can change up the type of loader pretty easily." ] }, { "cell_type": "code", "execution_count": 15, "id": "81c92da3", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import TextLoader" ] }, { "cell_type": "code", "execution_count": 6, "id": "ab38ee36", "metadata": {}, "outputs": [], "source": [ "loader = DirectoryLoader('../', glob=\"**/*.md\", loader_cls=TextLoader)" ] }, { "cell_type": "code", "execution_count": 7, "id": "25c8740f", "metadata": {}, "outputs": [], "source": [ "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 8, "id": "38337763", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(docs)" ] }, { "cell_type": "markdown", "id": "598a2805", "metadata": {}, "source": [ "If you need to load Python source code files, use the `PythonLoader`." ] }, { "cell_type": "code", "execution_count": 14, "id": "c558bd73", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import PythonLoader" ] }, { "cell_type": "code", "execution_count": 13, "id": "a3cfaba7", "metadata": {}, "outputs": [], "source": [ "loader = DirectoryLoader('../../../../../', glob=\"**/*.py\", loader_cls=PythonLoader)" ] }, { "cell_type": "code", "execution_count": 14, "id": "e2e1e26a", "metadata": {}, "outputs": [], "source": [ "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": 15, "id": "ffb8ff36", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "691" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(docs)" ] }, { "cell_type": "markdown", "id": "6411a0cb", "metadata": {}, "source": [ "## Auto detect file encodings with TextLoader\n", "\n", "In this example we will see some strategies that can be useful when loading a big list of arbitrary files from a directory using the `TextLoader` class.\n", "\n", "First to illustrate the problem, let's try to load multiple text with arbitrary encodings." ] }, { "cell_type": "code", "execution_count": 16, "id": "2c787a69", "metadata": {}, "outputs": [], "source": [ "path = '../../../../../tests/integration_tests/examples'\n", "loader = DirectoryLoader(path, glob=\"**/*.txt\", loader_cls=TextLoader)" ] }, { "cell_type": "markdown", "id": "e9001e12", "metadata": {}, "source": [ "### A. Default Behavior" ] }, { "cell_type": "code", "execution_count": 19, "id": "b1e88c31", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮\n", "│ /data/source/langchain/langchain/document_loaders/text.py:29 in load │\n", "│ │\n", "│ 26 │ │ text = \"\" │\n", "│ 27 │ │ with open(self.file_path, encoding=self.encoding) as f: │\n", "│ 28 │ │ │ try: │\n", "│ ❱ 29 │ │ │ │ text = f.read() │\n", "│ 30 │ │ │ except UnicodeDecodeError as e: │\n", "│ 31 │ │ │ │ if self.autodetect_encoding: │\n", "│ 32 │ │ │ │ │ detected_encodings = self.detect_file_encodings() │\n", "│ │\n", "│ /home/spike/.pyenv/versions/3.9.11/lib/python3.9/codecs.py:322 in decode │\n", "│ │\n", "│ 319 │ def decode(self, input, final=False): │\n", "│ 320 │ │ # decode input (taking the buffer into account) │\n", "│ 321 │ │ data = self.buffer + input │\n", "│ ❱ 322 │ │ (result, consumed) = self._buffer_decode(data, self.errors, final) │\n", "│ 323 │ │ # keep undecoded input until the next call │\n", "│ 324 │ │ self.buffer = data[consumed:] │\n", "│ 325 │ │ return result │\n", "╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\n", "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte\n", "\n", "The above exception was the direct cause of the following exception:\n", "\n", "╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮\n", "│ in <module>:1 │\n", "│ │\n", "│ ❱ 1 loader.load() │\n", "│ 2 │\n", "│ │\n", "│ /data/source/langchain/langchain/document_loaders/directory.py:84 in load │\n", "│ │\n", "│ 81 │ │ │ │ │ │ if self.silent_errors: │\n", "│ 82 │ │ │ │ │ │ │ logger.warning(e) │\n", "│ 83 │ │ │ │ │ │ else: │\n", "│ ❱ 84 │ │ │ │ │ │ │ raise e │\n", "│ 85 │ │ │ │ │ finally: │\n", "│ 86 │ │ │ │ │ │ if pbar: │\n", "│ 87 │ │ │ │ │ │ │ pbar.update(1) │\n", "│ │\n", "│ /data/source/langchain/langchain/document_loaders/directory.py:78 in load │\n", "│ │\n", "│ 75 │ │ │ if i.is_file(): │\n", "│ 76 │ │ │ │ if _is_visible(i.relative_to(p)) or self.load_hidden: │\n", "│ 77 │ │ │ │ │ try: │\n", "│ ❱ 78 │ │ │ │ │ │ sub_docs = self.loader_cls(str(i), **self.loader_kwargs).load() │\n", "│ 79 │ │ │ │ │ │ docs.extend(sub_docs) │\n", "│ 80 │ │ │ │ │ except Exception as e: │\n", "│ 81 │ │ │ │ │ │ if self.silent_errors: │\n", "│ │\n", "│ /data/source/langchain/langchain/document_loaders/text.py:44 in load │\n", "│ │\n", "│ 41 │ │ │ │ │ │ except UnicodeDecodeError: │\n", "│ 42 │ │ │ │ │ │ │ continue │\n", "│ 43 │ │ │ │ else: │\n", "│ ❱ 44 │ │ │ │ │ raise RuntimeError(f\"Error loading {self.file_path}\") from e │\n", "│ 45 │ │ │ except Exception as e: │\n", "│ 46 │ │ │ │ raise RuntimeError(f\"Error loading {self.file_path}\") from e │\n", "│ 47 │\n", "╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\n", "RuntimeError: Error loading ../../../../../tests/integration_tests/examples/example-non-utf8.txt\n", "\n" ], "text/plain": [ "\u001b[31m╭─\u001b[0m\u001b[31m──────────────────────────────\u001b[0m\u001b[31m \u001b[0m\u001b[1;31mTraceback \u001b[0m\u001b[1;2;31m(most recent call last)\u001b[0m\u001b[31m \u001b[0m\u001b[31m───────────────────────────────\u001b[0m\u001b[31m─╮\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2;33m/data/source/langchain/langchain/document_loaders/\u001b[0m\u001b[1;33mtext.py\u001b[0m:\u001b[94m29\u001b[0m in \u001b[92mload\u001b[0m \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m26 \u001b[0m\u001b[2m│ │ \u001b[0mtext = \u001b[33m\"\u001b[0m\u001b[33m\"\u001b[0m \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m27 \u001b[0m\u001b[2m│ │ \u001b[0m\u001b[94mwith\u001b[0m \u001b[96mopen\u001b[0m(\u001b[96mself\u001b[0m.file_path, encoding=\u001b[96mself\u001b[0m.encoding) \u001b[94mas\u001b[0m f: \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m28 \u001b[0m\u001b[2m│ │ │ \u001b[0m\u001b[94mtry\u001b[0m: \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[31m❱ \u001b[0m29 \u001b[2m│ │ │ │ \u001b[0mtext = f.read() \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m30 \u001b[0m\u001b[2m│ │ │ \u001b[0m\u001b[94mexcept\u001b[0m \u001b[96mUnicodeDecodeError\u001b[0m \u001b[94mas\u001b[0m e: \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m31 \u001b[0m\u001b[2m│ │ │ │ \u001b[0m\u001b[94mif\u001b[0m \u001b[96mself\u001b[0m.autodetect_encoding: \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m32 \u001b[0m\u001b[2m│ │ │ │ │ \u001b[0mdetected_encodings = \u001b[96mself\u001b[0m.detect_file_encodings() \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2;33m/home/spike/.pyenv/versions/3.9.11/lib/python3.9/\u001b[0m\u001b[1;33mcodecs.py\u001b[0m:\u001b[94m322\u001b[0m in \u001b[92mdecode\u001b[0m \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m 319 \u001b[0m\u001b[2m│ \u001b[0m\u001b[94mdef\u001b[0m \u001b[92mdecode\u001b[0m(\u001b[96mself\u001b[0m, \u001b[96minput\u001b[0m, final=\u001b[94mFalse\u001b[0m): \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m 320 \u001b[0m\u001b[2m│ │ \u001b[0m\u001b[2m# decode input (taking the buffer into account)\u001b[0m \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m 321 \u001b[0m\u001b[2m│ │ \u001b[0mdata = \u001b[96mself\u001b[0m.buffer + \u001b[96minput\u001b[0m \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[31m❱ \u001b[0m 322 \u001b[2m│ │ \u001b[0m(result, consumed) = \u001b[96mself\u001b[0m._buffer_decode(data, \u001b[96mself\u001b[0m.errors, final) \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m 323 \u001b[0m\u001b[2m│ │ \u001b[0m\u001b[2m# keep undecoded input until the next call\u001b[0m \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m 324 \u001b[0m\u001b[2m│ │ \u001b[0m\u001b[96mself\u001b[0m.buffer = data[consumed:] \u001b[31m│\u001b[0m\n", "\u001b[31m│\u001b[0m \u001b[2m 325 \u001b[0m\u001b[2m│ │ \u001b[0m\u001b[94mreturn\u001b[0m result \u001b[31m│\u001b[0m\n", "\u001b[31m╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n", "\u001b[1;91mUnicodeDecodeError: \u001b[0m\u001b[32m'utf-8'\u001b[0m codec can't decode byte \u001b[1;36m0xca\u001b[0m in position \u001b[1;36m0\u001b[0m: invalid continuation byte\n", "\n", "\u001b[3mThe above exception was the direct cause of the following exception:\u001b[0m\n", "\n", "\u001b[31m╭─\u001b[0m\u001b[31m──────────────────────────────\u001b[0m\u001b[31m \u001b[0m\u001b[1;31mTraceback \u001b[0m\u001b[1;2;31m(most recent call last)\u001b[0m\u001b[31m \u001b[0m\u001b[31m───────────────────────────────\u001b[0m\u001b[31m─╮\u001b[0m\n", "\u001b[31m│\u001b[0m in \u001b[92m