langchain/docs/docs/integrations/document_loaders/source_code.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "213a38a2",
   "metadata": {},
   "source": [
    "# Source Code\n",
    "\n",
    "This notebook covers how to load source code files using a special approach with language parsing: each top-level function and class in the code is loaded into separate documents. Any remaining code top-level code outside the already loaded functions and classes will be loaded into a separate document.\n",
    "\n",
    "This approach can potentially improve the accuracy of QA models over source code.\n",
    "\n",
    "The supported languages for code parsing are:\n",
    "\n",
    "- C (*)\n",
    "- C++ (*)\n",
    "- C# (*)\n",
    "- COBOL\n",
    "- Go (*)\n",
    "- Java (*)\n",
    "- JavaScript (requires package `esprima`)\n",
    "- Kotlin (*)\n",
    "- Lua (*)\n",
    "- Perl (*)\n",
    "- Python\n",
    "- Ruby (*)\n",
    "- Rust (*)\n",
    "- Scala (*)\n",
    "- TypeScript (*)\n",
    "\n",
    "Items marked with (*) require the packages `tree_sitter` and `tree_sitter_languages`.\n",
    "It is straightforward to add support for additional languages using `tree_sitter`,\n",
    "although this currently requires modifying LangChain.\n",
    "\n",
    "The language used for parsing can be configured, along with the minimum number of\n",
    "lines required to activate the splitting based on syntax.\n",
    "\n",
    "If a language is not explicitly specified, `LanguageParser` will infer one from\n",
    "filename extensions, if present."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fa47b2e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -qU esprima esprima tree_sitter tree_sitter_languages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "beb55c2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "from pprint import pprint\n",
    "\n",
    "from langchain_community.document_loaders.generic import GenericLoader\n",
    "from langchain_community.document_loaders.parsers import LanguageParser\n",
    "from langchain_text_splitters import Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "64056e07",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = GenericLoader.from_filesystem(\n",
    "    \"./example_data/source_code\",\n",
    "    glob=\"*\",\n",
    "    suffixes=[\".py\", \".js\"],\n",
    "    parser=LanguageParser(),\n",
    ")\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8af79bd7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "85edf3fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'content_type': 'functions_classes',\n",
      " 'language': <Language.PYTHON: 'python'>,\n",
      " 'source': 'example_data/source_code/example.py'}\n",
      "{'content_type': 'functions_classes',\n",
      " 'language': <Language.PYTHON: 'python'>,\n",
      " 'source': 'example_data/source_code/example.py'}\n",
      "{'content_type': 'simplified_code',\n",
      " 'language': <Language.PYTHON: 'python'>,\n",
      " 'source': 'example_data/source_code/example.py'}\n",
      "{'content_type': 'functions_classes',\n",
      " 'language': <Language.JS: 'js'>,\n",
      " 'source': 'example_data/source_code/example.js'}\n",
      "{'content_type': 'functions_classes',\n",
      " 'language': <Language.JS: 'js'>,\n",
      " 'source': 'example_data/source_code/example.js'}\n",
      "{'content_type': 'simplified_code',\n",
      " 'language': <Language.JS: 'js'>,\n",
      " 'source': 'example_data/source_code/example.js'}\n"
     ]
    }
   ],
   "source": [
    "for document in docs:\n",
    "    pprint(document.metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f44e3e37",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "class MyClass:\n",
      "    def __init__(self, name):\n",
      "        self.name = name\n",
      "\n",
      "    def greet(self):\n",
      "        print(f\"Hello, {self.name}!\")\n",
      "\n",
      "--8<--\n",
      "\n",
      "def main():\n",
      "    name = input(\"Enter your name: \")\n",
      "    obj = MyClass(name)\n",
      "    obj.greet()\n",
      "\n",
      "--8<--\n",
      "\n",
      "# Code for: class MyClass:\n",
      "\n",
      "\n",
      "# Code for: def main():\n",
      "\n",
      "\n",
      "if __name__ == \"__main__\":\n",
      "    main()\n",
      "\n",
      "--8<--\n",
      "\n",
      "class MyClass {\n",
      "  constructor(name) {\n",
      "    this.name = name;\n",
      "  }\n",
      "\n",
      "  greet() {\n",
      "    console.log(`Hello, ${this.name}!`);\n",
      "  }\n",
      "}\n",
      "\n",
      "--8<--\n",
      "\n",
      "function main() {\n",
      "  const name = prompt(\"Enter your name:\");\n",
      "  const obj = new MyClass(name);\n",
      "  obj.greet();\n",
      "}\n",
      "\n",
      "--8<--\n",
      "\n",
      "// Code for: class MyClass {\n",
      "\n",
      "// Code for: function main() {\n",
      "\n",
      "main();\n"
     ]
    }
   ],
   "source": [
    "print(\"\\n\\n--8<--\\n\\n\".join([document.page_content for document in docs]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69aad0ed",
   "metadata": {},
   "source": [
    "The parser can be disabled for small files. \n",
    "\n",
    "The parameter `parser_threshold` indicates the minimum number of lines that the source code file must have to be segmented using the parser."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ae024794",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = GenericLoader.from_filesystem(\n",
    "    \"./example_data/source_code\",\n",
    "    glob=\"*\",\n",
    "    suffixes=[\".py\"],\n",
    "    parser=LanguageParser(language=Language.PYTHON, parser_threshold=1000),\n",
    ")\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "5d3b372a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "89e546ad",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "class MyClass:\n",
      "    def __init__(self, name):\n",
      "        self.name = name\n",
      "\n",
      "    def greet(self):\n",
      "        print(f\"Hello, {self.name}!\")\n",
      "\n",
      "\n",
      "def main():\n",
      "    name = input(\"Enter your name: \")\n",
      "    obj = MyClass(name)\n",
      "    obj.greet()\n",
      "\n",
      "\n",
      "if __name__ == \"__main__\":\n",
      "    main()\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(docs[0].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9c71e61",
   "metadata": {},
   "source": [
    "## Splitting\n",
    "\n",
    "Additional splitting could be needed for those functions, classes, or scripts that are too big."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "adbaa79f",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = GenericLoader.from_filesystem(\n",
    "    \"./example_data/source_code\",\n",
    "    glob=\"*\",\n",
    "    suffixes=[\".js\"],\n",
    "    parser=LanguageParser(language=Language.JS),\n",
    ")\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "c44c0d3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import (\n",
    "    Language,\n",
    "    RecursiveCharacterTextSplitter,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "b1e0053d",
   "metadata": {},
   "outputs": [],
   "source": [
    "js_splitter = RecursiveCharacterTextSplitter.from_language(\n",
    "    language=Language.JS, chunk_size=60, chunk_overlap=0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "7dbe6188",
   "metadata": {},
   "outputs": [],
   "source": [
    "result = js_splitter.split_documents(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8a80d089",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "000a6011",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "class MyClass {\n",
      "  constructor(name) {\n",
      "    this.name = name;\n",
      "\n",
      "--8<--\n",
      "\n",
      "}\n",
      "\n",
      "--8<--\n",
      "\n",
      "greet() {\n",
      "    console.log(`Hello, ${this.name}!`);\n",
      "  }\n",
      "}\n",
      "\n",
      "--8<--\n",
      "\n",
      "function main() {\n",
      "  const name = prompt(\"Enter your name:\");\n",
      "\n",
      "--8<--\n",
      "\n",
      "const obj = new MyClass(name);\n",
      "  obj.greet();\n",
      "}\n",
      "\n",
      "--8<--\n",
      "\n",
      "// Code for: class MyClass {\n",
      "\n",
      "// Code for: function main() {\n",
      "\n",
      "--8<--\n",
      "\n",
      "main();\n"
     ]
    }
   ],
   "source": [
    "print(\"\\n\\n--8<--\\n\\n\".join([document.page_content for document in result]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fb27b941602401d91542211134fc71a",
   "metadata": {},
   "source": [
    "## Adding Languages using Tree-sitter Template\n",
    "\n",
    "Expanding language support using the Tree-Sitter template involves a few essential steps:\n",
    "\n",
    "1. **Creating a New Language File**:\n",
    "    - Begin by creating a new file in the designated directory (langchain/libs/community/langchain_community/document_loaders/parsers/language).\n",
    "    - Model this file based on the structure and parsing logic of existing language files like **`cpp.py`**.\n",
    "    - You will also need to create a file in the langchain directory (langchain/libs/langchain/langchain/document_loaders/parsers/language).\n",
    "2. **Parsing Language Specifics**:\n",
    "    - Mimic the structure used in the **`cpp.py`** file, adapting it to suit the language you are incorporating.\n",
    "    - The primary alteration involves adjusting the chunk query array to suit the syntax and structure of the language you are parsing.\n",
    "3. **Testing the Language Parser**:\n",
    "    - For thorough validation, generate a test file specific to the new language. Create **`test_language.py`** in the designated directory(langchain/libs/community/tests/unit_tests/document_loaders/parsers/language).\n",
    "    - Follow the example set by **`test_cpp.py`** to establish fundamental tests for the parsed elements in the new language.\n",
    "4. **Integration into the Parser and Text Splitter**:\n",
    "    - Incorporate your new language within the **`language_parser.py`** file. Ensure to update LANGUAGE_EXTENSIONS and LANGUAGE_SEGMENTERS along with the docstring for LanguageParser to recognize and handle the added language.\n",
    "    - Also, confirm that your language is included in **`text_splitter.py`** in class Language for proper parsing.\n",
    "\n",
    "By following these steps and ensuring comprehensive testing and integration, you'll successfully extend language support using the Tree-Sitter template.\n",
    "\n",
    "Best of luck!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}