mirror of https://github.com/hwchase17/langchain
You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
478 lines
12 KiB
Plaintext
478 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "213a38a2",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Source Code\n",
|
|
"\n",
|
|
"This notebook covers how to load source code files using a special approach with language parsing: each top-level function and class in the code is loaded into separate documents. Any remaining code top-level code outside the already loaded functions and classes will be loaded into a separate document.\n",
|
|
"\n",
|
|
"This approach can potentially improve the accuracy of QA models over source code.\n",
|
|
"\n",
|
|
"The supported languages for code parsing are:\n",
|
|
"\n",
|
|
"- C (*)\n",
|
|
"- C++ (*)\n",
|
|
"- C# (*)\n",
|
|
"- COBOL\n",
|
|
"- Go (*)\n",
|
|
"- Java (*)\n",
|
|
"- JavaScript (requires package `esprima`)\n",
|
|
"- Kotlin (*)\n",
|
|
"- Lua (*)\n",
|
|
"- Perl (*)\n",
|
|
"- Python\n",
|
|
"- Ruby (*)\n",
|
|
"- Rust (*)\n",
|
|
"- Scala (*)\n",
|
|
"- TypeScript (*)\n",
|
|
"\n",
|
|
"Items marked with (*) require the packages `tree_sitter` and `tree_sitter_languages`.\n",
|
|
"It is straightforward to add support for additional languages using `tree_sitter`,\n",
|
|
"although this currently requires modifying LangChain.\n",
|
|
"\n",
|
|
"The language used for parsing can be configured, along with the minimum number of\n",
|
|
"lines required to activate the splitting based on syntax.\n",
|
|
"\n",
|
|
"If a language is not explicitly specified, `LanguageParser` will infer one from\n",
|
|
"filename extensions, if present."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7fa47b2e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install -qU esprima esprima tree_sitter tree_sitter_languages"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "beb55c2f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import warnings\n",
|
|
"\n",
|
|
"warnings.filterwarnings(\"ignore\")\n",
|
|
"from pprint import pprint\n",
|
|
"\n",
|
|
"from langchain_community.document_loaders.generic import GenericLoader\n",
|
|
"from langchain_community.document_loaders.parsers import LanguageParser\n",
|
|
"from langchain_text_splitters import Language"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "64056e07",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = GenericLoader.from_filesystem(\n",
|
|
" \"./example_data/source_code\",\n",
|
|
" glob=\"*\",\n",
|
|
" suffixes=[\".py\", \".js\"],\n",
|
|
" parser=LanguageParser(),\n",
|
|
")\n",
|
|
"docs = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "8af79bd7",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"6"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"len(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "85edf3fc",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"{'content_type': 'functions_classes',\n",
|
|
" 'language': <Language.PYTHON: 'python'>,\n",
|
|
" 'source': 'example_data/source_code/example.py'}\n",
|
|
"{'content_type': 'functions_classes',\n",
|
|
" 'language': <Language.PYTHON: 'python'>,\n",
|
|
" 'source': 'example_data/source_code/example.py'}\n",
|
|
"{'content_type': 'simplified_code',\n",
|
|
" 'language': <Language.PYTHON: 'python'>,\n",
|
|
" 'source': 'example_data/source_code/example.py'}\n",
|
|
"{'content_type': 'functions_classes',\n",
|
|
" 'language': <Language.JS: 'js'>,\n",
|
|
" 'source': 'example_data/source_code/example.js'}\n",
|
|
"{'content_type': 'functions_classes',\n",
|
|
" 'language': <Language.JS: 'js'>,\n",
|
|
" 'source': 'example_data/source_code/example.js'}\n",
|
|
"{'content_type': 'simplified_code',\n",
|
|
" 'language': <Language.JS: 'js'>,\n",
|
|
" 'source': 'example_data/source_code/example.js'}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for document in docs:\n",
|
|
" pprint(document.metadata)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "f44e3e37",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"class MyClass:\n",
|
|
" def __init__(self, name):\n",
|
|
" self.name = name\n",
|
|
"\n",
|
|
" def greet(self):\n",
|
|
" print(f\"Hello, {self.name}!\")\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"def main():\n",
|
|
" name = input(\"Enter your name: \")\n",
|
|
" obj = MyClass(name)\n",
|
|
" obj.greet()\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"# Code for: class MyClass:\n",
|
|
"\n",
|
|
"\n",
|
|
"# Code for: def main():\n",
|
|
"\n",
|
|
"\n",
|
|
"if __name__ == \"__main__\":\n",
|
|
" main()\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"class MyClass {\n",
|
|
" constructor(name) {\n",
|
|
" this.name = name;\n",
|
|
" }\n",
|
|
"\n",
|
|
" greet() {\n",
|
|
" console.log(`Hello, ${this.name}!`);\n",
|
|
" }\n",
|
|
"}\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"function main() {\n",
|
|
" const name = prompt(\"Enter your name:\");\n",
|
|
" const obj = new MyClass(name);\n",
|
|
" obj.greet();\n",
|
|
"}\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"// Code for: class MyClass {\n",
|
|
"\n",
|
|
"// Code for: function main() {\n",
|
|
"\n",
|
|
"main();\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(\"\\n\\n--8<--\\n\\n\".join([document.page_content for document in docs]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "69aad0ed",
|
|
"metadata": {},
|
|
"source": [
|
|
"The parser can be disabled for small files. \n",
|
|
"\n",
|
|
"The parameter `parser_threshold` indicates the minimum number of lines that the source code file must have to be segmented using the parser."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "ae024794",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = GenericLoader.from_filesystem(\n",
|
|
" \"./example_data/source_code\",\n",
|
|
" glob=\"*\",\n",
|
|
" suffixes=[\".py\"],\n",
|
|
" parser=LanguageParser(language=Language.PYTHON, parser_threshold=1000),\n",
|
|
")\n",
|
|
"docs = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "5d3b372a",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"1"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"len(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "89e546ad",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"class MyClass:\n",
|
|
" def __init__(self, name):\n",
|
|
" self.name = name\n",
|
|
"\n",
|
|
" def greet(self):\n",
|
|
" print(f\"Hello, {self.name}!\")\n",
|
|
"\n",
|
|
"\n",
|
|
"def main():\n",
|
|
" name = input(\"Enter your name: \")\n",
|
|
" obj = MyClass(name)\n",
|
|
" obj.greet()\n",
|
|
"\n",
|
|
"\n",
|
|
"if __name__ == \"__main__\":\n",
|
|
" main()\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(docs[0].page_content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c9c71e61",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Splitting\n",
|
|
"\n",
|
|
"Additional splitting could be needed for those functions, classes, or scripts that are too big."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "adbaa79f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = GenericLoader.from_filesystem(\n",
|
|
" \"./example_data/source_code\",\n",
|
|
" glob=\"*\",\n",
|
|
" suffixes=[\".js\"],\n",
|
|
" parser=LanguageParser(language=Language.JS),\n",
|
|
")\n",
|
|
"docs = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "c44c0d3f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_text_splitters import (\n",
|
|
" Language,\n",
|
|
" RecursiveCharacterTextSplitter,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"id": "b1e0053d",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"js_splitter = RecursiveCharacterTextSplitter.from_language(\n",
|
|
" language=Language.JS, chunk_size=60, chunk_overlap=0\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"id": "7dbe6188",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"result = js_splitter.split_documents(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"id": "8a80d089",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"7"
|
|
]
|
|
},
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"len(result)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"id": "000a6011",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"class MyClass {\n",
|
|
" constructor(name) {\n",
|
|
" this.name = name;\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"}\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"greet() {\n",
|
|
" console.log(`Hello, ${this.name}!`);\n",
|
|
" }\n",
|
|
"}\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"function main() {\n",
|
|
" const name = prompt(\"Enter your name:\");\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"const obj = new MyClass(name);\n",
|
|
" obj.greet();\n",
|
|
"}\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"// Code for: class MyClass {\n",
|
|
"\n",
|
|
"// Code for: function main() {\n",
|
|
"\n",
|
|
"--8<--\n",
|
|
"\n",
|
|
"main();\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(\"\\n\\n--8<--\\n\\n\".join([document.page_content for document in result]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7fb27b941602401d91542211134fc71a",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Adding Languages using Tree-sitter Template\n",
|
|
"\n",
|
|
"Expanding language support using the Tree-Sitter template involves a few essential steps:\n",
|
|
"\n",
|
|
"1. **Creating a New Language File**:\n",
|
|
" - Begin by creating a new file in the designated directory (langchain/libs/community/langchain_community/document_loaders/parsers/language).\n",
|
|
" - Model this file based on the structure and parsing logic of existing language files like **`cpp.py`**.\n",
|
|
" - You will also need to create a file in the langchain directory (langchain/libs/langchain/langchain/document_loaders/parsers/language).\n",
|
|
"2. **Parsing Language Specifics**:\n",
|
|
" - Mimic the structure used in the **`cpp.py`** file, adapting it to suit the language you are incorporating.\n",
|
|
" - The primary alteration involves adjusting the chunk query array to suit the syntax and structure of the language you are parsing.\n",
|
|
"3. **Testing the Language Parser**:\n",
|
|
" - For thorough validation, generate a test file specific to the new language. Create **`test_language.py`** in the designated directory(langchain/libs/community/tests/unit_tests/document_loaders/parsers/language).\n",
|
|
" - Follow the example set by **`test_cpp.py`** to establish fundamental tests for the parsed elements in the new language.\n",
|
|
"4. **Integration into the Parser and Text Splitter**:\n",
|
|
" - Incorporate your new language within the **`language_parser.py`** file. Ensure to update LANGUAGE_EXTENSIONS and LANGUAGE_SEGMENTERS along with the docstring for LanguageParser to recognize and handle the added language.\n",
|
|
" - Also, confirm that your language is included in **`text_splitter.py`** in class Language for proper parsing.\n",
|
|
"\n",
|
|
"By following these steps and ensuring comprehensive testing and integration, you'll successfully extend language support using the Tree-Sitter template.\n",
|
|
"\n",
|
|
"Best of luck!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.5"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|