Code understanding use case (#8801)

Code understanding docs --------- Co-authored-by: Manuel Soria <manuel.soria@greyscaleai.com> Co-authored-by: Lance Martin <lance@langchain.dev>
1 year ago · 31cfc00845
parent f7ae183f40
commit 31cfc00845
6 changed files with 354 additions and 0 deletions
--- a/docs/docs_skeleton/static/img/code_retrieval.png
+++ b/docs/docs_skeleton/static/img/code_retrieval.png
--- a/docs/docs_skeleton/static/img/code_understanding.png
+++ b/docs/docs_skeleton/static/img/code_understanding.png
--- a/docs/extras/use_cases/code_understanding.ipynb
+++ b/docs/extras/use_cases/code_understanding.ipynb
@ -0,0 +1,354 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Code Understanding\n",
+    "\n",
+    "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/code_understanding.ipynb)\n",
+    "\n",
+    "## Use case\n",
+    "\n",
+    "Source code analysis is one of the most popular LLM applications (e.g., [GitHub Co-Pilot](https://github.com/features/copilot), [Code Interpreter](https://chat.openai.com/auth/login?next=%2F%3Fmodel%3Dgpt-4-code-interpreter), [Codium](https://www.codium.ai/), and [Codeium](https://codeium.com/about)) for use-cases such as:\n",
+    "\n",
+    "- Q&A over the code base to understand how it works\n",
+    "- Using LLMs for suggesting refactors or improvements\n",
+    "- Using LLMs for documenting the code\n",
+    "\n",
+    "![Image description](/img/code_understanding.png)\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "The pipeline for QA over code follows [the steps we do for document question answering](/docs/extras/use_cases/question_answering), with some differences:\n",
+    "\n",
+    "In particular, we can employ a [splitting strategy](https://python.langchain.com/docs/integrations/document_loaders/source_code) that does a few things:\n",
+    "\n",
+    "* Keeps each top-level function and class in the code is loaded into separate documents. \n",
+    "* Puts remaining into a seperate document.\n",
+    "* Retains metadata about where each split comes from\n",
+    "\n",
+    "## Quickstart"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install openai tiktoken chromadb langchain\n",
+    "\n",
+    "# Set env var OPENAI_API_KEY or load from a .env file\n",
+    "# import dotenv\n",
+    "\n",
+    "# dotenv.load_env()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'lll follow the structure of [this notebook](https://github.com/cristobalcl/LearningLangChain/blob/master/notebooks/04%20-%20QA%20with%20code.ipynb) and employ [context aware code splitting](https://python.langchain.com/docs/integrations/document_loaders/source_code)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Loading\n",
+    "\n",
+    "\n",
+    "We will upload all python project files using the `langchain.document_loaders.TextLoader`.\n",
+    "\n",
+    "The following script iterates over the files in the LangChain repository and loads every `.py` file (a.k.a. **documents**):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from git import Repo\n",
+    "from langchain.text_splitter import Language\n",
+    "from langchain.document_loaders.generic import GenericLoader\n",
+    "from langchain.document_loaders.parsers import LanguageParser"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Clone\n",
+    "repo_path = \"/Users/rlm/Desktop/test_repo\"\n",
+    "repo = Repo.clone_from(\"https://github.com/hwchase17/langchain\", to_path=repo_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We load the py code using [`LanguageParser`](https://python.langchain.com/docs/integrations/document_loaders/source_code), which will:\n",
+    "\n",
+    "* Keep top-level functions and classes together (into a single document)\n",
+    "* Put remaining code into a seperate document\n",
+    "* Retains metadata about where each split comes from"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1293"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Load\n",
+    "loader = GenericLoader.from_filesystem(\n",
+    "    repo_path+\"/libs/langchain/langchain\",\n",
+    "    glob=\"**/*\",\n",
+    "    suffixes=[\".py\"],\n",
+    "    parser=LanguageParser(language=Language.PYTHON, parser_threshold=500)\n",
+    ")\n",
+    "documents = loader.load()\n",
+    "len(documents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Splittng\n",
+    "\n",
+    "Split the `Document` into chunks for embedding and vector storage.\n",
+    "\n",
+    "We can use `RecursiveCharacterTextSplitter` w/ `language` specified."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "3748"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, \n",
+    "                                                               chunk_size=2000, \n",
+    "                                                               chunk_overlap=200)\n",
+    "texts = python_splitter.split_documents(documents)\n",
+    "len(texts)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### RetrievalQA\n",
+    "\n",
+    "We need to store the documents in a way we can semantically search for their content. \n",
+    "\n",
+    "The most common approach is to embed the contents of each document then store the embedding and document in a vector store. \n",
+    "\n",
+    "When setting up the vectorstore retriever:\n",
+    "\n",
+    "* We test [max marginal relevance](/docs/extras/use_cases/question_answering) for retrieval\n",
+    "* And 8 documents returned\n",
+    "\n",
+    "#### Go deeper\n",
+    "\n",
+    "- Browse the > 40 vectorstores integrations [here](https://integrations.langchain.com/).\n",
+    "- See further documentation on vectorstores [here](/docs/modules/data_connection/vectorstores/).\n",
+    "- Browse the > 30 text embedding integrations [here](https://integrations.langchain.com/).\n",
+    "- See further documentation on embedding models [here](/docs/modules/data_connection/text_embedding/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.vectorstores import Chroma\n",
+    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
+    "db = Chroma.from_documents(texts, OpenAIEmbeddings(disallowed_special=()))\n",
+    "retriever = db.as_retriever(\n",
+    "    search_type=\"mmr\",  # Also test \"similarity\"\n",
+    "    search_kwargs={\"k\": 8},\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Chat\n",
+    "\n",
+    "Test chat, just as we do for [chatbots](/docs/extras/use_cases/chatbots).\n",
+    "\n",
+    "#### Go deeper\n",
+    "\n",
+    "- Browse the > 55 LLM and chat model integrations [here](https://integrations.langchain.com/).\n",
+    "- See further documentation on LLMs and chat models [here](/docs/modules/model_io/models/).\n",
+    "- Use local LLMS: The popularity of [PrivateGPT](https://github.com/imartinez/privateGPT) and [GPT4All](https://github.com/nomic-ai/gpt4all) underscore the importance of running LLMs locally."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.memory import ConversationSummaryMemory\n",
+    "from langchain.chains import ConversationalRetrievalChain\n",
+    "llm = ChatOpenAI(model_name=\"gpt-4\") \n",
+    "memory = ConversationSummaryMemory(llm=llm,memory_key=\"chat_history\",return_messages=True)\n",
+    "qa = ConversationalRetrievalChain.from_llm(llm, retriever=retriever, memory=memory)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'To load a source code as documents for a QA over code, you can use the `CodeLoader` class. This class allows you to load source code files and split them into classes and functions.\\n\\nHere is an example of how to use the `CodeLoader` class:\\n\\n```python\\nfrom langchain.document_loaders.code import CodeLoader\\n\\n# Specify the path to the source code file\\ncode_file_path = \"path/to/code/file.py\"\\n\\n# Create an instance of the CodeLoader class\\ncode_loader = CodeLoader(code_file_path)\\n\\n# Load the code as documents\\ndocuments = code_loader.load()\\n\\n# Iterate over the documents\\nfor document in documents:\\n    # Access the class or function name\\n    name = document.metadata[\"name\"]\\n    \\n    # Access the code content\\n    code = document.page_content\\n    \\n    # Process the code as needed\\n    # ...\\n```\\n\\nIn the example above, `code_file_path` should be replaced with the actual path to your source code file. The `load()` method of the `CodeLoader` class will return a list of `Document` objects, where each document represents a class or function in the source code. You can access the class or function name using the `metadata[\"name\"]` attribute, and the code content using the `page_content` attribute of each `Document` object.\\n\\nYou can then process the code as needed for your QA task.'"
+      ]
+     },
+     "execution_count": 30,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "question = \"How can I load a source code as documents, for a QA over code, spliting the code in classes and functions?\"\n",
+    "result = qa(question)\n",
+    "result['answer']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "-> **Question**: What is the class hierarchy? \n",
+      "\n",
+      "**Answer**: The class hierarchy in object-oriented programming is the structure that forms when classes are derived from other classes. The derived class is a subclass of the base class also known as the superclass. This hierarchy is formed based on the concept of inheritance in object-oriented programming where a subclass inherits the properties and functionalities of the superclass. \n",
+      "\n",
+      "In the given context, we have the following examples of class hierarchies:\n",
+      "\n",
+      "1. `BaseCallbackHandler --> <name>CallbackHandler` means `BaseCallbackHandler` is a base class and `<name>CallbackHandler` (like `AimCallbackHandler`, `ArgillaCallbackHandler` etc.) are derived classes that inherit from `BaseCallbackHandler`.\n",
+      "\n",
+      "2. `BaseLoader --> <name>Loader` means `BaseLoader` is a base class and `<name>Loader` (like `TextLoader`, `UnstructuredFileLoader` etc.) are derived classes that inherit from `BaseLoader`.\n",
+      "\n",
+      "3. `ToolMetaclass --> BaseTool --> <name>Tool` means `ToolMetaclass` is a base class, `BaseTool` is a derived class that inherits from `ToolMetaclass`, and `<name>Tool` (like `AIPluginTool`, `BaseGraphQLTool` etc.) are further derived classes that inherit from `BaseTool`. \n",
+      "\n",
+      "-> **Question**: What classes are derived from the Chain class? \n",
+      "\n",
+      "**Answer**: The classes that are derived from the Chain class are:\n",
+      "\n",
+      "1. LLMSummarizationCheckerChain\n",
+      "2. MapReduceChain\n",
+      "3. OpenAIModerationChain\n",
+      "4. NatBotChain\n",
+      "5. QAGenerationChain\n",
+      "6. QAWithSourcesChain\n",
+      "7. RetrievalQAWithSourcesChain\n",
+      "8. VectorDBQAWithSourcesChain\n",
+      "9. RetrievalQA\n",
+      "10. VectorDBQA\n",
+      "11. LLMRouterChain\n",
+      "12. MultiPromptChain\n",
+      "13. MultiRetrievalQAChain\n",
+      "14. MultiRouteChain\n",
+      "15. RouterChain\n",
+      "16. SequentialChain\n",
+      "17. SimpleSequentialChain\n",
+      "18. TransformChain\n",
+      "19. BaseConversationalRetrievalChain\n",
+      "20. ConstitutionalChain \n",
+      "\n",
+      "-> **Question**: What one improvement do you propose in code in relation to the class herarchy for the Chain class? \n",
+      "\n",
+      "**Answer**: As an AI model, I don't have personal opinions. However, one suggestion could be to improve the documentation of the Chain class hierarchy. The current comments and docstrings provide some details but it could be helpful to include more explicit explanations about the hierarchy, roles of each subclass, and their relationships with one another. Also, incorporating UML diagrams or other visuals could help developers better understand the structure and interactions of the classes. \n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "questions = [\n",
+    "    \"What is the class hierarchy?\",\n",
+    "    \"What classes are derived from the Chain class?\",\n",
+    "    \"What one improvement do you propose in code in relation to the class herarchy for the Chain class?\",\n",
+    "]\n",
+    "\n",
+    "for question in questions:\n",
+    "    result = qa(question)\n",
+    "    print(f\"-> **Question**: {question} \\n\")\n",
+    "    print(f\"**Answer**: {result['answer']} \\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The can look at the [LangSmith trace](https://smith.langchain.com/public/2b23045f-4e49-4d2d-8980-dec85259af36/r) to see what is happening under the hood:\n",
+    "\n",
+    "* In particular, the code well structured and kept together in the retrival output\n",
+    "* The retrieved code and chat history are passed to the LLM for answer distillation\n",
+    "\n",
+    "![Image description](/img/code_retrieval.png)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/docs/extras/use_cases/question_answering/how_to/code/code-analysis-deeplake.ipynb
+++ b/docs/extras/use_cases/question_answering/how_to/code/code-analysis-deeplake.ipynb
--- a/docs/extras/use_cases/question_answering/how_to/code/index.mdx
+++ b/docs/extras/use_cases/question_answering/how_to/code/index.mdx
--- a/docs/extras/use_cases/question_answering/how_to/code/twitter-the-algorithm-analysis-deeplake.ipynb
+++ b/docs/extras/use_cases/question_answering/how_to/code/twitter-the-algorithm-analysis-deeplake.ipynb