langchain/docs/docs/use_cases/code_understanding.ipynb

{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "title: Code understanding\n",
    "sidebar_class_name: hidden\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/use_cases/code_understanding.ipynb)\n",
    "\n",
    "## Use case\n",
    "\n",
    "Source code analysis is one of the most popular LLM applications (e.g., [GitHub Copilot](https://github.com/features/copilot), [Code Interpreter](https://chat.openai.com/auth/login?next=%2F%3Fmodel%3Dgpt-4-code-interpreter), [Codium](https://www.codium.ai/), and [Codeium](https://codeium.com/about)) for use-cases such as:\n",
    "\n",
    "- Q&A over the code base to understand how it works\n",
    "- Using LLMs for suggesting refactors or improvements\n",
    "- Using LLMs for documenting the code\n",
    "\n",
    "![Image description](../../static/img/code_understanding.png)\n",
    "\n",
    "## Overview\n",
    "\n",
    "The pipeline for QA over code follows [the steps we do for document question answering](/docs/use_cases/question_answering), with some differences:\n",
    "\n",
    "In particular, we can employ a [splitting strategy](/docs/integrations/document_loaders/source_code) that does a few things:\n",
    "\n",
    "* Keeps each top-level function and class in the code is loaded into separate documents. \n",
    "* Puts remaining into a separate document.\n",
    "* Retains metadata about where each split comes from\n",
    "\n",
    "## Quickstart"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet langchain-openai tiktoken chromadb langchain git\n",
    "\n",
    "# Set env var OPENAI_API_KEY or load from a .env file\n",
    "# import dotenv\n",
    "\n",
    "# dotenv.load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll follow the structure of [this notebook](https://github.com/cristobalcl/LearningLangChain/blob/master/notebooks/04%20-%20QA%20with%20code.ipynb) and employ [context aware code splitting](/docs/integrations/document_loaders/source_code)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading\n",
    "\n",
    "\n",
    "We will upload all python project files using the `langchain_community.document_loaders.TextLoader`.\n",
    "\n",
    "The following script iterates over the files in the LangChain repository and loads every `.py` file (a.k.a. **documents**):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from git import Repo\n",
    "from langchain_community.document_loaders.generic import GenericLoader\n",
    "from langchain_community.document_loaders.parsers import LanguageParser\n",
    "from langchain_text_splitters import Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clone\n",
    "repo_path = \"/Users/jacoblee/Desktop/test_repo\"\n",
    "repo = Repo.clone_from(\"https://github.com/langchain-ai/langchain\", to_path=repo_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load the py code using [`LanguageParser`](/docs/integrations/document_loaders/source_code), which will:\n",
    "\n",
    "* Keep top-level functions and classes together (into a single document)\n",
    "* Put remaining code into a separate document\n",
    "* Retains metadata about where each split comes from"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "295"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load\n",
    "loader = GenericLoader.from_filesystem(\n",
    "    repo_path + \"/libs/core/langchain_core\",\n",
    "    glob=\"**/*\",\n",
    "    suffixes=[\".py\"],\n",
    "    exclude=[\"**/non-utf8-encoding.py\"],\n",
    "    parser=LanguageParser(language=Language.PYTHON, parser_threshold=500),\n",
    ")\n",
    "documents = loader.load()\n",
    "len(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Splitting\n",
    "\n",
    "Split the `Document` into chunks for embedding and vector storage.\n",
    "\n",
    "We can use `RecursiveCharacterTextSplitter` w/ `language` specified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "898"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "python_splitter = RecursiveCharacterTextSplitter.from_language(\n",
    "    language=Language.PYTHON, chunk_size=2000, chunk_overlap=200\n",
    ")\n",
    "texts = python_splitter.split_documents(documents)\n",
    "len(texts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RetrievalQA\n",
    "\n",
    "We need to store the documents in a way we can semantically search for their content. \n",
    "\n",
    "The most common approach is to embed the contents of each document then store the embedding and document in a vector store. \n",
    "\n",
    "When setting up the vectorstore retriever:\n",
    "\n",
    "* We test [max marginal relevance](/docs/use_cases/question_answering) for retrieval\n",
    "* And 8 documents returned\n",
    "\n",
    "#### Go deeper\n",
    "\n",
    "- Browse the > 40 vectorstores integrations [here](https://integrations.langchain.com/).\n",
    "- See further documentation on vectorstores [here](/docs/modules/data_connection/vectorstores/).\n",
    "- Browse the > 30 text embedding integrations [here](https://integrations.langchain.com/).\n",
    "- See further documentation on embedding models [here](/docs/modules/data_connection/text_embedding/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.vectorstores import Chroma\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "\n",
    "db = Chroma.from_documents(texts, OpenAIEmbeddings(disallowed_special=()))\n",
    "retriever = db.as_retriever(\n",
    "    search_type=\"mmr\",  # Also test \"similarity\"\n",
    "    search_kwargs={\"k\": 8},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Chat\n",
    "\n",
    "Test chat, just as we do for [chatbots](/docs/use_cases/chatbots).\n",
    "\n",
    "#### Go deeper\n",
    "\n",
    "- Browse the > 55 LLM and chat model integrations [here](https://integrations.langchain.com/).\n",
    "- See further documentation on LLMs and chat models [here](/docs/modules/model_io/).\n",
    "- Use local LLMS: The popularity of [PrivateGPT](https://github.com/imartinez/privateGPT) and [GPT4All](https://github.com/nomic-ai/gpt4all) underscore the importance of running LLMs locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains import create_history_aware_retriever, create_retrieval_chain\n",
    "from langchain.chains.combine_documents import create_stuff_documents_chain\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "llm = ChatOpenAI(model=\"gpt-4\")\n",
    "\n",
    "# First we need a prompt that we can pass into an LLM to generate this search query\n",
    "\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"placeholder\", \"{chat_history}\"),\n",
    "        (\"user\", \"{input}\"),\n",
    "        (\n",
    "            \"user\",\n",
    "            \"Given the above conversation, generate a search query to look up to get information relevant to the conversation\",\n",
    "        ),\n",
    "    ]\n",
    ")\n",
    "\n",
    "retriever_chain = create_history_aware_retriever(llm, retriever, prompt)\n",
    "\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"Answer the user's questions based on the below context:\\n\\n{context}\",\n",
    "        ),\n",
    "        (\"placeholder\", \"{chat_history}\"),\n",
    "        (\"user\", \"{input}\"),\n",
    "    ]\n",
    ")\n",
    "document_chain = create_stuff_documents_chain(llm, prompt)\n",
    "\n",
    "qa = create_retrieval_chain(retriever_chain, document_chain)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'A RunnableBinding is a class in the LangChain library that is used to bind arguments to a Runnable. This is useful when a runnable in a chain requires an argument that is not in the output of the previous runnable or included in the user input. It returns a new Runnable with the bound arguments and configuration. The bind method in the RunnableBinding class is used to perform this operation.'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "question = \"What is a RunnableBinding?\"\n",
    "result = qa.invoke({\"input\": question})\n",
    "result[\"answer\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-> **Question**: What classes are derived from the Runnable class? \n",
      "\n",
      "**Answer**: The classes derived from the `Runnable` class as mentioned in the context are: `RunnableLambda`, `RunnableLearnable`, `RunnableSerializable`, `RunnableWithFallbacks`. \n",
      "\n",
      "-> **Question**: What one improvement do you propose in code in relation to the class hierarchy for the Runnable class? \n",
      "\n",
      "**Answer**: One potential improvement could be the introduction of abstract base classes (ABCs) or interfaces for different types of Runnable classes. Currently, it seems like there are lots of different Runnable types, like RunnableLambda, RunnableParallel, etc., each with their own methods and attributes. By defining a common interface or ABC for all these classes, we can ensure consistency and better organize the codebase. It would also make it easier to add new types of Runnable classes in the future, as they would just need to implement the methods defined in the interface or ABC. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "questions = [\n",
    "    \"What classes are derived from the Runnable class?\",\n",
    "    \"What one improvement do you propose in code in relation to the class hierarchy for the Runnable class?\",\n",
    "]\n",
    "\n",
    "for question in questions:\n",
    "    result = qa.invoke({\"input\": question})\n",
    "    print(f\"-> **Question**: {question} \\n\")\n",
    "    print(f\"**Answer**: {result['answer']} \\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can look at the [LangSmith trace](https://smith.langchain.com/public/616f6620-f49f-46c7-8f4b-dae847705c5d/r) to see what is happening under the hood:\n",
    "\n",
    "* In particular, the code well structured and kept together in the retrieval output\n",
    "* The retrieved code and chat history are passed to the LLM for answer distillation\n",
    "\n",
    "![Image description](../../static/img/code_retrieval.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Open source LLMs\n",
    "\n",
    "We'll use LangChain's [Ollama integration](https://ollama.com/) to query a local OSS model.\n",
    "\n",
    "Check out the latest available models [here](https://ollama.com/library)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet langchain-community"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.chat_models.ollama import ChatOllama\n",
    "\n",
    "llm = ChatOllama(model=\"codellama\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's run it with a generic coding question to test its knowledge:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "You can use the `find` command with the `-mtime` option to find all the text files in the current directory that have been modified in the last month. Here's an example command:\n",
      "```bash\n",
      "find . -type f -name \"*.txt\" -mtime -30\n",
      "```\n",
      "This will list all the text files in the current directory (`.`) that have been modified in the last 30 days. The `-type f` option ensures that only regular files are matched, and not directories or other types of files. The `-name \"*.txt\"` option restricts the search to files with a `.txt` extension. Finally, the `-mtime -30` option specifies that we want to find files that have been modified in the last 30 days.\n",
      "\n",
      "You can also use `find` command with `-mmin` option to find all the text files in the current directory that have been modified within the last month. Here's an example command:\n",
      "```bash\n",
      "find . -type f -name \"*.txt\" -mmin -4320\n",
      "```\n",
      "This will list all the text files in the current directory (`.`) that have been modified within the last 30 days. The `-type f` option ensures that only regular files are matched, and not directories or other types of files. The `-name \"*.txt\"` option restricts the search to files with a `.txt` extension. Finally, the `-mmin -4320` option specifies that we want to find files that have been modified within the last 4320 minutes (which is equivalent to one month).\n",
      "\n",
      "You can also use `ls` command with `-l` option and pipe it to `grep` command to filter out the text files. Here's an example command:\n",
      "```bash\n",
      "ls -l | grep \"*.txt\"\n",
      "```\n",
      "This will list all the text files in the current directory (`.`) that have been modified within the last 30 days. The `-l` option of `ls` command lists the files in a long format, including the modification time, and the `grep` command filters out the files that do not match the specified pattern.\n",
      "\n",
      "Please note that these commands are case-sensitive, so if you have any files with different extensions (e.g., `.TXT`), they will not be matched by these commands.\n",
      "{'model': 'codellama', 'created_at': '2024-04-03T00:41:44.014203Z', 'message': {'role': 'assistant', 'content': ''}, 'done': True, 'total_duration': 27078466916, 'load_duration': 12947208, 'prompt_eval_count': 44, 'prompt_eval_duration': 11497468000, 'eval_count': 510, 'eval_duration': 15548191000}\n"
     ]
    }
   ],
   "source": [
    "response_message = llm.invoke(\n",
    "    \"In bash, how do I list all the text files in the current directory that have been modified in the last month?\"\n",
    ")\n",
    "\n",
    "print(response_message.content)\n",
    "print(response_message.response_metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks reasonable! Now let's set it up with our previously loaded vectorstore.\n",
    "\n",
    "We omit the conversational aspect to keep things more manageable for the lower-powered local model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from langchain.chains.question_answering import load_qa_chain\n",
    "\n",
    "# # Prompt\n",
    "# template = \"\"\"Use the following pieces of context to answer the question at the end.\n",
    "# If you don't know the answer, just say that you don't know, don't try to make up an answer.\n",
    "# Use three sentences maximum and keep the answer as concise as possible.\n",
    "# {context}\n",
    "# Question: {question}\n",
    "# Helpful Answer:\"\"\"\n",
    "# QA_CHAIN_PROMPT = PromptTemplate(\n",
    "#     input_variables=[\"context\", \"question\"],\n",
    "#     template=template,\n",
    "# )\n",
    "\n",
    "system_template = \"\"\"\n",
    "Answer the user's questions based on the below context.\n",
    "If you don't know the answer, just say that you don't know, don't try to make up an answer. \n",
    "Use three sentences maximum and keep the answer as concise as possible:\n",
    "\n",
    "{context}\n",
    "\"\"\"\n",
    "\n",
    "# First we need a prompt that we can pass into an LLM to generate this search query\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"system\", system_template),\n",
    "        (\"user\", \"{input}\"),\n",
    "    ]\n",
    ")\n",
    "document_chain = create_stuff_documents_chain(llm, prompt)\n",
    "\n",
    "qa_chain = create_retrieval_chain(retriever, document_chain)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"A RunnableBinding is a high-level class in the LangChain framework. It's an abstraction layer that sits between a program and an LLM or other data source.\\n\\nThe main goal of a RunnableBinding is to enable a program, which may be a chat bot or a backend service, to fetch responses from an LLM or other data sources in a way that is easy for both the program and the data sources to use. This is achieved through a set of predefined protocols that are implemented by the RunnableBinding.\\n\\nThe protocols defined by a RunnableBinding include:\\n\\n1. Fetching inputs from the program. The RunnableBinding should be able to receive inputs from the program and translate them into a format that can be processed by the LLM or other data sources.\\n2. Translating outputs from the LLM or other data sources into something that can be returned to the program. This includes converting the raw output of an LLM into something that is easier for the program to process, such as text or a structured object.\\n3. Handling errors that may arise during the fetching, processing, and returning of responses from the LLM or other data sources. The RunnableBinding should be able to catch exceptions and errors that occur during these operations and return a suitable error message or response to the program.\\n4. Managing concurrency and parallelism in the communication with the LLM or other data sources. This may include things like allowing multiple requests to be sent to the LLM or other data sources simultaneously, handling the responses asynchronously, and retrying failed requests.\\n5. Providing a way for the program to set configuration options that affect how the RunnableBinding interacts with the LLM or other data sources. This could include things like setting up credentials, providing additional contextual information to the LLM or other data sources, and controlling logging or error handling behavior.\\n\\nIn summary, a RunnableBinding provides a way for a program to easily communicate with an LLM or other data sources without having to know about the details of how they work. By providing a consistent interface between the program and the data sources, the RunnableBinding enables more robust and scalable communication protocols that are easier for both parties to use.\\n\\nIn the context of the chatbot tutorial, a RunnableBinding may be used to fetch responses from an LLM and return them as output for the bot to process. The RunnableBinding could also be used to handle errors that occur during this process, such as providing error messages or retrying failed requests to the LLM.\\n\\nTo summarize:\\n\\n* A RunnableBinding provides a way for a program to communicate with an LLM or other data sources without having to know about the details of how they work.\\n* It enables more robust and scalable communication protocols that are easier for both parties to use.\\n* It manages concurrency and parallelism in the communication with the LLM or other data sources.\\n* It provides a way for the program to set configuration options that affect how the RunnableBinding interacts with the LLM or other data sources.\""
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Run, only returning the value under the answer key for readability\n",
    "qa_chain.pick(\"answer\").invoke({\"input\": \"What is a RunnableBinding?\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not perfect, but it did pick up on the fact that it lets the developer set configuration option!\n",
    "\n",
    "Here's the [LangSmith trace](https://smith.langchain.com/public/d8bb2af8-99cd-406b-a870-f255f4a2423c/r) showing the retrieved docs used as context."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}