{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Use LangChain, GPT and Deep Lake to work with code base\n", "In this tutorial, we are going to use Langchain + Deep Lake with GPT to analyze the code base of the LangChain itself. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Design" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "1. Prepare data:\n", " 1. Upload all python project files using the `langchain.document_loaders.TextLoader`. We will call these files the **documents**.\n", " 2. Split all documents to chunks using the `langchain.text_splitter.CharacterTextSplitter`.\n", " 3. Embed chunks and upload them into the DeepLake using `langchain.embeddings.openai.OpenAIEmbeddings` and `langchain.vectorstores.DeepLake`\n", "2. Question-Answering:\n", " 1. Build a chain from `langchain.chat_models.ChatOpenAI` and `langchain.chains.ConversationalRetrievalChain`\n", " 2. Prepare questions.\n", " 3. Get answers running the chain.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Implementation" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Integration preparations" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We need to set up keys for external services and install necessary python libraries." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "#!python3 -m pip install --upgrade langchain deeplake openai" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Set up OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. \n", "\n", "For full documentation of Deep Lake please follow https://docs.activeloop.ai/ and API reference https://docs.deeplake.ai/en/latest/" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ········\n" ] } ], "source": [ "import os\n", "from getpass import getpass\n", "\n", "os.environ['OPENAI_API_KEY'] = getpass()\n", "# Please manually enter OpenAI Key" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the platform at [app.activeloop.ai](https://app.activeloop.ai)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ········\n" ] } ], "source": [ "os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Activeloop Token:')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare data " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Load all repository files. Here we assume this notebook is downloaded as the part of the langchain fork and we work with the python files of the `langchain` repo.\n", "\n", "If you want to use files from different repo, change `root_dir` to the root dir of your repo." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1147\n" ] } ], "source": [ "from langchain.document_loaders import TextLoader\n", "\n", "root_dir = '../../../..'\n", "\n", "docs = []\n", "for dirpath, dirnames, filenames in os.walk(root_dir):\n", " for file in filenames:\n", " if file.endswith('.py') and '/.venv/' not in dirpath:\n", " try: \n", " loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')\n", " docs.extend(loader.load_and_split())\n", " except Exception as e: \n", " pass\n", "print(f'{len(docs)}')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Then, chunk the files" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Created a chunk of size 1620, which is longer than the specified 1000\n", "Created a chunk of size 1213, which is longer than the specified 1000\n", "Created a chunk of size 1263, which is longer than the specified 1000\n", "Created a chunk of size 1448, which is longer than the specified 1000\n", "Created a chunk of size 1120, which is longer than the specified 1000\n", "Created a chunk of size 1148, which is longer than the specified 1000\n", "Created a chunk of size 1826, which is longer than the specified 1000\n", "Created a chunk of size 1260, which is longer than the specified 1000\n", "Created a chunk of size 1195, which is longer than the specified 1000\n", "Created a chunk of size 2147, which is longer than the specified 1000\n", "Created a chunk of size 1410, which is longer than the specified 1000\n", "Created a chunk of size 1269, which is longer than the specified 1000\n", "Created a chunk of size 1030, which is longer than the specified 1000\n", "Created a chunk of size 1046, which is longer than the specified 1000\n", "Created a chunk of size 1024, which is longer than the specified 1000\n", "Created a chunk of size 1026, which is longer than the specified 1000\n", "Created a chunk of size 1285, which is longer than the specified 1000\n", "Created a chunk of size 1370, which is longer than the specified 1000\n", "Created a chunk of size 1031, which is longer than the specified 1000\n", "Created a chunk of size 1999, which is longer than the specified 1000\n", "Created a chunk of size 1029, which is longer than the specified 1000\n", "Created a chunk of size 1120, which is longer than the specified 1000\n", "Created a chunk of size 1033, which is longer than the specified 1000\n", "Created a chunk of size 1143, which is longer than the specified 1000\n", "Created a chunk of size 1416, which is longer than the specified 1000\n", "Created a chunk of size 2482, which is longer than the specified 1000\n", "Created a chunk of size 1890, which is longer than the specified 1000\n", "Created a chunk of size 1418, which is longer than the specified 1000\n", "Created a chunk of size 1848, which is longer than the specified 1000\n", "Created a chunk of size 1069, which is longer than the specified 1000\n", "Created a chunk of size 2369, which is longer than the specified 1000\n", "Created a chunk of size 1045, which is longer than the specified 1000\n", "Created a chunk of size 1501, which is longer than the specified 1000\n", "Created a chunk of size 1208, which is longer than the specified 1000\n", "Created a chunk of size 1950, which is longer than the specified 1000\n", "Created a chunk of size 1283, which is longer than the specified 1000\n", "Created a chunk of size 1414, which is longer than the specified 1000\n", "Created a chunk of size 1304, which is longer than the specified 1000\n", "Created a chunk of size 1224, which is longer than the specified 1000\n", "Created a chunk of size 1060, which is longer than the specified 1000\n", "Created a chunk of size 2461, which is longer than the specified 1000\n", "Created a chunk of size 1099, which is longer than the specified 1000\n", "Created a chunk of size 1178, which is longer than the specified 1000\n", "Created a chunk of size 1449, which is longer than the specified 1000\n", "Created a chunk of size 1345, which is longer than the specified 1000\n", "Created a chunk of size 3359, which is longer than the specified 1000\n", "Created a chunk of size 2248, which is longer than the specified 1000\n", "Created a chunk of size 1589, which is longer than the specified 1000\n", "Created a chunk of size 2104, which is longer than the specified 1000\n", "Created a chunk of size 1505, which is longer than the specified 1000\n", "Created a chunk of size 1387, which is longer than the specified 1000\n", "Created a chunk of size 1215, which is longer than the specified 1000\n", "Created a chunk of size 1240, which is longer than the specified 1000\n", "Created a chunk of size 1635, which is longer than the specified 1000\n", "Created a chunk of size 1075, which is longer than the specified 1000\n", "Created a chunk of size 2180, which is longer than the specified 1000\n", "Created a chunk of size 1791, which is longer than the specified 1000\n", "Created a chunk of size 1555, which is longer than the specified 1000\n", "Created a chunk of size 1082, which is longer than the specified 1000\n", "Created a chunk of size 1225, which is longer than the specified 1000\n", "Created a chunk of size 1287, which is longer than the specified 1000\n", "Created a chunk of size 1085, which is longer than the specified 1000\n", "Created a chunk of size 1117, which is longer than the specified 1000\n", "Created a chunk of size 1966, which is longer than the specified 1000\n", "Created a chunk of size 1150, which is longer than the specified 1000\n", "Created a chunk of size 1285, which is longer than the specified 1000\n", "Created a chunk of size 1150, which is longer than the specified 1000\n", "Created a chunk of size 1585, which is longer than the specified 1000\n", "Created a chunk of size 1208, which is longer than the specified 1000\n", "Created a chunk of size 1267, which is longer than the specified 1000\n", "Created a chunk of size 1542, which is longer than the specified 1000\n", "Created a chunk of size 1183, which is longer than the specified 1000\n", "Created a chunk of size 2424, which is longer than the specified 1000\n", "Created a chunk of size 1017, which is longer than the specified 1000\n", "Created a chunk of size 1304, which is longer than the specified 1000\n", "Created a chunk of size 1379, which is longer than the specified 1000\n", "Created a chunk of size 1324, which is longer than the specified 1000\n", "Created a chunk of size 1205, which is longer than the specified 1000\n", "Created a chunk of size 1056, which is longer than the specified 1000\n", "Created a chunk of size 1195, which is longer than the specified 1000\n", "Created a chunk of size 3608, which is longer than the specified 1000\n", "Created a chunk of size 1058, which is longer than the specified 1000\n", "Created a chunk of size 1075, which is longer than the specified 1000\n", "Created a chunk of size 1217, which is longer than the specified 1000\n", "Created a chunk of size 1109, which is longer than the specified 1000\n", "Created a chunk of size 1440, which is longer than the specified 1000\n", "Created a chunk of size 1046, which is longer than the specified 1000\n", "Created a chunk of size 1220, which is longer than the specified 1000\n", "Created a chunk of size 1403, which is longer than the specified 1000\n", "Created a chunk of size 1241, which is longer than the specified 1000\n", "Created a chunk of size 1427, which is longer than the specified 1000\n", "Created a chunk of size 1049, which is longer than the specified 1000\n", "Created a chunk of size 1580, which is longer than the specified 1000\n", "Created a chunk of size 1565, which is longer than the specified 1000\n", "Created a chunk of size 1131, which is longer than the specified 1000\n", "Created a chunk of size 1425, which is longer than the specified 1000\n", "Created a chunk of size 1054, which is longer than the specified 1000\n", "Created a chunk of size 1027, which is longer than the specified 1000\n", "Created a chunk of size 2559, which is longer than the specified 1000\n", "Created a chunk of size 1028, which is longer than the specified 1000\n", "Created a chunk of size 1382, which is longer than the specified 1000\n", "Created a chunk of size 1888, which is longer than the specified 1000\n", "Created a chunk of size 1475, which is longer than the specified 1000\n", "Created a chunk of size 1652, which is longer than the specified 1000\n", "Created a chunk of size 1891, which is longer than the specified 1000\n", "Created a chunk of size 1899, which is longer than the specified 1000\n", "Created a chunk of size 1021, which is longer than the specified 1000\n", "Created a chunk of size 1085, which is longer than the specified 1000\n", "Created a chunk of size 1854, which is longer than the specified 1000\n", "Created a chunk of size 1672, which is longer than the specified 1000\n", "Created a chunk of size 2537, which is longer than the specified 1000\n", "Created a chunk of size 1251, which is longer than the specified 1000\n", "Created a chunk of size 1734, which is longer than the specified 1000\n", "Created a chunk of size 1642, which is longer than the specified 1000\n", "Created a chunk of size 1376, which is longer than the specified 1000\n", "Created a chunk of size 1253, which is longer than the specified 1000\n", "Created a chunk of size 1642, which is longer than the specified 1000\n", "Created a chunk of size 1419, which is longer than the specified 1000\n", "Created a chunk of size 1438, which is longer than the specified 1000\n", "Created a chunk of size 1427, which is longer than the specified 1000\n", "Created a chunk of size 1684, which is longer than the specified 1000\n", "Created a chunk of size 1760, which is longer than the specified 1000\n", "Created a chunk of size 1157, which is longer than the specified 1000\n", "Created a chunk of size 2504, which is longer than the specified 1000\n", "Created a chunk of size 1082, which is longer than the specified 1000\n", "Created a chunk of size 2268, which is longer than the specified 1000\n", "Created a chunk of size 1784, which is longer than the specified 1000\n", "Created a chunk of size 1311, which is longer than the specified 1000\n", "Created a chunk of size 2972, which is longer than the specified 1000\n", "Created a chunk of size 1144, which is longer than the specified 1000\n", "Created a chunk of size 1825, which is longer than the specified 1000\n", "Created a chunk of size 1508, which is longer than the specified 1000\n", "Created a chunk of size 2901, which is longer than the specified 1000\n", "Created a chunk of size 1715, which is longer than the specified 1000\n", "Created a chunk of size 1062, which is longer than the specified 1000\n", "Created a chunk of size 1206, which is longer than the specified 1000\n", "Created a chunk of size 1102, which is longer than the specified 1000\n", "Created a chunk of size 1184, which is longer than the specified 1000\n", "Created a chunk of size 1002, which is longer than the specified 1000\n", "Created a chunk of size 1065, which is longer than the specified 1000\n", "Created a chunk of size 1871, which is longer than the specified 1000\n", "Created a chunk of size 1754, which is longer than the specified 1000\n", "Created a chunk of size 2413, which is longer than the specified 1000\n", "Created a chunk of size 1771, which is longer than the specified 1000\n", "Created a chunk of size 2054, which is longer than the specified 1000\n", "Created a chunk of size 2000, which is longer than the specified 1000\n", "Created a chunk of size 2061, which is longer than the specified 1000\n", "Created a chunk of size 1066, which is longer than the specified 1000\n", "Created a chunk of size 1419, which is longer than the specified 1000\n", "Created a chunk of size 1368, which is longer than the specified 1000\n", "Created a chunk of size 1008, which is longer than the specified 1000\n", "Created a chunk of size 1227, which is longer than the specified 1000\n", "Created a chunk of size 1745, which is longer than the specified 1000\n", "Created a chunk of size 2296, which is longer than the specified 1000\n", "Created a chunk of size 1083, which is longer than the specified 1000\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "3477\n" ] } ], "source": [ "from langchain.text_splitter import CharacterTextSplitter\n", "\n", "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", "texts = text_splitter.split_documents(docs)\n", "print(f\"{len(texts)}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Then embed chunks and upload them to the DeepLake.\n", "\n", "This can take several minutes. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "OpenAIEmbeddings(client=, model='text-embedding-ada-002', document_model_name='text-embedding-ada-002', query_model_name='text-embedding-ada-002', embedding_ctx_length=8191, openai_api_key=None, openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from langchain.embeddings.openai import OpenAIEmbeddings\n", "\n", "embeddings = OpenAIEmbeddings()\n", "embeddings" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain.vectorstores import DeepLake\n", "\n", "db = DeepLake.from_documents(texts, embeddings, dataset_path=f\"hub://{DEEPLAKE_ACCOUNT_NAME}/langchain-code\")\n", "db" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Question Answering\n", "First load the dataset, construct the retriever, then construct the Conversational Chain" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/user_name/langchain-code\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "hub://user_name/langchain-code loaded successfully.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Deep Lake Dataset in hub://user_name/langchain-code already exists, loading from the storage\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Dataset(path='hub://user_name/langchain-code', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])\n", "\n", " tensor htype shape dtype compression\n", " ------- ------- ------- ------- ------- \n", " embedding generic (3477, 1536) float32 None \n", " ids text (3477, 1) str None \n", " metadata json (3477, 1) str None \n", " text text (3477, 1) str None \n" ] } ], "source": [ "db = DeepLake(dataset_path=f\"hub://{DEEPLAKE_ACCOUNT_NAME}/langchain-code\", read_only=True, embedding_function=embeddings)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "tags": [] }, "outputs": [], "source": [ "retriever = db.as_retriever()\n", "retriever.search_kwargs['distance_metric'] = 'cos'\n", "retriever.search_kwargs['fetch_k'] = 20\n", "retriever.search_kwargs['maximal_marginal_relevance'] = True\n", "retriever.search_kwargs['k'] = 20" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can also specify user defined functions using [Deep Lake filters](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "tags": [] }, "outputs": [], "source": [ "def filter(x):\n", " # filter based on source code\n", " if 'something' in x['text'].data()['value']:\n", " return False\n", " \n", " # filter based on path e.g. extension\n", " metadata = x['metadata'].data()['value']\n", " return 'only_this' in metadata['source'] or 'also_that' in metadata['source']\n", "\n", "### turn on below for custom filtering\n", "# retriever.search_kwargs['filter'] = filter" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain.chat_models import ChatOpenAI\n", "from langchain.chains import ConversationalRetrievalChain\n", "\n", "model = ChatOpenAI(model_name='gpt-3.5-turbo') # 'ada' 'gpt-3.5-turbo' 'gpt-4',\n", "qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "questions = [\n", " \"What is the class hierarchy?\",\n", " # \"What classes are derived from the Chain class?\",\n", " # \"What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?\",\n", " # \"What one improvement do you propose in code in relation to the class herarchy for the Chain class?\",\n", "] \n", "chat_history = []\n", "\n", "for question in questions: \n", " result = qa({\"question\": question, \"chat_history\": chat_history})\n", " chat_history.append((question, result['answer']))\n", " print(f\"-> **Question**: {question} \\n\")\n", " print(f\"**Answer**: {result['answer']} \\n\")\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "-> **Question**: What is the class hierarchy? \n", "\n", "**Answer**: There are several class hierarchies in the provided code, so I'll list a few:\n", "\n", "1. `BaseModel` -> `ConstitutionalPrinciple`: `ConstitutionalPrinciple` is a subclass of `BaseModel`.\n", "2. `BasePromptTemplate` -> `StringPromptTemplate`, `AIMessagePromptTemplate`, `BaseChatPromptTemplate`, `ChatMessagePromptTemplate`, `ChatPromptTemplate`, `HumanMessagePromptTemplate`, `MessagesPlaceholder`, `SystemMessagePromptTemplate`, `FewShotPromptTemplate`, `FewShotPromptWithTemplates`, `Prompt`, `PromptTemplate`: All of these classes are subclasses of `BasePromptTemplate`.\n", "3. `APIChain`, `Chain`, `MapReduceDocumentsChain`, `MapRerankDocumentsChain`, `RefineDocumentsChain`, `StuffDocumentsChain`, `HypotheticalDocumentEmbedder`, `LLMChain`, `LLMBashChain`, `LLMCheckerChain`, `LLMMathChain`, `LLMRequestsChain`, `PALChain`, `QAWithSourcesChain`, `VectorDBQAWithSourcesChain`, `VectorDBQA`, `SQLDatabaseChain`: All of these classes are subclasses of `Chain`.\n", "4. `BaseLoader`: `BaseLoader` is a subclass of `ABC`.\n", "5. `BaseTracer` -> `ChainRun`, `LLMRun`, `SharedTracer`, `ToolRun`, `Tracer`, `TracerException`, `TracerSession`: All of these classes are subclasses of `BaseTracer`.\n", "6. `OpenAIEmbeddings`, `HuggingFaceEmbeddings`, `CohereEmbeddings`, `JinaEmbeddings`, `LlamaCppEmbeddings`, `HuggingFaceHubEmbeddings`, `TensorflowHubEmbeddings`, `SagemakerEndpointEmbeddings`, `HuggingFaceInstructEmbeddings`, `SelfHostedEmbeddings`, `SelfHostedHuggingFaceEmbeddings`, `SelfHostedHuggingFaceInstructEmbeddings`, `FakeEmbeddings`, `AlephAlphaAsymmetricSemanticEmbedding`, `AlephAlphaSymmetricSemanticEmbedding`: All of these classes are subclasses of `BaseLLM`. \n", "\n", "\n", "-> **Question**: What classes are derived from the Chain class? \n", "\n", "**Answer**: There are multiple classes that are derived from the Chain class. Some of them are:\n", "- APIChain\n", "- AnalyzeDocumentChain\n", "- ChatVectorDBChain\n", "- CombineDocumentsChain\n", "- ConstitutionalChain\n", "- ConversationChain\n", "- GraphQAChain\n", "- HypotheticalDocumentEmbedder\n", "- LLMChain\n", "- LLMCheckerChain\n", "- LLMRequestsChain\n", "- LLMSummarizationCheckerChain\n", "- MapReduceChain\n", "- OpenAPIEndpointChain\n", "- PALChain\n", "- QAWithSourcesChain\n", "- RetrievalQA\n", "- RetrievalQAWithSourcesChain\n", "- SequentialChain\n", "- SQLDatabaseChain\n", "- TransformChain\n", "- VectorDBQA\n", "- VectorDBQAWithSourcesChain\n", "\n", "There might be more classes that are derived from the Chain class as it is possible to create custom classes that extend the Chain class.\n", "\n", "\n", "-> **Question**: What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests? \n", "\n", "**Answer**: All classes and functions in the `./langchain/utilities/` folder seem to have unit tests written for them. \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }