Add llama-v2 to local document QA (#7952)

2024-11-06 03:20:49 +00:00 · 2023-07-19 11:15:47 -07:00 · 2023-07-19 11:15:47 -07:00 · dfc533aa74
commit dfc533aa74
parent d9b5bcd691
1 changed files with 369 additions and 36 deletions
--- a/docs/extras/use_cases/question_answering/local_retrieval_qa.ipynb
+++ b/docs/extras/use_cases/question_answering/local_retrieval_qa.ipynb
@ -11,17 +11,22 @@
    "\n",
    "LangChain has [integrations](https://integrations.langchain.com/) with many open source LLMs that can be run locally.\n",
    "\n",
-    "For example, here we show how to run `GPT4All` locally (e.g., on your laptop) using local embeddings and a local LLM."
+    "For example, here we show how to run `GPT4All` or `Llama-v2` locally (e.g., on your laptop) using local embeddings and a local LLM.\n",
+    "\n",
+    "## Document Loading \n",
+    "\n",
+    "First, install packages needed for local embeddings and vector storage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "11514b36",
+   "id": "a7dc1ec5",
   "metadata": {},
   "outputs": [],
   "source": [
-    "! pip install gpt4all"
+    "! pip install gpt4all\n",
+    "! pip install chromadb"
   ]
  },
  {
@ -29,12 +34,14 @@
   "id": "5e7543fa",
   "metadata": {},
   "source": [
-    "Load and split an example docucment."
+    "Load and split an example docucment.\n",
+    "\n",
+    "We'll use a blog post on agents as an example."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 24,
   "id": "f8cf5765",
   "metadata": {},
   "outputs": [],
@ -55,18 +62,12 @@
   "id": "131d5059",
   "metadata": {},
   "source": [
-    "This will download the `GPT4All` embeddings locally if you don't already have them.\n",
-    "\n",
-    "For example, mine are here:\n",
-    " \n",
-    "```\n",
-    "Model downloaded at:  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n",
-    "```"
+    "Next, the below steps will download the `GPT4All` embeddings locally (if you don't already have them)."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 25,
   "id": "fdce8923",
   "metadata": {},
   "outputs": [
@ -74,8 +75,7 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n",
-      "llama_new_context_with_model: max tensor size =    87.89 MB\n"
+      "Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n"
     ]
    }
   ],
@ -96,7 +96,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 10,
   "id": "b0c55e98",
   "metadata": {},
   "outputs": [
@ -106,7 +106,7 @@
       "4"
      ]
     },
-     "execution_count": 6,
+     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -119,17 +119,17 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 11,
   "id": "32b43339",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "Document(page_content='Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\", \"What are the subgoals for achieving XYZ?\", (2) by using task-specific instructions; e.g. \"Write a story outline.\" for writing a novel, or (3) with human inputs.', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:', 'language': 'en'})"
+       "Document(page_content='Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\", \"What are the subgoals for achieving XYZ?\", (2) by using task-specific instructions; e.g. \"Write a story outline.\" for writing a novel, or (3) with human inputs.', metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:', 'language': 'en', 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': \"LLM Powered Autonomous Agents | Lil'Log\"})"
      ]
     },
-     "execution_count": 7,
+     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -138,11 +138,245 @@
    "docs[0]"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "557cd9b8",
+   "metadata": {},
+   "source": [
+    "## Model \n",
+    "\n",
+    "### Llama-v2\n",
+    "\n",
+    "Download a GGML converted model (e.g., [here](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main))."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9f218576",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install llama-cpp-python"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0dd1804f",
+   "metadata": {},
+   "source": [
+    "To enable use of GPU on Apple Silicon, follow the steps [here](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md) to use the Python binding `with Metal support`.\n",
+    "\n",
+    "In particular, ensure that `conda` is using the correct virtual enviorment that you created (`miniforge3`).\n",
+    "\n",
+    "E.g., for me:\n",
+    "\n",
+    "```\n",
+    "conda activate /Users/rlm/miniforge3/envs/llama\n",
+    "```\n",
+    "\n",
+    "With this confirmed:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2fd6fe25",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cd7164e3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.llms import LlamaCpp\n",
+    "from langchain.callbacks.manager import CallbackManager\n",
+    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fcf81052",
+   "metadata": {},
+   "source": [
+    "Setting model parameters as noted in the [llama.cpp docs](https://python.langchain.com/docs/modules/model_io/models/llms/integrations/llamacpp)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "74718579",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\n",
+      "llama_model_load_internal: format     = ggjt v3 (latest)\n",
+      "llama_model_load_internal: n_vocab    = 32000\n",
+      "llama_model_load_internal: n_ctx      = 2048\n",
+      "llama_model_load_internal: n_embd     = 5120\n",
+      "llama_model_load_internal: n_mult     = 256\n",
+      "llama_model_load_internal: n_head     = 40\n",
+      "llama_model_load_internal: n_layer    = 40\n",
+      "llama_model_load_internal: n_rot      = 128\n",
+      "llama_model_load_internal: freq_base  = 10000.0\n",
+      "llama_model_load_internal: freq_scale = 1\n",
+      "llama_model_load_internal: ftype      = 2 (mostly Q4_0)\n",
+      "llama_model_load_internal: n_ff       = 13824\n",
+      "llama_model_load_internal: model size = 13B\n",
+      "llama_model_load_internal: ggml ctx size =    0.09 MB\n",
+      "llama_model_load_internal: mem required  = 8819.71 MB (+ 1608.00 MB per state)\n",
+      "llama_new_context_with_model: kv self size  = 1600.00 MB\n",
+      "ggml_metal_init: allocating\n",
+      "ggml_metal_init: using MPS\n",
+      "ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'\n",
+      "ggml_metal_init: loaded kernel_add                            0x76add7460\n",
+      "ggml_metal_init: loaded kernel_mul                            0x76add5090\n",
+      "ggml_metal_init: loaded kernel_mul_row                        0x76addae00\n",
+      "ggml_metal_init: loaded kernel_scale                          0x76adb2940\n",
+      "ggml_metal_init: loaded kernel_silu                           0x76adb8610\n",
+      "ggml_metal_init: loaded kernel_relu                           0x76addb700\n",
+      "ggml_metal_init: loaded kernel_gelu                           0x76addc100\n",
+      "ggml_metal_init: loaded kernel_soft_max                       0x76addcb80\n",
+      "ggml_metal_init: loaded kernel_diag_mask_inf                  0x76addd600\n",
+      "ggml_metal_init: loaded kernel_get_rows_f16                   0x295f16380\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_0                  0x295f165e0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_1                  0x295f16840\n",
+      "ggml_metal_init: loaded kernel_get_rows_q2_K                  0x295f16aa0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q3_K                  0x295f16d00\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_K                  0x295f16f60\n",
+      "ggml_metal_init: loaded kernel_get_rows_q5_K                  0x295f171c0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q6_K                  0x295f17420\n",
+      "ggml_metal_init: loaded kernel_rms_norm                       0x295f17680\n",
+      "ggml_metal_init: loaded kernel_norm                           0x295f178e0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x295f17b40\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x295f17da0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x295f18000\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x7962b9900\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x7962bf5f0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x7962bc630\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x142045960\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x7962ba2b0\n",
+      "ggml_metal_init: loaded kernel_rope                           0x7962c35f0\n",
+      "ggml_metal_init: loaded kernel_alibi_f32                      0x7962c30b0\n",
+      "ggml_metal_init: loaded kernel_cpy_f32_f16                    0x7962c15b0\n",
+      "ggml_metal_init: loaded kernel_cpy_f32_f32                    0x7962beb10\n",
+      "ggml_metal_init: loaded kernel_cpy_f16_f16                    0x7962bf060\n",
+      "ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB\n",
+      "ggml_metal_init: hasUnifiedMemory             = true\n",
+      "ggml_metal_init: maxTransferRate              = built-in GPU\n",
+      "ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, (35852.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size\n",
+      "ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1026.00 MB, (36878.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size\n",
+      "ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1602.00 MB, (38480.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size\n",
+      "ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   298.00 MB, (38778.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size\n",
+      "AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | \n",
+      "ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, (39290.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size\n"
+     ]
+    }
+   ],
+   "source": [
+    "n_gpu_layers = 1  # Metal set to 1 is enough.\n",
+    "n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
+    "callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])\n",
+    "\n",
+    "# Make sure the model path is correct for your system!\n",
+    "llm = LlamaCpp(\n",
+    "    model_path=\"/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\",\n",
+    "    n_gpu_layers=n_gpu_layers,\n",
+    "    n_batch=n_batch,\n",
+    "    n_ctx=2048,\n",
+    "    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls\n",
+    "    callback_manager=callback_manager,\n",
+    "    verbose=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3831b16a",
+   "metadata": {},
+   "source": [
+    "Note that these indicate that [Metal was enabled properly](https://python.langchain.com/docs/modules/model_io/models/llms/integrations/llamacpp):\n",
+    "\n",
+    "```\n",
+    "ggml_metal_init: allocating\n",
+    "ggml_metal_init: using MPS\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "e940de71",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Llama.generate: prefix-match hit\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Setting: The Late Show with Stephen Colbert. The studio audience is filled with fans of both comedians, and the energy is electric. The two comedians are seated at a table, ready to begin their epic rap battle.\n",
+      "\n",
+      "Stephen Colbert: (smirking) Oh, you think you can take me down, John? You're just a Brit with a funny accent, and I'm the king of comedy!\n",
+      "John Oliver: (grinning) Oh, you think you're tough, Stephen? You're just a has-been from South Carolina, and I'm the future of comedy!\n",
+      "The battle begins, with each comedian delivering clever rhymes and witty insults. Here are a few lines that might be included:\n",
+      "Stephen Colbert: (rapping) You may have a big brain, John, but you can't touch my charm / I've got the audience in stitches, while you're just a blemish on the screen / Your accent is so thick, it's like trying to hear a speech through a mouthful of marshmallows / You may have"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "llama_print_timings:        load time =  2201.54 ms\n",
+      "llama_print_timings:      sample time =   182.54 ms /   256 runs   (    0.71 ms per token,  1402.41 tokens per second)\n",
+      "llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)\n",
+      "llama_print_timings:        eval time =  8484.62 ms /   256 runs   (   33.14 ms per token,    30.17 tokens per second)\n",
+      "llama_print_timings:       total time =  9000.62 ms\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "\"\\nSetting: The Late Show with Stephen Colbert. The studio audience is filled with fans of both comedians, and the energy is electric. The two comedians are seated at a table, ready to begin their epic rap battle.\\n\\nStephen Colbert: (smirking) Oh, you think you can take me down, John? You're just a Brit with a funny accent, and I'm the king of comedy!\\nJohn Oliver: (grinning) Oh, you think you're tough, Stephen? You're just a has-been from South Carolina, and I'm the future of comedy!\\nThe battle begins, with each comedian delivering clever rhymes and witty insults. Here are a few lines that might be included:\\nStephen Colbert: (rapping) You may have a big brain, John, but you can't touch my charm / I've got the audience in stitches, while you're just a blemish on the screen / Your accent is so thick, it's like trying to hear a speech through a mouthful of marshmallows / You may have\""
+      ]
+     },
+     "execution_count": 30,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "prompt = \"\"\"\n",
+    "Question: A rap battle between Stephen Colbert and John Oliver\n",
+    "\"\"\"\n",
+    "llm(prompt)"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "0d9579a7",
   "metadata": {},
   "source": [
+    "### GPT4All\n",
+    "\n",
+    "Similarly, we can use `GPT4All`.\n",
+    "\n",
    "[Download the GPT4All model binary](https://python.langchain.com/docs/modules/model_io/models/llms/integrations/gpt4all).\n",
    "\n",
    "The Model Explorer on the [GPT4All](https://gpt4all.io/index.html) is a great way to choose and download a model.\n",
@ -249,26 +483,61 @@
   "id": "d58838ae",
   "metadata": {},
   "source": [
-    "Run an `LLMChain` (see [here](https://python.langchain.com/docs/modules/chains/foundational/llm_chain)) by passing in the retrieved docs and a simple prompt.\n",
+    "## LLMChain\n",
    "\n",
-    "It formats the prompt template using the input key values provided and passes the formatted string to `GPT4All`.\n",
+    "Run an `LLMChain` (see [here](https://python.langchain.com/docs/modules/chains/foundational/llm_chain)) with either model by passing in the retrieved docs and a simple prompt.\n",
    "\n",
-    "In this case, the list of retrieved documents above are pass into `{context}`."
+    "It formats the prompt template using the input key values provided and passes the formatted string to `GPT4All`, `LLama-V2`, or another specified LLM.\n",
+    " \n",
+    "In this case, the list of retrieved documents (`docs`) above are pass into `{context}`."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 27,
   "id": "18a3716d",
-   "metadata": {},
+   "metadata": {
+    "scrolled": false
+   },
   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Llama.generate: prefix-match hit\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Based on the retrieved documents, the main themes are:\n",
+      "1. Task decomposition: The ability to break down complex tasks into smaller subtasks, which can be handled by an LLM or other components of the agent system.\n",
+      "2. LLM as the core controller: The use of a large language model (LLM) as the primary controller of an autonomous agent system, complemented by other key components such as a knowledge graph and a planner.\n",
+      "3. Potentiality of LLM: The idea that LLMs have the potential to be used as powerful general problem solvers, not just for generating well-written copies but also for solving complex tasks and achieving human-like intelligence.\n",
+      "4. Challenges in long-term planning: The challenges in planning over a lengthy history and effectively exploring the solution space, which are important limitations of current LLM-based autonomous agent systems."
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "llama_print_timings:        load time =  1191.88 ms\n",
+      "llama_print_timings:      sample time =   134.47 ms /   193 runs   (    0.70 ms per token,  1435.25 tokens per second)\n",
+      "llama_print_timings: prompt eval time = 39470.18 ms /  1055 tokens (   37.41 ms per token,    26.73 tokens per second)\n",
+      "llama_print_timings:        eval time =  8090.85 ms /   192 runs   (   42.14 ms per token,    23.73 tokens per second)\n",
+      "llama_print_timings:       total time = 47943.12 ms\n"
+     ]
+    },
    {
     "data": {
      "text/plain": [
-       "'\\nThe main themes in this context are task decomposition and building agents with large language models (LLM) as their core controller. The document summarizes how task decomposition can be done using LLM prompting or human inputs, and the challenges faced by LLMs in long-term planning and task decomposition. Finally, it discusses how expert models execute on specific tasks and log results during instruction execution.'"
+       "'\\nBased on the retrieved documents, the main themes are:\\n1. Task decomposition: The ability to break down complex tasks into smaller subtasks, which can be handled by an LLM or other components of the agent system.\\n2. LLM as the core controller: The use of a large language model (LLM) as the primary controller of an autonomous agent system, complemented by other key components such as a knowledge graph and a planner.\\n3. Potentiality of LLM: The idea that LLMs have the potential to be used as powerful general problem solvers, not just for generating well-written copies but also for solving complex tasks and achieving human-like intelligence.\\n4. Challenges in long-term planning: The challenges in planning over a lengthy history and effectively exploring the solution space, which are important limitations of current LLM-based autonomous agent systems.'"
      ]
     },
-     "execution_count": 15,
+     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -277,11 +546,18 @@
    "from langchain import PromptTemplate, LLMChain\n",
    "\n",
    "# Prompt\n",
-    "prompt_template = \"Summarize the main themes in this context: {context}?\"\n",
+    "prompt = PromptTemplate.from_template(\n",
+    "    \"Summarize the main themes in these retrieved docs: {docs}\"\n",
+    ")\n",
+    "\n",
    "# Chain\n",
-    "llm_chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template(prompt_template))\n",
+    "llm_chain = LLMChain(llm=llm, prompt=prompt)\n",
+    "\n",
    "# Run\n",
+    "question = \"What are the approaches to Task Decomposition?\"\n",
+    "docs = vectorstore.similarity_search(question)\n",
    "result = llm_chain(docs)\n",
+    "\n",
    "# Output\n",
    "result[\"text\"]"
   ]
@ -291,6 +567,8 @@
   "id": "ed9cecf8",
   "metadata": {},
   "source": [
+    "## QA Chain\n",
+    "\n",
    "We can use a `QA chain` to handle our question above.\n",
    "\n",
    "`chain_type=\"stuff\"` (see [here](https://python.langchain.com/docs/modules/chains/document/stuff)) means that all the docs will be added (stuffed) into a prompt."
@ -298,17 +576,43 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 20,
   "id": "c01c1725",
   "metadata": {},
   "outputs": [
    {
-     "data": {
-      "text/plain": [
-       "{'output_text': ' There are three main approaches to task decomposition: (1) using language model prompts with simple instructions like \"Steps for XYZ.\\\\n1.\", (2) using task-specific instructions, such as \"Write a story outline.\" for writing a novel, or (3) combining human inputs. However, challenges remain in long-term planning and adjusting plans when faced with unexpected errors, making LLMs less robust compared to humans who learn from trial and error during execution.'}"
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Llama.generate: prefix-match hit\n"
     ]
    },
-     "execution_count": 12,
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Hi there! There are three main approaches to task decomposition. One is using LLM with simple prompting like \"Steps for XYZ.\" or \"What are the subgoals for achieving XYZ?\" Another approach is by using task-specific instructions, such as \"Write a story outline\" for writing a novel. Finally, task decomposition can also be done with human inputs. Thanks for asking!"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "llama_print_timings:        load time =  1191.88 ms\n",
+      "llama_print_timings:      sample time =    61.21 ms /    85 runs   (    0.72 ms per token,  1388.64 tokens per second)\n",
+      "llama_print_timings: prompt eval time =  8014.11 ms /   267 tokens (   30.02 ms per token,    33.32 tokens per second)\n",
+      "llama_print_timings:        eval time =  2908.17 ms /    84 runs   (   34.62 ms per token,    28.88 tokens per second)\n",
+      "llama_print_timings:       total time = 11096.23 ms\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'output_text': ' Hi there! There are three main approaches to task decomposition. One is using LLM with simple prompting like \"Steps for XYZ.\" or \"What are the subgoals for achieving XYZ?\" Another approach is by using task-specific instructions, such as \"Write a story outline\" for writing a novel. Finally, task decomposition can also be done with human inputs. Thanks for asking!'}"
+      ]
+     },
+     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -341,6 +645,8 @@
   "id": "821729cb",
   "metadata": {},
   "source": [
+    "## RetrievalQA\n",
+    "\n",
    "For an even simpler flow, use `RetrievalQA`.\n",
    "\n",
    "This will use a QA default prompt (shown [here](https://github.com/hwchase17/langchain/blob/275b926cf745b5668d3ea30236635e20e7866442/langchain/chains/retrieval_qa/prompt.py#L4)) and will retrieve from the vectorDB.\n",
@ -350,7 +656,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 21,
   "id": "86c7a349",
   "metadata": {},
   "outputs": [],
@ -366,18 +672,45 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 22,
   "id": "112ca227",
   "metadata": {},
   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Llama.generate: prefix-match hit\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " \n",
+      "The three approaches to Task decomposition are LLMs with simple prompting, task-specific instructions, or human inputs. Thanks for asking!"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "llama_print_timings:        load time =  1191.88 ms\n",
+      "llama_print_timings:      sample time =    22.78 ms /    31 runs   (    0.73 ms per token,  1360.66 tokens per second)\n",
+      "llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)\n",
+      "llama_print_timings:        eval time =  1320.23 ms /    31 runs   (   42.59 ms per token,    23.48 tokens per second)\n",
+      "llama_print_timings:       total time =  1387.70 ms\n"
+     ]
+    },
    {
     "data": {
      "text/plain": [
       "{'query': 'What are the approaches to Task Decomposition?',\n",
-       " 'result': ' There are three main approaches to task decomposition: (1) using language model prompts with simple instructions like \"Steps for XYZ.\\\\n1.\", (2) using task-specific instructions, such as \"Write a story outline.\" for writing a novel, or (3) combining human inputs. However, challenges remain in long-term planning and adjusting plans when faced with unexpected errors, making LLMs less robust compared to humans who learn from trial and error during execution.'}"
+       " 'result': ' \\nThe three approaches to Task decomposition are LLMs with simple prompting, task-specific instructions, or human inputs. Thanks for asking!'}"
      ]
     },
-     "execution_count": 14,
+     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }