Open source LLM guide (#9266)

Guide for using open source LLMs locally.
11 months ago · b04e472acf
parent 090411842e
commit b04e472acf
4 changed files with 807 additions and 0 deletions
--- a/docs/docs_skeleton/static/img/OSS_LLM_overview.png
+++ b/docs/docs_skeleton/static/img/OSS_LLM_overview.png
--- a/docs/docs_skeleton/static/img/llama-memory-weights.png
+++ b/docs/docs_skeleton/static/img/llama-memory-weights.png
--- a/docs/docs_skeleton/static/img/llama_t_put.png
+++ b/docs/docs_skeleton/static/img/llama_t_put.png
--- a/docs/extras/guides/local_llms.ipynb
+++ b/docs/extras/guides/local_llms.ipynb
@ -0,0 +1,807 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "b8982428",
+   "metadata": {},
+   "source": [
+    "# Private, local, open source LLMs\n",
+    "\n",
+    "## Use case\n",
+    "\n",
+    "The popularity of projects like [PrivateGPT](https://github.com/imartinez/privateGPT), [llama.cpp](https://github.com/ggerganov/llama.cpp), and [GPT4All](https://github.com/nomic-ai/gpt4all) underscore the demand to run LLMs locally (on your own device).\n",
+    "\n",
+    "This has at least two important benefits:\n",
+    "\n",
+    "1. `Privacy`: Your data is not sent to a third party, and it is not subject to the terms of service of a commercial service\n",
+    "2. `Cost`: There is no inference fee, which is important for token-intensive applications (e.g., [long-running simulations](https://twitter.com/RLanceMartin/status/1691097659262820352?s=20), summarization)\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "Running an LLM locally requires a few things:\n",
+    "\n",
+    "1. `Open source LLM`: An open source LLM that can be freely modified and shared \n",
+    "2. `Inference`: Ability to run this LLM on your device w/ acceptable latency\n",
+    "\n",
+    "### Open Source LLMs\n",
+    "\n",
+    "Users can now gain access to a rapidly growing set of [open source LLMs](https://cameronrwolfe.substack.com/p/the-history-of-open-source-llms-better). \n",
+    "\n",
+    "These LLMs can be assessed across at least two dimentions (see figure):\n",
+    " \n",
+    "1. `Base model`: What is the base-model and how was it trained?\n",
+    "2. `Fine-tuning approach`: Was the base-model fine-tuned and, if so, what [set of instructions](https://cameronrwolfe.substack.com/p/beyond-llama-the-power-of-open-llms#%C2%A7alpaca-an-instruction-following-llama-model) was used?\n",
+    "\n",
+    "![Image description](/img/OSS_LLM_overview.png)\n",
+    "\n",
+    "The relative performance of these models can be assessed using several leaderboards, including:\n",
+    "\n",
+    "1. [LmSys](https://chat.lmsys.org/?arena)\n",
+    "2. [GPT4All](https://gpt4all.io/index.html)\n",
+    "3. [HuggingFace](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)\n",
+    "\n",
+    "### Inference\n",
+    "\n",
+    "A few frameworks for this have emerged to support inference of open source LLMs on various devices:\n",
+    "\n",
+    "1. [`llama.cpp`](https://github.com/ggerganov/llama.cpp): C++ implementation of llama inference code with [weight optimization / quantization](https://finbarr.ca/how-is-llama-cpp-possible/)\n",
+    "2. [`gpt4all`](https://docs.gpt4all.io/index.html): Optimized C backend for inference\n",
+    "3. [`Ollama`](https://ollama.ai/): Bundles model weights and environment into an app that runs on device and serves the LLM \n",
+    "\n",
+    "In general, these frameworks will do a few things:\n",
+    "\n",
+    "1. `Quantization`: Reduce the memory footprint of the raw model weights\n",
+    "2. `Efficient implementation for inference`: Support inference on consumer hardware (e.g., CPU or laptop GPU)\n",
+    "\n",
+    "In particular, see [this excellent post](https://finbarr.ca/how-is-llama-cpp-possible/) on the importance of quantization.\n",
+    "\n",
+    "![Image description](/img/llama-memory-weights.png)\n",
+    "\n",
+    "With less precision, we radically decrease the memory needed to store the LLM in memory.\n",
+    "\n",
+    "In addition, we can see the importance of GPU memory bandwidth [sheet](https://docs.google.com/spreadsheets/d/1OehfHHNSn66BP2h3Bxp2NJTVX97icU0GmCXF6pK23H8/edit#gid=0)!\n",
+    "\n",
+    "A Mac M2 Max is 5-6x faster than a M1 for inference due to the larger GPU memory bandwidth.\n",
+    "\n",
+    "![Image description](/img/llama_t_put.png)\n",
+    "\n",
+    "## Quickstart\n",
+    "\n",
+    "[`Ollama`](https://ollama.ai/) is one way to easily run inference on macOS.\n",
+    " \n",
+    "The instructions [here](docs/integrations/llms/ollama) provide details, which we summarize:\n",
+    " \n",
+    "* [Download and run](https://ollama.ai/download) the app\n",
+    "* From command line, fetch a model from this [list of options](https://github.com/jmorganca/ollama): e.g., `ollama pull llama2`\n",
+    "* When the app is running, all models are automatically served on `localhost:11434`\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "86178adb",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "' The first man on the moon was Neil Armstrong, who landed on the moon on July 20, 1969 as part of the Apollo 11 mission. obviously.'"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.llms import Ollama\n",
+    "llm = Ollama(model=\"llama2\")\n",
+    "llm(\"The first man on the moon was ...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "343ab645",
+   "metadata": {},
+   "source": [
+    "Stream tokens as they are being generated."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "9cd83603",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " The first man to walk on the moon was Neil Armstrong, an American astronaut who was part of the Apollo 11 mission in 1969. февруари 20, 1969, Armstrong stepped out of the lunar module Eagle and onto the moon's surface, famously declaring \"That's one small step for man, one giant leap for mankind\" as he took his first steps. He was followed by fellow astronaut Edwin \"Buzz\" Aldrin, who also walked on the moon during the mission."
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "' The first man to walk on the moon was Neil Armstrong, an American astronaut who was part of the Apollo 11 mission in 1969. февруари 20, 1969, Armstrong stepped out of the lunar module Eagle and onto the moon\\'s surface, famously declaring \"That\\'s one small step for man, one giant leap for mankind\" as he took his first steps. He was followed by fellow astronaut Edwin \"Buzz\" Aldrin, who also walked on the moon during the mission.'"
+      ]
+     },
+     "execution_count": 40,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.callbacks.manager import CallbackManager\n",
+    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler \n",
+    "llm = Ollama(model=\"llama2\", \n",
+    "             callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))\n",
+    "llm(\"The first man on the moon was ...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5cb27414",
+   "metadata": {},
+   "source": [
+    "## Environment\n",
+    "\n",
+    "Inference speed is a chllenge when running models locally (see above).\n",
+    "\n",
+    "To minimize latency, it is desiable to run models locally on GPU, which ships with many consumer laptops [e.g., Apple devices](https://www.apple.com/newsroom/2022/06/apple-unveils-m2-with-breakthrough-performance-and-capabilities/).\n",
+    "\n",
+    "And even with GPU, the available GPU memory bandwidth (as noted above) is important.\n",
+    "\n",
+    "### Running Apple silicon GPU\n",
+    "\n",
+    "`Ollama` will automatically utilize the GPU on Apple devices.\n",
+    " \n",
+    "Other frameworks require the user to set up the environment to utilize the Apple GPU.\n",
+    "\n",
+    "For example, `llama.cpp` python bindings can be configured to use the GPU via [Metal](https://developer.apple.com/metal/).\n",
+    "\n",
+    "Metal is a graphics and compute API created by Apple providing near-direct access to the GPU. \n",
+    "\n",
+    "See the [`llama.cpp`](docs/integrations/llms/llamacpp) setup [here](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md) to enable this.\n",
+    "\n",
+    "In particular, ensure that conda is using the correct virtual enviorment that you created (`miniforge3`).\n",
+    "\n",
+    "E.g., for me:\n",
+    "\n",
+    "```\n",
+    "conda activate /Users/rlm/miniforge3/envs/llama\n",
+    "```\n",
+    "\n",
+    "With the above confirmed, then:\n",
+    "\n",
+    "```\n",
+    "CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c382e79a",
+   "metadata": {},
+   "source": [
+    "## LLMs\n",
+    "\n",
+    "There are various ways to gain access to quantized model weights.\n",
+    "\n",
+    "1. [`HuggingFace`](https://huggingface.co/TheBloke) - Many quantized model are available for download and can be run with framework such as [`llama.cpp`](https://github.com/ggerganov/llama.cpp)\n",
+    "2. [`gpt4all`](https://gpt4all.io/index.html) - The model explorer offers a leaderboard of metrics and associated quantized models available for download \n",
+    "3. [`Ollama`](https://github.com/jmorganca/ollama) - Several models can be accessed directly via `pull`\n",
+    "\n",
+    "### Ollama\n",
+    "\n",
+    "With [Ollama](docs/integrations/llms/ollama), fetch a model via `ollama pull <model family>:<tag>`:\n",
+    "\n",
+    "* E.g., for Llama-7b: `ollama pull llama2` will download the most basic version of the model (e.g., smallest # parameters and 4 bit quantization)\n",
+    "* We can also specify a particular version from the [model list](https://github.com/jmorganca/ollama), e.g., `ollama pull llama2:13b`\n",
+    "* See the full set of parameters on the [API reference page](https://api.python.langchain.com/en/latest/llms/langchain.llms.ollama.Ollama.html)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "8ecd2f78",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "' Sure! Here\\'s the answer, broken down step by step:\\n\\nThe first man on the moon was... Neil Armstrong.\\n\\nHere\\'s how I arrived at that answer:\\n\\n1. The first manned mission to land on the moon was Apollo 11.\\n2. The mission included three astronauts: Neil Armstrong, Edwin \"Buzz\" Aldrin, and Michael Collins.\\n3. Neil Armstrong was the mission commander and the first person to set foot on the moon.\\n4. On July 20, 1969, Armstrong stepped out of the lunar module Eagle and onto the moon\\'s surface, famously declaring \"That\\'s one small step for man, one giant leap for mankind.\"\\n\\nSo, the first man on the moon was Neil Armstrong!'"
+      ]
+     },
+     "execution_count": 42,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.llms import Ollama\n",
+    "llm = Ollama(model=\"llama2:13b\")\n",
+    "llm(\"The first man on the moon was ... think step by step\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "07c8c0d1",
+   "metadata": {},
+   "source": [
+    "### Llama.cpp\n",
+    "\n",
+    "Llama.cpp is compatible with a [broad set of models](https://github.com/ggerganov/llama.cpp).\n",
+    "\n",
+    "For example, below we run inference on `llama2-13b` with 4 bit quantization downloaded from [HuggingFace](https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main).\n",
+    "\n",
+    "As noted above, see the [API reference](https://api.python.langchain.com/en/latest/llms/langchain.llms.llamacpp.LlamaCpp.html?highlight=llamacpp#langchain.llms.llamacpp.LlamaCpp) for the full set of parameters. \n",
+    "\n",
+    "From the [llama.cpp docs](https://python.langchain.com/docs/integrations/llms/llamacpp), a few are worth commenting on:\n",
+    "\n",
+    "`n_gpu_layers`: number of layers to be loaded into GPU memory\n",
+    "\n",
+    "* Value: 1\n",
+    "* Meaning: Only one layer of the model will be loaded into GPU memory (1 is often sufficient).\n",
+    "\n",
+    "`n_batch`: number of tokens the model should process in parallel \n",
+    "* Value: n_batch\n",
+    "* Meaning: It's recommended to choose a value between 1 and n_ctx (which in this case is set to 2048)\n",
+    "\n",
+    "`n_ctx`: Token context window .\n",
+    "* Value: 2048\n",
+    "* Meaning: The model will consider a window of 2048 tokens at a time\n",
+    "\n",
+    "`f16_kv`: whether the model should use half-precision for the key/value cache\n",
+    "* Value: True\n",
+    "* Meaning: The model will use half-precision, which can be more memory efficient; Metal only support True."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5eba38dc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pip install llama-cpp-python"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "id": "9d5f94b5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "objc[10142]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x2a0c4c208) and /Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/libllama.dylib (0x2c28bc208). One of the two will be used. Which one is undefined.\n",
+      "llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\n",
+      "llama_model_load_internal: format     = ggjt v3 (latest)\n",
+      "llama_model_load_internal: n_vocab    = 32000\n",
+      "llama_model_load_internal: n_ctx      = 2048\n",
+      "llama_model_load_internal: n_embd     = 5120\n",
+      "llama_model_load_internal: n_mult     = 256\n",
+      "llama_model_load_internal: n_head     = 40\n",
+      "llama_model_load_internal: n_layer    = 40\n",
+      "llama_model_load_internal: n_rot      = 128\n",
+      "llama_model_load_internal: freq_base  = 10000.0\n",
+      "llama_model_load_internal: freq_scale = 1\n",
+      "llama_model_load_internal: ftype      = 2 (mostly Q4_0)\n",
+      "llama_model_load_internal: n_ff       = 13824\n",
+      "llama_model_load_internal: model size = 13B\n",
+      "llama_model_load_internal: ggml ctx size =    0.09 MB\n",
+      "llama_model_load_internal: mem required  = 8953.71 MB (+ 1608.00 MB per state)\n",
+      "llama_new_context_with_model: kv self size  = 1600.00 MB\n",
+      "ggml_metal_init: allocating\n",
+      "ggml_metal_init: using MPS\n",
+      "ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'\n",
+      "ggml_metal_init: loaded kernel_add                            0x47774af60\n",
+      "ggml_metal_init: loaded kernel_mul                            0x47774bc00\n",
+      "ggml_metal_init: loaded kernel_mul_row                        0x47774c230\n",
+      "ggml_metal_init: loaded kernel_scale                          0x47774c890\n",
+      "ggml_metal_init: loaded kernel_silu                           0x47774cef0\n",
+      "ggml_metal_init: loaded kernel_relu                           0x10e33e500\n",
+      "ggml_metal_init: loaded kernel_gelu                           0x47774b2f0\n",
+      "ggml_metal_init: loaded kernel_soft_max                       0x47771a580\n",
+      "ggml_metal_init: loaded kernel_diag_mask_inf                  0x47774dab0\n",
+      "ggml_metal_init: loaded kernel_get_rows_f16                   0x47774e110\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_0                  0x47774e7d0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_1                  0x13efd7170\n",
+      "ggml_metal_init: loaded kernel_get_rows_q2_K                  0x13efd73d0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q3_K                  0x13efd7630\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13efd7890\n",
+      "ggml_metal_init: loaded kernel_get_rows_q5_K                  0x4744c9740\n",
+      "ggml_metal_init: loaded kernel_get_rows_q6_K                  0x4744ca6b0\n",
+      "ggml_metal_init: loaded kernel_rms_norm                       0x4744cb250\n",
+      "ggml_metal_init: loaded kernel_norm                           0x4744cb970\n",
+      "ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x10e33f700\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x10e33fcd0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x4744cc2d0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x4744cc6f0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x4744cd6b0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x4744cde20\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x10e33ff30\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x10e340190\n",
+      "ggml_metal_init: loaded kernel_rope                           0x10e3403f0\n",
+      "ggml_metal_init: loaded kernel_alibi_f32                      0x10e340de0\n",
+      "ggml_metal_init: loaded kernel_cpy_f32_f16                    0x10e3416d0\n",
+      "ggml_metal_init: loaded kernel_cpy_f32_f32                    0x10e342080\n",
+      "ggml_metal_init: loaded kernel_cpy_f16_f16                    0x10e342ca0\n",
+      "ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB\n",
+      "ggml_metal_init: hasUnifiedMemory             = true\n",
+      "ggml_metal_init: maxTransferRate              = built-in GPU\n",
+      "ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, ( 6986.19 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1032.00 MB, ( 8018.19 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1602.00 MB, ( 9620.19 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   426.00 MB, (10046.19 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, (10558.19 / 21845.34)\n",
+      "AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | \n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.llms import LlamaCpp\n",
+    "llm = LlamaCpp(\n",
+    "    model_path=\"/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\",\n",
+    "    n_gpu_layers=1,\n",
+    "    n_batch=512,\n",
+    "    n_ctx=2048,\n",
+    "    f16_kv=True,  \n",
+    "    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),\n",
+    "    verbose=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f56f5168",
+   "metadata": {},
+   "source": [
+    "The console log will show the the below to indicate Metal was enabled properly from steps above:\n",
+    "```\n",
+    "ggml_metal_init: allocating\n",
+    "ggml_metal_init: using MPS\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "id": "7890a077",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Llama.generate: prefix-match hit\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " and use logical reasoning to figure out who the first man on the moon was.\n",
+      "\n",
+      "Here are some clues:\n",
+      "\n",
+      "1. The first man on the moon was an American.\n",
+      "2. He was part of the Apollo 11 mission.\n",
+      "3. He stepped out of the lunar module and became the first person to set foot on the moon's surface.\n",
+      "4. His last name is Armstrong.\n",
+      "\n",
+      "Now, let's use our reasoning skills to figure out who the first man on the moon was. Based on clue #1, we know that the first man on the moon was an American. Clue #2 tells us that he was part of the Apollo 11 mission. Clue #3 reveals that he was the first person to set foot on the moon's surface. And finally, clue #4 gives us his last name: Armstrong.\n",
+      "Therefore, the first man on the moon was Neil Armstrong!"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "llama_print_timings:        load time =  9623.21 ms\n",
+      "llama_print_timings:      sample time =   143.77 ms /   203 runs   (    0.71 ms per token,  1412.01 tokens per second)\n",
+      "llama_print_timings: prompt eval time =   485.94 ms /     7 tokens (   69.42 ms per token,    14.40 tokens per second)\n",
+      "llama_print_timings:        eval time =  6385.16 ms /   202 runs   (   31.61 ms per token,    31.64 tokens per second)\n",
+      "llama_print_timings:       total time =  7279.28 ms\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "\" and use logical reasoning to figure out who the first man on the moon was.\\n\\nHere are some clues:\\n\\n1. The first man on the moon was an American.\\n2. He was part of the Apollo 11 mission.\\n3. He stepped out of the lunar module and became the first person to set foot on the moon's surface.\\n4. His last name is Armstrong.\\n\\nNow, let's use our reasoning skills to figure out who the first man on the moon was. Based on clue #1, we know that the first man on the moon was an American. Clue #2 tells us that he was part of the Apollo 11 mission. Clue #3 reveals that he was the first person to set foot on the moon's surface. And finally, clue #4 gives us his last name: Armstrong.\\nTherefore, the first man on the moon was Neil Armstrong!\""
+      ]
+     },
+     "execution_count": 45,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "llm(\"The first man on the moon was ... Let's think step by step\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "831ddf7c",
+   "metadata": {},
+   "source": [
+    "### GPT4All\n",
+    "\n",
+    "We can use model weights downloaded from [GPT4All](https://python.langchain.com/docs/integrations/llms/gpt4all) model explorer.\n",
+    "\n",
+    "Similar to what is shown above, we can run inference and use [the API reference](https://api.python.langchain.com/en/latest/llms/langchain.llms.gpt4all.GPT4All.html?highlight=gpt4all#langchain.llms.gpt4all.GPT4All) to set parameters of interest."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e27baf6e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pip install gpt4all"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "id": "b55a2147",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found model file at  /Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin\n",
+      "llama_new_context_with_model: max tensor size =    87.89 MB\n",
+      "llama_new_context_with_model: max tensor size =    87.89 MB\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "llama.cpp: using Metal\n",
+      "llama.cpp: loading model from /Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin\n",
+      "llama_model_load_internal: format     = ggjt v3 (latest)\n",
+      "llama_model_load_internal: n_vocab    = 32001\n",
+      "llama_model_load_internal: n_ctx      = 2048\n",
+      "llama_model_load_internal: n_embd     = 5120\n",
+      "llama_model_load_internal: n_mult     = 256\n",
+      "llama_model_load_internal: n_head     = 40\n",
+      "llama_model_load_internal: n_layer    = 40\n",
+      "llama_model_load_internal: n_rot      = 128\n",
+      "llama_model_load_internal: ftype      = 2 (mostly Q4_0)\n",
+      "llama_model_load_internal: n_ff       = 13824\n",
+      "llama_model_load_internal: n_parts    = 1\n",
+      "llama_model_load_internal: model size = 13B\n",
+      "llama_model_load_internal: ggml ctx size =    0.09 MB\n",
+      "llama_model_load_internal: mem required  = 9031.71 MB (+ 1608.00 MB per state)\n",
+      "llama_new_context_with_model: kv self size  = 1600.00 MB\n",
+      "ggml_metal_init: allocating\n",
+      "ggml_metal_init: using MPS\n",
+      "ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/ggml-metal.metal'\n",
+      "ggml_metal_init: loaded kernel_add                            0x37944d850\n",
+      "ggml_metal_init: loaded kernel_mul                            0x37944f350\n",
+      "ggml_metal_init: loaded kernel_mul_row                        0x37944fdd0\n",
+      "ggml_metal_init: loaded kernel_scale                          0x3794505a0\n",
+      "ggml_metal_init: loaded kernel_silu                           0x379450800\n",
+      "ggml_metal_init: loaded kernel_relu                           0x379450a60\n",
+      "ggml_metal_init: loaded kernel_gelu                           0x379450cc0\n",
+      "ggml_metal_init: loaded kernel_soft_max                       0x379450ff0\n",
+      "ggml_metal_init: loaded kernel_diag_mask_inf                  0x379451250\n",
+      "ggml_metal_init: loaded kernel_get_rows_f16                   0x3794514b0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_0                  0x379451710\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_1                  0x379451970\n",
+      "ggml_metal_init: loaded kernel_get_rows_q2_k                  0x379451bd0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q3_k                  0x379451e30\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_k                  0x379452090\n",
+      "ggml_metal_init: loaded kernel_get_rows_q5_k                  0x3794522f0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q6_k                  0x379452550\n",
+      "ggml_metal_init: loaded kernel_rms_norm                       0x3794527b0\n",
+      "ggml_metal_init: loaded kernel_norm                           0x379452a10\n",
+      "ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x379452c70\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x379452ed0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x379453130\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q2_k_f32               0x379453390\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q3_k_f32               0x3794535f0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_k_f32               0x379453850\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q5_k_f32               0x379453ab0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q6_k_f32               0x379453d10\n",
+      "ggml_metal_init: loaded kernel_rope                           0x379453f70\n",
+      "ggml_metal_init: loaded kernel_alibi_f32                      0x3794541d0\n",
+      "ggml_metal_init: loaded kernel_cpy_f32_f16                    0x379454430\n",
+      "ggml_metal_init: loaded kernel_cpy_f32_f32                    0x379454690\n",
+      "ggml_metal_init: loaded kernel_cpy_f16_f16                    0x3794548f0\n",
+      "ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB\n",
+      "ggml_metal_init: hasUnifiedMemory             = true\n",
+      "ggml_metal_init: maxTransferRate              = built-in GPU\n",
+      "ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, (17542.94 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB, (18566.94 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1602.00 MB, (20168.94 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, (20680.94 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, (21192.94 / 21845.34)\n",
+      "ggml_metal_free: deallocating\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.llms import GPT4All\n",
+    "llm = GPT4All(model=\"/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "e3d4526f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\".\\n1) The United States decides to send a manned mission to the moon.2) They choose their best astronauts and train them for this specific mission.3) They build a spacecraft that can take humans to the moon, called the Lunar Module (LM).4) They also create a larger spacecraft, called the Saturn V rocket, which will launch both the LM and the Command Service Module (CSM), which will carry the astronauts into orbit.5) The mission is planned down to the smallest detail: from the trajectory of the rockets to the exact movements of the astronauts during their moon landing.6) On July 16, 1969, the Saturn V rocket launches from Kennedy Space Center in Florida, carrying the Apollo 11 mission crew into space.7) After one and a half orbits around the Earth, the LM separates from the CSM and begins its descent to the moon's surface.8) On July 20, 1969, at 2:56 pm EDT (GMT-4), Neil Armstrong becomes the first man on the moon. He speaks these\""
+      ]
+     },
+     "execution_count": 47,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "llm(\"The first man on the moon was ... Let's think step by step\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b84e543",
+   "metadata": {},
+   "source": [
+    "## Prompts\n",
+    "\n",
+    "Some LLMs will benefit from specific prompts.\n",
+    "\n",
+    "For example, llama2 can use [special tokens](https://twitter.com/RLanceMartin/status/1681879318493003776?s=20).\n",
+    "\n",
+    "We can use `ConditionalPromptSelector` to set prompt based on the model type."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "id": "d082b10a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\n",
+      "llama_model_load_internal: format     = ggjt v3 (latest)\n",
+      "llama_model_load_internal: n_vocab    = 32000\n",
+      "llama_model_load_internal: n_ctx      = 2048\n",
+      "llama_model_load_internal: n_embd     = 5120\n",
+      "llama_model_load_internal: n_mult     = 256\n",
+      "llama_model_load_internal: n_head     = 40\n",
+      "llama_model_load_internal: n_layer    = 40\n",
+      "llama_model_load_internal: n_rot      = 128\n",
+      "llama_model_load_internal: freq_base  = 10000.0\n",
+      "llama_model_load_internal: freq_scale = 1\n",
+      "llama_model_load_internal: ftype      = 2 (mostly Q4_0)\n",
+      "llama_model_load_internal: n_ff       = 13824\n",
+      "llama_model_load_internal: model size = 13B\n",
+      "llama_model_load_internal: ggml ctx size =    0.09 MB\n",
+      "llama_model_load_internal: mem required  = 8953.71 MB (+ 1608.00 MB per state)\n",
+      "llama_new_context_with_model: kv self size  = 1600.00 MB\n",
+      "ggml_metal_init: allocating\n",
+      "ggml_metal_init: using MPS\n",
+      "ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'\n",
+      "ggml_metal_init: loaded kernel_add                            0x4744d09d0\n",
+      "ggml_metal_init: loaded kernel_mul                            0x3781cb3d0\n",
+      "ggml_metal_init: loaded kernel_mul_row                        0x37813bb60\n",
+      "ggml_metal_init: loaded kernel_scale                          0x474481080\n",
+      "ggml_metal_init: loaded kernel_silu                           0x4744d29f0\n",
+      "ggml_metal_init: loaded kernel_relu                           0x3781254c0\n",
+      "ggml_metal_init: loaded kernel_gelu                           0x47447f280\n",
+      "ggml_metal_init: loaded kernel_soft_max                       0x4744cf470\n",
+      "ggml_metal_init: loaded kernel_diag_mask_inf                  0x4744cf6d0\n",
+      "ggml_metal_init: loaded kernel_get_rows_f16                   0x4744cf930\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_0                  0x4744cfb90\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_1                  0x4744cfdf0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q2_K                  0x4744d0050\n",
+      "ggml_metal_init: loaded kernel_get_rows_q3_K                  0x4744ce980\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_K                  0x4744cebe0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q5_K                  0x4744cee40\n",
+      "ggml_metal_init: loaded kernel_get_rows_q6_K                  0x4744cf0a0\n",
+      "ggml_metal_init: loaded kernel_rms_norm                       0x474482450\n",
+      "ggml_metal_init: loaded kernel_norm                           0x4744826b0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x474482910\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x474482b70\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x474482dd0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x474483030\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x474483290\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x4744834f0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x474483750\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x4744839b0\n",
+      "ggml_metal_init: loaded kernel_rope                           0x474483c10\n",
+      "ggml_metal_init: loaded kernel_alibi_f32                      0x474483e70\n",
+      "ggml_metal_init: loaded kernel_cpy_f32_f16                    0x4744840d0\n",
+      "ggml_metal_init: loaded kernel_cpy_f32_f32                    0x474484330\n",
+      "ggml_metal_init: loaded kernel_cpy_f16_f16                    0x474484590\n",
+      "ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB\n",
+      "ggml_metal_init: hasUnifiedMemory             = true\n",
+      "ggml_metal_init: maxTransferRate              = built-in GPU\n",
+      "ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, ( 6986.94 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1032.00 MB, ( 8018.94 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1602.00 MB, ( 9620.94 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   426.00 MB, (10046.94 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, (10558.94 / 21845.34)\n",
+      "AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | \n"
+     ]
+    }
+   ],
+   "source": [
+    "# Set our LLM\n",
+    "llm = LlamaCpp(\n",
+    "    model_path=\"/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\",\n",
+    "    n_gpu_layers=1,\n",
+    "    n_batch=512,\n",
+    "    n_ctx=2048,\n",
+    "    f16_kv=True,  \n",
+    "    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),\n",
+    "    verbose=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "66656084",
+   "metadata": {},
+   "source": [
+    "Set the associated prompt."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "id": "8555f5bf",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "PromptTemplate(input_variables=['question'], output_parser=None, partial_variables={}, template='<<SYS>> \\n You are an assistant tasked with improving Google search results. \\n <</SYS>> \\n\\n [INST] Generate THREE Google search queries that are similar to this question. The output should be a numbered list of questions and each should have a question mark at the end: \\n\\n {question} [/INST]', template_format='f-string', validate_template=True)"
+      ]
+     },
+     "execution_count": 58,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain import PromptTemplate, LLMChain\n",
+    "from langchain.chains.prompt_selector import ConditionalPromptSelector\n",
+    "\n",
+    "DEFAULT_LLAMA_SEARCH_PROMPT = PromptTemplate(\n",
+    "    input_variables=[\"question\"],\n",
+    "    template=\"\"\"<<SYS>> \\n You are an assistant tasked with improving Google search \\\n",
+    "results. \\n <</SYS>> \\n\\n [INST] Generate THREE Google search queries that \\\n",
+    "are similar to this question. The output should be a numbered list of questions \\\n",
+    "and each should have a question mark at the end: \\n\\n {question} [/INST]\"\"\",\n",
+    ")\n",
+    "\n",
+    "DEFAULT_SEARCH_PROMPT = PromptTemplate(\n",
+    "    input_variables=[\"question\"],\n",
+    "    template=\"\"\"You are an assistant tasked with improving Google search \\\n",
+    "results. Generate THREE Google search queries that are similar to \\\n",
+    "this question. The output should be a numbered list of questions and each \\\n",
+    "should have a question mark at the end: {question}\"\"\",\n",
+    ")\n",
+    "\n",
+    "QUESTION_PROMPT_SELECTOR = ConditionalPromptSelector(\n",
+    "                default_prompt=DEFAULT_SEARCH_PROMPT,\n",
+    "                conditionals=[\n",
+    "                    (lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)\n",
+    "                ],\n",
+    "            )\n",
+    "\n",
+    "prompt = QUESTION_PROMPT_SELECTOR.get_prompt(llm)\n",
+    "prompt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "id": "d0aedfd2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  Sure! Here are three similar search queries with a question mark at the end:\n",
+      "\n",
+      "1. Which NBA team did LeBron James lead to a championship in the year he was drafted?\n",
+      "2. Who won the Grammy Awards for Best New Artist and Best Female Pop Vocal Performance in the same year that Lady Gaga was born?\n",
+      "3. What MLB team did Babe Ruth play for when he hit 60 home runs in a single season?"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "llama_print_timings:        load time = 14943.19 ms\n",
+      "llama_print_timings:      sample time =    72.93 ms /   101 runs   (    0.72 ms per token,  1384.87 tokens per second)\n",
+      "llama_print_timings: prompt eval time = 14942.95 ms /    93 tokens (  160.68 ms per token,     6.22 tokens per second)\n",
+      "llama_print_timings:        eval time =  3430.85 ms /   100 runs   (   34.31 ms per token,    29.15 tokens per second)\n",
+      "llama_print_timings:       total time = 18578.26 ms\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'  Sure! Here are three similar search queries with a question mark at the end:\\n\\n1. Which NBA team did LeBron James lead to a championship in the year he was drafted?\\n2. Who won the Grammy Awards for Best New Artist and Best Female Pop Vocal Performance in the same year that Lady Gaga was born?\\n3. What MLB team did Babe Ruth play for when he hit 60 home runs in a single season?'"
+      ]
+     },
+     "execution_count": 59,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Chain\n",
+    "llm_chain = LLMChain(prompt=prompt,llm=llm)\n",
+    "question = \"What NFL team won the Super Bowl in the year that Justin Bieber was born?\"\n",
+    "llm_chain.run({\"question\":question})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6ba66260",
+   "metadata": {},
+   "source": [
+    "## Use cases\n",
+    "\n",
+    "Given an `llm` created from one of the models above, you can use it for [many use cases](docs/use_cases).\n",
+    "\n",
+    "For example, here is a guide to [RAG](docs/use_cases/question_answering/how_to/local_retrieval_qa) with local LLMs.\n",
+    "\n",
+    "In general, use cases for local model can be driven by at least two factors:\n",
+    "\n",
+    "* `Privacy`: private data (e.g., journals, etc) that a user does not want to share \n",
+    "* `Cost`: text preprocessing (extraction/tagging), summarization, and agent simulations are token-use-intensive tasks\n",
+    "\n",
+    "There are a few approach to support specific use-cases: \n",
+    "\n",
+    "* Fine-tuning (e.g., [gpt-llm-trainer](https://github.com/mshumer/gpt-llm-trainer), [Anyscale](https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehensive-case-study-for-tailoring-models-to-unique-applications)) \n",
+    "* [Function-calling](https://github.com/MeetKai/functionary/tree/main) for use-cases like extraction or tagging\n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}