Add Metal support to llama.cpp doc (#7092)

- Description: Add Metal support to llama.cpp doc - Issue: #7091 - Dependencies: N/A - Twitter handle: gene_wu
2024-11-06 03:20:49 +00:00 · 2023-07-04 03:35:39 +08:00 · 2023-07-04 03:35:39 +08:00 · e49abd1277
commit e49abd1277
parent fad2c7e5e0
1 changed files with 95 additions and 1 deletions
--- a/docs/extras/modules/model_io/models/llms/integrations/llamacpp.ipynb
+++ b/docs/extras/modules/model_io/models/llms/integrations/llamacpp.ipynb
@ -23,6 +23,7 @@
    "There is a banch of options how to install the llama-cpp package: \n",
    "- only CPU usage\n",
    "- CPU + GPU (using one of many BLAS backends)\n",
+    "- Metal GPU (MacOS with Apple Silicon Chip) \n",
    "\n",
    "### CPU only installation"
   ]
@ -73,7 +74,45 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python"
+    "!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Installation with Metal\n",
+    "\n",
+    "`lama.cpp` supports Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. Use the `FORCE_CMAKE=1` environment variable to force the use of cmake and install the pip package for the Metal support ([source](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)).\n",
+    "\n",
+    "Example installation with Metal Support:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install llama-cpp-python"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**IMPORTANT**: If you have already installed a cpu only version of the package, you need to reinstall it from scratch: consider the following command: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir"
   ]
  },
  {
@ -325,6 +364,61 @@
   "metadata": {},
   "outputs": [],
   "source": []
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Metal\n",
+    "\n",
+    "If the installation with Metal was correct, you will see an `NEON = 1` indicator in model properties.\n",
+    "\n",
+    "Two of the most important parameters for use with GPU are:\n",
+    "\n",
+    "- `n_gpu_layers` - determines how many layers of the model are offloaded to your Metal GPU, in the most case, set it to `1` is enough for Metal\n",
+    "- `n_batch` - how many tokens are processed in parallel, default is 8, set to bigger number.\n",
+    "- `f16_kv` - for some reason, Metal only support `True`, otherwise you will get error such as `Asserting on type 0\n",
+    "GGML_ASSERT: .../ggml-metal.m:706: false && \"not implemented\"`\n",
+    "\n",
+    "Setting these parameters correctly will dramatically improve the evaluation speed (see [wrapper code](https://github.com/mmagnesium/langchain/blob/master/langchain/llms/llamacpp.py) for more details)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "n_gpu_layers = 1  # Metal set to 1 is enough.\n",
+    "n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
+    "\n",
+    "# Make sure the model path is correct for your system!\n",
+    "llm = LlamaCpp(\n",
+    "    model_path=\"./ggml-model-q4_0.bin\",\n",
+    "    n_gpu_layers=n_gpu_layers,\n",
+    "    n_batch=n_batch,\n",
+    "    f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls\n",
+    "    callback_manager=callback_manager,\n",
+    "    verbose=True,\n",
+    ")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The rest are almost same as GPU, the console log will show the following log to indicate the Metal was enable properly.\n",
+    "\n",
+    "```\n",
+    "ggml_metal_init: allocating\n",
+    "ggml_metal_init: using MPS\n",
+    "...\n",
+    "```\n",
+    "\n",
+    "You also could check the `Activity Monitor` by watching the % GPU of the process, the % CPU will drop dramatically after turn on `n_gpu_layers=1`. Also for the first time call LLM, the performance might be slow due to the model compilation in Metal GPU."
+   ]
  }
 ],
 "metadata": {