From e49abd12776a2549458a63e03e9b25ced602135d Mon Sep 17 00:00:00 2001
From: genewoo <genewoo@gmail.com>
Date: Tue, 4 Jul 2023 03:35:39 +0800
Subject: [PATCH] Add Metal support to llama.cpp doc (#7092)

- Description: Add Metal support to llama.cpp doc
  - Issue: #7091
  - Dependencies: N/A
  - Twitter handle: gene_wu
---
 .../models/llms/integrations/llamacpp.ipynb   | 96 ++++++++++++++++++-
 1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/docs/extras/modules/model_io/models/llms/integrations/llamacpp.ipynb b/docs/extras/modules/model_io/models/llms/integrations/llamacpp.ipynb
index 1c9c9a1c5f..b8a1e86ab2 100644
--- a/docs/extras/modules/model_io/models/llms/integrations/llamacpp.ipynb
+++ b/docs/extras/modules/model_io/models/llms/integrations/llamacpp.ipynb
@@ -23,6 +23,7 @@
     "There is a banch of options how to install the llama-cpp package: \n",
     "- only CPU usage\n",
     "- CPU + GPU (using one of many BLAS backends)\n",
+    "- Metal GPU (MacOS with Apple Silicon Chip) \n",
     "\n",
     "### CPU only installation"
    ]
@@ -73,7 +74,45 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python"
+    "!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Installation with Metal\n",
+    "\n",
+    "`lama.cpp` supports Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. Use the `FORCE_CMAKE=1` environment variable to force the use of cmake and install the pip package for the Metal support ([source](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)).\n",
+    "\n",
+    "Example installation with Metal Support:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install llama-cpp-python"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**IMPORTANT**: If you have already installed a cpu only version of the package, you need to reinstall it from scratch: consider the following command: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir"
    ]
   },
   {
@@ -325,6 +364,61 @@
    "metadata": {},
    "outputs": [],
    "source": []
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Metal\n",
+    "\n",
+    "If the installation with Metal was correct, you will see an `NEON = 1` indicator in model properties.\n",
+    "\n",
+    "Two of the most important parameters for use with GPU are:\n",
+    "\n",
+    "- `n_gpu_layers` - determines how many layers of the model are offloaded to your Metal GPU, in the most case, set it to `1` is enough for Metal\n",
+    "- `n_batch` - how many tokens are processed in parallel, default is 8, set to bigger number.\n",
+    "- `f16_kv` - for some reason, Metal only support `True`, otherwise you will get error such as `Asserting on type 0\n",
+    "GGML_ASSERT: .../ggml-metal.m:706: false && \"not implemented\"`\n",
+    "\n",
+    "Setting these parameters correctly will dramatically improve the evaluation speed (see [wrapper code](https://github.com/mmagnesium/langchain/blob/master/langchain/llms/llamacpp.py) for more details)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "n_gpu_layers = 1  # Metal set to 1 is enough.\n",
+    "n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
+    "\n",
+    "# Make sure the model path is correct for your system!\n",
+    "llm = LlamaCpp(\n",
+    "    model_path=\"./ggml-model-q4_0.bin\",\n",
+    "    n_gpu_layers=n_gpu_layers,\n",
+    "    n_batch=n_batch,\n",
+    "    f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls\n",
+    "    callback_manager=callback_manager,\n",
+    "    verbose=True,\n",
+    ")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The rest are almost same as GPU, the console log will show the following log to indicate the Metal was enable properly.\n",
+    "\n",
+    "```\n",
+    "ggml_metal_init: allocating\n",
+    "ggml_metal_init: using MPS\n",
+    "...\n",
+    "```\n",
+    "\n",
+    "You also could check the `Activity Monitor` by watching the % GPU of the process, the % CPU will drop dramatically after turn on `n_gpu_layers=1`. Also for the first time call LLM, the performance might be slow due to the model compilation in Metal GPU."
+   ]
   }
  ],
  "metadata": {