From e49abd12776a2549458a63e03e9b25ced602135d Mon Sep 17 00:00:00 2001 From: genewoo Date: Tue, 4 Jul 2023 03:35:39 +0800 Subject: [PATCH] Add Metal support to llama.cpp doc (#7092) - Description: Add Metal support to llama.cpp doc - Issue: #7091 - Dependencies: N/A - Twitter handle: gene_wu --- .../models/llms/integrations/llamacpp.ipynb | 96 ++++++++++++++++++- 1 file changed, 95 insertions(+), 1 deletion(-) diff --git a/docs/extras/modules/model_io/models/llms/integrations/llamacpp.ipynb b/docs/extras/modules/model_io/models/llms/integrations/llamacpp.ipynb index 1c9c9a1c5f..b8a1e86ab2 100644 --- a/docs/extras/modules/model_io/models/llms/integrations/llamacpp.ipynb +++ b/docs/extras/modules/model_io/models/llms/integrations/llamacpp.ipynb @@ -23,6 +23,7 @@ "There is a banch of options how to install the llama-cpp package: \n", "- only CPU usage\n", "- CPU + GPU (using one of many BLAS backends)\n", + "- Metal GPU (MacOS with Apple Silicon Chip) \n", "\n", "### CPU only installation" ] @@ -73,7 +74,45 @@ "metadata": {}, "outputs": [], "source": [ - "!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python" + "!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Installation with Metal\n", + "\n", + "`lama.cpp` supports Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. Use the `FORCE_CMAKE=1` environment variable to force the use of cmake and install the pip package for the Metal support ([source](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)).\n", + "\n", + "Example installation with Metal Support:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install llama-cpp-python" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**IMPORTANT**: If you have already installed a cpu only version of the package, you need to reinstall it from scratch: consider the following command: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir" ] }, { @@ -325,6 +364,61 @@ "metadata": {}, "outputs": [], "source": [] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Metal\n", + "\n", + "If the installation with Metal was correct, you will see an `NEON = 1` indicator in model properties.\n", + "\n", + "Two of the most important parameters for use with GPU are:\n", + "\n", + "- `n_gpu_layers` - determines how many layers of the model are offloaded to your Metal GPU, in the most case, set it to `1` is enough for Metal\n", + "- `n_batch` - how many tokens are processed in parallel, default is 8, set to bigger number.\n", + "- `f16_kv` - for some reason, Metal only support `True`, otherwise you will get error such as `Asserting on type 0\n", + "GGML_ASSERT: .../ggml-metal.m:706: false && \"not implemented\"`\n", + "\n", + "Setting these parameters correctly will dramatically improve the evaluation speed (see [wrapper code](https://github.com/mmagnesium/langchain/blob/master/langchain/llms/llamacpp.py) for more details)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "n_gpu_layers = 1 # Metal set to 1 is enough.\n", + "n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n", + "\n", + "# Make sure the model path is correct for your system!\n", + "llm = LlamaCpp(\n", + " model_path=\"./ggml-model-q4_0.bin\",\n", + " n_gpu_layers=n_gpu_layers,\n", + " n_batch=n_batch,\n", + " f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls\n", + " callback_manager=callback_manager,\n", + " verbose=True,\n", + ")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The rest are almost same as GPU, the console log will show the following log to indicate the Metal was enable properly.\n", + "\n", + "```\n", + "ggml_metal_init: allocating\n", + "ggml_metal_init: using MPS\n", + "...\n", + "```\n", + "\n", + "You also could check the `Activity Monitor` by watching the % GPU of the process, the % CPU will drop dramatically after turn on `n_gpu_layers=1`. Also for the first time call LLM, the performance might be slow due to the model compilation in Metal GPU." + ] } ], "metadata": {