Add Metal support to llama.cpp doc (#7092)

- Description: Add Metal support to llama.cpp doc
  - Issue: #7091 
  - Dependencies: N/A
  - Twitter handle: gene_wu
This commit is contained in:
genewoo 2023-07-04 03:35:39 +08:00 committed by GitHub
parent fad2c7e5e0
commit e49abd1277
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -23,6 +23,7 @@
"There is a banch of options how to install the llama-cpp package: \n", "There is a banch of options how to install the llama-cpp package: \n",
"- only CPU usage\n", "- only CPU usage\n",
"- CPU + GPU (using one of many BLAS backends)\n", "- CPU + GPU (using one of many BLAS backends)\n",
"- Metal GPU (MacOS with Apple Silicon Chip) \n",
"\n", "\n",
"### CPU only installation" "### CPU only installation"
] ]
@ -73,7 +74,45 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python" "!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation with Metal\n",
"\n",
"`lama.cpp` supports Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. Use the `FORCE_CMAKE=1` environment variable to force the use of cmake and install the pip package for the Metal support ([source](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)).\n",
"\n",
"Example installation with Metal Support:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install llama-cpp-python"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"**IMPORTANT**: If you have already installed a cpu only version of the package, you need to reinstall it from scratch: consider the following command: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir"
] ]
}, },
{ {
@ -325,6 +364,61 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [] "source": []
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Metal\n",
"\n",
"If the installation with Metal was correct, you will see an `NEON = 1` indicator in model properties.\n",
"\n",
"Two of the most important parameters for use with GPU are:\n",
"\n",
"- `n_gpu_layers` - determines how many layers of the model are offloaded to your Metal GPU, in the most case, set it to `1` is enough for Metal\n",
"- `n_batch` - how many tokens are processed in parallel, default is 8, set to bigger number.\n",
"- `f16_kv` - for some reason, Metal only support `True`, otherwise you will get error such as `Asserting on type 0\n",
"GGML_ASSERT: .../ggml-metal.m:706: false && \"not implemented\"`\n",
"\n",
"Setting these parameters correctly will dramatically improve the evaluation speed (see [wrapper code](https://github.com/mmagnesium/langchain/blob/master/langchain/llms/llamacpp.py) for more details)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n_gpu_layers = 1 # Metal set to 1 is enough.\n",
"n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
"\n",
"# Make sure the model path is correct for your system!\n",
"llm = LlamaCpp(\n",
" model_path=\"./ggml-model-q4_0.bin\",\n",
" n_gpu_layers=n_gpu_layers,\n",
" n_batch=n_batch,\n",
" f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls\n",
" callback_manager=callback_manager,\n",
" verbose=True,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The rest are almost same as GPU, the console log will show the following log to indicate the Metal was enable properly.\n",
"\n",
"```\n",
"ggml_metal_init: allocating\n",
"ggml_metal_init: using MPS\n",
"...\n",
"```\n",
"\n",
"You also could check the `Activity Monitor` by watching the % GPU of the process, the % CPU will drop dramatically after turn on `n_gpu_layers=1`. Also for the first time call LLM, the performance might be slow due to the model compilation in Metal GPU."
]
} }
], ],
"metadata": { "metadata": {