"`lama.cpp` supports Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. Use the `FORCE_CMAKE=1` environment variable to force the use of cmake and install the pip package for the Metal support ([source](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)).\n",
"**IMPORTANT**: If you have already installed a cpu only version of the package, you need to reinstall it from scratch: consider the following command: "
"Setting these parameters correctly will dramatically improve the evaluation speed (see [wrapper code](https://github.com/mmagnesium/langchain/blob/master/langchain/llms/llamacpp.py) for more details)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n_gpu_layers = 1 # Metal set to 1 is enough.\n",
"n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
"\n",
"# Make sure the model path is correct for your system!\n",
"llm = LlamaCpp(\n",
" model_path=\"./ggml-model-q4_0.bin\",\n",
" n_gpu_layers=n_gpu_layers,\n",
" n_batch=n_batch,\n",
" f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls\n",
" callback_manager=callback_manager,\n",
" verbose=True,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The rest are almost same as GPU, the console log will show the following log to indicate the Metal was enable properly.\n",
"\n",
"```\n",
"ggml_metal_init: allocating\n",
"ggml_metal_init: using MPS\n",
"...\n",
"```\n",
"\n",
"You also could check the `Activity Monitor` by watching the % GPU of the process, the % CPU will drop dramatically after turn on `n_gpu_layers=1`. Also for the first time call LLM, the performance might be slow due to the model compilation in Metal GPU."