langchain/docs/extras/integrations/llms/llamacpp.ipynb
eryk-dsai 7f5713b80a
feat: grammar-based sampling in llama-cpp (#9712)
## Description 

The following PR enables the [grammar-based
sampling](https://github.com/ggerganov/llama.cpp/tree/master/grammars)
in llama-cpp LLM.

In short, loading file with formal grammar definition will constrain
model outputs. For instance, one can force the model to generate valid
JSON or generate only python lists.

In the follow-up PR we will add:
* docs with some description why it is cool and how it works
* maybe some code sample for some task such as in llama repo

---------

Co-authored-by: Lance Martin <lance@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-08-28 09:52:55 -07:00

1047 lines
45 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Llama.cpp\n",
"\n",
"[llama-cpp-python](https://github.com/abetlen/llama-cpp-python) is a Python binding for [llama.cpp](https://github.com/ggerganov/llama.cpp). \n",
"It supports [several LLMs](https://github.com/ggerganov/llama.cpp).\n",
"\n",
"This notebook goes over how to run `llama-cpp-python` within LangChain."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation\n",
"\n",
"There are different options on how to install the llama-cpp package: \n",
"- only CPU usage\n",
"- CPU + GPU (using one of many BLAS backends)\n",
"- Metal GPU (MacOS with Apple Silicon Chip) \n",
"\n",
"### CPU only installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install llama-cpp-python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation with OpenBLAS / cuBLAS / CLBlast\n",
"\n",
"`lama.cpp` supports multiple BLAS backends for faster processing. Use the `FORCE_CMAKE=1` environment variable to force the use of cmake and install the pip package for the desired BLAS backend ([source](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast)).\n",
"\n",
"Example installation with cuBLAS backend:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install llama-cpp-python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**IMPORTANT**: If you have already installed the CPU only version of the package, you need to reinstall it from scratch. Consider the following command: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation with Metal\n",
"\n",
"`llama.cpp` supports Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. Use the `FORCE_CMAKE=1` environment variable to force the use of cmake and install the pip package for the Metal support ([source](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)).\n",
"\n",
"Example installation with Metal Support:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install llama-cpp-python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**IMPORTANT**: If you have already installed a cpu only version of the package, you need to reinstall it from scratch: consider the following command: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation with Windows\n",
"\n",
"It is stable to install the `llama-cpp-python` library by compiling from the source. You can follow most of the instructions in the repository itself but there are some windows specific instructions which might be useful.\n",
"\n",
"Requirements to install the `llama-cpp-python`,\n",
"\n",
"- git\n",
"- python\n",
"- cmake\n",
"- Visual Studio Community (make sure you install this with the following settings)\n",
" - Desktop development with C++\n",
" - Python development\n",
" - Linux embedded development with C++\n",
"\n",
"1. Clone git repository recursively to get `llama.cpp` submodule as well \n",
"\n",
"```\n",
"git clone --recursive -j8 https://github.com/abetlen/llama-cpp-python.git\n",
"```\n",
"\n",
"2. Open up command Prompt (or anaconda prompt if you have it installed), set up environment variables to install. Follow this if you do not have a GPU, you must set both of the following variables.\n",
"\n",
"```\n",
"set FORCE_CMAKE=1\n",
"set CMAKE_ARGS=-DLLAMA_CUBLAS=OFF\n",
"```\n",
"You can ignore the second environment variable if you have an NVIDIA GPU.\n",
"\n",
"#### Compiling and installing\n",
"\n",
"In the same command prompt (anaconda prompt) you set the variables, you can `cd` into `llama-cpp-python` directory and run the following commands.\n",
"\n",
"```\n",
"python setup.py clean\n",
"python setup.py install\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Make sure you are following all instructions to [install all necessary model files](https://github.com/ggerganov/llama.cpp).\n",
"\n",
"You don't need an `API_TOKEN` as you will run the LLM locally.\n",
"\n",
"It is worth understanding which models are suitable to be used on the desired machine."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.llms import LlamaCpp\n",
"from langchain import PromptTemplate, LLMChain\n",
"from langchain.callbacks.manager import CallbackManager\n",
"from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Consider using a template that suits your model! Check the models page on HuggingFace etc. to get a correct prompting template.**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"template = \"\"\"Question: {question}\n",
"\n",
"Answer: Let's work this out in a step by step way to be sure we have the right answer.\"\"\"\n",
"\n",
"prompt = PromptTemplate(template=template, input_variables=[\"question\"])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Callbacks support token-wise streaming\n",
"callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])\n",
"# Verbose is required to pass to the callback manager"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CPU"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Example using a LLaMA 2 7B model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Make sure the model path is correct for your system!\n",
"llm = LlamaCpp(\n",
" model_path=\"/Users/rlm/Desktop/Code/llama/llama-2-7b-ggml/llama-2-7b-chat.ggmlv3.q4_0.bin\",\n",
" temperature=0.75,\n",
" max_tokens=2000,\n",
" top_p=1,\n",
" callback_manager=callback_manager,\n",
" verbose=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Stephen Colbert:\n",
"Yo, John, I heard you've been talkin' smack about me on your show.\n",
"Let me tell you somethin', pal, I'm the king of late-night TV\n",
"My satire is sharp as a razor, it cuts deeper than a knife\n",
"While you're just a british bloke tryin' to be funny with your accent and your wit.\n",
"John Oliver:\n",
"Oh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.\n",
"My show is the one that people actually watch and listen to, not just for the laughs but for the facts.\n",
"While you're busy talkin' trash, I'm out here bringing the truth to light.\n",
"Stephen Colbert:\n",
"Truth? Ha! You think your show is about truth? Please, it's all just a joke to you.\n",
"You're just a fancy-pants british guy tryin' to be funny with your news and your jokes.\n",
"While I'm the one who's really makin' a difference, with my sat"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 358.60 ms\n",
"llama_print_timings: sample time = 172.55 ms / 256 runs ( 0.67 ms per token, 1483.59 tokens per second)\n",
"llama_print_timings: prompt eval time = 613.36 ms / 16 tokens ( 38.33 ms per token, 26.09 tokens per second)\n",
"llama_print_timings: eval time = 10151.17 ms / 255 runs ( 39.81 ms per token, 25.12 tokens per second)\n",
"llama_print_timings: total time = 11332.41 ms\n"
]
},
{
"data": {
"text/plain": [
"\"\\nStephen Colbert:\\nYo, John, I heard you've been talkin' smack about me on your show.\\nLet me tell you somethin', pal, I'm the king of late-night TV\\nMy satire is sharp as a razor, it cuts deeper than a knife\\nWhile you're just a british bloke tryin' to be funny with your accent and your wit.\\nJohn Oliver:\\nOh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.\\nMy show is the one that people actually watch and listen to, not just for the laughs but for the facts.\\nWhile you're busy talkin' trash, I'm out here bringing the truth to light.\\nStephen Colbert:\\nTruth? Ha! You think your show is about truth? Please, it's all just a joke to you.\\nYou're just a fancy-pants british guy tryin' to be funny with your news and your jokes.\\nWhile I'm the one who's really makin' a difference, with my sat\""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prompt = \"\"\"\n",
"Question: A rap battle between Stephen Colbert and John Oliver\n",
"\"\"\"\n",
"llm(prompt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Example using a LLaMA v1 model"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# Make sure the model path is correct for your system!\n",
"llm = LlamaCpp(\n",
" model_path=\"./ggml-model-q4_0.bin\", callback_manager=callback_manager, verbose=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"llm_chain = LLMChain(prompt=prompt, llm=llm)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"1. First, find out when Justin Bieber was born.\n",
"2. We know that Justin Bieber was born on March 1, 1994.\n",
"3. Next, we need to look up when the Super Bowl was played in that year.\n",
"4. The Super Bowl was played on January 28, 1995.\n",
"5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers."
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 434.15 ms\n",
"llama_print_timings: sample time = 41.81 ms / 121 runs ( 0.35 ms per token)\n",
"llama_print_timings: prompt eval time = 2523.78 ms / 48 tokens ( 52.58 ms per token)\n",
"llama_print_timings: eval time = 23971.57 ms / 121 runs ( 198.11 ms per token)\n",
"llama_print_timings: total time = 28945.95 ms\n"
]
},
{
"data": {
"text/plain": [
"'\\n\\n1. First, find out when Justin Bieber was born.\\n2. We know that Justin Bieber was born on March 1, 1994.\\n3. Next, we need to look up when the Super Bowl was played in that year.\\n4. The Super Bowl was played on January 28, 1995.\\n5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers.'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"question = \"What NFL team won the Super Bowl in the year Justin Bieber was born?\"\n",
"\n",
"llm_chain.run(question)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GPU\n",
"\n",
"If the installation with BLAS backend was correct, you will see a `BLAS = 1` indicator in model properties.\n",
"\n",
"Two of the most important parameters for use with GPU are:\n",
"\n",
"- `n_gpu_layers` - determines how many layers of the model are offloaded to your GPU.\n",
"- `n_batch` - how many tokens are processed in parallel. \n",
"\n",
"Setting these parameters correctly will dramatically improve the evaluation speed (see [wrapper code](https://github.com/mmagnesium/langchain/blob/master/langchain/llms/llamacpp.py) for more details)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\n",
"llama_model_load_internal: format = ggjt v3 (latest)\n",
"llama_model_load_internal: n_vocab = 32000\n",
"llama_model_load_internal: n_ctx = 512\n",
"llama_model_load_internal: n_embd = 5120\n",
"llama_model_load_internal: n_mult = 256\n",
"llama_model_load_internal: n_head = 40\n",
"llama_model_load_internal: n_head_kv = 40\n",
"llama_model_load_internal: n_layer = 40\n",
"llama_model_load_internal: n_rot = 128\n",
"llama_model_load_internal: n_gqa = 1\n",
"llama_model_load_internal: rnorm_eps = 5.0e-06\n",
"llama_model_load_internal: n_ff = 13824\n",
"llama_model_load_internal: freq_base = 10000.0\n",
"llama_model_load_internal: freq_scale = 1\n",
"llama_model_load_internal: ftype = 2 (mostly Q4_0)\n",
"llama_model_load_internal: model size = 13B\n",
"llama_model_load_internal: ggml ctx size = 0.11 MB\n",
"llama_model_load_internal: mem required = 6983.72 MB (+ 400.00 MB per state)\n",
"llama_new_context_with_model: kv self size = 400.00 MB\n",
"ggml_metal_init: allocating\n",
"ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'\n",
"ggml_metal_init: loaded kernel_add 0x1405ed6b0\n",
"ggml_metal_init: loaded kernel_add_row 0x1405eee00\n",
"ggml_metal_init: loaded kernel_mul 0x1405ee650\n",
"ggml_metal_init: loaded kernel_mul_row 0x1405eda20\n",
"ggml_metal_init: loaded kernel_scale 0x121fc1d80\n",
"ggml_metal_init: loaded kernel_silu 0x121fc1fe0\n",
"ggml_metal_init: loaded kernel_relu 0x121fc2240\n",
"ggml_metal_init: loaded kernel_gelu 0x121fc24e0\n",
"ggml_metal_init: loaded kernel_soft_max 0x121fc2950\n",
"ggml_metal_init: loaded kernel_diag_mask_inf 0x121fc2d60\n",
"ggml_metal_init: loaded kernel_get_rows_f16 0x121fc3160\n",
"ggml_metal_init: loaded kernel_get_rows_q4_0 0x121fc3a20\n",
"ggml_metal_init: loaded kernel_get_rows_q4_1 0x121fc4170\n",
"ggml_metal_init: loaded kernel_get_rows_q2_K 0x121fc4890\n",
"ggml_metal_init: loaded kernel_get_rows_q3_K 0x121fc5010\n",
"ggml_metal_init: loaded kernel_get_rows_q4_K 0x121fc5750\n",
"ggml_metal_init: loaded kernel_get_rows_q5_K 0x121fc5e90\n",
"ggml_metal_init: loaded kernel_get_rows_q6_K 0x121fc65d0\n",
"ggml_metal_init: loaded kernel_rms_norm 0x121fc6d20\n",
"ggml_metal_init: loaded kernel_norm 0x121fc7460\n",
"ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x121fc7dd0\n",
"ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x121fc8610\n",
"ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x121fc8e50\n",
"ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x1405edc80\n",
"ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x1405efdc0\n",
"ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x140306f30\n",
"ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x1403073d0\n",
"ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x140307aa0\n",
"ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x140307f80\n",
"ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x140308460\n",
"ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x140308940\n",
"ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x140308e20\n",
"ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x140309300\n",
"ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x1403097e0\n",
"ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x140309cc0\n",
"ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x14030a1a0\n",
"ggml_metal_init: loaded kernel_rope 0x14030a400\n",
"ggml_metal_init: loaded kernel_alibi_f32 0x14030aa00\n",
"ggml_metal_init: loaded kernel_cpy_f32_f16 0x14030afd0\n",
"ggml_metal_init: loaded kernel_cpy_f32_f32 0x14030b5a0\n",
"ggml_metal_init: loaded kernel_cpy_f16_f16 0x14030bb70\n",
"ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB\n",
"ggml_metal_init: hasUnifiedMemory = true\n",
"ggml_metal_init: maxTransferRate = built-in GPU\n",
"llama_new_context_with_model: compute buffer total size = 91.35 MB\n",
"llama_new_context_with_model: max tensor size = 87.89 MB\n",
"ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, ( 6984.50 / 21845.34)\n",
"ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.36 MB, ( 6985.86 / 21845.34)\n",
"ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, ( 7387.86 / 21845.34)\n",
"ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 90.02 MB, ( 7477.88 / 21845.34)\n",
"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | \n"
]
}
],
"source": [
"n_gpu_layers = 40 # Change this value based on your model and your GPU VRAM pool.\n",
"n_batch = 512 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.\n",
"\n",
"# Make sure the model path is correct for your system!\n",
"llm = LlamaCpp(\n",
" model_path=\"/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\",\n",
" n_gpu_layers=n_gpu_layers,\n",
" n_batch=n_batch,\n",
" callback_manager=callback_manager,\n",
" verbose=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Justin Bieber was born on March 1, 1994. The Super Bowl is played at the end of the NFL season which runs from September to February.\n",
"\n",
"In 1994, the NFL season ended with Super Bowl XXVIII which was played on January 28th, 1994.\n",
"\n",
"So, there was no Super Bowl in the year Justin Bieber was born. The Super Bowl has only been around since 1967 and is played annually between the champions of the National Football Conference (NFC) and the American Football Conference (AFC)."
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 427.90 ms\n",
"llama_print_timings: sample time = 98.36 ms / 133 runs ( 0.74 ms per token, 1352.18 tokens per second)\n",
"llama_print_timings: prompt eval time = 427.83 ms / 45 tokens ( 9.51 ms per token, 105.18 tokens per second)\n",
"llama_print_timings: eval time = 3687.12 ms / 132 runs ( 27.93 ms per token, 35.80 tokens per second)\n",
"llama_print_timings: total time = 4401.84 ms\n"
]
},
{
"data": {
"text/plain": [
"'\\n\\nJustin Bieber was born on March 1, 1994. The Super Bowl is played at the end of the NFL season which runs from September to February.\\n\\nIn 1994, the NFL season ended with Super Bowl XXVIII which was played on January 28th, 1994.\\n\\nSo, there was no Super Bowl in the year Justin Bieber was born. The Super Bowl has only been around since 1967 and is played annually between the champions of the National Football Conference (NFC) and the American Football Conference (AFC).'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
"\n",
"question = \"What NFL team won the Super Bowl in the year Justin Bieber was born?\"\n",
"\n",
"llm_chain.run(question)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Metal\n",
"\n",
"If the installation with Metal was correct, you will see a `NEON = 1` indicator in model properties.\n",
"\n",
"Two of the most important GPU parameters are:\n",
"\n",
"- `n_gpu_layers` - determines how many layers of the model are offloaded to your Metal GPU, in the most case, set it to `1` is enough for Metal\n",
"- `n_batch` - how many tokens are processed in parallel, default is 8, set to bigger number.\n",
"- `f16_kv` - for some reason, Metal only support `True`, otherwise you will get error such as `Asserting on type 0\n",
"GGML_ASSERT: .../ggml-metal.m:706: false && \"not implemented\"`\n",
"\n",
"Setting these parameters correctly will dramatically improve the evaluation speed (see [wrapper code](https://github.com/mmagnesium/langchain/blob/master/langchain/llms/llamacpp.py) for more details)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\n",
"llama_model_load_internal: format = ggjt v3 (latest)\n",
"llama_model_load_internal: n_vocab = 32000\n",
"llama_model_load_internal: n_ctx = 512\n",
"llama_model_load_internal: n_embd = 5120\n",
"llama_model_load_internal: n_mult = 256\n",
"llama_model_load_internal: n_head = 40\n",
"llama_model_load_internal: n_head_kv = 40\n",
"llama_model_load_internal: n_layer = 40\n",
"llama_model_load_internal: n_rot = 128\n",
"llama_model_load_internal: n_gqa = 1\n",
"llama_model_load_internal: rnorm_eps = 5.0e-06\n",
"llama_model_load_internal: n_ff = 13824\n",
"llama_model_load_internal: freq_base = 10000.0\n",
"llama_model_load_internal: freq_scale = 1\n",
"llama_model_load_internal: ftype = 2 (mostly Q4_0)\n",
"llama_model_load_internal: model size = 13B\n",
"llama_model_load_internal: ggml ctx size = 0.11 MB\n",
"llama_model_load_internal: mem required = 6983.72 MB (+ 400.00 MB per state)\n",
"llama_new_context_with_model: kv self size = 400.00 MB\n",
"ggml_metal_init: allocating\n",
"ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'\n",
"ggml_metal_init: loaded kernel_add 0x113b42480\n",
"ggml_metal_init: loaded kernel_add_row 0x113b44210\n",
"ggml_metal_init: loaded kernel_mul 0x113b43a80\n",
"ggml_metal_init: loaded kernel_mul_row 0x113b44880\n",
"ggml_metal_init: loaded kernel_scale 0x113b45010\n",
"ggml_metal_init: loaded kernel_silu 0x113b45650\n",
"ggml_metal_init: loaded kernel_relu 0x113b427f0\n",
"ggml_metal_init: loaded kernel_gelu 0x113b46300\n",
"ggml_metal_init: loaded kernel_soft_max 0x113b46980\n",
"ggml_metal_init: loaded kernel_diag_mask_inf 0x113b46e20\n",
"ggml_metal_init: loaded kernel_get_rows_f16 0x113b47860\n",
"ggml_metal_init: loaded kernel_get_rows_q4_0 0x113b48010\n",
"ggml_metal_init: loaded kernel_get_rows_q4_1 0x113b48880\n",
"ggml_metal_init: loaded kernel_get_rows_q2_K 0x113b48f70\n",
"ggml_metal_init: loaded kernel_get_rows_q3_K 0x113b49e00\n",
"ggml_metal_init: loaded kernel_get_rows_q4_K 0x113b4a530\n",
"ggml_metal_init: loaded kernel_get_rows_q5_K 0x113b4ac70\n",
"ggml_metal_init: loaded kernel_get_rows_q6_K 0x113b4b3b0\n",
"ggml_metal_init: loaded kernel_rms_norm 0x113b4bb00\n",
"ggml_metal_init: loaded kernel_norm 0x113b4c1a0\n",
"ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x113b4cba0\n",
"ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x113b4d360\n",
"ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x113b4dba0\n",
"ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x113b4e560\n",
"ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x113b4ed10\n",
"ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x113b4f580\n",
"ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x113b4fdc0\n",
"ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x113b50740\n",
"ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x113b51250\n",
"ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x113b51a80\n",
"ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x113b522b0\n",
"ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x113b52ae0\n",
"ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x113b53310\n",
"ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x113b53b40\n",
"ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x113b54370\n",
"ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x113b54ba0\n",
"ggml_metal_init: loaded kernel_rope 0x113b551a0\n",
"ggml_metal_init: loaded kernel_alibi_f32 0x113b55b10\n",
"ggml_metal_init: loaded kernel_cpy_f32_f16 0x113b56450\n",
"ggml_metal_init: loaded kernel_cpy_f32_f32 0x113b56dc0\n",
"ggml_metal_init: loaded kernel_cpy_f16_f16 0x113b576b0\n",
"ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB\n",
"ggml_metal_init: hasUnifiedMemory = true\n",
"ggml_metal_init: maxTransferRate = built-in GPU\n",
"llama_new_context_with_model: compute buffer total size = 91.35 MB\n",
"llama_new_context_with_model: max tensor size = 87.89 MB\n",
"ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, ( 6984.50 / 21845.34)\n",
"ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.36 MB, ( 6985.86 / 21845.34)\n",
"ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, ( 7387.86 / 21845.34)\n",
"ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 90.02 MB, ( 7477.88 / 21845.34)AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | \n",
"\n"
]
}
],
"source": [
"n_gpu_layers = 1 # Metal set to 1 is enough.\n",
"n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
"\n",
"# Make sure the model path is correct for your system!\n",
"llm = LlamaCpp(\n",
" model_path=\"/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\",\n",
" n_gpu_layers=n_gpu_layers,\n",
" n_batch=n_batch,\n",
" f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls\n",
" callback_manager=callback_manager,\n",
" verbose=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The console log will show the following log to indicate Metal was enable properly.\n",
"\n",
"```\n",
"ggml_metal_init: allocating\n",
"ggml_metal_init: using MPS\n",
"...\n",
"```\n",
"\n",
"You also could check `Activity Monitor` by watching the GPU usage of the process, the CPU usage will drop dramatically after turn on `n_gpu_layers=1`. \n",
"\n",
"For the first call to the LLM, the performance may be slow due to the model compilation in Metal GPU."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Grammars\n",
"\n",
"\n",
"We can specify [grammars](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md) to constrain model outputs.\n",
"\n",
"Supply the path to the specifed `json.gbnf` file."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\n",
"llama_model_load_internal: format = ggjt v3 (latest)\n",
"llama_model_load_internal: n_vocab = 32000\n",
"llama_model_load_internal: n_ctx = 512\n",
"llama_model_load_internal: n_embd = 5120\n",
"llama_model_load_internal: n_mult = 256\n",
"llama_model_load_internal: n_head = 40\n",
"llama_model_load_internal: n_head_kv = 40\n",
"llama_model_load_internal: n_layer = 40\n",
"llama_model_load_internal: n_rot = 128\n",
"llama_model_load_internal: n_gqa = 1\n",
"llama_model_load_internal: rnorm_eps = 5.0e-06\n",
"llama_model_load_internal: n_ff = 13824\n",
"llama_model_load_internal: freq_base = 10000.0\n",
"llama_model_load_internal: freq_scale = 1\n",
"llama_model_load_internal: ftype = 2 (mostly Q4_0)\n",
"llama_model_load_internal: model size = 13B\n",
"llama_model_load_internal: ggml ctx size = 0.11 MB\n",
"llama_model_load_internal: mem required = 6983.72 MB (+ 400.00 MB per state)\n",
"llama_new_context_with_model: kv self size = 400.00 MB\n",
"ggml_metal_init: allocating\n",
"ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'\n",
"ggml_metal_init: loaded kernel_add 0x1516fb530\n",
"ggml_metal_init: loaded kernel_add_row 0x1516fb790\n",
"ggml_metal_init: loaded kernel_mul 0x1516fb9f0\n",
"ggml_metal_init: loaded kernel_mul_row 0x1516fbc50\n",
"ggml_metal_init: loaded kernel_scale 0x1516fbeb0\n",
"ggml_metal_init: loaded kernel_silu 0x1516fc110\n",
"ggml_metal_init: loaded kernel_relu 0x1516fc370\n",
"ggml_metal_init: loaded kernel_gelu 0x1516fc5d0\n",
"ggml_metal_init: loaded kernel_soft_max 0x1516fc830\n",
"ggml_metal_init: loaded kernel_diag_mask_inf 0x1516fca90\n",
"ggml_metal_init: loaded kernel_get_rows_f16 0x1516fccf0\n",
"ggml_metal_init: loaded kernel_get_rows_q4_0 0x1516fcf50\n",
"ggml_metal_init: loaded kernel_get_rows_q4_1 0x1516fd1b0\n",
"ggml_metal_init: loaded kernel_get_rows_q2_K 0x1516fd410\n",
"ggml_metal_init: loaded kernel_get_rows_q3_K 0x1516fd670\n",
"ggml_metal_init: loaded kernel_get_rows_q4_K 0x1516fd8d0\n",
"ggml_metal_init: loaded kernel_get_rows_q5_K 0x1516fdb30\n",
"ggml_metal_init: loaded kernel_get_rows_q6_K 0x1516fdd90\n",
"ggml_metal_init: loaded kernel_rms_norm 0x1516fdff0\n",
"ggml_metal_init: loaded kernel_norm 0x1516fe250\n",
"ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x1516fe4b0\n",
"ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x1516fe710\n",
"ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x1516fe970\n",
"ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x1516febd0\n",
"ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x1516fee30\n",
"ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x1516ff090\n",
"ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x1516ff2f0\n",
"ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x1516ff550\n",
"ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x1516ff7b0\n",
"ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x121fce650\n",
"ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x121fcdce0\n",
"ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x121fceab0\n",
"ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x121fced10\n",
"ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x121fcef70\n",
"ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x121fcf1d0\n",
"ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x121fcf430\n",
"ggml_metal_init: loaded kernel_rope 0x121fcf690\n",
"ggml_metal_init: loaded kernel_alibi_f32 0x121fcf8f0\n",
"ggml_metal_init: loaded kernel_cpy_f32_f16 0x121fcfb50\n",
"ggml_metal_init: loaded kernel_cpy_f32_f32 0x121fcfdb0\n",
"ggml_metal_init: loaded kernel_cpy_f16_f16 0x121fd0010\n",
"ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB\n",
"ggml_metal_init: hasUnifiedMemory = true\n",
"ggml_metal_init: maxTransferRate = built-in GPU\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"root ::= object \n",
"object ::= [{] ws object_11 [}] \n",
"value ::= object | array | string | number | boolean | [n] [u] [l] [l] \n",
"array ::= [[] ws array_15 []] \n",
"string ::= [\"] string_18 [\"] ws \n",
"number ::= number_19 number_20 ws \n",
"boolean ::= boolean_21 ws \n",
"ws ::= ws_23 \n",
"object_8 ::= string [:] ws value object_10 \n",
"object_9 ::= [,] ws string [:] ws value \n",
"object_10 ::= object_9 object_10 | \n",
"object_11 ::= object_8 | \n",
"array_12 ::= value array_14 \n",
"array_13 ::= [,] ws value \n",
"array_14 ::= array_13 array_14 | \n",
"array_15 ::= array_12 | \n",
"string_16 ::= [^\"\\] | [\\] string_17 \n",
"string_17 ::= [\"\\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] \n",
"string_18 ::= string_16 string_18 | \n",
"number_19 ::= [-] | \n",
"number_20 ::= [0-9] number_20 | [0-9] \n",
"boolean_21 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] \n",
"ws_22 ::= [ <U+0009><U+000A>] ws \n",
"ws_23 ::= ws_22 | \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"llama_new_context_with_model: compute buffer total size = 91.35 MB\n",
"llama_new_context_with_model: max tensor size = 87.89 MB\n",
"ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, (14468.72 / 21845.34)\n",
"ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.36 MB, (14470.08 / 21845.34)\n",
"ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, (14872.08 / 21845.34)\n",
"ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 90.02 MB, (14962.09 / 21845.34)\n",
"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | \n",
"from_string grammar:\n",
"\n"
]
}
],
"source": [
"n_gpu_layers = 1 \n",
"n_batch = 512\n",
"llm = LlamaCpp(\n",
" model_path=\"/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\",\n",
" n_gpu_layers=n_gpu_layers,\n",
" n_batch=n_batch,\n",
" f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls\n",
" callback_manager=callback_manager,\n",
" verbose=True,\n",
" grammar_path=\"/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/json.gbnf\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Error in LangChainTracer.on_llm_start callback: ctypes objects containing pointers cannot be pickled\n",
"Exception ignored in: <function LlamaGrammar.__del__ at 0x1402b15e0>\n",
"Traceback (most recent call last):\n",
" File \"/Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/llama_cpp/llama_grammar.py\", line 46, in __del__\n",
" if self.grammar is not None:\n",
"AttributeError: 'LlamaGrammar' object has no attribute 'grammar'\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\"name\": \"John Doe\", \"age\": 30, \"gender\": \"male\"}"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 317.62 ms\n",
"llama_print_timings: sample time = 141.83 ms / 22 runs ( 6.45 ms per token, 155.11 tokens per second)\n",
"llama_print_timings: prompt eval time = 316.89 ms / 9 tokens ( 35.21 ms per token, 28.40 tokens per second)\n",
"llama_print_timings: eval time = 575.93 ms / 21 runs ( 27.43 ms per token, 36.46 tokens per second)\n",
"llama_print_timings: total time = 1087.31 ms\n",
"Error in LangChainTracer.on_llm_end callback: ctypes objects containing pointers cannot be pickled\n",
"Exception ignored in: <function LlamaGrammar.__del__ at 0x1402b15e0>\n",
"Traceback (most recent call last):\n",
" File \"/Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/llama_cpp/llama_grammar.py\", line 46, in __del__\n",
" if self.grammar is not None:\n",
"AttributeError: 'LlamaGrammar' object has no attribute 'grammar'\n"
]
}
],
"source": [
"result=llm(\"Describe a person in JSON format:\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'John Doe'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eval(result)[\"name\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also try `list.gbnf`."
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"llama.cpp: loading model from /home/eryk/deepsense/llama-2-7b.ggmlv3.q4_0.bin\n",
"llama_model_load_internal: format = ggjt v3 (latest)\n",
"llama_model_load_internal: n_vocab = 32000\n",
"llama_model_load_internal: n_ctx = 512\n",
"llama_model_load_internal: n_embd = 4096\n",
"llama_model_load_internal: n_mult = 256\n",
"llama_model_load_internal: n_head = 32\n",
"llama_model_load_internal: n_head_kv = 32\n",
"llama_model_load_internal: n_layer = 32\n",
"llama_model_load_internal: n_rot = 128\n",
"llama_model_load_internal: n_gqa = 1\n",
"llama_model_load_internal: rnorm_eps = 5.0e-06\n",
"llama_model_load_internal: n_ff = 11008\n",
"llama_model_load_internal: freq_base = 10000.0\n",
"llama_model_load_internal: freq_scale = 1\n",
"llama_model_load_internal: ftype = 2 (mostly Q4_0)\n",
"llama_model_load_internal: model size = 7B\n",
"llama_model_load_internal: ggml ctx size = 0.08 MB\n",
"llama_model_load_internal: mem required = 3615.73 MB (+ 256.00 MB per state)\n",
"llama_new_context_with_model: kv self size = 256.00 MB\n",
"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | \n",
"llama_new_context_with_model: compute buffer total size = 71.84 MB\n",
"from_string grammar:\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"root ::= [[] items []] EOF \n",
"items ::= item items_7 \n",
"EOF ::= [<U+000A>] \n",
"item ::= string \n",
"items_4 ::= [,] items_6 item \n",
"ws ::= [ ] \n",
"items_6 ::= ws items_6 | \n",
"items_7 ::= items_4 items_7 | \n",
"string ::= [\"] word string_12 [\"] string_13 \n",
"word ::= word_14 \n",
"string_10 ::= string_11 word \n",
"string_11 ::= ws string_11 | ws \n",
"string_12 ::= string_10 string_12 | \n",
"string_13 ::= ws string_13 | \n",
"word_14 ::= [a-zA-Z] word_14 | [a-zA-Z] \n"
]
}
],
"source": [
"n_gpu_layers = 1 \n",
"n_batch = 512\n",
"llm = LlamaCpp(\n",
" model_path=\"/home/eryk/deepsense/llama-2-7b.ggmlv3.q4_0.bin\",\n",
" n_gpu_layers=n_gpu_layers,\n",
" n_batch=n_batch,\n",
" f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls\n",
" callback_manager=callback_manager,\n",
" verbose=True,\n",
" grammar_path=\"/home/eryk/deepsense/langchain/libs/langchain/langchain/llms/grammars/list.gbnf\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[\"Jane Eyre\" , \"Sense and Sensibility\" , \"A Tale of Two Cities\"]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 1079.21 ms\n",
"llama_print_timings: sample time = 225.57 ms / 29 runs ( 7.78 ms per token, 128.56 tokens per second)\n",
"llama_print_timings: prompt eval time = 1078.34 ms / 11 tokens ( 98.03 ms per token, 10.20 tokens per second)\n",
"llama_print_timings: eval time = 4389.99 ms / 28 runs ( 156.79 ms per token, 6.38 tokens per second)\n",
"llama_print_timings: total time = 5807.84 ms\n"
]
}
],
"source": [
"result=llm(\"List of top-3 my favourite books:\")"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Jane Eyre', 'Sense and Sensibility', 'A Tale of Two Cities']"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eval(result)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 4
}