docs: Update llamacpp ntbk (#22907)

Co-authored-by: Bagatur <baskaryan@gmail.com>
4 months ago · ea96133890
parent e2304ebcdb
commit ea96133890
1 changed files with 107 additions and 284 deletions
--- a/docs/docs/integrations/chat/llamacpp.ipynb
+++ b/docs/docs/integrations/chat/llamacpp.ipynb
@ -6,9 +6,7 @@
   "source": [
    "# ChatLlamaCpp\n",
    "\n",
-    "This notebook provides a quick overview for getting started with chat model intergrated with [llama cpp python](https://github.com/abetlen/llama-cpp-python)\n",
-    "\n",
-    "An example below demonstrating how to implement with the open-source Llama3 Instruct 8B"
+    "This notebook provides a quick overview for getting started with chat model intergrated with [llama cpp python](https://github.com/abetlen/llama-cpp-python)."
   ]
  },
  {
@ -29,6 +27,18 @@
    "\n",
    "## Setup\n",
    "\n",
+    "To get started and use **all** the features show below, we reccomend using a model that has been fine-tuned for tool-calling.\n",
+    "\n",
+    "We will use [\n",
+    "Hermes-2-Pro-Llama-3-8B-GGUF](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF) from NousResearch. \n",
+    "\n",
+    "> Hermes 2 Pro is an upgraded version of Nous Hermes 2, consisting of an updated and cleaned version of the OpenHermes 2.5 Dataset, as well as a newly introduced Function Calling and JSON Mode dataset developed in-house. This new version of Hermes maintains its excellent general task and conversation capabilities - but also excels at Function Calling\n",
+    "\n",
+    "See our guides on local models to go deeper:\n",
+    "\n",
+    "* [Run LLMs locally](https://python.langchain.com/v0.1/docs/guides/development/local_llms/)\n",
+    "* [Using local models with RAG](https://python.langchain.com/v0.1/docs/use_cases/question_answering/local_retrieval_qa/)\n",
+    "\n",
    "### Installation\n",
    "\n",
    "The LangChain OpenAI integration lives in the `langchain-community` and `llama-cpp-python` packages:"
@ -54,119 +64,19 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 14,
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/tni5hc/Documents/langchain_llamacpp/SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))\n",
-      "llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\n",
-      "llama_model_loader: - kv   0:                       general.architecture str              = llama\n",
-      "llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct\n",
-      "llama_model_loader: - kv   2:                          llama.block_count u32              = 32\n",
-      "llama_model_loader: - kv   3:                       llama.context_length u32              = 8192\n",
-      "llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096\n",
-      "llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336\n",
-      "llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32\n",
-      "llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8\n",
-      "llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000\n",
-      "llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010\n",
-      "llama_model_loader: - kv  10:                          general.file_type u32              = 7\n",
-      "llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256\n",
-      "llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128\n",
-      "llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2\n",
-      "llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe\n",
-      "llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = [\"!\", \"\\\"\", \"#\", \"$\", \"%\", \"&\", \"'\", ...\n",
-      "llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\n",
-      "llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = [\"Ġ Ġ\", \"Ġ ĠĠĠ\", \"ĠĠ ĠĠ\", \"...\n",
-      "llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000\n",
-      "llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009\n",
-      "llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...\n",
-      "llama_model_loader: - kv  21:               general.quantization_version u32              = 2\n",
-      "llama_model_loader: - type  f32:   65 tensors\n",
-      "llama_model_loader: - type q8_0:  226 tensors\n",
-      "llm_load_vocab: special tokens definition check successful ( 256/128256 ).\n",
-      "llm_load_print_meta: format           = GGUF V3 (latest)\n",
-      "llm_load_print_meta: arch             = llama\n",
-      "llm_load_print_meta: vocab type       = BPE\n",
-      "llm_load_print_meta: n_vocab          = 128256\n",
-      "llm_load_print_meta: n_merges         = 280147\n",
-      "llm_load_print_meta: n_ctx_train      = 8192\n",
-      "llm_load_print_meta: n_embd           = 4096\n",
-      "llm_load_print_meta: n_head           = 32\n",
-      "llm_load_print_meta: n_head_kv        = 8\n",
-      "llm_load_print_meta: n_layer          = 32\n",
-      "llm_load_print_meta: n_rot            = 128\n",
-      "llm_load_print_meta: n_embd_head_k    = 128\n",
-      "llm_load_print_meta: n_embd_head_v    = 128\n",
-      "llm_load_print_meta: n_gqa            = 4\n",
-      "llm_load_print_meta: n_embd_k_gqa     = 1024\n",
-      "llm_load_print_meta: n_embd_v_gqa     = 1024\n",
-      "llm_load_print_meta: f_norm_eps       = 0.0e+00\n",
-      "llm_load_print_meta: f_norm_rms_eps   = 1.0e-05\n",
-      "llm_load_print_meta: f_clamp_kqv      = 0.0e+00\n",
-      "llm_load_print_meta: f_max_alibi_bias = 0.0e+00\n",
-      "llm_load_print_meta: f_logit_scale    = 0.0e+00\n",
-      "llm_load_print_meta: n_ff             = 14336\n",
-      "llm_load_print_meta: n_expert         = 0\n",
-      "llm_load_print_meta: n_expert_used    = 0\n",
-      "llm_load_print_meta: causal attn      = 1\n",
-      "llm_load_print_meta: pooling type     = 0\n",
-      "llm_load_print_meta: rope type        = 0\n",
-      "llm_load_print_meta: rope scaling     = linear\n",
-      "llm_load_print_meta: freq_base_train  = 500000.0\n",
-      "llm_load_print_meta: freq_scale_train = 1\n",
-      "llm_load_print_meta: n_yarn_orig_ctx  = 8192\n",
-      "llm_load_print_meta: rope_finetuned   = unknown\n",
-      "llm_load_print_meta: ssm_d_conv       = 0\n",
-      "llm_load_print_meta: ssm_d_inner      = 0\n",
-      "llm_load_print_meta: ssm_d_state      = 0\n",
-      "llm_load_print_meta: ssm_dt_rank      = 0\n",
-      "llm_load_print_meta: model type       = 7B\n",
-      "llm_load_print_meta: model ftype      = Q8_0\n",
-      "llm_load_print_meta: model params     = 8.03 B\n",
-      "llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) \n",
-      "llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct\n",
-      "llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'\n",
-      "llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'\n",
-      "llm_load_print_meta: LF token         = 128 'Ä'\n",
-      "ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no\n",
-      "ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes\n",
-      "ggml_cuda_init: found 1 CUDA devices:\n",
-      "  Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes\n",
-      "llm_load_tensors: ggml ctx size =    0.22 MiB\n",
-      "llm_load_tensors: offloading 8 repeating layers to GPU\n",
-      "llm_load_tensors: offloaded 8/33 layers to GPU\n",
-      "llm_load_tensors:        CPU buffer size =  8137.64 MiB\n",
-      "llm_load_tensors:      CUDA0 buffer size =  1768.25 MiB\n",
-      ".........................................................................................\n",
-      "llama_new_context_with_model: n_ctx      = 10016\n",
-      "llama_new_context_with_model: n_batch    = 300\n",
-      "llama_new_context_with_model: n_ubatch   = 300\n",
-      "llama_new_context_with_model: freq_base  = 10000.0\n",
-      "llama_new_context_with_model: freq_scale = 1\n",
-      "llama_kv_cache_init:  CUDA_Host KV buffer size =   939.00 MiB\n",
-      "llama_kv_cache_init:      CUDA0 KV buffer size =   313.00 MiB\n",
-      "llama_new_context_with_model: KV self size  = 1252.00 MiB, K (f16):  626.00 MiB, V (f16):  626.00 MiB\n",
-      "llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB\n",
-      "llama_new_context_with_model:      CUDA0 compute buffer size =   683.78 MiB\n",
-      "llama_new_context_with_model:  CUDA_Host compute buffer size =    16.15 MiB\n",
-      "llama_new_context_with_model: graph nodes  = 1030\n",
-      "llama_new_context_with_model: graph splits = 268\n",
-      "AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | \n",
-      "Model metadata: {'tokenizer.chat_template': \"{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}{% endif %}\", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'tokenizer.ggml.pre': 'llama-bpe', 'llama.context_length': '8192', 'general.name': 'Meta-Llama-3-8B-Instruct', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.file_type': '7', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128'}\n",
-      "Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n",
-      "\n",
-      "'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n",
-      "\n",
-      "' }}{% endif %}\n",
-      "Using chat eos_token: <|eot_id|>\n",
-      "Using chat bos_token: <|begin_of_text|>\n"
-     ]
-    }
-   ],
+   "outputs": [],
+   "source": [
+    "# Path to your model weights\n",
+    "local_model = \"local/path/to/Hermes-2-Pro-Llama-3-8B-Q8_0.gguf\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "import multiprocessing\n",
    "\n",
@ -174,7 +84,7 @@
    "\n",
    "llm = ChatLlamaCpp(\n",
    "    temperature=0.5,\n",
-    "    model_path=\"./SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf\",\n",
+    "    model_path=local_model,\n",
    "    n_ctx=10000,\n",
    "    n_gpu_layers=8,\n",
    "    n_batch=300,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.\n",
@ -195,32 +105,9 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "llama_print_timings:        load time =    1077.71 ms\n",
-      "llama_print_timings:      sample time =      21.82 ms /    39 runs   (    0.56 ms per token,  1787.35 tokens per second)\n",
-      "llama_print_timings: prompt eval time =    1077.65 ms /    37 tokens (   29.13 ms per token,    34.33 tokens per second)\n",
-      "llama_print_timings:        eval time =    8403.75 ms /    38 runs   (  221.15 ms per token,     4.52 tokens per second)\n",
-      "llama_print_timings:       total time =    9689.66 ms /    75 tokens\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "AIMessage(content='Je adore le programmation.\\n\\n(Note: \"programmation\" is used in both formal and informal contexts, but it\\'s generally accepted as equivalent of saying you like computer science or coding.)', response_metadata={'finish_reason': 'stop'}, id='run-e9e03b94-f29f-4c1d-8483-e23a46acb556-0')"
-      ]
-     },
-     "execution_count": 4,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
   "source": [
    "messages = [\n",
    "    (\n",
@ -236,16 +123,19 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Je adore le programmation.\n",
+      "J'aime programmer. (In France, \"programming\" is often used in its original sense of scheduling or organizing events.) \n",
      "\n",
-      "(Note: \"programmation\" is used in both formal and informal contexts, but it's generally accepted as equivalent of saying you like computer science or coding.)\n"
+      "If you meant computer-programming: \n",
+      "Je suis amoureux de la programmation informatique.\n",
+      "\n",
+      "(You might also say simply 'programmation', which would be understood as both meanings - depending on context).\n"
     ]
    }
   ],
@ -264,33 +154,9 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Llama.generate: prefix-match hit\n",
-      "\n",
-      "llama_print_timings:        load time =    1077.71 ms\n",
-      "llama_print_timings:      sample time =      29.23 ms /    52 runs   (    0.56 ms per token,  1778.75 tokens per second)\n",
-      "llama_print_timings: prompt eval time =     869.38 ms /    17 tokens (   51.14 ms per token,    19.55 tokens per second)\n",
-      "llama_print_timings:        eval time =    6694.18 ms /    51 runs   (  131.26 ms per token,     7.62 tokens per second)\n",
-      "llama_print_timings:       total time =    7830.86 ms /    68 tokens\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "AIMessage(content='Ich liebe auch Programmieren! (Translation: I also like coding!) Do you have any favorite languages or projects? Ich bin hier, um dir zu helfen und über deine Lieblingsprogrammierthemen sprechen können wir gerne weiter machen... !)', response_metadata={'finish_reason': 'stop'}, id='run-922c4cad-368f-41ba-9db9-eacb41d37cb2-0')"
-      ]
-     },
-     "execution_count": 6,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
   "source": [
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "\n",
@ -342,7 +208,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
@ -369,54 +235,79 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ai_msg = llm_with_tools.invoke(\n",
+    "    \"what is the weather like in HCMC in celsius\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
   "metadata": {},
   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Llama.generate: prefix-match hit\n",
-      "\n",
-      "llama_print_timings:        load time =    1077.71 ms\n",
-      "llama_print_timings:      sample time =     853.67 ms /    20 runs   (   42.68 ms per token,    23.43 tokens per second)\n",
-      "llama_print_timings: prompt eval time =    1060.96 ms /    21 tokens (   50.52 ms per token,    19.79 tokens per second)\n",
-      "llama_print_timings:        eval time =    2754.74 ms /    19 runs   (  144.99 ms per token,     6.90 tokens per second)\n",
-      "llama_print_timings:       total time =    4817.07 ms /    40 tokens\n"
-     ]
-    },
    {
     "data": {
      "text/plain": [
-       "AIMessage(content='', additional_kwargs={'function_call': {'name': 'get_current_weather', 'arguments': '{ \"location\": \"Ho Chi Minh City\", \"unit\" : \"celsius\"}'}, 'tool_calls': [{'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{ \"location\": \"Ho Chi Minh City\", \"unit\" : \"celsius\"}'}}]}, response_metadata={'token_usage': {'prompt_tokens': 23, 'completion_tokens': 19, 'total_tokens': 42}, 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-9d35869c-36fe-4f4a-835e-089a3f3aba3c-0', tool_calls=[{'name': 'get_current_weather', 'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'}, 'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554'}])"
+       "[{'name': 'get_current_weather',\n",
+       "  'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'},\n",
+       "  'id': 'call__0_get_current_weather_cmpl-394d9943-0a1f-425b-8139-d2826c1431f2'}]"
      ]
     },
-     "execution_count": 8,
+     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
+    "ai_msg.tool_calls"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class MagicFunctionInput(BaseModel):\n",
+    "    magic_function_input: int = Field(description=\"The input value for magic function\")\n",
+    "\n",
+    "\n",
+    "@tool(\"get_magic_function\", args_schema=MagicFunctionInput)\n",
+    "def magic_function(magic_function_input: int):\n",
+    "    \"\"\"Get the value of magic function for an input.\"\"\"\n",
+    "    return magic_function_input + 2\n",
+    "\n",
+    "\n",
+    "llm_with_tools = llm.bind_tools(\n",
+    "    tools=[magic_function],\n",
+    "    tool_choice={\"type\": \"function\", \"function\": {\"name\": \"get_magic_function\"}},\n",
+    ")\n",
+    "\n",
    "ai_msg = llm_with_tools.invoke(\n",
-    "    \"what is the weather like in HCMC in celsius\",\n",
+    "    \"What is magic function of 3?\",\n",
    ")\n",
+    "\n",
    "ai_msg"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "[{'name': 'get_current_weather',\n",
-       "  'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'},\n",
-       "  'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554'}]"
+       "[{'name': 'get_magic_function',\n",
+       "  'args': {'magic_function_input': 3},\n",
+       "  'id': 'call__0_get_magic_function_cmpl-cd83a994-b820-4428-957c-48076c68335a'}]"
      ]
     },
-     "execution_count": 9,
+     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -434,59 +325,46 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": null,
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Llama.generate: prefix-match hit\n",
-      "\n",
-      "llama_print_timings:        load time =    1077.71 ms\n",
-      "llama_print_timings:      sample time =    1964.76 ms /    44 runs   (   44.65 ms per token,    22.39 tokens per second)\n",
-      "llama_print_timings: prompt eval time =     914.34 ms /    18 tokens (   50.80 ms per token,    19.69 tokens per second)\n",
-      "llama_print_timings:        eval time =    7903.81 ms /    43 runs   (  183.81 ms per token,     5.44 tokens per second)\n",
-      "llama_print_timings:       total time =   11065.60 ms /    61 tokens\n"
-     ]
-    }
-   ],
+   "outputs": [],
   "source": [
    "from langchain_core.pydantic_v1 import BaseModel\n",
    "from langchain_core.utils.function_calling import convert_to_openai_tool\n",
    "\n",
    "\n",
-    "class AnswerWithJustification(BaseModel):\n",
-    "    \"\"\"An answer to the user question along with justification for the answer.\"\"\"\n",
+    "class Joke(BaseModel):\n",
+    "    \"\"\"A setup to a joke and the punchline.\"\"\"\n",
    "\n",
-    "    answer: str\n",
-    "    justification: str\n",
+    "    setup: str\n",
+    "    punchline: str\n",
    "\n",
    "\n",
-    "dict_schema = convert_to_openai_tool(AnswerWithJustification)\n",
-    "\n",
+    "dict_schema = convert_to_openai_tool(Joke)\n",
    "structured_llm = llm.with_structured_output(dict_schema)\n",
-    "\n",
-    "result = structured_llm.invoke(\n",
-    "    \"What weighs more a pound of bricks or a pound of feathers ?\"\n",
-    ")"
+    "result = structured_llm.invoke(\"Tell me a joke about birds\")\n",
+    "result"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "{'answer': \"a pound is always the same weight, regardless of what it's made up off. So both options are equal in terms of their mass.\", 'justification': ''}\n"
-     ]
+     "data": {
+      "text/plain": [
+       "{'setup': '- Why did the chicken cross the playground?',\n",
+       " 'punchline': '\\n\\n- To get to its gilded cage on the other side!'}"
+      ]
+     },
+     "execution_count": 27,
+     "metadata": {},
+     "output_type": "execute_result"
    }
   ],
   "source": [
-    "print(result)"
+    "result"
   ]
  },
  {
@ -498,64 +376,9 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": null,
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Llama.generate: prefix-match hit\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "The\n",
-      " answer\n",
-      " to\n",
-      " the\n",
-      " multiplication\n",
-      " problem\n",
-      " \"\n",
-      "What\n",
-      "'s\n",
-      " \n",
-      "25\n",
-      " x\n",
-      " \n",
-      "5\n",
-      "?\"\n",
-      " would\n",
-      " be\n",
-      ":\n",
-      "\n",
-      "\n",
-      "125\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "llama_print_timings:        load time =    1077.71 ms\n",
-      "llama_print_timings:      sample time =      10.60 ms /    20 runs   (    0.53 ms per token,  1886.26 tokens per second)\n",
-      "llama_print_timings: prompt eval time =    3661.75 ms /    12 tokens (  305.15 ms per token,     3.28 tokens per second)\n",
-      "llama_print_timings:        eval time =    2468.01 ms /    19 runs   (  129.90 ms per token,     7.70 tokens per second)\n",
-      "llama_print_timings:       total time =    3133.11 ms /    31 tokens\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n"
-     ]
-    }
-   ],
+   "outputs": [],
   "source": [
    "for chunk in llm.stream(\"what is 25x5\"):\n",
    "    print(chunk.content, end=\"\\n\", flush=True)"
@ -587,7 +410,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.1"
+   "version": "3.11.8"
  }
 },
 "nbformat": 4,