"This notebook provides a quick overview for getting started with chat model intergrated with [llama cpp python](https://github.com/abetlen/llama-cpp-python)\n",
"\n",
"An example below demonstrating how to implement with the open-source Llama3 Instruct 8B"
"This notebook provides a quick overview for getting started with chat model intergrated with [llama cpp python](https://github.com/abetlen/llama-cpp-python)."
]
},
{
@ -29,6 +27,18 @@
"\n",
"## Setup\n",
"\n",
"To get started and use **all** the features show below, we reccomend using a model that has been fine-tuned for tool-calling.\n",
"\n",
"We will use [\n",
"Hermes-2-Pro-Llama-3-8B-GGUF](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF) from NousResearch. \n",
"\n",
"> Hermes 2 Pro is an upgraded version of Nous Hermes 2, consisting of an updated and cleaned version of the OpenHermes 2.5 Dataset, as well as a newly introduced Function Calling and JSON Mode dataset developed in-house. This new version of Hermes maintains its excellent general task and conversation capabilities - but also excels at Function Calling\n",
"* [Using local models with RAG](https://python.langchain.com/v0.1/docs/use_cases/question_answering/local_retrieval_qa/)\n",
"\n",
"### Installation\n",
"\n",
"The LangChain OpenAI integration lives in the `langchain-community` and `llama-cpp-python` packages:"
@ -54,119 +64,19 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/tni5hc/Documents/langchain_llamacpp/SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))\n",
"llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\n",
" n_batch=300, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.\n",
@ -195,32 +105,9 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 1077.71 ms\n",
"llama_print_timings: sample time = 21.82 ms / 39 runs ( 0.56 ms per token, 1787.35 tokens per second)\n",
"llama_print_timings: prompt eval time = 1077.65 ms / 37 tokens ( 29.13 ms per token, 34.33 tokens per second)\n",
"llama_print_timings: eval time = 8403.75 ms / 38 runs ( 221.15 ms per token, 4.52 tokens per second)\n",
"llama_print_timings: total time = 9689.66 ms / 75 tokens\n"
]
},
{
"data": {
"text/plain": [
"AIMessage(content='Je adore le programmation.\\n\\n(Note: \"programmation\" is used in both formal and informal contexts, but it\\'s generally accepted as equivalent of saying you like computer science or coding.)', response_metadata={'finish_reason': 'stop'}, id='run-e9e03b94-f29f-4c1d-8483-e23a46acb556-0')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"messages = [\n",
" (\n",
@ -236,16 +123,19 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Je adore le programmation.\n",
"J'aime programmer. (In France, \"programming\" is often used in its original sense of scheduling or organizing events.) \n",
"\n",
"(Note: \"programmation\" is used in both formal and informal contexts, but it's generally accepted as equivalent of saying you like computer science or coding.)\n"
"If you meant computer-programming: \n",
"Je suis amoureux de la programmation informatique.\n",
"\n",
"(You might also say simply 'programmation', which would be understood as both meanings - depending on context).\n"
]
}
],
@ -264,33 +154,9 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Llama.generate: prefix-match hit\n",
"\n",
"llama_print_timings: load time = 1077.71 ms\n",
"llama_print_timings: sample time = 29.23 ms / 52 runs ( 0.56 ms per token, 1778.75 tokens per second)\n",
"llama_print_timings: prompt eval time = 869.38 ms / 17 tokens ( 51.14 ms per token, 19.55 tokens per second)\n",
"llama_print_timings: eval time = 6694.18 ms / 51 runs ( 131.26 ms per token, 7.62 tokens per second)\n",
"llama_print_timings: total time = 7830.86 ms / 68 tokens\n"
]
},
{
"data": {
"text/plain": [
"AIMessage(content='Ich liebe auch Programmieren! (Translation: I also like coding!) Do you have any favorite languages or projects? Ich bin hier, um dir zu helfen und über deine Lieblingsprogrammierthemen sprechen können wir gerne weiter machen... !)', response_metadata={'finish_reason': 'stop'}, id='run-922c4cad-368f-41ba-9db9-eacb41d37cb2-0')"
" \"What weighs more a pound of bricks or a pound of feathers ?\"\n",
")"
"result = structured_llm.invoke(\"Tell me a joke about birds\")\n",
"result"
]
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'answer': \"a pound is always the same weight, regardless of what it's made up off. So both options are equal in terms of their mass.\", 'justification': ''}\n"
]
"data": {
"text/plain": [
"{'setup': '- Why did the chicken cross the playground?',\n",
" 'punchline': '\\n\\n- To get to its gilded cage on the other side!'}"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(result)"
"result"
]
},
{
@ -498,64 +376,9 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Llama.generate: prefix-match hit\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"The\n",
" answer\n",
" to\n",
" the\n",
" multiplication\n",
" problem\n",
" \"\n",
"What\n",
"'s\n",
" \n",
"25\n",
" x\n",
" \n",
"5\n",
"?\"\n",
" would\n",
" be\n",
":\n",
"\n",
"\n",
"125\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 1077.71 ms\n",
"llama_print_timings: sample time = 10.60 ms / 20 runs ( 0.53 ms per token, 1886.26 tokens per second)\n",
"llama_print_timings: prompt eval time = 3661.75 ms / 12 tokens ( 305.15 ms per token, 3.28 tokens per second)\n",
"llama_print_timings: eval time = 2468.01 ms / 19 runs ( 129.90 ms per token, 7.70 tokens per second)\n",
"llama_print_timings: total time = 3133.11 ms / 31 tokens\n"