"objc[10142]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x2a0c4c208) and /Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/libllama.dylib (0x2c28bc208). One of the two will be used. Which one is undefined.\n",
"llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\n",
"llama_model_load_internal: format = ggjt v3 (latest)\n",
"* [Function-calling](https://github.com/MeetKai/functionary/tree/main) for use-cases like extraction or tagging\n",
"\n"
"In addition, [here](https://blog.langchain.dev/using-langsmith-to-support-fine-tuning-of-open-source-llms/) is an overview on fine-tuning, which can utilize open source LLMs."
"LangChain has [integrations](https://integrations.langchain.com/) with many open source LLMs that can be run locally.\n",
"\n",
"For example, here we show how to run `GPT4All` or `Llama-v2` locally (e.g., on your laptop) using local embeddings and a local LLM.\n",
"See [here](docs/guides/local_llms) for setup instructions for these LLMs. \n",
"\n",
"For example, here we show how to run `GPT4All` or `LLaMA2` locally (e.g., on your laptop) using local embeddings and a local LLM.\n",
"\n",
"## Document Loading \n",
"\n",
@ -25,7 +27,7 @@
"metadata": {},
"outputs": [],
"source": [
"pip install gpt4all chromadb"
"pip install gpt4all chromadb langchainhub"
]
},
{
@ -40,7 +42,7 @@
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": 3,
"id": "f8cf5765",
"metadata": {},
"outputs": [],
@ -66,7 +68,7 @@
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 5,
"id": "fdce8923",
"metadata": {},
"outputs": [
@ -76,6 +78,13 @@
"text": [
"Found model file at /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"objc[31511]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x14f4e8208) and /Users/rlm/miniforge3/envs/llama2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x14f5fc208). One of the two will be used. Which one is undefined.\n"
]
}
],
"source": [
@ -95,7 +104,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 6,
"id": "b0c55e98",
"metadata": {},
"outputs": [
@ -105,7 +114,7 @@
"4"
]
},
"execution_count": 10,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
@ -118,7 +127,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 7,
"id": "32b43339",
"metadata": {},
"outputs": [
@ -128,7 +137,7 @@
"Document(page_content='Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\", \"What are the subgoals for achieving XYZ?\", (2) by using task-specific instructions; e.g. \"Write a story outline.\" for writing a novel, or (3) with human inputs.', metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:', 'language': 'en', 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': \"LLM Powered Autonomous Agents | Lil'Log\"})"
]
},
"execution_count": 11,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
@ -144,9 +153,15 @@
"source": [
"## Model \n",
"\n",
"### Llama-v2\n",
"### LLaMA2\n",
"\n",
"Download a GGML converted model (e.g., [here](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main))."
"Note: new versions of `llama-cpp-python` use GGUF model files (see [here](https://github.com/abetlen/llama-cpp-python/pull/633)).\n",
"\n",
"If you have an existing GGML model, see [here](docs/integrations/llms/llamacpp) for instructions for conversion for GGUF. \n",
" \n",
"And / or, you can download a GGUF converted model (e.g., [here](https://huggingface.co/TheBloke)).\n",
"\n",
"Finally, as noted in detail [here](docs/guides/local_llms) install `llama-cpp-python`"
"ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, (35852.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size\n",
"ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1026.00 MB, (36878.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size\n",
"ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1602.00 MB, (38480.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size\n",
"ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 298.00 MB, (38778.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size\n",
"ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB, (39290.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size\n"
]
}
],
"outputs": [],
"source": [
"n_gpu_layers = 1 # Metal set to 1 is enough.\n",
"n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
@ -288,7 +235,7 @@
"\n",
"# Make sure the model path is correct for your system!\n",
"[Stephen Colbert]: Yo, this is Stephen Colbert, known for my comedy show. I'm here to put some sense in your mind, like an enema do-go. Your opponent? A man of laughter and witty quips, John Oliver! Now let's see who gets the most laughs while taking shots at each other\n",
"\n",
"Setting: The Late Show with Stephen Colbert. The studio audience is filled with fans of both comedians, and the energy is electric. The two comedians are seated at a table, ready to begin their epic rap battle.\n",
"[John Oliver]: Yo, this is John Oliver, known for my own comedy show. I'm here to take your mind on an adventure through wit and humor. But first, allow me to you to our contestant: Stephen Colbert! His show has been around since the '90s, but it's time to see who can out-rap whom\n",
"\n",
"Stephen Colbert: (smirking) Oh, you think you can take me down, John? You're just a Brit with a funny accent, and I'm the king of comedy!\n",
"John Oliver: (grinning) Oh, you think you're tough, Stephen? You're just a has-been from South Carolina, and I'm the future of comedy!\n",
"The battle begins, with each comedian delivering clever rhymes and witty insults. Here are a few lines that might be included:\n",
"Stephen Colbert: (rapping) You may have a big brain, John, but you can't touch my charm / I've got the audience in stitches, while you're just a blemish on the screen / Your accent is so thick, it's like trying to hear a speech through a mouthful of marshmallows / You may have"
"[Stephen Colbert]: You claim to be a witty man, John Oliver, with your British charm and clever remarks. But my knows that I'm America's funnyman! Who's the one taking you? Nobody!\n",
"\n",
"[John Oliver]: Hey Stephen Colbert, don't get too cocky. You may"
]
},
{
@ -342,29 +293,26 @@
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 2201.54 ms\n",
"llama_print_timings: sample time = 182.54 ms / 256 runs ( 0.71 ms per token, 1402.41 tokens per second)\n",
"llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)\n",
"llama_print_timings: eval time = 8484.62 ms / 256 runs ( 33.14 ms per token, 30.17 tokens per second)\n",
"llama_print_timings: total time = 9000.62 ms\n"
"llama_print_timings: load time = 4481.74 ms\n",
"llama_print_timings: sample time = 183.05 ms / 256 runs ( 0.72 ms per token, 1398.53 tokens per second)\n",
"llama_print_timings: prompt eval time = 456.05 ms / 13 tokens ( 35.08 ms per token, 28.51 tokens per second)\n",
"llama_print_timings: eval time = 7375.20 ms / 255 runs ( 28.92 ms per token, 34.58 tokens per second)\n",
"llama_print_timings: total time = 8388.92 ms\n"
]
},
{
"data": {
"text/plain": [
"\"\\nSetting: The Late Show with Stephen Colbert. The studio audience is filled with fans of both comedians, and the energy is electric. The two comedians are seated at a table, ready to begin their epic rap battle.\\n\\nStephen Colbert: (smirking) Oh, you think you can take me down, John? You're just a Brit with a funny accent, and I'm the king of comedy!\\nJohn Oliver: (grinning) Oh, you think you're tough, Stephen? You're just a has-been from South Carolina, and I'm the future of comedy!\\nThe battle begins, with each comedian delivering clever rhymes and witty insults. Here are a few lines that might be included:\\nStephen Colbert: (rapping) You may have a big brain, John, but you can't touch my charm / I've got the audience in stitches, while you're just a blemish on the screen / Your accent is so thick, it's like trying to hear a speech through a mouthful of marshmallows / You may have\""
"\"by jonathan \\n\\nHere's the hypothetical rap battle:\\n\\n[Stephen Colbert]: Yo, this is Stephen Colbert, known for my comedy show. I'm here to put some sense in your mind, like an enema do-go. Your opponent? A man of laughter and witty quips, John Oliver! Now let's see who gets the most laughs while taking shots at each other\\n\\n[John Oliver]: Yo, this is John Oliver, known for my own comedy show. I'm here to take your mind on an adventure through wit and humor. But first, allow me to you to our contestant: Stephen Colbert! His show has been around since the '90s, but it's time to see who can out-rap whom\\n\\n[Stephen Colbert]: You claim to be a witty man, John Oliver, with your British charm and clever remarks. But my knows that I'm America's funnyman! Who's the one taking you? Nobody!\\n\\n[John Oliver]: Hey Stephen Colbert, don't get too cocky. You may\""
]
},
"execution_count": 30,
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prompt = \"\"\"\n",
"Question: A rap battle between Stephen Colbert and John Oliver\n",
"\"\"\"\n",
"llm(prompt)"
"llm(\"Simulate a rap battle between Stephen Colbert and John Oliver\")"
]
},
{
@ -389,85 +337,10 @@
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4a24eef1",
"execution_count": null,
"id": "57c1aec0-04c7-479e-b9bf-af3c547ba0a3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found model file at /Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"objc[47842]: Class GGMLMetalClass is implemented in both /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x29f48c208) and /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x29f970208). One of the two will be used. Which one is undefined.\n",
"llama.cpp: using Metal\n",
"llama.cpp: loading model from /Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin\n",
"llama_model_load_internal: format = ggjt v3 (latest)\n",
"`chain_type=\"stuff\"` (see [here](https://python.langchain.com/docs/modules/chains/document/stuff)) means that all the docs will be added (stuffed) into a prompt."
]
},
{
"cell_type": "markdown",
"id": "3cce6977-52e7-4944-89b4-c161d04f6698",
"metadata": {},
"source": [
"We can also use the LangChain Prompt Hub to store and fetch prompts that are model-specific.\n",
"\n",
"This will work with your [LangSmith API key](https://docs.smith.langchain.com/).\n",
"\n",
"Let's try with a default RAG prompt, [here](https://smith.langchain.com/hub/rlm/rag-prompt)."
" Hi there! There are three main approaches to task decomposition. One is using LLM with simple prompting like \"Steps for XYZ.\" or \"What are the subgoals for achieving XYZ?\" Another approach is by using task-specific instructions, such as \"Write a story outline\" for writing a novel. Finally, task decomposition can also be done with human inputs. Thanks for asking!"
"\n",
"Task can be done by down a task into smaller subtasks, using simple prompting like \"Steps for XYZ.\" or task-specific like \"Write a story outline\" for writing a novel."
]
},
{
@ -598,43 +493,114 @@
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 1191.88 ms\n",
"llama_print_timings: sample time = 61.21 ms / 85 runs ( 0.72 ms per token, 1388.64 tokens per second)\n",
"llama_print_timings: prompt eval time = 8014.11 ms / 267 tokens ( 30.02 ms per token, 33.32 tokens per second)\n",
"llama_print_timings: eval time = 2908.17 ms / 84 runs ( 34.62 ms per token, 28.88 tokens per second)\n",
"llama_print_timings: total time = 11096.23 ms\n"
"llama_print_timings: load time = 11326.20 ms\n",
"llama_print_timings: sample time = 33.03 ms / 47 runs ( 0.70 ms per token, 1422.86 tokens per second)\n",
"llama_print_timings: prompt eval time = 1387.31 ms / 242 tokens ( 5.73 ms per token, 174.44 tokens per second)\n",
"llama_print_timings: eval time = 1321.62 ms / 46 runs ( 28.73 ms per token, 34.81 tokens per second)\n",
"llama_print_timings: total time = 2801.08 ms\n"
]
},
{
"data": {
"text/plain": [
"{'output_text': ' Hi there! There are three main approaches to task decomposition. One is using LLM with simple prompting like \"Steps for XYZ.\" or \"What are the subgoals for achieving XYZ?\" Another approach is by using task-specific instructions, such as \"Write a story outline\" for writing a novel. Finally, task decomposition can also be done with human inputs. Thanks for asking!'}"
"{'output_text': '\\nTask can be done by down a task into smaller subtasks, using simple prompting like \"Steps for XYZ.\" or task-specific like \"Write a story outline\" for writing a novel.'}"
"Now, let's try with [a prompt specifically for LLaMA](https://smith.langchain.com/hub/rlm/rag-prompt-llama), which [includes special tokens](https://huggingface.co/blog/llama2#how-to-prompt-llama-2)."
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "78f6862d-b7a6-4e03-84e4-45667185bf9b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ChatPromptTemplate(input_variables=['question', 'context'], output_parser=None, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['question', 'context'], output_parser=None, partial_variables={}, template=\"[INST]<<SYS>> You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<</SYS>> \\nQuestion: {question} \\nContext: {context} \\nAnswer: [/INST]\", template_format='f-string', validate_template=True), additional_kwargs={})])"
" Sure, I'd be happy to help! Based on the context, here are some to task:\n",
"\n",
"1. LLM with simple prompting: This using a large model (LLM) with simple prompts like \"Steps for XYZ\" or \"What are the subgoals for achieving XYZ?\" to decompose tasks into smaller steps.\n",
"2. Task-specific: Another is to use task-specific, such as \"Write a story outline\" for writing a novel, to guide the of tasks.\n",
"3. Human inputs:, human inputs can be used to supplement the process, in cases where the task a high degree of creativity or expertise.\n",
"\n",
"As fores in long-term and task, one major is that LLMs to adjust plans when faced with errors, making them less robust to humans who learn from trial and error."
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 11326.20 ms\n",
"llama_print_timings: sample time = 144.81 ms / 207 runs ( 0.70 ms per token, 1429.47 tokens per second)\n",
"llama_print_timings: prompt eval time = 1506.13 ms / 258 tokens ( 5.84 ms per token, 171.30 tokens per second)\n",
"llama_print_timings: eval time = 6231.92 ms / 206 runs ( 30.25 ms per token, 33.06 tokens per second)\n",
"llama_print_timings: total time = 8158.41 ms\n"
]
},
{
"data": {
"text/plain": [
"{'output_text': ' Sure, I\\'d be happy to help! Based on the context, here are some to task:\\n\\n1. LLM with simple prompting: This using a large model (LLM) with simple prompts like \"Steps for XYZ\" or \"What are the subgoals for achieving XYZ?\" to decompose tasks into smaller steps.\\n2. Task-specific: Another is to use task-specific, such as \"Write a story outline\" for writing a novel, to guide the of tasks.\\n3. Human inputs:, human inputs can be used to supplement the process, in cases where the task a high degree of creativity or expertise.\\n\\nAs fores in long-term and task, one major is that LLMs to adjust plans when faced with errors, making them less robust to humans who learn from trial and error.'}"
"The three approaches to Task decomposition are LLMs with simple prompting, task-specific instructions, or human inputs. Thanks for asking!"
" Sure! Based on the context, here's my answer to your:\n",
"\n",
"There are several to task,:\n",
"\n",
"1. LLM-based with simple prompting, such as \"Steps for XYZ\" or \"What are the subgoals for achieving XYZ?\"\n",
"2. Task-specific, like \"Write a story outline\" for writing a novel.\n",
"3. Human inputs to guide the process.\n",
"\n",
"These can be used to decompose complex tasks into smaller, more manageable subtasks, which can help improve the and effectiveness of task. However, long-term and task can being due to the need to plan over a lengthy history and explore the space., LLMs may to adjust plans when faced with errors, making them less robust to human learners who can learn from trial and error."
]
},
{
@ -695,21 +668,21 @@
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 1191.88 ms\n",
"llama_print_timings: sample time = 22.78 ms / 31 runs ( 0.73 ms per token, 1360.66 tokens per second)\n",
"llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)\n",
"llama_print_timings: eval time = 1320.23 ms / 31 runs ( 42.59 ms per token, 23.48 tokens per second)\n",
"llama_print_timings: total time = 1387.70 ms\n"
"llama_print_timings: load time = 11326.20 ms\n",
"llama_print_timings: sample time = 139.20 ms / 200 runs ( 0.70 ms per token, 1436.76 tokens per second)\n",
"llama_print_timings: prompt eval time = 1532.26 ms / 258 tokens ( 5.94 ms per token, 168.38 tokens per second)\n",
"llama_print_timings: eval time = 5977.62 ms / 199 runs ( 30.04 ms per token, 33.29 tokens per second)\n",
"llama_print_timings: total time = 7916.21 ms\n"
]
},
{
"data": {
"text/plain": [
"{'query': 'What are the approaches to Task Decomposition?',\n",
" 'result': ' \\nThe three approaches to Task decomposition are LLMs with simple prompting, task-specific instructions, or human inputs. Thanks for asking!'}"
" 'result': ' Sure! Based on the context, here\\'s my answer to your:\\n\\nThere are several to task,:\\n\\n1. LLM-based with simple prompting, such as \"Steps for XYZ\" or \"What are the subgoals for achieving XYZ?\"\\n2. Task-specific, like \"Write a story outline\" for writing a novel.\\n3. Human inputs to guide the process.\\n\\nThese can be used to decompose complex tasks into smaller, more manageable subtasks, which can help improve the and effectiveness of task. However, long-term and task can being due to the need to plan over a lengthy history and explore the space., LLMs may to adjust plans when faced with errors, making them less robust to human learners who can learn from trial and error.'}"