# Ollama

[Ollama](https://ollama.ai/) allows you to run open-source large language models, such as LLaMA2, locally.

Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. 

It optimizes setup and configuration details, including GPU usage.

For a complete list of supported models and model variants, see the [Ollama model library](https://ollama.ai/library).

## Setup

First, follow [these instructions](https://github.com/jmorganca/ollama) to set up and run a local Ollama instance:

* [Download](https://ollama.ai/download)
* Fetch a model via `ollama pull `
* e.g., for `Llama-7b`: `ollama pull llama2`
* This will download the most basic version of the model (e.g., minimum # parameters and 4-bit quantization)
* On Mac, it will download to:

`~/.ollama/models/manifests/registry.ollama.ai/library//latest`

* And we can specify a particular version, e.g., for `ollama pull vicuna:13b-v1.5-16k-q4_0`
* The file is here with the model version in place of `latest`

`~/.ollama/models/manifests/registry.ollama.ai/library/vicuna/13b-v1.5-16k-q4_0`

You can easily access models in a few ways:

1/ if the app is running:
* All of your local models are automatically served on `localhost:11434`
* Select your model when setting `llm = Ollama(..., model=":")`
* If you set `llm = Ollama(..., model="> Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum and keep the answer as concise as possible. <>
{context}
Question: {question}
Helpful Answer:[/INST]"""
QA_CHAIN_PROMPT = PromptTemplate(
 input_variables=["context", "question"],
 template=template,
)

In [13]:
# Chat model
from langchain.chat_models import ChatOllama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
chat_model = ChatOllama(model="llama2:13b",
 verbose=True,
 callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))

In [14]:
# QA chain
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
 chat_model,
 retriever=vectorstore.as_retriever(),
 chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

In [15]:
question = "What are the various approaches to Task Decomposition for AI Agents?"
result = qa_chain({"query": question})

 Based on the provided context, there are three approaches to task decomposition for AI agents:

1. LLM with simple prompting, such as "Steps for XYZ." or "What are the subgoals for achieving XYZ?"
2. Task-specific instructions, such as "Write a story outline" for writing a novel.
3. Human inputs.

You can also get logging for tokens.

In [16]:
from langchain.schema import LLMResult
from langchain.callbacks.base import BaseCallbackHandler

class GenerationStatisticsCallback(BaseCallbackHandler):
 def on_llm_end(self, response: LLMResult, **kwargs) -> None:
 print(response.generations[0][0].generation_info)
 
callback_manager = CallbackManager([StreamingStdOutCallbackHandler(), GenerationStatisticsCallback()])

chat_model = ChatOllama(model="llama2:13b-chat",
 verbose=True,
 callback_manager=callback_manager)

qa_chain = RetrievalQA.from_chain_type(
 chat_model,
 retriever=vectorstore.as_retriever(),
 chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

question = "What are the approaches to Task Decomposition?"
result = qa_chain({"query": question})

 Based on the given context, here is the answer to the question "What are the approaches to Task Decomposition?"

There are three approaches to task decomposition:

1. LLM with simple prompting, such as "Steps for XYZ." or "What are the subgoals for achieving XYZ?"
2. Using task-specific instructions, like "Write a story outline" for writing a novel.
3. With human inputs.{'model': 'llama2:13b-chat', 'created_at': '2023-08-23T15:37:51.469127Z', 'done': True, 'context': [1, 29871, 1, 29961, 25580, 29962, 518, 25580, 29962, 518, 25580, 29962, 3532, 14816, 29903, 6778, 4803, 278, 1494, 12785, 310, 3030, 304, 1234, 278, 1139, 472, 278, 1095, 29889, 29871, 13, 3644, 366, 1016, 29915, 29873, 1073, 278, 1234, 29892, 925, 1827, 393, 366, 1016, 29915, 29873, 1073, 29892, 1016, 29915, 29873, 1018, 304, 1207, 701, 385, 1234, 29889, 29871, 13, 11403, 2211, 25260, 7472, 322, 3013, 278, 1234, 408, 3022, 895, 408, 1950, 29889, 529, 829, 14816, 29903, 6778, 13, 5398, 26227, 508, 367, 2309, 313, 29896, 

`eval_count` / (`eval_duration`/10e9) gets `tok / s`

In [17]:
98 / (3229641000/1000/1000/1000)

30.343929867127645