Adds Ollama as an LLM (#8829)

Adds Ollama as an LLM. Ollama can run various open source models locally e.g. Llama 2 and Vicuna, automatically configuring and GPU-optimizing them. @rlancemartin @hwchase17 --------- Co-authored-by: Lance Martin <lance@langchain.dev>
11 months ago · fa30a57034
parent 1f9124ceaa
commit fa30a57034
3 changed files with 569 additions and 0 deletions
--- a/docs/extras/integrations/llms/ollama.ipynb
+++ b/docs/extras/integrations/llms/ollama.ipynb
@ -0,0 +1,340 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Ollama\n",
+    "\n",
+    "[Ollama](https://ollama.ai/) allows you to run open-source large language models, such as Llama 2, locally.\n",
+    "\n",
+    "Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. \n",
+    "\n",
+    "It optimizes setup and configuration details, including GPU usage.\n",
+    "\n",
+    "For a complete list of supported models and model variants, see the [Ollama model library](https://github.com/jmorganca/ollama#model-library).\n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "First, follow [these instructions](https://github.com/jmorganca/ollama) to set up and run a local Ollama instance:\n",
+    "\n",
+    "* [Download](https://ollama.ai/download)\n",
+    "* Fetch a model, e.g., `Llama-7b`: `ollama pull llama2`\n",
+    "* Run `ollama run llama2`\n",
+    "\n",
+    "\n",
+    "## Usage\n",
+    "\n",
+    "You can see a full list of supported parameters on the [API reference page](https://api.python.langchain.com/en/latest/llms/langchain.llms.ollama.Ollama.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.llms import Ollama\n",
+    "from langchain.callbacks.manager import CallbackManager\n",
+    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler                                  \n",
+    "llm = Ollama(base_url=\"http://localhost:11434\", \n",
+    "             model=\"llama2\", \n",
+    "             callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With `StreamingStdOutCallbackHandler`, you will see tokens streamed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Great! The history of Artificial Intelligence (AI) is a fascinating and complex topic that spans several decades. Here's a brief overview:\n",
+      "\n",
+      "1. Early Years (1950s-1960s): The term \"Artificial Intelligence\" was coined in 1956 by computer scientist John McCarthy. However, the concept of AI dates back to ancient Greece, where mythical creatures like Talos and Hephaestus were created to perform tasks without any human intervention. In the 1950s and 1960s, researchers began exploring ways to replicate human intelligence using computers, leading to the development of simple AI programs like ELIZA (1966) and PARRY (1972).\n",
+      "2. Rule-Based Systems (1970s-1980s): As computing power increased, researchers developed rule-based systems, such as Mycin (1976), which could diagnose medical conditions based on a set of rules. This period also saw the rise of expert systems, like EDICT (1985), which mimicked human experts in specific domains.\n",
+      "3. Machine Learning (1990s-2000s): With the advent of big data and machine learning algorithms, AI evolved to include neural networks, decision trees, and other techniques for training models on large datasets. This led to the development of applications like speech recognition (e.g., Siri, Alexa), image recognition (e.g., Google Image Search), and natural language processing (e.g., chatbots).\n",
+      "4. Deep Learning (2010s-present): The rise of deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), has enabled AI to perform complex tasks like image and speech recognition, natural language processing, and even autonomous driving. Companies like Google, Facebook, and Baidu have invested heavily in deep learning research, leading to breakthroughs in areas like facial recognition, object detection, and machine translation.\n",
+      "5. Current Trends (present-future): AI is currently being applied to various industries, including healthcare, finance, education, and entertainment. With the growth of cloud computing, edge AI, and autonomous systems, we can expect to see more sophisticated AI applications in the near future. However, there are also concerns about the ethical implications of AI, such as data privacy, algorithmic bias, and job displacement.\n",
+      "\n",
+      "Remember, AI has a long history, and its development is an ongoing process. As technology advances, we can expect to see even more innovative applications of AI in various fields."
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'\\nGreat! The history of Artificial Intelligence (AI) is a fascinating and complex topic that spans several decades. Here\\'s a brief overview:\\n\\n1. Early Years (1950s-1960s): The term \"Artificial Intelligence\" was coined in 1956 by computer scientist John McCarthy. However, the concept of AI dates back to ancient Greece, where mythical creatures like Talos and Hephaestus were created to perform tasks without any human intervention. In the 1950s and 1960s, researchers began exploring ways to replicate human intelligence using computers, leading to the development of simple AI programs like ELIZA (1966) and PARRY (1972).\\n2. Rule-Based Systems (1970s-1980s): As computing power increased, researchers developed rule-based systems, such as Mycin (1976), which could diagnose medical conditions based on a set of rules. This period also saw the rise of expert systems, like EDICT (1985), which mimicked human experts in specific domains.\\n3. Machine Learning (1990s-2000s): With the advent of big data and machine learning algorithms, AI evolved to include neural networks, decision trees, and other techniques for training models on large datasets. This led to the development of applications like speech recognition (e.g., Siri, Alexa), image recognition (e.g., Google Image Search), and natural language processing (e.g., chatbots).\\n4. Deep Learning (2010s-present): The rise of deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), has enabled AI to perform complex tasks like image and speech recognition, natural language processing, and even autonomous driving. Companies like Google, Facebook, and Baidu have invested heavily in deep learning research, leading to breakthroughs in areas like facial recognition, object detection, and machine translation.\\n5. Current Trends (present-future): AI is currently being applied to various industries, including healthcare, finance, education, and entertainment. With the growth of cloud computing, edge AI, and autonomous systems, we can expect to see more sophisticated AI applications in the near future. However, there are also concerns about the ethical implications of AI, such as data privacy, algorithmic bias, and job displacement.\\n\\nRemember, AI has a long history, and its development is an ongoing process. As technology advances, we can expect to see even more innovative applications of AI in various fields.'"
+      ]
+     },
+     "execution_count": 40,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "llm(\"Tell me about the history of AI\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## RAG\n",
+    "\n",
+    "We can use Olama with RAG, [just as shown here](https://python.langchain.com/docs/use_cases/question_answering/how_to/local_retrieval_qa).\n",
+    "\n",
+    "Let's use the 13b model:\n",
+    "\n",
+    "```\n",
+    "ollama pull llama2:13b\n",
+    "ollama run llama2:13b \n",
+    "```\n",
+    "\n",
+    "Let's also use local embeddings from `GPT4AllEmbeddings` and `Chroma`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install gpt4all chromadb"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 60,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import WebBaseLoader\n",
+    "loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
+    "data = loader.load()\n",
+    "\n",
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
+    "all_splits = text_splitter.split_documents(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.vectorstores import Chroma\n",
+    "from langchain.embeddings import GPT4AllEmbeddings\n",
+    "\n",
+    "vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "4"
+      ]
+     },
+     "execution_count": 62,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "question = \"What are the approaches to Task Decomposition?\"\n",
+    "docs = vectorstore.similarity_search(question)\n",
+    "len(docs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain import PromptTemplate\n",
+    "\n",
+    "# Prompt\n",
+    "template = \"\"\"Use the following pieces of context to answer the question at the end. \n",
+    "If you don't know the answer, just say that you don't know, don't try to make up an answer. \n",
+    "Use three sentences maximum and keep the answer as concise as possible. \n",
+    "{context}\n",
+    "Question: {question}\n",
+    "Helpful Answer:\"\"\"\n",
+    "QA_CHAIN_PROMPT = PromptTemplate(\n",
+    "    input_variables=[\"context\", \"question\"],\n",
+    "    template=template,\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 69,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# LLM\n",
+    "from langchain.llms import Ollama\n",
+    "from langchain.callbacks.manager import CallbackManager\n",
+    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n",
+    "llm = Ollama(base_url=\"http://localhost:11434\",\n",
+    "             model=\"llama2\",\n",
+    "             verbose=True,\n",
+    "             callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 66,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# QA chain\n",
+    "from langchain.chains import RetrievalQA\n",
+    "qa_chain = RetrievalQA.from_chain_type(\n",
+    "    llm,\n",
+    "    retriever=vectorstore.as_retriever(),\n",
+    "    chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT},\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 70,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Task decomposition can be approached in different ways for AI agents, including:\n",
+      "\n",
+      "1. Using simple prompts like \"Steps for XYZ.\" or \"What are the subgoals for achieving XYZ?\" to guide the LLM.\n",
+      "2. Providing task-specific instructions, such as \"Write a story outline\" for writing a novel.\n",
+      "3. Utilizing human inputs to help the AI agent understand the task and break it down into smaller steps."
+     ]
+    }
+   ],
+   "source": [
+    "question = \"What are the various approaches to Task Decomposition for AI Agents?\"\n",
+    "result = qa_chain({\"query\": question})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can also get logging for tokens."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 56,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Task decomposition can be approached in three ways: (1) using simple prompting like \"Steps for XYZ.\\n1.\", \"What are the subgoals for achieving XYZ?\", (2) by using task-specific instructions, or (3) with human inputs.{'model': 'llama2', 'created_at': '2023-08-08T04:01:09.005367Z', 'done': True, 'context': [1, 29871, 1, 13, 9314, 14816, 29903, 6778, 13, 13, 3492, 526, 263, 8444, 29892, 3390, 1319, 322, 15993, 20255, 29889, 29849, 1234, 408, 1371, 3730, 408, 1950, 29892, 1550, 1641, 9109, 29889, 3575, 6089, 881, 451, 3160, 738, 10311, 1319, 29892, 443, 621, 936, 29892, 11021, 391, 29892, 7916, 391, 29892, 304, 27375, 29892, 18215, 29892, 470, 27302, 2793, 29889, 3529, 9801, 393, 596, 20890, 526, 5374, 635, 443, 5365, 1463, 322, 6374, 297, 5469, 29889, 13, 13, 3644, 263, 1139, 947, 451, 1207, 738, 4060, 29892, 470, 338, 451, 2114, 1474, 16165, 261, 296, 29892, 5649, 2020, 2012, 310, 22862, 1554, 451, 1959, 29889, 960, 366, 1016, 29915, 29873, 1073, 278, 1234, 304, 263, 1139, 29892, 3113, 1016, 29915, 29873, 6232, 2089, 2472, 29889, 13, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 29961, 25580, 29962, 4803, 278, 1494, 12785, 310, 3030, 304, 1234, 278, 1139, 472, 278, 1095, 29889, 29871, 13, 3644, 366, 1016, 29915, 29873, 1073, 278, 1234, 29892, 925, 1827, 393, 366, 1016, 29915, 29873, 1073, 29892, 1016, 29915, 29873, 1018, 304, 1207, 701, 385, 1234, 29889, 29871, 13, 11403, 2211, 25260, 7472, 322, 3013, 278, 1234, 408, 3022, 895, 408, 1950, 29889, 29871, 13, 5398, 26227, 508, 367, 2309, 313, 29896, 29897, 491, 365, 26369, 411, 2560, 9508, 292, 763, 376, 7789, 567, 363, 1060, 29979, 29999, 7790, 29876, 29896, 19602, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 29973, 613, 313, 29906, 29897, 491, 773, 3414, 29899, 14940, 11994, 29936, 321, 29889, 29887, 29889, 376, 6113, 263, 5828, 27887, 1213, 363, 5007, 263, 9554, 29892, 470, 313, 29941, 29897, 411, 5199, 10970, 29889, 13, 13, 5398, 26227, 508, 367, 2309, 313, 29896, 29897, 491, 365, 26369, 411, 2560, 9508, 292, 763, 376, 7789, 567, 363, 1060, 29979, 29999, 7790, 29876, 29896, 19602, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 29973, 613, 313, 29906, 29897, 491, 773, 3414, 29899, 14940, 11994, 29936, 321, 29889, 29887, 29889, 376, 6113, 263, 5828, 27887, 1213, 363, 5007, 263, 9554, 29892, 470, 313, 29941, 29897, 411, 5199, 10970, 29889, 13, 13, 5398, 26227, 508, 367, 2309, 313, 29896, 29897, 491, 365, 26369, 411, 2560, 9508, 292, 763, 376, 7789, 567, 363, 1060, 29979, 29999, 7790, 29876, 29896, 19602, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 29973, 613, 313, 29906, 29897, 491, 773, 3414, 29899, 14940, 11994, 29936, 321, 29889, 29887, 29889, 376, 6113, 263, 5828, 27887, 1213, 363, 5007, 263, 9554, 29892, 470, 313, 29941, 29897, 411, 5199, 10970, 29889, 13, 13, 1451, 16047, 267, 297, 1472, 29899, 8489, 18987, 322, 3414, 26227, 29901, 1858, 9450, 975, 263, 3309, 29891, 4955, 322, 17583, 3902, 8253, 278, 1650, 2913, 3933, 18066, 292, 29889, 365, 26369, 29879, 21117, 304, 10365, 13900, 746, 20050, 411, 15668, 4436, 29892, 3907, 963, 3109, 16424, 9401, 304, 25618, 1058, 5110, 515, 14260, 322, 1059, 29889, 13, 16492, 29901, 1724, 526, 278, 13501, 304, 9330, 897, 510, 3283, 29973, 13, 29648, 1319, 673, 29901, 518, 29914, 25580, 29962, 13, 5398, 26227, 508, 367, 26733, 297, 2211, 5837, 29901, 313, 29896, 29897, 773, 2560, 9508, 292, 763, 376, 7789, 567, 363, 1060, 29979, 29999, 7790, 29876, 29896, 19602, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 29973, 613, 313, 29906, 29897, 491, 773, 3414, 29899, 14940, 11994, 29892, 470, 313, 29941, 29897, 411, 5199, 10970, 29889, 2], 'total_duration': 1364428708, 'load_duration': 1246375, 'sample_count': 62, 'sample_duration': 44859000, 'prompt_eval_count': 1, 'eval_count': 62, 'eval_duration': 1313002000}\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.schema import LLMResult\n",
+    "from langchain.callbacks.base import BaseCallbackHandler\n",
+    "\n",
+    "class GenerationStatisticsCallback(BaseCallbackHandler):\n",
+    "    def on_llm_end(self, response: LLMResult, **kwargs) -> None:\n",
+    "        print(response.generations[0][0].generation_info)\n",
+    "        \n",
+    "callback_manager = CallbackManager([StreamingStdOutCallbackHandler(), GenerationStatisticsCallback()])\n",
+    "\n",
+    "llm = Ollama(base_url=\"http://localhost:11434\",\n",
+    "             model=\"llama2\",\n",
+    "             verbose=True,\n",
+    "             callback_manager=callback_manager)\n",
+    "\n",
+    "qa_chain = RetrievalQA.from_chain_type(\n",
+    "    llm,\n",
+    "    retriever=vectorstore.as_retriever(),\n",
+    "    chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT},\n",
+    ")\n",
+    "\n",
+    "question = \"What are the approaches to Task Decomposition?\"\n",
+    "result = qa_chain({\"query\": question})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`eval_count` / (`eval_duration`/10e9)  gets `tok / s`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "47.22003469910937"
+      ]
+     },
+     "execution_count": 57,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "62 / (1313002000/1000/1000/1000)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/libs/langchain/langchain/llms/init.py
+++ b/libs/langchain/langchain/llms/init.py
@ -59,6 +59,7 @@ from langchain.llms.modal import Modal
 from langchain.llms.mosaicml import MosaicML
 from langchain.llms.nlpcloud import NLPCloud
 from langchain.llms.octoai_endpoint import OctoAIEndpoint
+from langchain.llms.ollama import Ollama
 from langchain.llms.openai import AzureOpenAI, OpenAI, OpenAIChat
 from langchain.llms.openllm import OpenLLM
 from langchain.llms.openlm import OpenLM
@ -124,6 +125,7 @@ __all__ = [
    "MosaicML",
    "Nebula",
    "NLPCloud",
+    "Ollama",
    "OpenAI",
    "OpenAIChat",
    "OpenLLM",
@ -188,6 +190,7 @@ type_to_cls_dict: Dict[str, Type[BaseLLM]] = {
    "mosaic": MosaicML,
    "nebula": Nebula,
    "nlpcloud": NLPCloud,
+    "ollama": Ollama,
    "openai": OpenAI,
    "openlm": OpenLM,
    "petals": Petals,
--- a/libs/langchain/langchain/llms/ollama.py
+++ b/libs/langchain/langchain/llms/ollama.py
@ -0,0 +1,226 @@
+import json
+from typing import Any, Dict, Iterator, List, Mapping, Optional
+
+import requests
+from pydantic import Extra
+
+from langchain.callbacks.manager import CallbackManagerForLLMRun
+from langchain.llms.base import BaseLLM
+from langchain.schema import LLMResult
+from langchain.schema.language_model import BaseLanguageModel
+from langchain.schema.output import GenerationChunk
+
+
+def _stream_response_to_generation_chunk(
+    stream_response: str,
+) -> GenerationChunk:
+    """Convert a stream response to a generation chunk."""
+    parsed_response = json.loads(stream_response)
+    generation_info = parsed_response if parsed_response.get("done") is True else None
+    return GenerationChunk(
+        text=parsed_response.get("response", ""), generation_info=generation_info
+    )
+
+
+class _OllamaCommon(BaseLanguageModel):
+    base_url = "http://localhost:11434"
+    """Base url the model is hosted under."""
+
+    model: str = "llama2"
+    """Model name to use."""
+
+    mirostat: Optional[int]
+    """Enable Mirostat sampling for controlling perplexity.
+    (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)"""
+
+    mirostat_eta: Optional[float]
+    """Influences how quickly the algorithm responds to feedback
+    from the generated text. A lower learning rate will result in
+    slower adjustments, while a higher learning rate will make
+    the algorithm more responsive. (Default: 0.1)"""
+
+    mirostat_tau: Optional[float]
+    """Controls the balance between coherence and diversity
+    of the output. A lower value will result in more focused and
+    coherent text. (Default: 5.0)"""
+
+    num_ctx: Optional[int]
+    """Sets the size of the context window used to generate the
+    next token. (Default: 2048)	"""
+
+    num_gpu: Optional[int]
+    """The number of GPUs to use. On macOS it defaults to 1 to
+    enable metal support, 0 to disable."""
+
+    num_thread: Optional[int]
+    """Sets the number of threads to use during computation.
+    By default, Ollama will detect this for optimal performance.
+    It is recommended to set this value to the number of physical
+    CPU cores your system has (as opposed to the logical number of cores)."""
+
+    repeat_last_n: Optional[int]
+    """Sets how far back for the model to look back to prevent
+    repetition. (Default: 64, 0 = disabled, -1 = num_ctx)"""
+
+    repeat_penalty: Optional[float]
+    """Sets how strongly to penalize repetitions. A higher value (e.g., 1.5)
+    will penalize repetitions more strongly, while a lower value (e.g., 0.9)
+    will be more lenient. (Default: 1.1)"""
+
+    temperature: Optional[float]
+    """The temperature of the model. Increasing the temperature will
+    make the model answer more creatively. (Default: 0.8)"""
+
+    stop: Optional[List[str]]
+    """Sets the stop tokens to use."""
+
+    tfs_z: Optional[float]
+    """Tail free sampling is used to reduce the impact of less probable
+    tokens from the output. A higher value (e.g., 2.0) will reduce the
+    impact more, while a value of 1.0 disables this setting. (default: 1)"""
+
+    top_k: Optional[int]
+    """Reduces the probability of generating nonsense. A higher value (e.g. 100)
+    will give more diverse answers, while a lower value (e.g. 10)
+    will be more conservative. (Default: 40)"""
+
+    top_p: Optional[int]
+    """Works together with top-k. A higher value (e.g., 0.95) will lead
+    to more diverse text, while a lower value (e.g., 0.5) will
+    generate more focused and conservative text. (Default: 0.9)"""
+
+    @property
+    def _default_params(self) -> Dict[str, Any]:
+        """Get the default parameters for calling Ollama."""
+        return {
+            "model": self.model,
+            "options": {
+                "mirostat": self.mirostat,
+                "mirostat_eta": self.mirostat_eta,
+                "mirostat_tau": self.mirostat_tau,
+                "num_ctx": self.num_ctx,
+                "num_gpu": self.num_gpu,
+                "num_thread": self.num_thread,
+                "repeat_last_n": self.repeat_last_n,
+                "repeat_penalty": self.repeat_penalty,
+                "temperature": self.temperature,
+                "stop": self.stop,
+                "tfs_z": self.tfs_z,
+                "top_k": self.top_k,
+                "top_p": self.top_p,
+            },
+        }
+
+    @property
+    def _identifying_params(self) -> Mapping[str, Any]:
+        """Get the identifying parameters."""
+        return {**{"model": self.model}, **self._default_params}
+
+    def _create_stream(
+        self,
+        prompt: str,
+        stop: Optional[List[str]] = None,
+        **kwargs: Any,
+    ) -> Iterator[str]:
+        if self.stop is not None and stop is not None:
+            raise ValueError("`stop` found in both the input and default params.")
+        elif self.stop is not None:
+            stop = self.stop
+        elif stop is None:
+            stop = []
+        params = {**self._default_params, "stop": stop, **kwargs}
+        response = requests.post(
+            url=f"{self.base_url}/api/generate/",
+            headers={"Content-Type": "application/json"},
+            json={"prompt": prompt, **params},
+            stream=True,
+        )
+        response.encoding = "utf-8"
+        if response.status_code != 200:
+            optional_detail = response.json().get("error")
+            raise ValueError(
+                f"Ollama call failed with status code {response.status_code}."
+                f" Details: {optional_detail}"
+            )
+        return response.iter_lines(decode_unicode=True)
+
+
+class Ollama(BaseLLM, _OllamaCommon):
+    """Ollama locally run large language models.
+
+    To use, follow the instructions at https://ollama.ai/.
+
+    Example:
+        .. code-block:: python
+
+            from langchain.llms import Ollama
+            ollama = Ollama(model="llama2")
+    """
+
+    class Config:
+        """Configuration for this pydantic object."""
+
+        extra = Extra.forbid
+
+    @property
+    def _llm_type(self) -> str:
+        """Return type of llm."""
+        return "ollama-llm"
+
+    def _generate(
+        self,
+        prompts: List[str],
+        stop: Optional[List[str]] = None,
+        run_manager: Optional[CallbackManagerForLLMRun] = None,
+        **kwargs: Any,
+    ) -> LLMResult:
+        """Call out to Ollama's generate endpoint.
+
+        Args:
+            prompt: The prompt to pass into the model.
+            stop: Optional list of stop words to use when generating.
+
+        Returns:
+            The string generated by the model.
+
+        Example:
+            .. code-block:: python
+
+                response = ollama("Tell me a joke.")
+        """
+        # TODO: add caching here.
+        generations = []
+        for prompt in prompts:
+            final_chunk: Optional[GenerationChunk] = None
+            for stream_resp in self._create_stream(prompt, stop, **kwargs):
+                if stream_resp:
+                    chunk = _stream_response_to_generation_chunk(stream_resp)
+                    if final_chunk is None:
+                        final_chunk = chunk
+                    else:
+                        final_chunk += chunk
+                    if run_manager:
+                        run_manager.on_llm_new_token(
+                            chunk.text,
+                            verbose=self.verbose,
+                        )
+
+            generations.append([final_chunk])
+        return LLMResult(generations=generations)
+
+    def _stream(
+        self,
+        prompt: str,
+        stop: Optional[List[str]] = None,
+        run_manager: Optional[CallbackManagerForLLMRun] = None,
+        **kwargs: Any,
+    ) -> Iterator[GenerationChunk]:
+        for stream_resp in self._create_stream(prompt, stop, **kwargs):
+            if stream_resp:
+                chunk = _stream_response_to_generation_chunk(stream_resp)
+                yield chunk
+                if run_manager:
+                    run_manager.on_llm_new_token(
+                        chunk.text,
+                        verbose=self.verbose,
+                    )