{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "499c3142-2033-437d-a60a-731988ac6074",
   "metadata": {},
   "source": [
    "# vLLM\n",
    "\n",
    "[vLLM](https://vllm.readthedocs.io/en/latest/index.html) is a fast and easy-to-use library for LLM inference and serving, offering:\n",
    "* State-of-the-art serving throughput \n",
    "* Efficient management of attention key and value memory with PagedAttention\n",
    "* Continuous batching of incoming requests\n",
    "* Optimized CUDA kernels\n",
    "\n",
    "This notebooks goes over how to use a LLM with langchain and vLLM.\n",
    "\n",
    "To use, you should have the `vllm` python package installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8a3f2666-5c75-4797-967a-7915a247bf33",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "#!pip install vllm -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "84e350f7-21f6-455b-b1f0-8b0116a2fd49",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)\n",
      "INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "What is the capital of France ? The capital of France is Paris.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from langchain.llms import VLLM\n",
    "\n",
    "llm = VLLM(model=\"mosaicml/mpt-7b\",\n",
    "           trust_remote_code=True,  # mandatory for hf models\n",
    "           max_new_tokens=128,\n",
    "           top_k=10,\n",
    "           top_p=0.95,\n",
    "           temperature=0.8,\n",
    ")\n",
    "\n",
    "print(llm(\"What is the capital of France ?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94a3b41d-8329-4f8f-94f9-453d7f132214",
   "metadata": {},
   "source": [
    "## Integrate the model in an LLMChain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5605b7a1-fa63-49c1-934d-8b4ef8d71dd5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "1. The first Pokemon game was released in 1996.\n",
      "2. The president was Bill Clinton.\n",
      "3. Clinton was president from 1993 to 2001.\n",
      "4. The answer is Clinton.\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from langchain import PromptTemplate, LLMChain\n",
    "\n",
    "template = \"\"\"Question: {question}\n",
    "\n",
    "Answer: Let's think step by step.\"\"\"\n",
    "prompt = PromptTemplate(template=template, input_variables=[\"question\"])\n",
    "\n",
    "llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
    "\n",
    "question = \"Who was the US president in the year the first Pokemon game was released?\"\n",
    "\n",
    "print(llm_chain.run(question))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56826aba-d08b-4838-8bfa-ca96e463b25d",
   "metadata": {},
   "source": [
    "## Distributed Inference\n",
    "\n",
    "vLLM supports distributed tensor-parallel inference and serving. \n",
    "\n",
    "To run multi-GPU inference with the LLM class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8c25c35-47b5-459d-9985-3cf546e9ac16",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.llms import VLLM\n",
    "\n",
    "llm = VLLM(model=\"mosaicml/mpt-30b\",\n",
    "           tensor_parallel_size=4,\n",
    "           trust_remote_code=True,  # mandatory for hf models\n",
    ")\n",
    "\n",
    "llm(\"What is the future of AI?\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p310",
   "language": "python",
   "name": "conda_pytorch_p310"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}