langchain/docs/extras/integrations/llms/vllm.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "499c3142-2033-437d-a60a-731988ac6074",
   "metadata": {},
   "source": [
    "# vLLM\n",
    "\n",
    "[vLLM](https://vllm.readthedocs.io/en/latest/index.html) is a fast and easy-to-use library for LLM inference and serving, offering:\n",
    "* State-of-the-art serving throughput \n",
    "* Efficient management of attention key and value memory with PagedAttention\n",
    "* Continuous batching of incoming requests\n",
    "* Optimized CUDA kernels\n",
    "\n",
    "This notebooks goes over how to use a LLM with langchain and vLLM.\n",
    "\n",
    "To use, you should have the `vllm` python package installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8a3f2666-5c75-4797-967a-7915a247bf33",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "#!pip install vllm -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "84e350f7-21f6-455b-b1f0-8b0116a2fd49",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)\n",
      "INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "What is the capital of France ? The capital of France is Paris.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from langchain.llms import VLLM\n",
    "\n",
    "llm = VLLM(model=\"mosaicml/mpt-7b\",\n",
    "           trust_remote_code=True,  # mandatory for hf models\n",
    "           max_new_tokens=128,\n",
    "           top_k=10,\n",
    "           top_p=0.95,\n",
    "           temperature=0.8,\n",
    ")\n",
    "\n",
    "print(llm(\"What is the capital of France ?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94a3b41d-8329-4f8f-94f9-453d7f132214",
   "metadata": {},
   "source": [
    "## Integrate the model in an LLMChain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5605b7a1-fa63-49c1-934d-8b4ef8d71dd5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "1. The first Pokemon game was released in 1996.\n",
      "2. The president was Bill Clinton.\n",
      "3. Clinton was president from 1993 to 2001.\n",
      "4. The answer is Clinton.\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from langchain import PromptTemplate, LLMChain\n",
    "\n",
    "template = \"\"\"Question: {question}\n",
    "\n",
    "Answer: Let's think step by step.\"\"\"\n",
    "prompt = PromptTemplate(template=template, input_variables=[\"question\"])\n",
    "\n",
    "llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
    "\n",
    "question = \"Who was the US president in the year the first Pokemon game was released?\"\n",
    "\n",
    "print(llm_chain.run(question))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56826aba-d08b-4838-8bfa-ca96e463b25d",
   "metadata": {},
   "source": [
    "## Distributed Inference\n",
    "\n",
    "vLLM supports distributed tensor-parallel inference and serving. \n",
    "\n",
    "To run multi-GPU inference with the LLM class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8c25c35-47b5-459d-9985-3cf546e9ac16",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.llms import VLLM\n",
    "\n",
    "llm = VLLM(model=\"mosaicml/mpt-30b\",\n",
    "           tensor_parallel_size=4,\n",
    "           trust_remote_code=True,  # mandatory for hf models\n",
    ")\n",
    "\n",
    "llm(\"What is the future of AI?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64e89be0-6ad7-43a8-9dac-1324dcd4e851",
   "metadata": {
    "tags": []
   },
   "source": [
    "## OpenAI-Compatible Server\n",
    "\n",
    "vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.\n",
    "\n",
    "This server can be queried in the same format as OpenAI API.\n",
    "\n",
    "### OpenAI-Compatible Completion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c3cbc428-0bb8-422a-913e-1c6fef8b89d4",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " a city that is filled with history, ancient buildings, and art around every corner\n"
     ]
    }
   ],
   "source": [
    "from langchain.llms import VLLMOpenAI\n",
    "\n",
    "\n",
    "llm = VLLMOpenAI(\n",
    "    openai_api_key=\"EMPTY\",\n",
    "    openai_api_base=\"http://localhost:8000/v1\",\n",
    "    model_name=\"tiiuae/falcon-7b\",\n",
    "    model_kwargs={\"stop\": [\".\"]}\n",
    ")\n",
    "print(llm(\"Rome is\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p310",
   "language": "python",
   "name": "conda_pytorch_p310"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
feat(llms): add support for vLLM (#8806) Hello langchain maintainers, this PR aims at integrating [vllm](https://vllm.readthedocs.io/en/latest/#) into langchain. This PR closes #8729. This feature clearly depends on `vllm`, but I've seen other models supported here depend on packages that are not included in the pyproject.toml (e.g. `gpt4all`, `text-generation`) so I thought it was the case for this as well. @hwchase17, @baskaryan --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> 2023-08-07 14:32:02 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"id": "499c3142-2033-437d-a60a-731988ac6074",`
			`"metadata": {},`
			`"source": [`
			`"# vLLM\n",`
			`"\n",`
			`"[vLLM](https://vllm.readthedocs.io/en/latest/index.html) is a fast and easy-to-use library for LLM inference and serving, offering:\n",`
			`"* State-of-the-art serving throughput \n",`
			`"* Efficient management of attention key and value memory with PagedAttention\n",`
			`"* Continuous batching of incoming requests\n",`
			`"* Optimized CUDA kernels\n",`
			`"\n",`
			`"This notebooks goes over how to use a LLM with langchain and vLLM.\n",`
			`"\n",`
			"To use, you should have the `vllm` python package installed."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"id": "8a3f2666-5c75-4797-967a-7915a247bf33",`
			`"metadata": {`
			`"tags": []`
			`},`
			`"outputs": [],`
			`"source": [`
			`"#!pip install vllm -q"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"id": "84e350f7-21f6-455b-b1f0-8b0116a2fd49",`
			`"metadata": {`
			`"tags": []`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)\n",`
			`"INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512\n"`
			`]`
			`},`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"Processed prompts: 100%\|██████████\| 1/1 [00:00<00:00, 2.00it/s]"`
			`]`
			`},`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"\n",`
			`"What is the capital of France ? The capital of France is Paris.\n"`
			`]`
			`},`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"from langchain.llms import VLLM\n",`
			`"\n",`
			`"llm = VLLM(model=\"mosaicml/mpt-7b\",\n",`
			`" trust_remote_code=True, # mandatory for hf models\n",`
			`" max_new_tokens=128,\n",`
			`" top_k=10,\n",`
			`" top_p=0.95,\n",`
			`" temperature=0.8,\n",`
			`")\n",`
			`"\n",`
			`"print(llm(\"What is the capital of France ?\"))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "94a3b41d-8329-4f8f-94f9-453d7f132214",`
			`"metadata": {},`
			`"source": [`
			`"## Integrate the model in an LLMChain"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"id": "5605b7a1-fa63-49c1-934d-8b4ef8d71dd5",`
			`"metadata": {`
			`"tags": []`
			`},`
			`"outputs": [`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"Processed prompts: 100%\|██████████\| 1/1 [00:01<00:00, 1.34s/it]"`
			`]`
			`},`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"\n",`
			`"\n",`
			`"1. The first Pokemon game was released in 1996.\n",`
			`"2. The president was Bill Clinton.\n",`
			`"3. Clinton was president from 1993 to 2001.\n",`
			`"4. The answer is Clinton.\n",`
			`"\n"`
			`]`
			`},`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"from langchain import PromptTemplate, LLMChain\n",`
			`"\n",`
			`"template = \"\"\"Question: {question}\n",`
			`"\n",`
			`"Answer: Let's think step by step.\"\"\"\n",`
			`"prompt = PromptTemplate(template=template, input_variables=[\"question\"])\n",`
			`"\n",`
			`"llm_chain = LLMChain(prompt=prompt, llm=llm)\n",`
			`"\n",`
			`"question = \"Who was the US president in the year the first Pokemon game was released?\"\n",`
			`"\n",`
			`"print(llm_chain.run(question))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "56826aba-d08b-4838-8bfa-ca96e463b25d",`
			`"metadata": {},`
			`"source": [`
			`"## Distributed Inference\n",`
			`"\n",`
			`"vLLM supports distributed tensor-parallel inference and serving. \n",`
			`"\n",`
			"To run multi-GPU inference with the LLM class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "f8c25c35-47b5-459d-9985-3cf546e9ac16",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.llms import VLLM\n",`
			`"\n",`
			`"llm = VLLM(model=\"mosaicml/mpt-30b\",\n",`
			`" tensor_parallel_size=4,\n",`
			`" trust_remote_code=True, # mandatory for hf models\n",`
			`")\n",`
			`"\n",`
			`"llm(\"What is the future of AI?\")"`
			`]`
feat(llms): support vLLM's OpenAI-compatible server (#9179) This PR aims at supporting [vLLM's OpenAI-compatible server feature](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html#openai-compatible-server), i.e. allowing to call vLLM's LLMs like if they were OpenAI's. I've also udpated the related notebook providing an example usage. At the moment, vLLM only supports the `Completion` API. 2023-08-14 06:03:05 +00:00			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "64e89be0-6ad7-43a8-9dac-1324dcd4e851",`
			`"metadata": {`
			`"tags": []`
			`},`
			`"source": [`
			`"## OpenAI-Compatible Server\n",`
			`"\n",`
			`"vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.\n",`
			`"\n",`
			`"This server can be queried in the same format as OpenAI API.\n",`
			`"\n",`
			`"### OpenAI-Compatible Completion"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"id": "c3cbc428-0bb8-422a-913e-1c6fef8b89d4",`
			`"metadata": {`
			`"tags": []`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`" a city that is filled with history, ancient buildings, and art around every corner\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"from langchain.llms import VLLMOpenAI\n",`
			`"\n",`
			`"\n",`
			`"llm = VLLMOpenAI(\n",`
			`" openai_api_key=\"EMPTY\",\n",`
			`" openai_api_base=\"http://localhost:8000/v1\",\n",`
			`" model_name=\"tiiuae/falcon-7b\",\n",`
			`" model_kwargs={\"stop\": [\".\"]}\n",`
			`")\n",`
			`"print(llm(\"Rome is\"))"`
			`]`
feat(llms): add support for vLLM (#8806) Hello langchain maintainers, this PR aims at integrating [vllm](https://vllm.readthedocs.io/en/latest/#) into langchain. This PR closes #8729. This feature clearly depends on `vllm`, but I've seen other models supported here depend on packages that are not included in the pyproject.toml (e.g. `gpt4all`, `text-generation`) so I thought it was the case for this as well. @hwchase17, @baskaryan --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> 2023-08-07 14:32:02 +00:00			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "conda_pytorch_p310",`
			`"language": "python",`
			`"name": "conda_pytorch_p310"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.10.10"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`