langchain/libs/partners/nvidia-trt/docs/llms.ipynb

{
 "cells": [
  {
   "cell_type": "raw",
   "id": "67db2992",
   "metadata": {},
   "source": [
    "---\n",
    "sidebar_label: TritonTensorRT\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b56b221d",
   "metadata": {},
   "source": [
    "# Nvidia Triton+TRT-LLM\n",
    "\n",
    "Nvidia's Triton is an inference server that provides an API style access to hosted LLM models. Likewise, Nvidia TensorRT-LLM, often abbreviated as TRT-LLM, is a GPU accelerated SDK for running optimizations and inference on LLM models. This connector allows for Langchain to remotely interact with a Triton inference server over GRPC or HTTP to performance accelerated inference operations.\n",
    "\n",
    "[Triton Inference Server Github](https://github.com/triton-inference-server/server)\n",
    "\n",
    "\n",
    "## TritonTensorRTLLM\n",
    "\n",
    "This example goes over how to use LangChain to interact with `TritonTensorRT` LLMs. To install, run the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59c710c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# install package\n",
    "%pip install -U langchain-nvidia-trt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ee90032",
   "metadata": {},
   "source": [
    "## Create the Triton+TRT-LLM instance\n",
    "\n",
    "Remember that a Triton instance represents a running server instance therefore you should ensure you have a valid server configuration running and change the `localhost:8001` to the correct IP/hostname:port combination for your server.\n",
    "\n",
    "An example of setting up this environment can be found at Nvidia's (GenerativeAIExamples Github Repo)[https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "035dea0f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain_core.prompts import PromptTemplate\n",
    "from langchain_nvidia_trt.llms import TritonTensorRTLLM\n",
    "\n",
    "template = \"\"\"Question: {question}\n",
    "\n",
    "Answer: Let's think step by step.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate.from_template(template)\n",
    "\n",
    "# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below\n",
    "triton_llm = TritonTensorRTLLM(server_url =\"localhost:8001\", model_name=\"ensemble\", tokens=500)\n",
    "\n",
    "chain = prompt | triton_llm \n",
    "\n",
    "chain.invoke({\"question\": \"What is LangChain?\"})"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "e971737741ff4ec9aff7dc6155a1060a59a8a6d52c757dbbe66bf8ee389494b1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
[Partner] NVIDIA TRT Package (#14733) Simplify #13976 and add as a separate package. - [] Add README - [X] Add doc notebook - [X] Add simple LLM integration --------- Co-authored-by: Jeremy Dyer <jdye64@gmail.com> 2023-12-19 03:08:25 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "raw",`
			`"id": "67db2992",`
			`"metadata": {},`
			`"source": [`
			`"---\n",`
			`"sidebar_label: TritonTensorRT\n",`
			`"---"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "b56b221d",`
			`"metadata": {},`
			`"source": [`
			`"# Nvidia Triton+TRT-LLM\n",`
			`"\n",`
			`"Nvidia's Triton is an inference server that provides an API style access to hosted LLM models. Likewise, Nvidia TensorRT-LLM, often abbreviated as TRT-LLM, is a GPU accelerated SDK for running optimizations and inference on LLM models. This connector allows for Langchain to remotely interact with a Triton inference server over GRPC or HTTP to performance accelerated inference operations.\n",`
			`"\n",`
			`"[Triton Inference Server Github](https://github.com/triton-inference-server/server)\n",`
			`"\n",`
			`"\n",`
			`"## TritonTensorRTLLM\n",`
			`"\n",`
			"This example goes over how to use LangChain to interact with `TritonTensorRT` LLMs. To install, run the following command:"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "59c710c4",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# install package\n",`
			`"%pip install -U langchain-nvidia-trt"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "0ee90032",`
			`"metadata": {},`
			`"source": [`
			`"## Create the Triton+TRT-LLM instance\n",`
			`"\n",`
			"Remember that a Triton instance represents a running server instance therefore you should ensure you have a valid server configuration running and change the `localhost:8001` to the correct IP/hostname:port combination for your server.\n",
			`"\n",`
			`"An example of setting up this environment can be found at Nvidia's (GenerativeAIExamples Github Repo)[https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "035dea0f",`
			`"metadata": {`
			`"tags": []`
			`},`
			`"outputs": [],`
			`"source": [`
			`"from langchain_core.prompts import PromptTemplate\n",`
			`"from langchain_nvidia_trt.llms import TritonTensorRTLLM\n",`
			`"\n",`
			`"template = \"\"\"Question: {question}\n",`
			`"\n",`
			`"Answer: Let's think step by step.\"\"\"\n",`
			`"\n",`
			`"prompt = PromptTemplate.from_template(template)\n",`
			`"\n",`
			`"# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below\n",`
			`"triton_llm = TritonTensorRTLLM(server_url =\"localhost:8001\", model_name=\"ensemble\", tokens=500)\n",`
			`"\n",`
			`"chain = prompt \| triton_llm \n",`
			`"\n",`
			`"chain.invoke({\"question\": \"What is LangChain?\"})"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.10.9"`
			`},`
			`"vscode": {`
			`"interpreter": {`
			`"hash": "e971737741ff4ec9aff7dc6155a1060a59a8a6d52c757dbbe66bf8ee389494b1"`
			`}`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`