langchain/libs/partners/nvidia-trt/docs/llms.ipynb

{
 "cells": [
  {
   "cell_type": "raw",
   "id": "67db2992",
   "metadata": {},
   "source": [
    "---\n",
    "sidebar_label: TritonTensorRT\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b56b221d",
   "metadata": {},
   "source": [
    "# Nvidia Triton+TRT-LLM\n",
    "\n",
    "Nvidia's Triton is an inference server that provides an API style access to hosted LLM models. Likewise, Nvidia TensorRT-LLM, often abbreviated as TRT-LLM, is a GPU accelerated SDK for running optimizations and inference on LLM models. This connector allows for Langchain to remotely interact with a Triton inference server over GRPC or HTTP to performance accelerated inference operations.\n",
    "\n",
    "[Triton Inference Server Github](https://github.com/triton-inference-server/server)\n",
    "\n",
    "\n",
    "## TritonTensorRTLLM\n",
    "\n",
    "This example goes over how to use LangChain to interact with `TritonTensorRT` LLMs. To install, run the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59c710c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# install package\n",
    "%pip install -U langchain-nvidia-trt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ee90032",
   "metadata": {},
   "source": [
    "## Create the Triton+TRT-LLM instance\n",
    "\n",
    "Remember that a Triton instance represents a running server instance therefore you should ensure you have a valid server configuration running and change the `localhost:8001` to the correct IP/hostname:port combination for your server.\n",
    "\n",
    "An example of setting up this environment can be found at Nvidia's (GenerativeAIExamples Github Repo)[https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "035dea0f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain_core.prompts import PromptTemplate\n",
    "from langchain_nvidia_trt.llms import TritonTensorRTLLM\n",
    "\n",
    "template = \"\"\"Question: {question}\n",
    "\n",
    "Answer: Let's think step by step.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate.from_template(template)\n",
    "\n",
    "# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below\n",
    "triton_llm = TritonTensorRTLLM(server_url =\"localhost:8001\", model_name=\"ensemble\", tokens=500)\n",
    "\n",
    "chain = prompt | triton_llm \n",
    "\n",
    "chain.invoke({\"question\": \"What is LangChain?\"})"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "e971737741ff4ec9aff7dc6155a1060a59a8a6d52c757dbbe66bf8ee389494b1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}