# Nvidia Triton+TRT-LLM

Nvidia's Triton is an inference server that provides an API style access to hosted LLM models. Likewise, Nvidia TensorRT-LLM, often abbreviated as TRT-LLM, is a GPU accelerated SDK for running optimizations and inference on LLM models. This connector allows for Langchain to remotely interact with a Triton inference server over GRPC or HTTP to performance accelerated inference operations.

[Triton Inference Server Github](https://github.com/triton-inference-server/server)


## TritonTensorRTLLM

This example goes over how to use LangChain to interact with `TritonTensorRT` LLMs. To install, run the following command:

In [None]:
# install package
%pip install -U langchain-nvidia-trt

## Create the Triton+TRT-LLM instance

Remember that a Triton instance represents a running server instance therefore you should ensure you have a valid server configuration running and change the `localhost:8001` to the correct IP/hostname:port combination for your server.

An example of setting up this environment can be found at Nvidia's (GenerativeAIExamples Github Repo)[https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration]

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_nvidia_trt.llms import TritonTensorRTLLM

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below
triton_llm = TritonTensorRTLLM(server_url ="localhost:8001", model_name="ensemble", tokens=500)

chain = prompt | triton_llm 

chain.invoke({"question": "What is LangChain?"})