{ "cells": [ { "cell_type": "raw", "id": "67db2992", "metadata": {}, "source": [ "---\n", "sidebar_label: TritonTensorRT\n", "---" ] }, { "cell_type": "markdown", "id": "b56b221d", "metadata": {}, "source": [ "# Nvidia Triton+TRT-LLM\n", "\n", "Nvidia's Triton is an inference server that provides an API style access to hosted LLM models. Likewise, Nvidia TensorRT-LLM, often abbreviated as TRT-LLM, is a GPU accelerated SDK for running optimizations and inference on LLM models. This connector allows for Langchain to remotely interact with a Triton inference server over GRPC or HTTP to performance accelerated inference operations.\n", "\n", "[Triton Inference Server Github](https://github.com/triton-inference-server/server)\n", "\n", "\n", "## TritonTensorRTLLM\n", "\n", "This example goes over how to use LangChain to interact with `TritonTensorRT` LLMs. To install, run the following command:" ] }, { "cell_type": "code", "execution_count": null, "id": "59c710c4", "metadata": {}, "outputs": [], "source": [ "# install package\n", "%pip install -U langchain-nvidia-trt" ] }, { "cell_type": "markdown", "id": "0ee90032", "metadata": {}, "source": [ "## Create the Triton+TRT-LLM instance\n", "\n", "Remember that a Triton instance represents a running server instance therefore you should ensure you have a valid server configuration running and change the `localhost:8001` to the correct IP/hostname:port combination for your server.\n", "\n", "An example of setting up this environment can be found at Nvidia's (GenerativeAIExamples Github Repo)[https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration]" ] }, { "cell_type": "code", "execution_count": null, "id": "035dea0f", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain_core.prompts import PromptTemplate\n", "from langchain_nvidia_trt.llms import TritonTensorRTLLM\n", "\n", "template = \"\"\"Question: {question}\n", "\n", "Answer: Let's think step by step.\"\"\"\n", "\n", "prompt = PromptTemplate.from_template(template)\n", "\n", "# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below\n", "triton_llm = TritonTensorRTLLM(server_url =\"localhost:8001\", model_name=\"ensemble\", tokens=500)\n", "\n", "chain = prompt | triton_llm \n", "\n", "chain.invoke({\"question\": \"What is LangChain?\"})" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" }, "vscode": { "interpreter": { "hash": "e971737741ff4ec9aff7dc6155a1060a59a8a6d52c757dbbe66bf8ee389494b1" } } }, "nbformat": 4, "nbformat_minor": 5 }