You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/libs/partners/nvidia-trt/docs/llms.ipynb

107 lines
3.1 KiB
Plaintext

{
"cells": [
{
"cell_type": "raw",
"id": "67db2992",
"metadata": {},
"source": [
"---\n",
"sidebar_label: TritonTensorRT\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "b56b221d",
"metadata": {},
"source": [
"# Nvidia Triton+TRT-LLM\n",
"\n",
"Nvidia's Triton is an inference server that provides an API style access to hosted LLM models. Likewise, Nvidia TensorRT-LLM, often abbreviated as TRT-LLM, is a GPU accelerated SDK for running optimizations and inference on LLM models. This connector allows for Langchain to remotely interact with a Triton inference server over GRPC or HTTP to performance accelerated inference operations.\n",
"\n",
"[Triton Inference Server Github](https://github.com/triton-inference-server/server)\n",
"\n",
"\n",
"## TritonTensorRTLLM\n",
"\n",
"This example goes over how to use LangChain to interact with `TritonTensorRT` LLMs. To install, run the following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59c710c4",
"metadata": {},
"outputs": [],
"source": [
"# install package\n",
"%pip install -U langchain-nvidia-trt"
]
},
{
"cell_type": "markdown",
"id": "0ee90032",
"metadata": {},
"source": [
"## Create the Triton+TRT-LLM instance\n",
"\n",
"Remember that a Triton instance represents a running server instance therefore you should ensure you have a valid server configuration running and change the `localhost:8001` to the correct IP/hostname:port combination for your server.\n",
"\n",
"An example of setting up this environment can be found at Nvidia's (GenerativeAIExamples Github Repo)[https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "035dea0f",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain_core.prompts import PromptTemplate\n",
"from langchain_nvidia_trt.llms import TritonTensorRTLLM\n",
"\n",
"template = \"\"\"Question: {question}\n",
"\n",
"Answer: Let's think step by step.\"\"\"\n",
"\n",
"prompt = PromptTemplate.from_template(template)\n",
"\n",
"# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below\n",
"triton_llm = TritonTensorRTLLM(server_url =\"localhost:8001\", model_name=\"ensemble\", tokens=500)\n",
"\n",
"chain = prompt | triton_llm \n",
"\n",
"chain.invoke({\"question\": \"What is LangChain?\"})"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
},
"vscode": {
"interpreter": {
"hash": "e971737741ff4ec9aff7dc6155a1060a59a8a6d52c757dbbe66bf8ee389494b1"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}