mirror of
https://github.com/hwchase17/langchain
synced 2024-11-06 03:20:49 +00:00
107 lines
3.1 KiB
Plaintext
107 lines
3.1 KiB
Plaintext
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "raw",
|
||
|
"id": "67db2992",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"---\n",
|
||
|
"sidebar_label: TritonTensorRT\n",
|
||
|
"---"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "b56b221d",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Nvidia Triton+TRT-LLM\n",
|
||
|
"\n",
|
||
|
"Nvidia's Triton is an inference server that provides an API style access to hosted LLM models. Likewise, Nvidia TensorRT-LLM, often abbreviated as TRT-LLM, is a GPU accelerated SDK for running optimizations and inference on LLM models. This connector allows for Langchain to remotely interact with a Triton inference server over GRPC or HTTP to performance accelerated inference operations.\n",
|
||
|
"\n",
|
||
|
"[Triton Inference Server Github](https://github.com/triton-inference-server/server)\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"## TritonTensorRTLLM\n",
|
||
|
"\n",
|
||
|
"This example goes over how to use LangChain to interact with `TritonTensorRT` LLMs. To install, run the following command:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"id": "59c710c4",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"# install package\n",
|
||
|
"%pip install -U langchain-nvidia-trt"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "0ee90032",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Create the Triton+TRT-LLM instance\n",
|
||
|
"\n",
|
||
|
"Remember that a Triton instance represents a running server instance therefore you should ensure you have a valid server configuration running and change the `localhost:8001` to the correct IP/hostname:port combination for your server.\n",
|
||
|
"\n",
|
||
|
"An example of setting up this environment can be found at Nvidia's (GenerativeAIExamples Github Repo)[https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"id": "035dea0f",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from langchain_core.prompts import PromptTemplate\n",
|
||
|
"from langchain_nvidia_trt.llms import TritonTensorRTLLM\n",
|
||
|
"\n",
|
||
|
"template = \"\"\"Question: {question}\n",
|
||
|
"\n",
|
||
|
"Answer: Let's think step by step.\"\"\"\n",
|
||
|
"\n",
|
||
|
"prompt = PromptTemplate.from_template(template)\n",
|
||
|
"\n",
|
||
|
"# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below\n",
|
||
|
"triton_llm = TritonTensorRTLLM(server_url =\"localhost:8001\", model_name=\"ensemble\", tokens=500)\n",
|
||
|
"\n",
|
||
|
"chain = prompt | triton_llm \n",
|
||
|
"\n",
|
||
|
"chain.invoke({\"question\": \"What is LangChain?\"})"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3 (ipykernel)",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.10.9"
|
||
|
},
|
||
|
"vscode": {
|
||
|
"interpreter": {
|
||
|
"hash": "e971737741ff4ec9aff7dc6155a1060a59a8a6d52c757dbbe66bf8ee389494b1"
|
||
|
}
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 5
|
||
|
}
|