From 14ff1438e65455413848d2ebc92c2e9ac36260fa Mon Sep 17 00:00:00 2001 From: Mikhail Khludnev Date: Tue, 6 Feb 2024 22:47:07 +0300 Subject: [PATCH] nvidia-trt[patch]: propagate InferenceClientException to the caller. (#16936) - **Description:** before the change I've got 1. propagate InferenceClientException to the caller. 2. stop grpc receiver thread on exception ``` for token in result_queue: > result_str += token E TypeError: can only concatenate str (not "InferenceServerException") to str ../../langchain_nvidia_trt/llms.py:207: TypeError ``` And stream thread keeps running. after the change request thread stops correctly and caller got a root cause exception: ``` E tritonclient.utils.InferenceServerException: [request id: 4529729] expected number of inputs between 2 and 3 but got 10 inputs for model 'vllm_model' ../../langchain_nvidia_trt/llms.py:205: InferenceServerException ``` - **Issue:** the issue # it fixes if applicable, - **Dependencies:** any dependencies required for this change, - **Twitter handle:** [t.me/mkhl_spb](https://t.me/mkhl_spb) I'm not sure about test coverage. Should I setup deep mocks or there's a kind of triton stub via testcontainers or so. --- libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py b/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py index 36e1e6e5ca..0ea1fca1df 100644 --- a/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py +++ b/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py @@ -199,10 +199,13 @@ class TritonTensorRTLLM(BaseLLM): result_queue = self._invoke_triton(self.model_name, inputs, outputs, stop) result_str = "" - for token in result_queue: - result_str += token - - self.client.stop_stream() + try: + for token in result_queue: + if isinstance(token, Exception): + raise token + result_str += token + finally: + self.client.stop_stream() return result_str