From 14ff1438e65455413848d2ebc92c2e9ac36260fa Mon Sep 17 00:00:00 2001
From: Mikhail Khludnev <mkhludnev@users.noreply.github.com>
Date: Tue, 6 Feb 2024 22:47:07 +0300
Subject: [PATCH] nvidia-trt[patch]: propagate InferenceClientException to the
 caller. (#16936)

- **Description:**

before the change I've got

1. propagate InferenceClientException to the caller.
2. stop grpc receiver thread on exception

```
        for token in result_queue:
>           result_str += token
E           TypeError: can only concatenate str (not "InferenceServerException") to str

../../langchain_nvidia_trt/llms.py:207: TypeError
```
And stream thread keeps running.

after the change request thread stops correctly and caller got a root
cause exception:

```
E                   tritonclient.utils.InferenceServerException: [request id: 4529729] expected number of inputs between 2 and 3 but got 10 inputs for model 'vllm_model'

../../langchain_nvidia_trt/llms.py:205: InferenceServerException
```

  - **Issue:** the issue # it fixes if applicable,
  - **Dependencies:** any dependencies required for this change,
  - **Twitter handle:** [t.me/mkhl_spb](https://t.me/mkhl_spb)

I'm not sure about test coverage. Should I setup deep mocks or there's a
kind of triton stub via testcontainers or so.
---
 libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py b/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py
index 36e1e6e5ca..0ea1fca1df 100644
--- a/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py
+++ b/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py
@@ -199,10 +199,13 @@ class TritonTensorRTLLM(BaseLLM):
         result_queue = self._invoke_triton(self.model_name, inputs, outputs, stop)
 
         result_str = ""
-        for token in result_queue:
-            result_str += token
-
-        self.client.stop_stream()
+        try:
+            for token in result_queue:
+                if isinstance(token, Exception):
+                    raise token
+                result_str += token
+        finally:
+            self.client.stop_stream()
 
         return result_str