Change method to calculate number of tokens for OpenAIChat (#1457)

Solves https://github.com/hwchase17/langchain/issues/1412 Currently `OpenAIChat` inherits the way it calculates the number of tokens, `get_num_token`, from `BaseLLM`. In the other hand `OpenAI` inherits from `BaseOpenAI`. `BaseOpenAI` and `BaseLLM` uses different methodologies for doing this. The first relies on `tiktoken` while the second on `GPT2TokenizerFast`. The motivation of this PR is: 1. Bring consistency about the way of calculating number of tokens `get_num_token` to the `OpenAI` family, regardless of `Chat` vs `non Chat` scenarios. 2. Give preference to the `tiktoken` method as it's serverless friendly. It doesn't require downloading models which might make it incompatible with `readonly` filesystems.
1 year ago · dec3750875
parent 763f879536
commit dec3750875
1 changed files with 22 additions and 0 deletions
--- a/langchain/llms/openai.py
+++ b/langchain/llms/openai.py
@ -692,3 +692,25 @@ class OpenAIChat(BaseLLM, BaseModel):
    def _llm_type(self) -> str:
        """Return type of llm."""
        return "openai-chat"
+
+    def get_num_tokens(self, text: str) -> int:
+        """Calculate num tokens with tiktoken package."""
+        # tiktoken NOT supported for Python 3.8 or below
+        if sys.version_info[1] <= 8:
+            return super().get_num_tokens(text)
+        try:
+            import tiktoken
+        except ImportError:
+            raise ValueError(
+                "Could not import tiktoken python package. "
+                "This is needed in order to calculate get_num_tokens. "
+                "Please it install it with `pip install tiktoken`."
+            )
+        # create a GPT-3.5-Turbo encoder instance
+        enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
+
+        # encode the text using the GPT-3.5-Turbo encoder
+        tokenized_text = enc.encode(text)
+
+        # calculate the number of tokens in the encoded text
+        return len(tokenized_text)