langchain/tests/integration_tests/test_schema.py

"""Test formatting functionality."""

from langchain.schema.language_model import _get_token_ids_default_method


class TestTokenCountingWithGPT2Tokenizer:
    def test_tokenization(self) -> None:
        # Check that the tokenization is consistent with the GPT-2 tokenizer
        assert _get_token_ids_default_method("This is a test") == [1212, 318, 257, 1332]

    def test_empty_token(self) -> None:
        assert len(_get_token_ids_default_method("")) == 0

    def test_multiple_tokens(self) -> None:
        assert len(_get_token_ids_default_method("a b c")) == 3

    def test_special_tokens(self) -> None:
        # test for consistency when the default tokenizer is changed
        assert len(_get_token_ids_default_method("a:b_c d")) == 6
[simple][test] Added test case for schema.py (#3692) - added unittest for schema.py covering utility functions and token counting. - fixed a nit. based on huggingface doc, the tokenizer model is gpt-2. [link](https://huggingface.co/transformers/v4.8.2/_modules/transformers/models/gpt2/tokenization_gpt2_fast.html) - make lint && make format, passed on local - screenshot of new test running result <img width="1283" alt="Screenshot 2023-04-27 at 9 51 55 PM" src="https://user-images.githubusercontent.com/62768671/235057441-c0ac3406-9541-453f-ba14-3ebb08656114.png"> 2023-04-29 03:42:24 +00:00			`"""Test formatting functionality."""`

Base language model docstrings (#7104) 2023-07-07 20:09:10 +00:00			`from langchain.schema.language_model import _get_token_ids_default_method`
[simple][test] Added test case for schema.py (#3692) - added unittest for schema.py covering utility functions and token counting. - fixed a nit. based on huggingface doc, the tokenizer model is gpt-2. [link](https://huggingface.co/transformers/v4.8.2/_modules/transformers/models/gpt2/tokenization_gpt2_fast.html) - make lint && make format, passed on local - screenshot of new test running result <img width="1283" alt="Screenshot 2023-04-27 at 9 51 55 PM" src="https://user-images.githubusercontent.com/62768671/235057441-c0ac3406-9541-453f-ba14-3ebb08656114.png"> 2023-04-29 03:42:24 +00:00

			`class TestTokenCountingWithGPT2Tokenizer:`
Add 'get_token_ids' method (#4784) Let user inspect the token ids in addition to getting th enumber of tokens --------- Co-authored-by: Zach Schillaci <40636930+zachschillaci27@users.noreply.github.com> 2023-05-22 13:17:26 +00:00			`def test_tokenization(self) -> None:`
			`# Check that the tokenization is consistent with the GPT-2 tokenizer`
			`assert _get_token_ids_default_method("This is a test") == [1212, 318, 257, 1332]`

[simple][test] Added test case for schema.py (#3692) - added unittest for schema.py covering utility functions and token counting. - fixed a nit. based on huggingface doc, the tokenizer model is gpt-2. [link](https://huggingface.co/transformers/v4.8.2/_modules/transformers/models/gpt2/tokenization_gpt2_fast.html) - make lint && make format, passed on local - screenshot of new test running result <img width="1283" alt="Screenshot 2023-04-27 at 9 51 55 PM" src="https://user-images.githubusercontent.com/62768671/235057441-c0ac3406-9541-453f-ba14-3ebb08656114.png"> 2023-04-29 03:42:24 +00:00			`def test_empty_token(self) -> None:`
Add 'get_token_ids' method (#4784) Let user inspect the token ids in addition to getting th enumber of tokens --------- Co-authored-by: Zach Schillaci <40636930+zachschillaci27@users.noreply.github.com> 2023-05-22 13:17:26 +00:00			`assert len(_get_token_ids_default_method("")) == 0`
[simple][test] Added test case for schema.py (#3692) - added unittest for schema.py covering utility functions and token counting. - fixed a nit. based on huggingface doc, the tokenizer model is gpt-2. [link](https://huggingface.co/transformers/v4.8.2/_modules/transformers/models/gpt2/tokenization_gpt2_fast.html) - make lint && make format, passed on local - screenshot of new test running result <img width="1283" alt="Screenshot 2023-04-27 at 9 51 55 PM" src="https://user-images.githubusercontent.com/62768671/235057441-c0ac3406-9541-453f-ba14-3ebb08656114.png"> 2023-04-29 03:42:24 +00:00
			`def test_multiple_tokens(self) -> None:`
Add 'get_token_ids' method (#4784) Let user inspect the token ids in addition to getting th enumber of tokens --------- Co-authored-by: Zach Schillaci <40636930+zachschillaci27@users.noreply.github.com> 2023-05-22 13:17:26 +00:00			`assert len(_get_token_ids_default_method("a b c")) == 3`
[simple][test] Added test case for schema.py (#3692) - added unittest for schema.py covering utility functions and token counting. - fixed a nit. based on huggingface doc, the tokenizer model is gpt-2. [link](https://huggingface.co/transformers/v4.8.2/_modules/transformers/models/gpt2/tokenization_gpt2_fast.html) - make lint && make format, passed on local - screenshot of new test running result <img width="1283" alt="Screenshot 2023-04-27 at 9 51 55 PM" src="https://user-images.githubusercontent.com/62768671/235057441-c0ac3406-9541-453f-ba14-3ebb08656114.png"> 2023-04-29 03:42:24 +00:00
			`def test_special_tokens(self) -> None:`
			`# test for consistency when the default tokenizer is changed`
Add 'get_token_ids' method (#4784) Let user inspect the token ids in addition to getting th enumber of tokens --------- Co-authored-by: Zach Schillaci <40636930+zachschillaci27@users.noreply.github.com> 2023-05-22 13:17:26 +00:00			`assert len(_get_token_ids_default_method("a:b_c d")) == 6`