petals/tests/test_aux_functions.py

import pytest
import torch

from petals.client import DistributedBloomConfig
from petals.server.throughput import measure_compute_rps
from test_utils import MODEL_NAME


@pytest.mark.forked
@pytest.mark.parametrize("tensor_parallel", [False, True])
def test_compute_throughput(tensor_parallel: bool):
    config = DistributedBloomConfig.from_pretrained(MODEL_NAME)
    tensor_parallel_devices = ("cpu", "cpu") if tensor_parallel else ()
    compute_rps = measure_compute_rps(
        config,
        device=torch.device("cpu"),
        dtype=torch.bfloat16,
        load_in_8bit=False,
        tensor_parallel_devices=tensor_parallel_devices,
        n_steps=10,
    )
    assert isinstance(compute_rps, float) and compute_rps > 0
Bump transformers to 4.25.1 (#151) - latest accelerate, transformers, huggingface_hub - rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344 - remove unused code - fix edge case where session crashes when receiving seq length 0 - assert transformer version when importing WrappedBloomBlock Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> 2022-12-13 08:03:49 +00:00			`import pytest`
			`import torch`

			`from petals.client import DistributedBloomConfig`
Speed up loading blocks using init with meta weights (#285) * Init WrappedBloomBlock with meta weights --------- Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> 2023-03-12 21:49:04 +00:00			`from petals.server.throughput import measure_compute_rps`
			`from test_utils import MODEL_NAME`
Bump transformers to 4.25.1 (#151) - latest accelerate, transformers, huggingface_hub - rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344 - remove unused code - fix edge case where session crashes when receiving seq length 0 - assert transformer version when importing WrappedBloomBlock Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> 2022-12-13 08:03:49 +00:00

			`@pytest.mark.forked`
Add local tensor-parallel fwd/bwd (#143) This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel - 8bit approximation error same as in main (mean~=2% q0.9~=5%) - TP=1, 2, 3 (see screenshots above) - forward, grad w.r.t. input and inference exact match with main with TP=1 - `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens - throughput measured with and without TP - TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message) Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru> Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru> Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> 2023-01-03 15:35:51 +00:00			`@pytest.mark.parametrize("tensor_parallel", [False, True])`
CI: Update deprecated actions, don't measure network RPS (#215) * CI: Switch to actions/cache@v3 (v2 is deprecated) * Don't run measure_network_rps() in tests since it doesn't work well in CI 2023-01-13 16:16:31 +00:00			`def test_compute_throughput(tensor_parallel: bool):`
Bump transformers to 4.25.1 (#151) - latest accelerate, transformers, huggingface_hub - rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344 - remove unused code - fix edge case where session crashes when receiving seq length 0 - assert transformer version when importing WrappedBloomBlock Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> 2022-12-13 08:03:49 +00:00			`config = DistributedBloomConfig.from_pretrained(MODEL_NAME)`
Add local tensor-parallel fwd/bwd (#143) This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel - 8bit approximation error same as in main (mean~=2% q0.9~=5%) - TP=1, 2, 3 (see screenshots above) - forward, grad w.r.t. input and inference exact match with main with TP=1 - `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens - throughput measured with and without TP - TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message) Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru> Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru> Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> 2023-01-03 15:35:51 +00:00			`tensor_parallel_devices = ("cpu", "cpu") if tensor_parallel else ()`
Switch to speedtest-cli (#157) This pullrequest removes custom speed_test code in favour of speedtest-cli module. This is necessary to ensure that random warnings / print-outs do not mess with our outputs. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> 2022-12-15 12:21:33 +00:00			`compute_rps = measure_compute_rps(`
Add local tensor-parallel fwd/bwd (#143) This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel - 8bit approximation error same as in main (mean~=2% q0.9~=5%) - TP=1, 2, 3 (see screenshots above) - forward, grad w.r.t. input and inference exact match with main with TP=1 - `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens - throughput measured with and without TP - TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message) Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru> Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru> Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> 2023-01-03 15:35:51 +00:00			`config,`
			`device=torch.device("cpu"),`
			`dtype=torch.bfloat16,`
			`load_in_8bit=False,`
			`tensor_parallel_devices=tensor_parallel_devices,`
			`n_steps=10,`
Bump transformers to 4.25.1 (#151) - latest accelerate, transformers, huggingface_hub - rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344 - remove unused code - fix edge case where session crashes when receiving seq length 0 - assert transformer version when importing WrappedBloomBlock Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> 2022-12-13 08:03:49 +00:00			`)`
Switch to speedtest-cli (#157) This pullrequest removes custom speed_test code in favour of speedtest-cli module. This is necessary to ensure that random warnings / print-outs do not mess with our outputs. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> 2022-12-15 12:21:33 +00:00			`assert isinstance(compute_rps, float) and compute_rps > 0`