petals/benchmarks/benchmark_forward.py

#!/usr/bin/env python3

import argparse
import multiprocessing as mp
from time import perf_counter

import numpy as np
import torch
from hivemind.utils.logging import get_logger

from petals import AutoDistributedModel
from petals.constants import DTYPE_MAP, PUBLIC_INITIAL_PEERS

logger = get_logger()


def main():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument("--model", type=str, required=True, help="Model")
    parser.add_argument("--initial_peers", type=str, nargs="+", default=PUBLIC_INITIAL_PEERS, help="Initial peers")
    parser.add_argument("--torch_dtype", type=str, default="float32", help="Torch dtype")
    parser.add_argument("--n_processes", type=str, default=1, help="Number of concurrent processes")
    parser.add_argument("--seq_len", type=int, default=128, help="Sequence length")
    parser.add_argument("--n_steps", type=int, default=100, help="Number of benchmark steps")
    parser.add_argument("--batch_size", type=int, required=True, help="Batch size")
    parser.add_argument("--warmup_steps", type=int, default=1, help="Number of warmup steps")
    args = parser.parse_args()

    if args.n_processes == "n_gpus":
        args.n_processes = torch.cuda.device_count()
    else:
        args.n_processes = int(args.n_processes)

    pipe_recv, pipe_send = mp.Pipe(duplex=False)
    processes = [mp.Process(target=benchmark_forward, args=(i, args, pipe_send)) for i in range(args.n_processes)]
    for proc in processes:
        proc.start()
    for proc in processes:
        proc.join()

    speed = np.mean([pipe_recv.recv() for _ in range(args.n_processes)])
    logger.info(f"Final result: {speed=:.2f}")


@torch.inference_mode()
def benchmark_forward(process_idx, args, result_pipe):
    model = AutoDistributedModel.from_pretrained(
        args.model,
        initial_peers=args.initial_peers,
        torch_dtype=DTYPE_MAP[args.torch_dtype],
    )
    logger.info(f"Created model: {process_idx=} {model.device=}")

    torch.manual_seed(42)
    step_times = []
    for step in range(args.warmup_steps + args.n_steps):
        start_time = perf_counter()

        input_ids = torch.randint(0, model.config.vocab_size, size=(args.batch_size, args.seq_len))

        logger.info(f"{process_idx=} Fwd begin {input_ids.shape=}")
        h = model(input_ids)
        # We don't use model.lm_head
        logger.info(f"{process_idx=} Fwd end")

        if step >= args.warmup_steps:
            step_times.append(perf_counter() - start_time)
            speed = input_ids.numel() / np.mean(step_times)
            logger.info(f"{process_idx=} {step=} {speed=:.2f}")

    result_pipe.send(speed)


if __name__ == "__main__":
    main()
Add benchmark scripts (#319) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant. 2023-06-29 21:12:59 +00:00			`#!/usr/bin/env python3`

			`import argparse`
			`import multiprocessing as mp`
			`from time import perf_counter`

Fix warmup steps and minor issues in benchmarks (#334) The previous code was incorrect for the case of `warmup_steps != 1` (this mode was never used, but can be used in future). 2023-06-30 00:18:43 +00:00			`import numpy as np`
Add benchmark scripts (#319) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant. 2023-06-29 21:12:59 +00:00			`import torch`
			`from hivemind.utils.logging import get_logger`

			`from petals import AutoDistributedModel`
			`from petals.constants import DTYPE_MAP, PUBLIC_INITIAL_PEERS`

			`logger = get_logger()`


			`def main():`
Test Llama, rebalancing, throughput eval, and all CLI scripts (#452) This PR extends CI to: 1. Test Llama code using [TinyLlama-v0](https://huggingface.co/Maykeye/TinyLLama-v0). 2. Test rebalancing (sets up a situation where the 1st server needs to change its original position). 3. Check if benchmark scripts run (in case someone breaks its code). Note that the benchmark results are meaningless here (since they're measured on a tiny swarm of CPU servers, with low `--n_steps`). 4. Test `petals.cli.run_dht`. 5. Increase swap space and watch free RAM (a common issue is that actions are cancelled without explanation if there's not enough RAM - so it's a useful reminder + debug tool). 6. Fix flapping tests for bloom-560m by increasing tolerance. Other minor changes: fix `--help` messages to show defaults, fix docs, tune rebalancing constants. 2023-08-08 15:10:27 +00:00			`parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)`
			`parser.add_argument("--model", type=str, required=True, help="Model")`
			`parser.add_argument("--initial_peers", type=str, nargs="+", default=PUBLIC_INITIAL_PEERS, help="Initial peers")`
benchmarks: Aggregate speed among workers, set default dtype torch32 (#454) 2023-08-09 12:50:02 +00:00			`parser.add_argument("--torch_dtype", type=str, default="float32", help="Torch dtype")`
Test Llama, rebalancing, throughput eval, and all CLI scripts (#452) This PR extends CI to: 1. Test Llama code using [TinyLlama-v0](https://huggingface.co/Maykeye/TinyLLama-v0). 2. Test rebalancing (sets up a situation where the 1st server needs to change its original position). 3. Check if benchmark scripts run (in case someone breaks its code). Note that the benchmark results are meaningless here (since they're measured on a tiny swarm of CPU servers, with low `--n_steps`). 4. Test `petals.cli.run_dht`. 5. Increase swap space and watch free RAM (a common issue is that actions are cancelled without explanation if there's not enough RAM - so it's a useful reminder + debug tool). 6. Fix flapping tests for bloom-560m by increasing tolerance. Other minor changes: fix `--help` messages to show defaults, fix docs, tune rebalancing constants. 2023-08-08 15:10:27 +00:00			`parser.add_argument("--n_processes", type=str, default=1, help="Number of concurrent processes")`
			`parser.add_argument("--seq_len", type=int, default=128, help="Sequence length")`
			`parser.add_argument("--n_steps", type=int, default=100, help="Number of benchmark steps")`
			`parser.add_argument("--batch_size", type=int, required=True, help="Batch size")`
			`parser.add_argument("--warmup_steps", type=int, default=1, help="Number of warmup steps")`
Add benchmark scripts (#319) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant. 2023-06-29 21:12:59 +00:00			`args = parser.parse_args()`

			`if args.n_processes == "n_gpus":`
			`args.n_processes = torch.cuda.device_count()`
			`else:`
			`args.n_processes = int(args.n_processes)`

benchmarks: Aggregate speed among workers, set default dtype torch32 (#454) 2023-08-09 12:50:02 +00:00			`pipe_recv, pipe_send = mp.Pipe(duplex=False)`
			`processes = [mp.Process(target=benchmark_forward, args=(i, args, pipe_send)) for i in range(args.n_processes)]`
Add benchmark scripts (#319) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant. 2023-06-29 21:12:59 +00:00			`for proc in processes:`
			`proc.start()`
			`for proc in processes:`
			`proc.join()`

benchmarks: Aggregate speed among workers, set default dtype torch32 (#454) 2023-08-09 12:50:02 +00:00			`speed = np.mean([pipe_recv.recv() for _ in range(args.n_processes)])`
			`logger.info(f"Final result: {speed=:.2f}")`

Add benchmark scripts (#319) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant. 2023-06-29 21:12:59 +00:00
			`@torch.inference_mode()`
benchmarks: Aggregate speed among workers, set default dtype torch32 (#454) 2023-08-09 12:50:02 +00:00			`def benchmark_forward(process_idx, args, result_pipe):`
Add benchmark scripts (#319) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant. 2023-06-29 21:12:59 +00:00			`model = AutoDistributedModel.from_pretrained(`
			`args.model,`
			`initial_peers=args.initial_peers,`
			`torch_dtype=DTYPE_MAP[args.torch_dtype],`
			`)`
			`logger.info(f"Created model: {process_idx=} {model.device=}")`

			`torch.manual_seed(42)`
Fix warmup steps and minor issues in benchmarks (#334) The previous code was incorrect for the case of `warmup_steps != 1` (this mode was never used, but can be used in future). 2023-06-30 00:18:43 +00:00			`step_times = []`
			`for step in range(args.warmup_steps + args.n_steps):`
			`start_time = perf_counter()`
Add benchmark scripts (#319) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant. 2023-06-29 21:12:59 +00:00
			`input_ids = torch.randint(0, model.config.vocab_size, size=(args.batch_size, args.seq_len))`

			`logger.info(f"{process_idx=} Fwd begin {input_ids.shape=}")`
			`h = model(input_ids)`
			`# We don't use model.lm_head`
			`logger.info(f"{process_idx=} Fwd end")`

			`if step >= args.warmup_steps:`
Fix warmup steps and minor issues in benchmarks (#334) The previous code was incorrect for the case of `warmup_steps != 1` (this mode was never used, but can be used in future). 2023-06-30 00:18:43 +00:00			`step_times.append(perf_counter() - start_time)`
			`speed = input_ids.numel() / np.mean(step_times)`
			`logger.info(f"{process_idx=} {step=} {speed=:.2f}")`
Add benchmark scripts (#319) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant. 2023-06-29 21:12:59 +00:00
benchmarks: Aggregate speed among workers, set default dtype torch32 (#454) 2023-08-09 12:50:02 +00:00			`result_pipe.send(speed)`
Add benchmark scripts (#319) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant. 2023-06-29 21:12:59 +00:00

			`if __name__ == "__main__":`
			`main()`