Add benchmarks to readme (#284)

Alexander Borzunov 1 year ago committed by GitHub
parent 793726b041
commit 8dab37c1a9
No known key found for this signature in database

@ -1,7 +1,7 @@
<p align="center">
<img src="" width="400"><br>
Run 100B+ language models at home, BitTorrent-style.<br>
Fine-tuning and inference up to 10x faster than offloading<br><br>
Fine-tuning and inference <a href="">up to 10x faster</a> than offloading<br><br>
<a href=""><img src=""></a><br>
@ -83,8 +83,8 @@ Learning more:
## How does it work?
- Petals runs large language models like [BLOOM-176B]( **collaboratively** — you load a small part of the model, then team up with people serving the other parts to run inference or fine-tuning.
- Inference runs at ≈ 1 sec per step (token) — 10x faster than possible with offloading, enough for chatbots and other interactive apps. Parallel inference reaches hundreds of tokens/sec.
- Beyond classic language model APIs — you can employ any fine-tuning and sampling methods by executing custom paths through the model or accessing its hidden states. You get the comforts of an API with the flexibility of PyTorch.
- Single-batch inference runs at ≈ 1 sec per step (token) — [up to 10x faster]( than offloading, enough for [chatbots]( and other interactive apps. Parallel inference reaches hundreds of tokens/sec.
- Beyond classic language model APIs — you can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch.
<p align="center">
<img src="" width="800">
@ -98,61 +98,106 @@ Learning more:
## Installation
Here's how to install Petals with conda:
Here's how to install Petals with [Anaconda]( on Linux:
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -U petals
This script uses Anaconda to install CUDA-enabled PyTorch.
If you don't have anaconda, you can get it from [here](
If you don't want anaconda, you can install PyTorch [any other way](
If you want to run models with 8-bit weights, please install **PyTorch with CUDA 11** or newer for compatility with [bitsandbytes](
__System requirements:__ Petals only supports Linux for now. If you don't have a Linux machine, consider running Petals in Docker (see our [image]( or, in case of Windows, in WSL2 ([read more]( CPU is enough to run a client, but you probably need a GPU to run a server efficiently.
## 🛠️ Development
Petals uses pytest with a few plugins. To install them, run:
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
git clone && cd petals
pip install -e .[dev]
To run minimalistic tests, you need to make a local swarm with a small model and some servers. You may find more information about how local swarms work and how to run them in [this tutorial](
export MODEL_NAME=bloom-testing/test-bloomd-560m-main
python -m petals.cli.run_server $MODEL_NAME --block_indices 0:12 \
--identity tests/ --host_maddrs /ip4/ --new_swarm &> server1.log &
sleep 5 # wait for the first server to initialize DHT
python -m petals.cli.run_server $MODEL_NAME --block_indices 12:24 \
--initial_peers SEE_THE_OUTPUT_OF_THE_1ST_PEER &> server2.log &
tail -f server1.log server2.log # view logs for both servers
Then launch pytest:
export MODEL_NAME=bloom-testing/test-bloomd-560m-main REF_NAME=bigscience/bloom-560m
export INITIAL_PEERS=/ip4/
PYTHONPATH=. pytest tests --durations=0 --durations-min=1.0 -v
After you're done, you can terminate the servers and ensure that no zombie processes are left with `pkill -f petals.cli.run_server && pkill -f p2p`.
The automated tests use a more complex server configuration that can be found [here](
### Code style
We use [black]( and [isort]( for all pull requests.
Before committing your code, simply run `black . && isort .` and you will be fine.
If you don't use Anaconda, you can install PyTorch in [any other way]( If you want to run models with 8-bit weights, please install PyTorch with CUDA 11.x or newer for compatility with [bitsandbytes](
See the instructions for macOS and Windows, the full requirements, and troubleshooting advice in our [FAQ](
## ⏱️ Benchmarks
<table align="center">
<th colspan="2">Network</th>
<th colspan="2">Single-batch inference<br>(steps/s)</th>
<th colspan="2">Parallel forward<br>(tokens/s)</th>
<th rowspan="2">Bandwidth</th>
<th rowspan="2">Round-trip<br>latency</th>
<th colspan="2">Sequence length</th>
<th colspan="2">Batch size</th>
<tr align="center">
<th colspan="6">Offloading, max. possible speed on 1x A100 <sup>1</sup></th>
<tr align="center">
<td>256 Gbit/s</td>
<tr align="center">
<td>128 Gbit/s</td>
<th colspan="6">Petals on 14 heterogeneous servers across Europe and North America <sup>2</sup></th>
<tr align="center">
<td colspan="2">Real world</td>
<th colspan="6">Petals on 3 servers, with one A100 each <sup>3</sup></th>
<tr align="center">
<td>1 Gbit/s</td>
<td>&lt; 5 ms</td>
<tr align="center">
<td>100 Mbit/s</td>
<td>&lt; 5 ms</td>
<tr align="center">
<td>100 Mbit/s</td>
<td>100 ms</td>
<sup>1</sup> **An upper bound for offloading performance.** We base our offloading numbers on the best possible hardware setup for offloading: CPU RAM offloading via PCIe 4.0 with 16 PCIe lanes per GPU and PCIe switches for pairs of GPUs. We assume zero latency for the upper bound estimation. In 8-bit, the model uses 1 GB of memory per billion parameters. PCIe 4.0 with 16 lanes has a throughput of 256 Gbit/s, so offloading 176B parameters takes 5.5 seconds. The throughput is twice as slow (128 Gbit/s) if we have two GPUs behind the same PCIe switch.
<sup>2</sup> **A real-world distributed setting** with 14 servers holding 2× RTX 3060, 4× 2080Ti, 2× 3090, 2× A4000, and 4× A5000 GPUs. These are personal servers and servers from university labs, spread across Europe and North America and connected to the Internet at speeds of 1001000 Mbit/s. 4 servers operate from under firewalls.
<sup>3</sup> **An optimistic setup** that requires least communication. The client nodes have 8 CPU cores and no GPU.
We provide more evaluations and discuss these results in more detail in **Section 3.3** of our [paper](
## 🛠️ Contributing
Please see our [FAQ]( on contributing.
## 📜 Citation
