petals/README.md

# PETALS: Collaborative Inference of Large Models

Run BLOOM-176B, the largest open language model, by collaborating over the Internet.

__[EARLY PROTOTYPE]__ - this project is a work in progress. Stuff breaks and gets fixed every day. Docs are nonexistent.
If you want us to wake you up when it's ready, click Watch -> Custom and tick "Releases".

Roadmap: [__Issue #12__](https://github.com/learning-at-home/bloom-demo/issues/12)

### Installation

```bash
conda install -y -c conda-forge cudatoolkit-dev==11.3.1 cudatoolkit==11.3.1 cudnn==8.2.1.32
pip install torch==1.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113
```


### Basic functionality

All tests is run on localhost

First, run one or more servers like this:
```bash
# minimalistic server with non-trained bloom blocks
python -m cli.run_server --converted_model_name_or_path bigscience/test-bloomd-6b3 \
  --block_indices 3:5 --torch_dtype float32 --identity_path ./server1.id --host_maddrs /ip4/127.0.0.1/tcp/31337
# when running multiple servers:
# - give each server a unique --identity_path (or remote --identity_path arg when debugging)
# - if running multiple servers on the same machine, give each a unique port (last integer in --host_maddrs, 0 means random port)
# - when running over the internet, change --host_maddrs according to https://learning-at-home.readthedocs.io/en/latest/user/dht.html#running-across-the-internet
# - each server except first should have --initial_peers pointing to one of pre-existing servers
```

Then open a python notebook or console and run:
```python
import torch
import hivemind
from src import DistributedBloomConfig, get_remote_module


dht = hivemind.DHT(
    initial_peers=[TODO_COPY_FULL_ADDRESS_FROM_ANY_OF_THE_SERVERS],  # e.g. /ip4/127.0.0.1/...
    client_mode=True, start=True,
)
config = DistributedBloomConfig.from_pretrained("bigscience/test-bloom-6b3")
layer3, layer4 = get_remote_module(dht, ['bigscience/test-bloomd-6b3.3', 'bigscience/test-bloomd-6b3.4'], config)
assert layer3 is not None and layer4 is not None, "one or both layers were not found in DHT"
# test forward/backward, two blocks
outputs = layer4(layer3(torch.randn(1, 64, 4096)))
loss = (outputs * torch.randn_like(outputs)).norm()
loss.backward()

# test inference, one block
with layer3.inference_session(max_length=10) as sess:
    for i in range(10):
        res = sess.step(torch.ones(1, 1, 4096))
```


### Convert regular BLOOM into distributed
```bash

# convert model from HF hub to a distributed format (can take hours depending on your connection!)
MY_WRITE_TOKEN=TODO_WRITE_TOKEN_FROM_https://huggingface.co/settings/token
python -m cli.convert_model --model bigscience/bloom-6b3  \
  --output_path ./converted_model --output_repo bigscience/test-bloomd-6b3 \
  --use_auth_token $MY_WRITE_TOKEN  # ^-- todo replace output repo with something you have access to
```


### Test local vs remote block (allclose)

To test distributed inference, run one or more servers, then open a new shell and run pytest with environment variables:
```bash
# shell A: serve model
python -m cli.run_server --converted_model_name_or_path bigscience/test-bloomd-6b3 \
  --torch_dtype float32 --identity_path ./server1.id --host_maddrs /ip4/127.0.0.1/tcp/31337

# shell B:
export PYTHONPATH=.
export INITIAL_PEERS="/ip4/TODO_COPY_INITIAL_PEERS_FROM_SERVER_OUTPUT"
export MODEL_NAME="bigscience/test-bloomd-6b3"

# test individual random blocks for exact match
pytest tests/test_block_exact_match.py

# test the full model
pytest tests/test_full_model.py
```
Use "PETALS" as the readme title (#40) Since we've chosen the system name, let's use it in the repo name and the readme title. 2 years ago			`# PETALS: Collaborative Inference of Large Models`
Update README.md 2 years ago
Use "PETALS" as the readme title (#40) Since we've chosen the system name, let's use it in the repo name and the readme title. 2 years ago			`Run BLOOM-176B, the largest open language model, by collaborating over the Internet.`
install script 2 years ago
Clean up readme (#24) Remove some deprecated sections of README and turns on CI for main branch 2 years ago			`__[EARLY PROTOTYPE]__ - this project is a work in progress. Stuff breaks and gets fixed every day. Docs are nonexistent.`
			`If you want us to wake you up when it's ready, click Watch -> Custom and tick "Releases".`
install script 2 years ago
Use "PETALS" as the readme title (#40) Since we've chosen the system name, let's use it in the repo name and the readme title. 2 years ago			`Roadmap: [__Issue #12__](https://github.com/learning-at-home/bloom-demo/issues/12)`
install script 2 years ago
Clean up readme (#24) Remove some deprecated sections of README and turns on CI for main branch 2 years ago			`### Installation`
install script 2 years ago
			```bash
			`conda install -y -c conda-forge cudatoolkit-dev==11.3.1 cudatoolkit==11.3.1 cudnn==8.2.1.32`
Clean up readme (#24) Remove some deprecated sections of README and turns on CI for main branch 2 years ago			`pip install torch==1.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html`
			`pip install -r requirements.txt`
integrate mixed-8bit model (#39) * integrate mixed-8bit model * Fix bug with model duplication in RAM * set throughput=1.0 to fix zero throughput problem * add revision support * update hivemind and bitsandbytes * update deploy scripts * update installation instructions 2 years ago			`pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113`
install script 2 years ago			```
add minimalistic benchmarks 2 years ago

Clean up readme (#24) Remove some deprecated sections of README and turns on CI for main branch 2 years ago			`### Basic functionality`
add minimalistic benchmarks 2 years ago
Clean up readme (#24) Remove some deprecated sections of README and turns on CI for main branch 2 years ago			`All tests is run on localhost`
warn about long runtime 2 years ago
instructions to test distributed inference 2 years ago			`First, run one or more servers like this:`
			```bash
add minimalistic benchmarks 2 years ago			`# minimalistic server with non-trained bloom blocks`
use default prefix in readme 2 years ago			`python -m cli.run_server --converted_model_name_or_path bigscience/test-bloomd-6b3 \`
fetch a specific bloom block without downloading the entire model 2 years ago			`--block_indices 3:5 --torch_dtype float32 --identity_path ./server1.id --host_maddrs /ip4/127.0.0.1/tcp/31337`
notes on hosting servers 2 years ago			`# when running multiple servers:`
			`# - give each server a unique --identity_path (or remote --identity_path arg when debugging)`
			`# - if running multiple servers on the same machine, give each a unique port (last integer in --host_maddrs, 0 means random port)`
			`# - when running over the internet, change --host_maddrs according to https://learning-at-home.readthedocs.io/en/latest/user/dht.html#running-across-the-internet`
Implement block selection on servers (#20) 2 years ago			`# - each server except first should have --initial_peers pointing to one of pre-existing servers`
instructions to test distributed inference 2 years ago			```

			`Then open a python notebook or console and run:`
			```python
			`import torch`
			`import hivemind`
remove transformer block, implement as sequential of size 1 (#54) * remove transformer block, implement as sequence size 1 * reimplement get_remote_module * fix readme Co-authored-by: Alexander Borzunov <hxrussia@gmail.com> Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> 2 years ago			`from src import DistributedBloomConfig, get_remote_module`
instructions to test distributed inference 2 years ago
basic chained inference (multiple blocks per one RPC call) 2 years ago
instructions to test distributed inference 2 years ago			`dht = hivemind.DHT(`
basic chained inference (multiple blocks per one RPC call) 2 years ago			`initial_peers=[TODO_COPY_FULL_ADDRESS_FROM_ANY_OF_THE_SERVERS], # e.g. /ip4/127.0.0.1/...`
instructions to test distributed inference 2 years ago			`client_mode=True, start=True,`
			`)`
remove transformer block, implement as sequential of size 1 (#54) * remove transformer block, implement as sequence size 1 * reimplement get_remote_module * fix readme Co-authored-by: Alexander Borzunov <hxrussia@gmail.com> Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> 2 years ago			`config = DistributedBloomConfig.from_pretrained("bigscience/test-bloom-6b3")`
			`layer3, layer4 = get_remote_module(dht, ['bigscience/test-bloomd-6b3.3', 'bigscience/test-bloomd-6b3.4'], config)`
fetch a specific bloom block without downloading the entire model 2 years ago			`assert layer3 is not None and layer4 is not None, "one or both layers were not found in DHT"`
instructions to test distributed inference 2 years ago			`# test forward/backward, two blocks`
remove transformer block, implement as sequential of size 1 (#54) * remove transformer block, implement as sequence size 1 * reimplement get_remote_module * fix readme Co-authored-by: Alexander Borzunov <hxrussia@gmail.com> Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> 2 years ago			`outputs = layer4(layer3(torch.randn(1, 64, 4096)))`
instructions to test distributed inference 2 years ago			`loss = (outputs * torch.randn_like(outputs)).norm()`
			`loss.backward()`

instructions to test distributed inference 2 years ago			`# test inference, one block`
Let users specify sequence length instead of assuming 2048 (#52) - Maximum length is now provided in `.inference_session(max_length=100)` - previously, we would always assume max length = 2048 - added a generic way to forward *kwargs to inference session - for compatibility with #47 - Note to @borzunov : it does not* pass them arbitrarily, but instead checks for kwarg names at the bottom level - run_server can be started with a custom max_length for inference - renamed --cache_size_bytes to --attention_cache_bytes (to avoid collision with --cache_dir) - --attn_cache_bytes can now support humane file sizes (e.g. 300MB instead of 314572800) - made some server-side errors more human-readable to user (e.g. when max length is exceeded) Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Alexander Borzunov <hxrussia@gmail.com> 2 years ago			`with layer3.inference_session(max_length=10) as sess:`
instructions to test distributed inference 2 years ago			`for i in range(10):`
			`res = sess.step(torch.ones(1, 1, 4096))`
			```


Clean up readme (#24) Remove some deprecated sections of README and turns on CI for main branch 2 years ago			`### Convert regular BLOOM into distributed`
instructions to test distributed inference 2 years ago			```bash

			`# convert model from HF hub to a distributed format (can take hours depending on your connection!)`
			`MY_WRITE_TOKEN=TODO_WRITE_TOKEN_FROM_https://huggingface.co/settings/token`
			`python -m cli.convert_model --model bigscience/bloom-6b3 \`
			`--output_path ./converted_model --output_repo bigscience/test-bloomd-6b3 \`
			`--use_auth_token $MY_WRITE_TOKEN # ^-- todo replace output repo with something you have access to`
add testing guide 2 years ago			```


Clean up readme (#24) Remove some deprecated sections of README and turns on CI for main branch 2 years ago			`### Test local vs remote block (allclose)`
add testing guide 2 years ago
			`To test distributed inference, run one or more servers, then open a new shell and run pytest with environment variables:`
			```bash
remove transformer block, implement as sequential of size 1 (#54) * remove transformer block, implement as sequence size 1 * reimplement get_remote_module * fix readme Co-authored-by: Alexander Borzunov <hxrussia@gmail.com> Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> 2 years ago			`# shell A: serve model`
use default prefix in readme 2 years ago			`python -m cli.run_server --converted_model_name_or_path bigscience/test-bloomd-6b3 \`
remove transformer block, implement as sequential of size 1 (#54) * remove transformer block, implement as sequence size 1 * reimplement get_remote_module * fix readme Co-authored-by: Alexander Borzunov <hxrussia@gmail.com> Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> 2 years ago			`--torch_dtype float32 --identity_path ./server1.id --host_maddrs /ip4/127.0.0.1/tcp/31337`
add testing guide 2 years ago
remove transformer block, implement as sequential of size 1 (#54) * remove transformer block, implement as sequence size 1 * reimplement get_remote_module * fix readme Co-authored-by: Alexander Borzunov <hxrussia@gmail.com> Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> 2 years ago			`# shell B:`
			`export PYTHONPATH=.`
			`export INITIAL_PEERS="/ip4/TODO_COPY_INITIAL_PEERS_FROM_SERVER_OUTPUT"`
			`export MODEL_NAME="bigscience/test-bloomd-6b3"`
list latest additions 2 years ago
remove transformer block, implement as sequential of size 1 (#54) * remove transformer block, implement as sequence size 1 * reimplement get_remote_module * fix readme Co-authored-by: Alexander Borzunov <hxrussia@gmail.com> Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> 2 years ago			`# test individual random blocks for exact match`
			`pytest tests/test_block_exact_match.py`
Add instructions to test the full model (#25) add instructions to test the full model 2 years ago
remove transformer block, implement as sequential of size 1 (#54) * remove transformer block, implement as sequence size 1 * reimplement get_remote_module * fix readme Co-authored-by: Alexander Borzunov <hxrussia@gmail.com> Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> 2 years ago			`# test the full model`
			`pytest tests/test_full_model.py`
Update README.md 2 years ago			```