Commit Graph

23 Commits (c4938bc23efe22e3ab6d638261bfd56c6ad807a9)

Author SHA1 Message Date
justheuristic c4938bc23e
Merge inference pools into one to increase inference speed (#225)
It turns out using a separate pool for each block has led to significant slowdown, see #224 for details.
1 year ago
Alexander Borzunov af3da5bb04
Choose --num_blocks automatically for all models (#217) 1 year ago
Alexander Borzunov 6b12b0d050
Report server version and dht.client_mode in rpc_info(), check for updates on startup (#209)
This PR:

1. Shows the current Petals version and checks for updates on startup.
2. Reports the current version and DHT mode in `rpc_info()`, so it can be shown on http://health.petals.ml or used on clients for efficient routing.
1 year ago
justheuristic 771ca590e7
Add service checking direct reachability from peers (#195)
Servers joining from behind NATs/firewalls usually take several minutes to join a libp2p relay before they become accessible from the outside Internet. Moreover, requests to such servers are slower and more likely to fail (e.g., if the server switches a relay at the moment). If such servers host certain DHT keys, the swarm may occasionally lose read/write access to these keys, which results in:

- Clients being unable to find any servers hosting a certain block.
- All servers starting rebalancing to the same place to close the alleged "gap" in the swarm.

This PRs modifies servers so that DHT keys are only hosted on **directly reachable** servers (the ones who aren't behind NAT/firewall). This way, DHT becomes more stable and works faster. Of course, trhe servers behind NATs/firewalls still accept requests for running inference/forward/backward for blocks they hold (it's more acceptable for this kind of requests to be slower or fail).

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
1 year ago
Alexander Borzunov a617ce3cfa
Fix psutil-related AccessDenied crash, disable --load_in_8bit by default in case of TP (#188)
* Don't count open fds since it leads to AccessDenied crashes on some machines
* Use --load_in_8bit=False by default in case of tensor parallelism
* Install petals from PyPI in fine-tuning tutorials
1 year ago
Egiazarian Vage 93bed7da5a
Support libp2p relays for NAT traversal (#186)
- Added relay options to servers
- Enabled relay options by default
- Changed hivemind version to 1.1.5
- Moved reachability check to be performed after blocks are loaded

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
1 year ago
justheuristic ae9e71fe8e
Add local tensor-parallel fwd/bwd (#143)
This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel

- 8bit approximation error same as in main (mean~=2% q0.9~=5%)
    - TP=1, 2, 3 (see screenshots above)
- forward, grad w.r.t. input and inference exact match with main with TP=1
- `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens
- throughput measured with and without TP
- TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message)


Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru>
Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru>
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
1 year ago
Alexander Borzunov 668b736031
Fix logging: do not duplicate lines, enable colors in Colab (#156) 1 year ago
Alexander Borzunov 041ad20891
Check reachability automatically and give advice how to fix it (#155)
1. If we connect to the **public swarm**, the server now **automatically checks its DHT's reachability** from the outside world using API at http://health.petals.ml This is important to disallow unreachable servers to proceed (they create issues for the clients, such as repetitive retries).

    If http://health.petals.ml is down, the server proceeds without the check (so we don't depend on it). However, if health.petals.ml is up and explicitly tells us that we are unrechable, the server shows the reason of that and how to solve it.

    The check may be disabled with the `--skip_reachability_check` option (though I can't imagine cases where someone needs to use it).

2. Added `--port` and `--public_ip` as **simplified convenience options** for users not familiar with `--host_maddrs` and `--announce_maddrs`.
1 year ago
Alexander Borzunov 73df69a117
Reset MemoryCache during rebalancings (#154)
Before this PR, if there were open inference sessions right when rebalancing is triggered, their cache was never properly destroyed.
1 year ago
Alexander Borzunov 701ec7e53e
Clean up disk space (#152) 1 year ago
justheuristic b04982c1a2
Bump transformers to 4.25.1 (#151)
- latest accelerate, transformers, huggingface_hub
- rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344
- remove unused code
- fix edge case where session crashes when receiving seq length 0
- assert transformer version when importing WrappedBloomBlock

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
1 year ago
Alexander Borzunov e4dc938dfe
Fix OOMs during server rebalancing (#150)
The cause of OOMs were the cyclic references `TransformerBackend <-> PrioritizedTaskPool` that could not have been garbage collected properly. Still, I've added explicit tensor removal just in case.
1 year ago
Alexander Borzunov 83d9493b6c
Improve block size calculations (#149) 1 year ago
Alexander Borzunov e99bf36647
Use common folder for all caches, make it a volume in Dockerfile (#141) 1 year ago
Alexander Borzunov 66f1799d32
Set default --step_timeout to 5 min (#133) 1 year ago
Alexander Borzunov 9dbf5e2e6f
Set dht.num_workers = n_layer, update_period = 150, expiration = 300 (#125) 1 year ago
Max Ryabinin 3ca8b4f082
Fix typos with codespell (#126) 1 year ago
Alexander Borzunov fc6722576b
Choose --num_blocks for bigscience/bloom-petals automatically (#119) 1 year ago
Alexander Borzunov 643a054170
Make server use smart defaults (#115)
Summary:

```python
parser.add_argument('--attn_cache_size', type=str, default=None,
                    help='The size of GPU memory allocated for storing past attention keys/values between inference steps. '
                         'Examples: 500MB, 1.2GB, 1073741824 (bytes). Note that 1KB != 1KiB here. '
                         'Default: 0.5GiB * num_blocks * hidden_size / 14336. '
                         'The latter is the hidden size of the bigscience/bloom-petals model.')

parser.add_argument('--request_timeout', type=float, required=False, default=3 * 60,
                    help='Timeout (in seconds) for the whole rpc_forward/rpc_backward/rpc_forward_stream/rpc_backward_stream request')
parser.add_argument('--session_timeout', type=float, required=False, default=30 * 60,
                    help='Timeout (in seconds) for the whole inference session')
parser.add_argument('--step_timeout', type=float, required=False, default=60,
                    help="Timeout (in seconds) for waiting the next step's inputs inside an inference session")

parser.add_argument('--load_in_8bit', type=bool, default=None,
                    help="Convert the loaded model into mixed-8bit quantized model. Default: True if GPU is available")
```

Co-authored-by: justheuristic <justheuristic@gmail.com>
1 year ago
Alexander Borzunov 1ea44b0d3c
Measure throughput for different configs, devices, and dtypes separately (#114) 1 year ago
Alexander Borzunov 43ac6016ac
Fix dtypes in backend schemas (#99)
Currently, the schemas use `torch.float32`, so all inputs and outputs converted to float32 before sending and after receiving on both servers and clients. This creates a huge slowdown for the system.

* This PR makes the schemas use the server's `--torch_dtype` argument (default is `torch.bloat16` for BLOOM-176B)
* an option for client to request a specific output compression. Use case 1: client sends quantized inputs and expects quantized inputs in return. Use case 2: client uses quantization for gradients w.r.t. activations, but keeps grads w.r.t. __prompts__ as is for greater precision.
* a comment explaining the purpose of NoSpendingPolicy - since we likely won't have it for the workshop
* a test with custom compression (janky implementation for testing purposes)

Co-authored-by: justheuristic <justheuristic@gmail.com>
1 year ago
Alexander Borzunov 7bd5916744
Make Petals a pip-installable package (attempt 2) (#102)
1. Petals can be now installed using `pip install git+https://github.com/bigscience-workshop/petals`
    - In case if you already cloned the repo, you can do `pip install .` or `pip install .[dev]`
2. Moved `src` => `src/petals`
    - Replaced `from src.smth import smth` with `from petals.smth import smth`
3. Moved `cli` => `src/petals/cli`
    - Replaced `python -m cli.run_smth` with `python -m petals.cli.run_smth` (all utilities are now available right after pip installation)
4. Moved the `requirements*.txt` contents to `setup.cfg` (`requirements.txt` for packages is not supported well by modern packaging utils)
5. Increased the package version from `0.2` to `1.0alpha1`
1 year ago