This PR is designed to avoid OOMs when processing long sequences that happen due to the huge attention logits matrices.
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
The value is chosen as some safe value below average at https://health.petals.dev/
Note that if a server uses relays, the effective throughput will be further divided by 2 (see #399).
* Hide GeneratorExit in _iterate_inference_steps()
* Update README.md about `--public_name`
* Use .from_pretrained(..., use_auth_token=token) instead of token=token
until it's fully supported across HF libs
* Use default network speed 25 Mbit/s
* Apply relay penalty in max-throughput routing
* Replace RPS with "tokens/sec per block" in logs
* Increase default expiration
Since `petals.ml` DNS record is still unavailable, we're switching everything to https://petals.dev
Co-authored-by: Aleksandr Borzunov <hxrussia@gmail.com>
This PR:
1. **Adds shortest path routing for inference.** We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358.
2. **Makes a server ping neighboring servers in addition to next ones.** This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed.
3. **Fixes a `SequenceManager` bug with the first `update()`.** Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.
Inference RPS may be very different from forward RPS. E.g., currently bnb uses a completely different algorithm for NF4 inference. We report detailed RPS info that can be then used for shortest-path routing for inference.
Currently, each `TransformerBackend.inference_step` looks for adapters and sets the correct adapter type for each block. This is not very expensive, but it can measurably affect inference time.
This pull request uses faster adapter switching with just one variable assignment, without iterating over block.modules().
We want to use `3.10.x` since `grpcio-tools` is not compatible with 3.11 yet. However, `python~=3.10` meant `python>=3.10, python<4.0`, so we ended up with a broken build due to python 3.11 installed.
Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters.
---------
Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: justheuristic <justheuristic@gmail.com>