Commit Graph

463 Commits (amd-gpus)
 

Author SHA1 Message Date
Alexander Borzunov 1ba721d51e
Merge branch 'main' into amd-gpus 9 months ago
Alexander Borzunov 2a150770a4
Prefer longer servers for fine-tuning, exclude unreachable (#448)
We choose longer servers to minimize the number of hops but leave some randomization to distribute the load. We also exclude servers known to be unreachable.
9 months ago
Alexander Borzunov 00d48dcbe1
Override float32 in config to bfloat16 (#431) 9 months ago
justheuristic ac9b546706
[Refactor] extract block forward, backward and inference into a separate file (#435)
This PR does not change any functionality. It merely moves stuff around.
List of changes:

handler.py/_rpc_forward became block_methods/rpc_forward
handler.py/_rpc_backward became block_methods/rpc_backward
the math bits of rpc_inference were extracted into block_methods/iterate_rpc_inference

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: artek0chumak <artek.chumak@gmail.com>
Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
9 months ago
Alexander Borzunov 593d980ad8
Use bitsandbytes 0.41.1 (#442) 9 months ago
Alexander Borzunov 32fbab5192
Remove deprecated comment in fine-tuning notebook (#443) 9 months ago
Aleksandr Borzunov fe3b8d6e66 Append .amd to reported version 9 months ago
Alexander Borzunov b58141ef66
Remove distracting links from readme (#441) 9 months ago
Alexander Borzunov 679397df0c
Update Discord links from channels to forums (#440)
As our Discord community growths, we found it difficult to look for open and resolved issues in **#running-a-client** and **#running-a-server** channels, as well as navigate through interleaving conversations happening there. That's why we recreated these channels as Discord forums, where different discussions are separated into different posts.
9 months ago
Vadim Peretokin d0b5af34cd
Fix typo and make blocks message more informative (#437)
The message really doesn't tell me much as a user, since I never touched update_period to begin with:

```
Aug 06 09:43:07.287 [WARN] [petals.server.server.run:701] Declaring blocs to DHT takes more than --update_period, consider increasing it
```

Made it better and more informative.
9 months ago
Aleksandr Borzunov b6e31c6d0f Fix "import peft" in tests 9 months ago
Aleksandr Borzunov d8298faa00 Remove --adapters from tests 9 months ago
Aleksandr Borzunov 753f8df594 Don't use NF4 default 9 months ago
Aleksandr Borzunov 203a1b3a24 Use bitsandbytes-rocm 9 months ago
Aleksandr Borzunov 6b38bc89ef Remove peft dependency for AMD GPUs 9 months ago
Alexander Borzunov a1f7791d5e
Fix petals.utils.ping for servers with client-mode DHT (#430)
Fix #429.
10 months ago
Alexander Borzunov 351e96bc46
Penalize servers that use relays during rebalancing (#428)
Servers accessible only via relays may introduce issues if they are the only type of servers holding certain blocks. Specifically, a connection to such servers may be unstable or opened after a certain delay.

This PR changes their self-reported throughput, so that the rebalancing algorithm prefers to put directly available servers for hosting each block.
10 months ago
Alexander Borzunov 6a1b8a6a90
Add Stable Beluga 2 to readme (#424) 10 months ago
Alexander Borzunov 44fefa5e54
Add connect_timeout (#423) 10 months ago
Alexander Borzunov cdc0f70653
Add Discord badge and more Discord links to readme (#422) 10 months ago
Guocheng 8072cd9d1b
Fix stale link (#418) 10 months ago
Alexander Borzunov f3fafd14a4
Bump version to 2.0.1 (#411) 10 months ago
Alexander Borzunov fd19c21859
Update --update_period and --expiration defaults (#410) 10 months ago
Alexander Borzunov ffb20b585c
Update commands for hosting Llama 2 in readme (#409) 10 months ago
Alexander Borzunov 48c6b6d963
Update README.md (#407) 10 months ago
Alexander Borzunov c153cba1fa
Add Llama 2, WSL instructions to readme (#406) 10 months ago
justheuristic 5af04524dd
Split long sequences into chunks (#403)
This PR is designed to avoid OOMs when processing long sequences that happen due to the huge attention logits matrices.

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
10 months ago
Alexander Borzunov 30b94ef18b
If speedtest fails, assume network speed of 100 Mbit/s (#404)
The value is chosen as some safe value below average at https://health.petals.dev/

Note that if a server uses relays, the effective throughput will be further divided by 2 (see #399).
10 months ago
Alexander Borzunov 8666653cf5
Fix routing through relay, default network RPS, --token, logging, readme (#399)
* Hide GeneratorExit in _iterate_inference_steps()
* Update README.md about `--public_name`
* Use .from_pretrained(..., use_auth_token=token) instead of token=token
until it's fully supported across HF libs
* Use default network speed 25 Mbit/s
* Apply relay penalty in max-throughput routing
* Replace RPS with "tokens/sec per block" in logs
* Increase default expiration
10 months ago
Alexander Borzunov eb0664b993
Support Python 3.11 (#393) 10 months ago
Alexander Borzunov 6e4ebb94d2
Fix deadlocks in MemoryCache (#396)
- Fix deadlocks in MemoryCache
- Set default --alloc_timeout to 1 until the MemoryCache update
10 months ago
Alexander Borzunov b6b3ae964f
Fix --attn_cache_tokens default (#392) 10 months ago
Alexander Borzunov d49d9ad0cf
Bump version to 2.0.0.post3 (#391) 10 months ago
justheuristic e51e84631d
Update to petals.dev (#390)
Since `petals.ml` DNS record is still unavailable, we're switching everything to https://petals.dev

Co-authored-by: Aleksandr Borzunov <hxrussia@gmail.com>
10 months ago
Aleksandr Borzunov ddcda02b06 Hardcode IPs until DNS issues get resolved 10 months ago
Alexander Borzunov b1ff8bdd6c
Bump version to 2.0.0.post1 (#384) 10 months ago
Alexander Borzunov e9a20e7e53
Require accelerate>=0.20.3 as transformers do (#383) 10 months ago
Alexander Borzunov 057a2fb5de
Support Llama 2 (#379) 10 months ago
Alexander Borzunov 3218534745
Fix --token arg (#378) 10 months ago
justheuristic 398a384075
Inherit bitsandbytes compute dtype correctly (override peft quirk) (#377) 10 months ago
justheuristic 5a8de2f1f8
Fix handler memory leak, get rid of mp.Manager (#373)
This PR removes the memory leak from somewhere within handler.py that has something to do with mp.SyncManager.
10 months ago
Alexander Borzunov 895327a0ae
Fix readme code example, require Python < 3.11 until supported (#374)
* Fix readme code example

* Require Python < 3.11 until it's supported
10 months ago
Alexander Borzunov c735dd7ba3
Update transformers to 4.31.0 and peft to 0.4.0 (#371) 10 months ago
justheuristic 1ab35c2826
Typo in inference_session.py 10 months ago
Alexander Borzunov a6fdfc0556
Fix AssertionError on rebalancing (#370) 10 months ago
Alexander Borzunov f97582fb5f
Require transformers < 4.31.0 until we're compatible (#369) 10 months ago
Alexander Borzunov 3b300c32e4
Update readme to show new models (#365) 10 months ago
Alexander Borzunov 62d9ed5ce7
Implement shortest-path routing for inference (#362)
This PR:

1. **Adds shortest path routing for inference.** We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358.

2. **Makes a server ping neighboring servers in addition to next ones.** This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed.

3. **Fixes a `SequenceManager` bug with the first `update()`.** Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.
10 months ago
Ikko Eltociear Ashimine fd30f7ce10
Fix typo in generation_algorithms.py (#364) 10 months ago
Alexander Borzunov 11f0d992d7
Report inference, forward, and network RPS separately (#358)
Inference RPS may be very different from forward RPS. E.g., currently bnb uses a completely different algorithm for NF4 inference. We report detailed RPS info that can be then used for shortest-path routing for inference.
10 months ago