petals

Commit Graph

Author	SHA1	Message	Date
Alexander Borzunov	5ce4f1a159	Store (start_block, end_block) in each DHT record for reliability (#510 ) This PR fixes gaps in the DHT server info caused by unavailable DHT keys. Now, one DHT key is enough to get info about all blocks hosted by a server - so we'll see info until all keys are unavailable. Also, this PR refactors `petals.client.routing` and `petals.server.block_selection` modules to use the common `compute_spans()` function (defined in `petals.utils.dht`) and `RemoteSpanInfo` class (defined in `petals.data_structures`).	8 months ago
Alexander Borzunov	063e94b4c8	Move SequenceManagerConfig -> ClientConfig, petals.dht_utils -> petals.utils.dht (#463 )	9 months ago
Artem Chumachenko	568f21dc3b	Add customizable input tensors (#445 )	9 months ago
Alexander Borzunov	329f7d31e8	Add `blocked_servers` argument (#462 ) Should be used as: ```python model = AutoDistributedModelForCausalLM(model_name, blocked_servers=[peer_id1, peer_id2]) ```	9 months ago
Alexander Borzunov	2a150770a4	Prefer longer servers for fine-tuning, exclude unreachable (#448 ) We choose longer servers to minimize the number of hops but leave some randomization to distribute the load. We also exclude servers known to be unreachable.	10 months ago
Alexander Borzunov	351e96bc46	Penalize servers that use relays during rebalancing (#428 ) Servers accessible only via relays may introduce issues if they are the only type of servers holding certain blocks. Specifically, a connection to such servers may be unstable or opened after a certain delay. This PR changes their self-reported throughput, so that the rebalancing algorithm prefers to put directly available servers for hosting each block.	10 months ago
Alexander Borzunov	44fefa5e54	Add connect_timeout (#423 )	10 months ago
Alexander Borzunov	8666653cf5	Fix routing through relay, default network RPS, --token, logging, readme (#399 ) * Hide GeneratorExit in _iterate_inference_steps() * Update README.md about `--public_name` * Use .from_pretrained(..., use_auth_token=token) instead of token=token until it's fully supported across HF libs * Use default network speed 25 Mbit/s * Apply relay penalty in max-throughput routing * Replace RPS with "tokens/sec per block" in logs * Increase default expiration	10 months ago
justheuristic	e51e84631d	Update to petals.dev (#390 ) Since `petals.ml` DNS record is still unavailable, we're switching everything to https://petals.dev Co-authored-by: Aleksandr Borzunov <hxrussia@gmail.com>	10 months ago
justheuristic	398a384075	Inherit bitsandbytes compute dtype correctly (override peft quirk) (#377 )	10 months ago
Alexander Borzunov	62d9ed5ce7	Implement shortest-path routing for inference (#362 ) This PR: 1. Adds shortest path routing for inference. We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358. 2. Makes a server ping neighboring servers in addition to next ones. This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed. 3. Fixes a `SequenceManager` bug with the first `update()`. Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.	10 months ago
Alexander Borzunov	11f0d992d7	Report inference, forward, and network RPS separately (#358 ) Inference RPS may be very different from forward RPS. E.g., currently bnb uses a completely different algorithm for NF4 inference. We report detailed RPS info that can be then used for shortest-path routing for inference.	10 months ago
Artem Chumachenko	b9f0a5467f	Support peft LoRA adapters (#335 ) Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters. --------- Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> Co-authored-by: justheuristic <justheuristic@gmail.com>	10 months ago
Alexander Borzunov	158013a671	Implement direct server-to-server communication (#331 ) Implement #226.	10 months ago
Alexander Borzunov	cb3f018f9f	Add LLaMA support (#323 ) This PR: 1. Abolishes the model conversion procedure. Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms. - BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ. - LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name). 2. Refactors the client to generalize it for multiple models. Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`. 3. Introduces `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.). 4. Introduces `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers. Upgrade instructions: - Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).	11 months ago
Max Ryabinin	3e7ae5116d	Remove unused imports and attributes (#324 ) * Remove unused imports and attributes	11 months ago
Alexander Borzunov	6137b1b4b0	Replace .make_sequence(..., mode="random") with mode="max_throughput" (#313 ) We need to sample the next server using its throughput as the weight to actually achieve max throughput for fine-tuning. As an example, imagine a situation where we have 3 servers with throughputs [1000, 500, 1] hosting the same blocks, then compare the uniform and weighted sampling strategies.	1 year ago
Alexander Borzunov	8f6342a861	Refactor RemoteSequenceManager (#309 ) This PR: 1. Extracts `SequenceManagerConfig` and `SequenceManagerState` subclasses. The config is provided by caller and never changed from inside `RemoteSequenceManager`. The state is a part of the `RemoteSequenceManager`'s state shared between the main manager and its slices. We fix some slicing bugs along the way. 2. Removes `dht_prefix` and `p2p` arguments, makes `dht` argument optional. `dht_prefix` can always be overridden using `config.dht_prefix`. `p2p` actually needed only under the hood of `RemoteSequenceManager`, so it can extract it by itself without exposing this low-level class to callers. If strictly necessary, a caller can provide `p2p` as a part of `SequenceManagerState`. `dht` is also needed only by `RemoteSequenceManager`, so we can make it optional in the parent classes and create it automatically when it's not provided. 3. Simplifies retry logic. Previously, we could have "nested" retry loops: one in `._update()`, another in inference/forward/backward steps. The loop in `._update()` could introduce issues to concurrent inference/forward/backward calls, since it blocks the entire class if its delay period becomes too high. Now this logic is simplified: `._update()` performs only one attempt to fetch the DHT info, any retries are triggered by the inference/forward/backward steps. 4. Removes deprecated `RemoteTransformerBlock`. `RemoteTransformerBlock` was deprecated a long time ago, before Petals 1.0.0. Its removal is long due. 5. Removes `dht_utils.get_remote_module()`, `dht_utils.get_remote_sequence()`. This functions duplicate the functionality of the `RemoteSequential` constructor. 6. (minor) Removes `RemoteSequential.is_subsequence` flag. This flag worked incorrectly and was never used. I am removing it for the sake of simplicity.	1 year ago
Alexander Borzunov	21c3526ec1	Start SequenceManager's thread only after first .make_sequence() (#301 ) Why? - We'd like to avoid excess threads for the original sequence manager in case if we only use its slices (e.g. when we add adapters or need only a subset of model blocks): - If we create a sequence manager just before a fork (e.g. in a web app backend or a multi-thread benchmark), we'd like to avoid excess threads in the original process and only use this thread in child processes where we actually call `.make_sequence()`.	1 year ago
Alexander Borzunov	a2e7f27a5a	Improve "connect your GPU" message (#266 )	1 year ago
Alexander Borzunov	fee19e9b9b	Use get_logger(__name__) instead of get_logger(__file__) (#265 )	1 year ago
Alexander Borzunov	55e7dc07a0	Limit max delay between retries to 15 min (#264 )	1 year ago
Alexander Borzunov	9954cb84fe	Add `allowed_servers`, `max_retries` options to the client, improve logs (#235 )	1 year ago
Alexander Borzunov	5ff250bee9	Improve errors in case of missing blocks, suggest to join your own server (#212 )	1 year ago
justheuristic	012f840f7e	Use length-weighted sampling in routing for inference (#204 ) This pull-request implements a simple (1) greedy (2) latency-agnostic routing optimization that should speed up both our use cases. Why this exists: our effort to merge full routing (ping-aware, throughut-aware, dijkstra) is in a sorry state between several branches; merging it into main would take many days. Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>	1 year ago
Alexander Borzunov	b4f3224cda	Make client ignore blacklist if all servers holding a block are blacklisted (#197 ) If all servers holding a certain block are blacklisted, we should display errors from them instead of raising `No peers holding blocks`. Indeed, if the error is client-caused, the client should learn its reason from the latest error messages. In turn, if the error is server/network-caused and we only have a few servers, we'd better know the error instead of banning all the servers and making the user think that no servers are available.	1 year ago
Alexander Borzunov	668b736031	Fix logging: do not duplicate lines, enable colors in Colab (#156 )	1 year ago
justheuristic	b04982c1a2	Bump transformers to 4.25.1 (#151 ) - latest accelerate, transformers, huggingface_hub - rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344 - remove unused code - fix edge case where session crashes when receiving seq length 0 - assert transformer version when importing WrappedBloomBlock Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	1 year ago
Alexander Borzunov	84fec81543	Suppress asyncio error logs by default (#142 )	1 year ago
Alexander Borzunov	e1d8793f00	Show route on client (#139 )	1 year ago
Alexander Borzunov	1fe3716589	Don't ban servers in case of client-caused handler errors (#134 )	1 year ago
Alexander Borzunov	f56edaa13f	Fix inference and rpc_info() fault tolerance (#131 )	1 year ago
justheuristic	79a4308992	Clear trigger before engaging in update (#130 ) Update sequence_manager.py	1 year ago
justheuristic	68c85e7492	Avoid synchronous updates, ban peers based on request outcome (#127 ) - sequence_manager now takes care for its own updated-ness - no need to manually update it - if a peer fails a request, sequence manager will ban this peer temporarily. Ban times increase with failure streaks Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	1 year ago
Max Ryabinin	3ca8b4f082	Fix typos with codespell (#126 )	1 year ago
justheuristic	a2066a4096	Optimize RemoteSequenceManager (#106 ) - [x] made RemoteSequenceManager into a background thread that pre-fetches information instead of running just in time - [x] moved routing-related stuff to petals.client.routing - [x] extract remote peer routing information to RemoteSequenceInfo - [x] made sure that the code survives continued use (e.g. one hour) - [x] updated every spot where update_ is called manually - [x] modified get_sequence to check that the thread is alive, warn if not - [x] removed max_retries, switched rpc_info to exponential backoff - [x] fixed a bg that causes RemoteSeq* to lose user-defined hyperparameters (e.g. timeout) upon subsequencing (sequential[3:5]) - [x] moved client-side points strategy to client.routing - [x] ensured that RemoteSequenceManager thread created in get_remote_module properly shuts down when the module is destroyed - [x] resolved minor affected todos - [x] modified tests to no longer use PYTHONPATH - [x] worked around protocol error in rpc_info Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Artem Chumachenko <artek.chumak@gmail.com>	1 year ago

36 Commits (5ce4f1a1598b1fca9fe6bd30cfbd85aa99bce2c7)