This PR adds:
- Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B.
- CI tests for Falcon-RW-1B.
- `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab).
Limitations:
- Backward pass support is broken for now, will be fixed in #500.
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
This PR creates an index of models hosted in the swarm - it is useful to know which custom models users run and display them at https://health.petals.dev as "not officially supported" models.
This PR makes both clients and servers work on macOS. Specifically, it:
- Follows https://github.com/learning-at-home/hivemind/pull/586 to run a macOS-compatible `p2pd` binary (both x86-64 and ARM64 are supported)
- Fixes forking issues and tests on macOS, Python 3.10+
- Introduces basic support for serving model blocks on Apple M1/M2 GPUs (torch.mps)
- Increases max number of open files by default (it's not enough on Linux and is really small on macOS)
- rpc_inference: server will now accept allocation timeout from user, defaults to no timeout
- bugfix: inference timeout is now measured from the moment the request is received
- previously, you would have to wait for your timeout plus the time it takes to sort through the queue (other users' timeout)
- now, you get AllocationFailed if you had to wait for over (timeout) seconds - regardless of other users
- a request for inference with no timeout will now fail instantly if there is not enough memory available
- dtype number of bytes is now correctly determined for int, bool & other types
---------
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Aleksandr Borzunov <hxrussia@gmail.com>
Right now, long inference requests may occupy Runtime for a few seconds without giving it away to process short (most latency-sensitive requests). This PR fixes it by disallowing the merged pool for long requests and prioritizing the short ones.
This PR extends CI to:
1. Test Llama code using [TinyLlama-v0](https://huggingface.co/Maykeye/TinyLLama-v0).
2. Test rebalancing (sets up a situation where the 1st server needs to change its original position).
3. Check if benchmark scripts run (in case someone breaks its code). Note that the benchmark results are meaningless here (since they're measured on a tiny swarm of CPU servers, with low `--n_steps`).
4. Test `petals.cli.run_dht`.
5. Increase swap space and watch free RAM (a common issue is that actions are cancelled without explanation if there's not enough RAM - so it's a useful reminder + debug tool).
6. Fix flapping tests for bloom-560m by increasing tolerance.
Other minor changes: fix `--help` messages to show defaults, fix docs, tune rebalancing constants.
The message really doesn't tell me much as a user, since I never touched update_period to begin with:
```
Aug 06 09:43:07.287 [WARN] [petals.server.server.run:701] Declaring blocs to DHT takes more than --update_period, consider increasing it
```
Made it better and more informative.
Servers accessible only via relays may introduce issues if they are the only type of servers holding certain blocks. Specifically, a connection to such servers may be unstable or opened after a certain delay.
This PR changes their self-reported throughput, so that the rebalancing algorithm prefers to put directly available servers for hosting each block.
This PR is designed to avoid OOMs when processing long sequences that happen due to the huge attention logits matrices.
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
* Hide GeneratorExit in _iterate_inference_steps()
* Update README.md about `--public_name`
* Use .from_pretrained(..., use_auth_token=token) instead of token=token
until it's fully supported across HF libs
* Use default network speed 25 Mbit/s
* Apply relay penalty in max-throughput routing
* Replace RPS with "tokens/sec per block" in logs
* Increase default expiration
This PR:
1. **Adds shortest path routing for inference.** We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358.
2. **Makes a server ping neighboring servers in addition to next ones.** This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed.
3. **Fixes a `SequenceManager` bug with the first `update()`.** Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.
Inference RPS may be very different from forward RPS. E.g., currently bnb uses a completely different algorithm for NF4 inference. We report detailed RPS info that can be then used for shortest-path routing for inference.
Currently, each `TransformerBackend.inference_step` looks for adapters and sets the correct adapter type for each block. This is not very expensive, but it can measurably affect inference time.
This pull request uses faster adapter switching with just one variable assignment, without iterating over block.modules().
Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters.
---------
Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: justheuristic <justheuristic@gmail.com>
This PR:
- Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper).
- Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`.
- (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant.
This PR:
1. **Abolishes the model conversion procedure.** Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms.
- BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ.
- LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name).
2. **Refactors the client to generalize it for multiple models.** Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`.
3. **Introduces** `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.).
4. **Introduces** `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers.
Upgrade instructions:
- Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).
**Why?**
- We'd like to avoid excess threads for the original sequence manager in case if we only use its slices (e.g. when we add adapters or need only a subset of model blocks):
- If we create a sequence manager just before a fork (e.g. in a web app backend or a multi-thread benchmark), we'd like to avoid excess threads in the original process and only use this thread in child processes where we actually call `.make_sequence()`.
This PR fixes issues of #290:
- hivemind bfloat16 codec crashed on dummy tensors (with 0 elements), see https://github.com/learning-at-home/hivemind/pull/560 (this PR makes Petals depend on the latest hivemind version from the repo, it's temporary)
- transformers version check mismatched with the version allowed in `setup.cfg`
Also:
- This PR enables 8-bit by default for TP. Even though TP in 8-bit may be slower, we currently prefer to host more blocks to increase the network's stability.
This PR:
1. Shows the current Petals version and checks for updates on startup.
2. Reports the current version and DHT mode in `rpc_info()`, so it can be shown on http://health.petals.ml or used on clients for efficient routing.
Servers joining from behind NATs/firewalls usually take several minutes to join a libp2p relay before they become accessible from the outside Internet. Moreover, requests to such servers are slower and more likely to fail (e.g., if the server switches a relay at the moment). If such servers host certain DHT keys, the swarm may occasionally lose read/write access to these keys, which results in:
- Clients being unable to find any servers hosting a certain block.
- All servers starting rebalancing to the same place to close the alleged "gap" in the swarm.
This PRs modifies servers so that DHT keys are only hosted on **directly reachable** servers (the ones who aren't behind NAT/firewall). This way, DHT becomes more stable and works faster. Of course, trhe servers behind NATs/firewalls still accept requests for running inference/forward/backward for blocks they hold (it's more acceptable for this kind of requests to be slower or fail).
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
* Don't count open fds since it leads to AccessDenied crashes on some machines
* Use --load_in_8bit=False by default in case of tensor parallelism
* Install petals from PyPI in fine-tuning tutorials
- Added relay options to servers
- Enabled relay options by default
- Changed hivemind version to 1.1.5
- Moved reachability check to be performed after blocks are loaded
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>