**Why?**
- We'd like to avoid excess threads for the original sequence manager in case if we only use its slices (e.g. when we add adapters or need only a subset of model blocks):
- If we create a sequence manager just before a fork (e.g. in a web app backend or a multi-thread benchmark), we'd like to avoid excess threads in the original process and only use this thread in child processes where we actually call `.make_sequence()`.
`use_auto_relay=True` makes the libp2p daemon look for relays to become reachable if we are behind NAT/firewall. However, being reachable is not necessary for the Petals client, and we should not spend the relays' capacity on this.
For some reasons, right now 15 sec is not enough to connect to the bootstrap peers in the public swarm, as reported by multiple users and observed by me. Increasing it to 120 sec until we find the root cause of the issue.
This PR increases `request_timeout`, since the previous default of 30 sec is not enough for many use cases.
Previously, we kept the request timeout low since we assumed that the server could freeze on dial if the target peer is behind a firewall. However, apparently, it won't freeze because libp2p has its own [dial timeout](https://github.com/libp2p/go-libp2p/blob/v0.26.0/core/network/context.go#L11).
Before this PR, `model.generate()` returned one excess token when resuming generation with an existing (the last token of the previous session, `session.last_token_id`). This is an unexpected behavior not convenient for the downstream apps, so this PR changes it until it's too late.
This pull-request implements a simple (1) greedy (2) latency-agnostic routing optimization that should speed up both our use cases.
Why this exists: our effort to merge full routing (ping-aware, throughut-aware, dijkstra) is in a sorry state between several branches; merging it into main would take many days.
Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
If all servers holding a certain block are blacklisted, we should display errors from them instead of raising `No peers holding blocks`.
Indeed, if the error is client-caused, the client should learn its reason from the latest error messages. In turn, if the error is server/network-caused and we only have a few servers, we'd better know the error instead of banning all the servers and making the user think that no servers are available.
- Added relay options to servers
- Enabled relay options by default
- Changed hivemind version to 1.1.5
- Moved reachability check to be performed after blocks are loaded
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
1. Added `from petals.client import *` to `petals/__init__.py`, so you can write just that:
```python
from petals import DistributedBloomForCausalLM
```
I didn't do the same with server, since its classes are supposed to by used by `petals.cli.run_server`, not end-users. Though it's still possible to do `from petals.server.smth import smth` if necessary.
2. Fixed one more logging issue: log lines from hivemind were shown twice due to a bug in #156.
3. Removed unused `runtime.py`, since the server actually uses `hivemind.moe.Runtime`, and `runtime.py` has no significant changes comparing to it.
* Add missing methods for SamplingAlgorithm, fix docstrings
* Add SamplingAlgorithm to _choose_sample_algorithm
* Add test_sampling
* Add a warning if sampling options were passed, but do_sample=False
* Skip the sampling test for now
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
- latest accelerate, transformers, huggingface_hub
- rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344
- remove unused code
- fix edge case where session crashes when receiving seq length 0
- assert transformer version when importing WrappedBloomBlock
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
- sequence_manager now takes care for its own updated-ness - no need to manually update it
- if a peer fails a request, sequence manager will ban this peer temporarily. Ban times increase with failure streaks
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
- [x] made RemoteSequenceManager into a background thread that pre-fetches information instead of running just in time
- [x] moved routing-related stuff to petals.client.routing
- [x] extract remote peer routing information to RemoteSequenceInfo
- [x] made sure that the code survives continued use (e.g. one hour)
- [x] updated every spot where update_ is called manually
- [x] modified get_sequence to check that the thread is alive, warn if not
- [x] removed max_retries, switched rpc_info to exponential backoff
- [x] fixed a bg that causes RemoteSeq* to lose user-defined hyperparameters (e.g. timeout) upon subsequencing (sequential[3:5])
- [x] moved client-side points strategy to client.routing
- [x] ensured that RemoteSequenceManager thread created in get_remote_module properly shuts down when the module is destroyed
- [x] resolved minor affected todos
- [x] modified tests to no longer use PYTHONPATH
- [x] worked around protocol error in rpc_info
Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Artem Chumachenko <artek.chumak@gmail.com>
Fixes:
- An exception while creating a model with `ptune/deep_ptune` and `low_cpu_mem_usage=True` (which is currently default).
- dtype mismatch between the prompts and the rest of the model in `.forward()`.
Currently, the schemas use `torch.float32`, so all inputs and outputs converted to float32 before sending and after receiving on both servers and clients. This creates a huge slowdown for the system.
* This PR makes the schemas use the server's `--torch_dtype` argument (default is `torch.bloat16` for BLOOM-176B)
* an option for client to request a specific output compression. Use case 1: client sends quantized inputs and expects quantized inputs in return. Use case 2: client uses quantization for gradients w.r.t. activations, but keeps grads w.r.t. __prompts__ as is for greater precision.
* a comment explaining the purpose of NoSpendingPolicy - since we likely won't have it for the workshop
* a test with custom compression (janky implementation for testing purposes)
Co-authored-by: justheuristic <justheuristic@gmail.com>
1. Petals can be now installed using `pip install git+https://github.com/bigscience-workshop/petals`
- In case if you already cloned the repo, you can do `pip install .` or `pip install .[dev]`
2. Moved `src` => `src/petals`
- Replaced `from src.smth import smth` with `from petals.smth import smth`
3. Moved `cli` => `src/petals/cli`
- Replaced `python -m cli.run_smth` with `python -m petals.cli.run_smth` (all utilities are now available right after pip installation)
4. Moved the `requirements*.txt` contents to `setup.cfg` (`requirements.txt` for packages is not supported well by modern packaging utils)
5. Increased the package version from `0.2` to `1.0alpha1`