1. Added `from petals.client import *` to `petals/__init__.py`, so you can write just that:
```python
from petals import DistributedBloomForCausalLM
```
I didn't do the same with server, since its classes are supposed to by used by `petals.cli.run_server`, not end-users. Though it's still possible to do `from petals.server.smth import smth` if necessary.
2. Fixed one more logging issue: log lines from hivemind were shown twice due to a bug in #156.
3. Removed unused `runtime.py`, since the server actually uses `hivemind.moe.Runtime`, and `runtime.py` has no significant changes comparing to it.
This pullrequest removes custom speed_test code in favour of speedtest-cli module.
This is necessary to ensure that random warnings / print-outs do not mess with our outputs.
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
1. If we connect to the **public swarm**, the server now **automatically checks its DHT's reachability** from the outside world using API at http://health.petals.ml This is important to disallow unreachable servers to proceed (they create issues for the clients, such as repetitive retries).
If http://health.petals.ml is down, the server proceeds without the check (so we don't depend on it). However, if health.petals.ml is up and explicitly tells us that we are unrechable, the server shows the reason of that and how to solve it.
The check may be disabled with the `--skip_reachability_check` option (though I can't imagine cases where someone needs to use it).
2. Added `--port` and `--public_ip` as **simplified convenience options** for users not familiar with `--host_maddrs` and `--announce_maddrs`.
* Add missing methods for SamplingAlgorithm, fix docstrings
* Add SamplingAlgorithm to _choose_sample_algorithm
* Add test_sampling
* Add a warning if sampling options were passed, but do_sample=False
* Skip the sampling test for now
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
- latest accelerate, transformers, huggingface_hub
- rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344
- remove unused code
- fix edge case where session crashes when receiving seq length 0
- assert transformer version when importing WrappedBloomBlock
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
The cause of OOMs were the cyclic references `TransformerBackend <-> PrioritizedTaskPool` that could not have been garbage collected properly. Still, I've added explicit tensor removal just in case.
- sequence_manager now takes care for its own updated-ness - no need to manually update it
- if a peer fails a request, sequence manager will ban this peer temporarily. Ban times increase with failure streaks
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Summary:
```python
parser.add_argument('--attn_cache_size', type=str, default=None,
help='The size of GPU memory allocated for storing past attention keys/values between inference steps. '
'Examples: 500MB, 1.2GB, 1073741824 (bytes). Note that 1KB != 1KiB here. '
'Default: 0.5GiB * num_blocks * hidden_size / 14336. '
'The latter is the hidden size of the bigscience/bloom-petals model.')
parser.add_argument('--request_timeout', type=float, required=False, default=3 * 60,
help='Timeout (in seconds) for the whole rpc_forward/rpc_backward/rpc_forward_stream/rpc_backward_stream request')
parser.add_argument('--session_timeout', type=float, required=False, default=30 * 60,
help='Timeout (in seconds) for the whole inference session')
parser.add_argument('--step_timeout', type=float, required=False, default=60,
help="Timeout (in seconds) for waiting the next step's inputs inside an inference session")
parser.add_argument('--load_in_8bit', type=bool, default=None,
help="Convert the loaded model into mixed-8bit quantized model. Default: True if GPU is available")
```
Co-authored-by: justheuristic <justheuristic@gmail.com>
- Linear8bitLt now supports for pre-turing GPUs by temporarily upcasting quantized weights.
- added a test for linear8bitlt accuracy with the new fallback, the accuracy is similar than the real thing, (slightly better due to non-quantized A)
- performance is roughly halfway between the default mode and memory_efficient_backward
Alternatives considered:
- cupy - slow, casting to float internally
- triton - fast but unstable af. every 3rd attempt to matmul is a segfault
- bnb.functional.igemm (no lt) - "CuBLAS Error 8" on old GPUs
Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>