**Why?**
- We'd like to avoid excess threads for the original sequence manager in case if we only use its slices (e.g. when we add adapters or need only a subset of model blocks):
- If we create a sequence manager just before a fork (e.g. in a web app backend or a multi-thread benchmark), we'd like to avoid excess threads in the original process and only use this thread in child processes where we actually call `.make_sequence()`.
`use_auto_relay=True` makes the libp2p daemon look for relays to become reachable if we are behind NAT/firewall. However, being reachable is not necessary for the Petals client, and we should not spend the relays' capacity on this.
This PR fixes issues of #290:
- hivemind bfloat16 codec crashed on dummy tensors (with 0 elements), see https://github.com/learning-at-home/hivemind/pull/560 (this PR makes Petals depend on the latest hivemind version from the repo, it's temporary)
- transformers version check mismatched with the version allowed in `setup.cfg`
Also:
- This PR enables 8-bit by default for TP. Even though TP in 8-bit may be slower, we currently prefer to host more blocks to increase the network's stability.
- new bitsandbytes supports newer *and* older GPUs
- new hivemind supports a better bfloat16 codec
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
For some reasons, right now 15 sec is not enough to connect to the bootstrap peers in the public swarm, as reported by multiple users and observed by me. Increasing it to 120 sec until we find the root cause of the issue.
This PR increases `request_timeout`, since the previous default of 30 sec is not enough for many use cases.
Previously, we kept the request timeout low since we assumed that the server could freeze on dial if the target peer is behind a firewall. However, apparently, it won't freeze because libp2p has its own [dial timeout](https://github.com/libp2p/go-libp2p/blob/v0.26.0/core/network/context.go#L11).
Before this PR, `model.generate()` returned one excess token when resuming generation with an existing (the last token of the previous session, `session.last_token_id`). This is an unexpected behavior not convenient for the downstream apps, so this PR changes it until it's too late.
Even if the swarm seems to have at least 2 servers for each block, turning off on one of the servers could break it. That's because once a server is turned off, others may move to a better position, creating a significant downtime on their way. This PR prohibits switching blocks if it would make the swarm disjoint along the way.
This PR:
1. Shows the current Petals version and checks for updates on startup.
2. Reports the current version and DHT mode in `rpc_info()`, so it can be shown on http://health.petals.ml or used on clients for efficient routing.
Servers joining from behind NATs/firewalls usually take several minutes to join a libp2p relay before they become accessible from the outside Internet. Moreover, requests to such servers are slower and more likely to fail (e.g., if the server switches a relay at the moment). If such servers host certain DHT keys, the swarm may occasionally lose read/write access to these keys, which results in:
- Clients being unable to find any servers hosting a certain block.
- All servers starting rebalancing to the same place to close the alleged "gap" in the swarm.
This PRs modifies servers so that DHT keys are only hosted on **directly reachable** servers (the ones who aren't behind NAT/firewall). This way, DHT becomes more stable and works faster. Of course, trhe servers behind NATs/firewalls still accept requests for running inference/forward/backward for blocks they hold (it's more acceptable for this kind of requests to be slower or fail).
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
This PR makes servers return their free cache (in tokens * layers to make it compression-agnostic)
To be used when calling make_sequence(optimize="inference")