Commit Graph

337 Commits

Author SHA1 Message Date
Just Heuristic
4748c03ac5 readme 2023-01-19 16:37:03 +03:00
Just Heuristic
c2e3c13241 benchmark 2023-01-19 16:32:30 +03:00
Alexander Borzunov
fa5ac6e3b4
Mention BLOOMZ in readme (#221) 2023-01-18 03:23:21 +04:00
Alexander Borzunov
e651d73f11
Add one more link to the "Getting started" tutorial (#218)
Some people miss the "Try now in Colab" link or don't understand that it leads to the comprehensive tutorial, so I added one more explicit link.
2023-01-16 04:35:06 +04:00
Alexander Borzunov
af3da5bb04
Choose --num_blocks automatically for all models (#217) 2023-01-16 01:53:09 +04:00
Alexander Borzunov
cea83d3356
Bump version to 1.1.1 (#214) 2023-01-14 00:34:46 +04:00
Alexander Borzunov
702bb5a2c2
CI: Update deprecated actions, don't measure network RPS (#215)
* CI: Switch to actions/cache@v3 (v2 is deprecated)
* Don't run measure_network_rps() in tests since it doesn't work well in
CI
2023-01-13 20:16:31 +04:00
Alexander Borzunov
825f5dbf2d
CI: Convert model only when convert_model.py or setup.cfg change (#213)
This reduces the test running time by 2 times, unless convert_model.py or setup.cfg are changed.
2023-01-13 19:53:57 +04:00
Alexander Borzunov
5ff250bee9
Improve errors in case of missing blocks, suggest to join your own server (#212) 2023-01-13 17:53:00 +04:00
Alexander Borzunov
6ba63c6cc8
Fix output shape when resuming generation (#211)
Before this PR, `model.generate()` returned one excess token when resuming generation with an existing (the last token of the previous session, `session.last_token_id`). This is an unexpected behavior not convenient for the downstream apps, so this PR changes it until it's too late.
2023-01-13 16:27:10 +04:00
Alexander Borzunov
cc5e5d32c0
Don't switch blocks if it makes swarm disjoint (#210)
Even if the swarm seems to have at least 2 servers for each block, turning off on one of the servers could break it. That's because once a server is turned off, others may move to a better position, creating a significant downtime on their way. This PR prohibits switching blocks if it would make the swarm disjoint along the way.
2023-01-13 08:45:53 +04:00
Alexander Borzunov
6b12b0d050
Report server version and dht.client_mode in rpc_info(), check for updates on startup (#209)
This PR:

1. Shows the current Petals version and checks for updates on startup.
2. Reports the current version and DHT mode in `rpc_info()`, so it can be shown on http://health.petals.ml or used on clients for efficient routing.
2023-01-13 07:46:10 +04:00
justheuristic
771ca590e7
Add service checking direct reachability from peers (#195)
Servers joining from behind NATs/firewalls usually take several minutes to join a libp2p relay before they become accessible from the outside Internet. Moreover, requests to such servers are slower and more likely to fail (e.g., if the server switches a relay at the moment). If such servers host certain DHT keys, the swarm may occasionally lose read/write access to these keys, which results in:

- Clients being unable to find any servers hosting a certain block.
- All servers starting rebalancing to the same place to close the alleged "gap" in the swarm.

This PRs modifies servers so that DHT keys are only hosted on **directly reachable** servers (the ones who aren't behind NAT/firewall). This way, DHT becomes more stable and works faster. Of course, trhe servers behind NATs/firewalls still accept requests for running inference/forward/backward for blocks they hold (it's more acceptable for this kind of requests to be slower or fail).

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2023-01-13 03:05:39 +04:00
justheuristic
5f58f00649
Return available cache size in rpc_info() (#191)
This PR makes servers return their free cache (in tokens * layers to make it compression-agnostic)

To be used when calling make_sequence(optimize="inference")
2023-01-12 06:49:41 +03:00
Alexander Borzunov
37373a66c3 Update Anaconda installation commands (#205) 2023-01-11 21:45:54 +00:00
justheuristic
012f840f7e
Use length-weighted sampling in routing for inference (#204)
This pull-request implements a simple (1) greedy (2) latency-agnostic routing optimization that should speed up both our use cases.

Why this exists: our effort to merge full routing (ping-aware, throughut-aware, dijkstra) is in a sorry state between several branches; merging it into main would take many days.

Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
2023-01-11 23:26:09 +03:00
Alexander Borzunov
42d1bbb568
Fix --no_auto_relay help (#199) 2023-01-11 22:27:14 +04:00
justheuristic
c2cb6d19ae
Increase tolerances in test_tp_block (#196)
deflapify tests
2023-01-11 17:54:24 +03:00
Alexander Borzunov
b4f3224cda
Make client ignore blacklist if all servers holding a block are blacklisted (#197)
If all servers holding a certain block are blacklisted, we should display errors from them instead of raising `No peers holding blocks`.

Indeed, if the error is client-caused, the client should learn its reason from the latest error messages. In turn, if the error is server/network-caused and we only have a few servers, we'd better know the error instead of banning all the servers and making the user think that no servers are available.
2023-01-11 16:50:24 +04:00
Alexander Borzunov
127cf66bee
Ignore network RPS if we failed to measure it (#198) 2023-01-11 16:37:30 +04:00
Alexander Borzunov
487411e87e
Fix fine-tuning notebooks intros (#194)
The notebook intros were outdated and mentioned the 6B model, while the actual code already runs the 176B model. This led to confusion among our users in Discord.
2023-01-11 02:28:49 +04:00
Alexander Borzunov
82c9f93ce6
Bump version to 1.1.0 (#190) 2023-01-10 15:47:58 +04:00
Alexander Borzunov
a617ce3cfa
Fix psutil-related AccessDenied crash, disable --load_in_8bit by default in case of TP (#188)
* Don't count open fds since it leads to AccessDenied crashes on some machines
* Use --load_in_8bit=False by default in case of tensor parallelism
* Install petals from PyPI in fine-tuning tutorials
2023-01-10 13:04:52 +04:00
Egiazarian Vage
93bed7da5a
Support libp2p relays for NAT traversal (#186)
- Added relay options to servers
- Enabled relay options by default
- Changed hivemind version to 1.1.5
- Moved reachability check to be performed after blocks are loaded

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2023-01-09 20:41:23 +04:00
Alexander Borzunov
16b69d6050
Fix GiBs in the "insufficient disk space" message (#187) 2023-01-09 09:54:44 +04:00
Alexander Borzunov
391c855208
Add readme subsections (#185) 2023-01-08 09:59:07 +04:00
Alexander Borzunov
f344c7801b
Add link to health.petals.ml to readme (#184) 2023-01-08 08:41:47 +04:00
Alexander Borzunov
27406a9377
Add more links to BLOOM to readme (#183) 2023-01-07 10:48:51 +04:00
Alexander Borzunov
0f6464103d
Remove protobuf from requirements (#182)
A correct protobuf version should be already installed by hivemind.

This also resolves version conflict on Colab, where protobuf versions required by Petals were different from the ones required by pre-installed tensorflow and tensorboard packages.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2023-01-07 01:55:40 +04:00
Alexander Borzunov
712f5a330f
Remove backup bootstrap peer 2023-01-06 10:51:12 +04:00
justheuristic
d1fa5eb260
hotfix: add initial peer that did not crash :) (#181)
add hotfix initial peer (@borzunov's peers are down)
2023-01-06 04:57:49 +03:00
Alexander Borzunov
6dd9a938bd
Import bitsandbytes only if it's going to be used (#180) 2023-01-05 10:34:52 +04:00
Alexander Borzunov
e27706358c
Use slightly less memory in .generate() (#177) 2023-01-05 09:34:03 +04:00
Alexander Borzunov
55698381d0
Disable chunked_forward() on AVX512 CPUs (#179) 2023-01-04 23:28:16 +04:00
Alexander Borzunov
6948a0c5ee
Allow to disable chunked forward (#176) 2023-01-04 06:42:03 +04:00
Alexander Borzunov
356e099c3d
Make Docker command more visible (#175) 2023-01-04 02:18:45 +04:00
justheuristic
ae9e71fe8e
Add local tensor-parallel fwd/bwd (#143)
This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel

- 8bit approximation error same as in main (mean~=2% q0.9~=5%)
    - TP=1, 2, 3 (see screenshots above)
- forward, grad w.r.t. input and inference exact match with main with TP=1
- `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens
- throughput measured with and without TP
- TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message)


Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru>
Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru>
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2023-01-03 18:35:51 +03:00
Alexander Borzunov
779959bc70
Add link to PyPI (#173) 2022-12-31 02:51:52 +04:00
Alexander Borzunov
cdc3b6a25a
Add PyPI badge, update instructions and links in readme (#172) 2022-12-31 02:22:40 +04:00
Aleksandr Borzunov
ff8ade8d3b Bump version to 1.0.0 2022-12-30 21:52:57 +00:00
justheuristic
4014442a0f
Fix instruction for developers (#170) 2022-12-30 23:42:07 +04:00
Alexander Borzunov
26e6120288
Fix code example in readme (#169)
Makes it closer to runnable code, except for imports and defining tokenizer & data loader.
2022-12-30 19:50:38 +04:00
Alexander Borzunov
0b0277ed6f
Add link to chat.petals.ml (#168) 2022-12-30 07:58:33 +04:00
Vadim Peretokin
50fb8205de
Correct grammar in readme (#166) 2022-12-29 16:46:23 +03:00
Alexander Borzunov
714da529e6
Update wording in readme (#165) 2022-12-27 02:40:36 +04:00
Alexander Borzunov
9997ada3bb
Shield alloc & free from cancellation (#163)
A handler's RPC code may be cancelled due to a request timeout or a client closing the connection. Before this PR:

- If `.cancel()` happens while waiting for `hivemind.utils.enter_asynchronously()`, the lock will never be released.
- If `.cancel()` happens while doing that before freeing memory, the memory will never be freed.

This PR fixes it by deferring the cancellation with [asyncio.shield()](https://docs.python.org/3/library/asyncio-task.html#asyncio.shield). Now, the cancellation will happen only when all locks are released and alloc/free has completed.
2022-12-22 20:05:57 +04:00
Alexander Borzunov
d6992fca63
Hot fix: Increase hivemind.P2P's startup_timeout for Colab, remove absent initial peer (#162) 2022-12-19 19:36:20 +04:00
Artem Chumachenko
0a6b5f31aa
Fix misstypos in the example notebooks. (#161)
Fix misstypos
2022-12-16 19:42:18 +04:00
Alexander Borzunov
7cdc57a04b
Alloc inference cache as one contiguous buffer (#160) 2022-12-16 10:38:20 +04:00
Alexander Borzunov
523a7cad33
Fix issues related to petals as a module (#159)
1. Added `from petals.client import *` to `petals/__init__.py`, so you can write just that:

    ```python
    from petals import DistributedBloomForCausalLM
    ```

    I didn't do the same with server, since its classes are supposed to by used by `petals.cli.run_server`, not end-users. Though it's still possible to do `from petals.server.smth import smth` if necessary.

2. Fixed one more logging issue: log lines from hivemind were shown twice due to a bug in #156.

3. Removed unused `runtime.py`, since the server actually uses `hivemind.moe.Runtime`, and `runtime.py` has no significant changes comparing to it.
2022-12-16 09:09:06 +04:00