Commit Graph

541 Commits

Author SHA1 Message Date
Alexander Borzunov
fd19c21859
Update --update_period and --expiration defaults (#410) 2023-07-23 17:22:04 +04:00
Alexander Borzunov
ffb20b585c
Update commands for hosting Llama 2 in readme (#409) 2023-07-23 13:08:07 +04:00
Alexander Borzunov
48c6b6d963
Update README.md (#407) 2023-07-23 00:41:41 +04:00
Alexander Borzunov
c153cba1fa
Add Llama 2, WSL instructions to readme (#406) 2023-07-23 00:35:19 +04:00
justheuristic
5af04524dd
Split long sequences into chunks (#403)
This PR is designed to avoid OOMs when processing long sequences that happen due to the huge attention logits matrices.

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2023-07-22 23:10:46 +04:00
Alexander Borzunov
30b94ef18b
If speedtest fails, assume network speed of 100 Mbit/s (#404)
The value is chosen as some safe value below average at https://health.petals.dev/

Note that if a server uses relays, the effective throughput will be further divided by 2 (see #399).
2023-07-22 18:49:37 +04:00
Alexander Borzunov
8666653cf5
Fix routing through relay, default network RPS, --token, logging, readme (#399)
* Hide GeneratorExit in _iterate_inference_steps()
* Update README.md about `--public_name`
* Use .from_pretrained(..., use_auth_token=token) instead of token=token
until it's fully supported across HF libs
* Use default network speed 25 Mbit/s
* Apply relay penalty in max-throughput routing
* Replace RPS with "tokens/sec per block" in logs
* Increase default expiration
2023-07-22 18:27:58 +04:00
Alexander Borzunov
eb0664b993
Support Python 3.11 (#393) 2023-07-22 13:07:43 +04:00
Alexander Borzunov
6e4ebb94d2
Fix deadlocks in MemoryCache (#396)
- Fix deadlocks in MemoryCache
- Set default --alloc_timeout to 1 until the MemoryCache update
2023-07-21 11:09:24 +04:00
Alexander Borzunov
b6b3ae964f
Fix --attn_cache_tokens default (#392) 2023-07-20 23:20:15 +04:00
Alexander Borzunov
d49d9ad0cf
Bump version to 2.0.0.post3 (#391) 2023-07-20 21:07:00 +04:00
justheuristic
e51e84631d
Update to petals.dev (#390)
Since `petals.ml` DNS record is still unavailable, we're switching everything to https://petals.dev

Co-authored-by: Aleksandr Borzunov <hxrussia@gmail.com>
2023-07-20 20:59:28 +04:00
Aleksandr Borzunov
ddcda02b06 Hardcode IPs until DNS issues get resolved 2023-07-20 08:55:35 +00:00
Alexander Borzunov
b1ff8bdd6c
Bump version to 2.0.0.post1 (#384) 2023-07-19 21:13:24 +04:00
Alexander Borzunov
e9a20e7e53
Require accelerate>=0.20.3 as transformers do (#383) 2023-07-19 20:28:23 +04:00
Alexander Borzunov
057a2fb5de
Support Llama 2 (#379) 2023-07-19 19:15:53 +04:00
Alexander Borzunov
3218534745
Fix --token arg (#378) 2023-07-19 15:25:34 +04:00
justheuristic
398a384075
Inherit bitsandbytes compute dtype correctly (override peft quirk) (#377) 2023-07-19 14:08:52 +04:00
justheuristic
5a8de2f1f8
Fix handler memory leak, get rid of mp.Manager (#373)
This PR removes the memory leak from somewhere within handler.py that has something to do with mp.SyncManager.
2023-07-19 13:31:47 +04:00
Alexander Borzunov
895327a0ae
Fix readme code example, require Python < 3.11 until supported (#374)
* Fix readme code example

* Require Python < 3.11 until it's supported
2023-07-19 12:45:14 +04:00
Alexander Borzunov
c735dd7ba3
Update transformers to 4.31.0 and peft to 0.4.0 (#371) 2023-07-19 05:15:30 +04:00
justheuristic
1ab35c2826
Typo in inference_session.py 2023-07-19 02:22:40 +03:00
Alexander Borzunov
a6fdfc0556
Fix AssertionError on rebalancing (#370) 2023-07-19 03:22:19 +04:00
Alexander Borzunov
f97582fb5f
Require transformers < 4.31.0 until we're compatible (#369) 2023-07-19 02:35:47 +04:00
Alexander Borzunov
3b300c32e4
Update readme to show new models (#365) 2023-07-18 19:57:39 +04:00
Alexander Borzunov
62d9ed5ce7
Implement shortest-path routing for inference (#362)
This PR:

1. **Adds shortest path routing for inference.** We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358.

2. **Makes a server ping neighboring servers in addition to next ones.** This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed.

3. **Fixes a `SequenceManager` bug with the first `update()`.** Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.
2023-07-18 08:46:36 +04:00
Ikko Eltociear Ashimine
fd30f7ce10
Fix typo in generation_algorithms.py (#364) 2023-07-18 05:44:41 +04:00
Alexander Borzunov
11f0d992d7
Report inference, forward, and network RPS separately (#358)
Inference RPS may be very different from forward RPS. E.g., currently bnb uses a completely different algorithm for NF4 inference. We report detailed RPS info that can be then used for shortest-path routing for inference.
2023-07-17 13:45:59 +04:00
Alexander Borzunov
9517dd1e3d
Update readme and "Getting started" link (#360)
This updates readme with the latest updates and fixes an old Colab link, as pointed out in #359.
2023-07-17 05:02:08 +04:00
Alexander Borzunov
3f733a96e3
Use bitsandbytes 0.40.1.post1 (#357) 2023-07-16 03:07:21 +04:00
Alexander Borzunov
81c4a45ca2
Make a server ping next servers (#356)
This PR makes a server ping potential next servers in a chain and report the RTTs to DHT. This will be used for shortest-path routing.
2023-07-15 20:16:21 +04:00
Alexander Borzunov
2c8959e713
Share more info about a server in DHT (#355) 2023-07-15 03:36:31 +04:00
justheuristic
37fdcb3fe0
Switch adapters slightly faster (#353)
Currently, each `TransformerBackend.inference_step` looks for adapters and sets the correct adapter type for each block. This is not very expensive, but it can measurably affect inference time.

This pull request uses faster adapter switching with just one variable assignment, without iterating over block.modules().
2023-07-14 23:04:55 +04:00
Alexander Borzunov
9703358df0
Fix bugs in _choose_num_blocks() added in #346 (#354) 2023-07-14 22:33:48 +04:00
Alexander Borzunov
1a78638c02
Test that bitsandbytes is not imported when it's not used (#351)
We avoid importing bitsandbytes when it's not used, since bitsandbytes doesn't always find correct CUDA libs and may raise exceptions because of that.
2023-07-14 18:40:47 +04:00
justheuristic
c511990236
Remove unused import os (#352) 2023-07-14 18:05:21 +04:00
Alexander Borzunov
e12d4c666b
Spam less in server logs (#350) 2023-07-14 02:52:52 +04:00
justheuristic
010857a834
Estimate adapter memory overhead in choose_num_blocks() (#346)
* estimate adapter memory overhead
* reduce number of heads based on that

---------

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2023-07-14 01:03:42 +03:00
Alexander Borzunov
f605f093f7
Support LLaMA repos without "-hf" suffix (#349) 2023-07-14 00:43:28 +04:00
Alexander Borzunov
90fbaab61e
Fix Docker build by avoiding Python 3.11 (#348)
We want to use `3.10.x` since `grpcio-tools` is not compatible with 3.11 yet. However, `python~=3.10` meant `python>=3.10, python<4.0`, so we ended up with a broken build due to python 3.11 installed.
2023-07-13 19:34:17 +04:00
Alexander Borzunov
43acfe52a7
Import petals.utils.peft only when needed to avoid unnecessary import of bitsandbytes (#345)
The motivation is the same as in #180.
2023-07-12 23:15:16 +04:00
Alexander Borzunov
294970fe18
Update Colab link 2023-07-12 17:00:15 +04:00
Alexander Borzunov
515a5120cb
Mention LLaMA in readme (#344) 2023-07-12 16:58:58 +04:00
Max Ryabinin
13f4e3a88a
Fix convergence issues and switch to LLaMA in the SST-2 example (#343)
* Fix convergence issues and switch to LLaMA in the SST-2 example
2023-07-12 15:50:54 +03:00
Artem Chumachenko
b9f0a5467f
Support peft LoRA adapters (#335)
Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters.

---------

Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: justheuristic <justheuristic@gmail.com>
2023-07-12 15:22:28 +03:00
Alexander Borzunov
dfc6578c8e
Use bitsandbytes 0.40.0.post4 with bias hotfix (#342)
This PR includes a bnb hotfix: 90b0ac57b0
2023-07-12 15:29:59 +04:00
Alexander Borzunov
b28f5016ea
Delete deprecated petals.cli scripts (#336) 2023-07-11 21:42:35 +04:00
Alexander Borzunov
fa095f6461
Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 (#340)
NF4 inference with bitsandbytes 0.40.0.post3 is ~2x faster than int8 inference, though training is still ~3x slower, see:

- [bitsandbytes 0.40.0 Release notes](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0)
- [RPS benchmarks](https://github.com/bigscience-workshop/petals/pull/333#issuecomment-1614040385)

We've decided to use NF4 by default for LLaMA.
2023-07-11 18:53:17 +04:00
Alexander Borzunov
158013a671
Implement direct server-to-server communication (#331)
Implement #226.
2023-07-11 17:29:34 +04:00
Alexander Borzunov
4d9c26fe5c
Allow free_disk_space_for() remove arbitrary files from Petals cache (#339)
Before this PR, `free_disk_space_for()` was able to remove **(a)** only entire cached revisions (= git commits/branches) and **(b)** only from the repository we're loading right now.

This PR allows this functions to remove arbitrary files separately from any repositories.

This is useful for transition to Petals 1.2.0+, since it now uses original repos instead of the ones with converted models (see #323). In particular, the cache for `bigscience/bloom-petals` is now deprecated and should be removed in favor of `bigscience/bloom`. This is also useful as a way to free space before loading LoRA adapters (#335).
2023-07-05 14:57:59 +04:00