Commit Graph

45 Commits

Author SHA1 Message Date
Alexander Borzunov
abd547735f
Force use_cache=True (#496) 2023-09-02 22:57:18 +04:00
Alexander Borzunov
26ebbfe8f0
Support macOS (#477)
This PR makes both clients and servers work on macOS. Specifically, it:

- Follows https://github.com/learning-at-home/hivemind/pull/586 to run a macOS-compatible `p2pd` binary (both x86-64 and ARM64 are supported)
- Fixes forking issues and tests on macOS, Python 3.10+
- Introduces basic support for serving model blocks on Apple M1/M2 GPUs (torch.mps)
- Increases max number of open files by default (it's not enough on Linux and is really small on macOS)
2023-08-29 07:49:27 +04:00
Alexander Borzunov
915b357740
Require transformers>=4.32.0 (#479)
It's necessary to load https://huggingface.co/petals-team/StableBeluga2 since it doesn't have deprecated `inv_freq` weights.
2023-08-25 01:37:30 +04:00
Alexander Borzunov
18e93afc73
Don't install cpufeature on non-x86_64 machines (#478)
Necessary since cpufeature crashes when installing on ARM.
2023-08-24 19:57:15 +04:00
Artem Chumachenko
a14ae7334d
Update peft to 0.5.0 version (#475)
Update peft to 0.5.0
2023-08-23 20:21:28 +04:00
justheuristic
4f850996bb
Change transformers version assert (#472) 2023-08-22 21:53:14 +04:00
justheuristic
9250025140
Support transformers 4.32.x (#471) 2023-08-22 20:10:29 +03:00
justheuristic
adda5f8c20
Temporarily require peft<0.5.0, transformers<4.32.0 (#470)
Peft 0.5 recently released and broke some compatilibities. This PR temporarily requires petals to use the previous stable version of peft while we work on 0.5.0 support.
2023-08-22 19:45:37 +03:00
Alexander Borzunov
593d980ad8
Use bitsandbytes 0.41.1 (#442) 2023-08-07 02:33:42 +04:00
Alexander Borzunov
f3fafd14a4
Bump version to 2.0.1 (#411) 2023-07-23 18:45:19 +04:00
Alexander Borzunov
eb0664b993
Support Python 3.11 (#393) 2023-07-22 13:07:43 +04:00
Alexander Borzunov
e9a20e7e53
Require accelerate>=0.20.3 as transformers do (#383) 2023-07-19 20:28:23 +04:00
Alexander Borzunov
895327a0ae
Fix readme code example, require Python < 3.11 until supported (#374)
* Fix readme code example

* Require Python < 3.11 until it's supported
2023-07-19 12:45:14 +04:00
Alexander Borzunov
c735dd7ba3
Update transformers to 4.31.0 and peft to 0.4.0 (#371) 2023-07-19 05:15:30 +04:00
Alexander Borzunov
f97582fb5f
Require transformers < 4.31.0 until we're compatible (#369) 2023-07-19 02:35:47 +04:00
Alexander Borzunov
62d9ed5ce7
Implement shortest-path routing for inference (#362)
This PR:

1. **Adds shortest path routing for inference.** We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358.

2. **Makes a server ping neighboring servers in addition to next ones.** This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed.

3. **Fixes a `SequenceManager` bug with the first `update()`.** Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.
2023-07-18 08:46:36 +04:00
Alexander Borzunov
3f733a96e3
Use bitsandbytes 0.40.1.post1 (#357) 2023-07-16 03:07:21 +04:00
Alexander Borzunov
2c8959e713
Share more info about a server in DHT (#355) 2023-07-15 03:36:31 +04:00
Alexander Borzunov
1a78638c02
Test that bitsandbytes is not imported when it's not used (#351)
We avoid importing bitsandbytes when it's not used, since bitsandbytes doesn't always find correct CUDA libs and may raise exceptions because of that.
2023-07-14 18:40:47 +04:00
Artem Chumachenko
b9f0a5467f
Support peft LoRA adapters (#335)
Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters.

---------

Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: justheuristic <justheuristic@gmail.com>
2023-07-12 15:22:28 +03:00
Alexander Borzunov
dfc6578c8e
Use bitsandbytes 0.40.0.post4 with bias hotfix (#342)
This PR includes a bnb hotfix: 90b0ac57b0
2023-07-12 15:29:59 +04:00
Alexander Borzunov
fa095f6461
Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 (#340)
NF4 inference with bitsandbytes 0.40.0.post3 is ~2x faster than int8 inference, though training is still ~3x slower, see:

- [bitsandbytes 0.40.0 Release notes](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0)
- [RPS benchmarks](https://github.com/bigscience-workshop/petals/pull/333#issuecomment-1614040385)

We've decided to use NF4 by default for LLaMA.
2023-07-11 18:53:17 +04:00
Alexander Borzunov
de930918a0
Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) (#333) 2023-07-03 20:13:04 +04:00
Alexander Borzunov
66a47c763e
Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) (#337)
See https://github.com/learning-at-home/hivemind/pull/573.
2023-07-02 03:32:51 +04:00
Alexander Borzunov
cb3f018f9f
Add LLaMA support (#323)
This PR:

1. **Abolishes the model conversion procedure.** Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms.

    - BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ.
    - LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name).

2. **Refactors the client to generalize it for multiple models.** Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`.

3. **Introduces** `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.).

4. **Introduces** `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers.

Upgrade instructions:

- Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and  `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).
2023-06-23 15:46:10 +04:00
Alexander Borzunov
0a313bf6c5
Update hivemind to 1.1.8, enable efficient bfloat16 encoding (#311)
This PR:

1. Updates hivemind to 1.1.8 (includes https://github.com/learning-at-home/hivemind/pull/565)
2. Enables efficient bfloat16 serialization by default (`USE_LEGACY_BFLOAT16 = False`)
3. Removes logging code that was included to hivemind in https://github.com/learning-at-home/hivemind/pull/542
2023-05-07 14:57:05 +04:00
Alexander Borzunov
454c193863
Fix OOMs happening in case of accelerate >= 0.16.0 (#310)
- After #285, `load_pretrained_block()` uses `accelerate.utils.set_module_tensor_to_device()`
- In accelerate>=0.16.0, it saves the tensor in the dtype previously used by the model instead of dtype of the weights (https://github.com/huggingface/accelerate/pull/920)
- Because of that, blocks and attention caches used float32, which caused OOMs
- This PR makes `load_pretrained_block()` respect `torch_dtype` (default: `"auto"`, which means reading `torch_dtype` from `config.json`)
2023-04-25 17:20:19 +04:00
Alexander Borzunov
98be9ffe4c
Relax the rest of Hugging Face dependencies (#305) 2023-04-13 01:05:35 +04:00
Alexander Borzunov
35662b4a16
Require bitsandbytes == 0.38.0.post2, hivemind == 1.1.7 (#302)
In particular, this PR fixes 8-bit support on nvidia16 GPUs (such as 1660) by including https://github.com/TimDettmers/bitsandbytes/pull/292. This support was requested multiple times on Discord.
2023-04-12 23:07:29 +04:00
Alexander Borzunov
2116df08bc
Fix deps, enable 8-bit by default for TP (#298)
This PR fixes issues of #290:

- hivemind bfloat16 codec crashed on dummy tensors (with 0 elements), see https://github.com/learning-at-home/hivemind/pull/560 (this PR makes Petals depend on the latest hivemind version from the repo, it's temporary)
- transformers version check mismatched with the version allowed in `setup.cfg`

Also:

- This PR enables 8-bit by default for TP. Even though TP in 8-bit may be slower, we currently prefer to host more blocks to increase the network's stability.
2023-03-29 04:21:37 +04:00
justheuristic
987f4d2b2f
Update bitsandbytes, hivemind, transformers (#290)
- new bitsandbytes supports newer *and* older GPUs
- new hivemind supports a better bfloat16 codec

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2023-03-29 01:20:29 +04:00
Alexander Borzunov
a7d3d02194
Fix invalid author email in setup.cfg (#287) 2023-03-13 06:21:09 +04:00
Alexander Borzunov
6ba63c6cc8
Fix output shape when resuming generation (#211)
Before this PR, `model.generate()` returned one excess token when resuming generation with an existing (the last token of the previous session, `session.last_token_id`). This is an unexpected behavior not convenient for the downstream apps, so this PR changes it until it's too late.
2023-01-13 16:27:10 +04:00
Alexander Borzunov
6b12b0d050
Report server version and dht.client_mode in rpc_info(), check for updates on startup (#209)
This PR:

1. Shows the current Petals version and checks for updates on startup.
2. Reports the current version and DHT mode in `rpc_info()`, so it can be shown on http://health.petals.ml or used on clients for efficient routing.
2023-01-13 07:46:10 +04:00
Alexander Borzunov
82c9f93ce6
Bump version to 1.1.0 (#190) 2023-01-10 15:47:58 +04:00
Egiazarian Vage
93bed7da5a
Support libp2p relays for NAT traversal (#186)
- Added relay options to servers
- Enabled relay options by default
- Changed hivemind version to 1.1.5
- Moved reachability check to be performed after blocks are loaded

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2023-01-09 20:41:23 +04:00
Alexander Borzunov
0f6464103d
Remove protobuf from requirements (#182)
A correct protobuf version should be already installed by hivemind.

This also resolves version conflict on Colab, where protobuf versions required by Petals were different from the ones required by pre-installed tensorflow and tensorboard packages.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2023-01-07 01:55:40 +04:00
Alexander Borzunov
55698381d0
Disable chunked_forward() on AVX512 CPUs (#179) 2023-01-04 23:28:16 +04:00
justheuristic
ae9e71fe8e
Add local tensor-parallel fwd/bwd (#143)
This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel

- 8bit approximation error same as in main (mean~=2% q0.9~=5%)
    - TP=1, 2, 3 (see screenshots above)
- forward, grad w.r.t. input and inference exact match with main with TP=1
- `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens
- throughput measured with and without TP
- TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message)


Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru>
Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru>
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2023-01-03 18:35:51 +03:00
Aleksandr Borzunov
ff8ade8d3b Bump version to 1.0.0 2022-12-30 21:52:57 +00:00
justheuristic
91898c3c90
Switch to speedtest-cli (#157)
This pullrequest removes custom speed_test code in favour of speedtest-cli module.
This is necessary to ensure that random warnings / print-outs do not mess with our outputs.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2022-12-15 15:21:33 +03:00
justheuristic
b04982c1a2
Bump transformers to 4.25.1 (#151)
- latest accelerate, transformers, huggingface_hub
- rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344
- remove unused code
- fix edge case where session crashes when receiving seq length 0
- assert transformer version when importing WrappedBloomBlock

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2022-12-13 11:03:49 +03:00
Alexander Borzunov
b8e1c1b7f5
Revert to hivemind==1.1.3 for stability (#129) 2022-12-03 17:36:05 +04:00
Alexander Borzunov
893987ebf8
Require hivemind==1.1.4 with p2pd v0.3.13 (#121) 2022-12-03 00:16:14 +04:00
Alexander Borzunov
7bd5916744
Make Petals a pip-installable package (attempt 2) (#102)
1. Petals can be now installed using `pip install git+https://github.com/bigscience-workshop/petals`
    - In case if you already cloned the repo, you can do `pip install .` or `pip install .[dev]`
2. Moved `src` => `src/petals`
    - Replaced `from src.smth import smth` with `from petals.smth import smth`
3. Moved `cli` => `src/petals/cli`
    - Replaced `python -m cli.run_smth` with `python -m petals.cli.run_smth` (all utilities are now available right after pip installation)
4. Moved the `requirements*.txt` contents to `setup.cfg` (`requirements.txt` for packages is not supported well by modern packaging utils)
5. Increased the package version from `0.2` to `1.0alpha1`
2022-11-30 10:41:13 +04:00