Commit Graph

53 Commits (19be29e89e2fe15a68e225d8c44986b66a058b7e)

Author SHA1 Message Date
justheuristic 6477cb85e7
Bump transformers to 4.43.1 (#596)
* Update setup.cfg to transformers 4.43.1
* Update __init__.py to transformers 4.43.1
* add cache_position check for Mixtral

Co-authored-by: xtinkt <ant.sinitsin@gmail.com>
Co-authored-by: Anton Sinitsin <30695750+xtinkt@users.noreply.github.com>
3 months ago
Artem Chumachenko f1e1b051d0
Update peft dependency, fix initialization and inference with new peft (#557)
* Make fixes

* lib number

* Fix inference without adapter

* Fix trainability

* Fix versions

* style

* Update comments

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>

* Remove unnesc todo

---------

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
Co-authored-by: justheuristic <justheuristic@gmail.com>
3 months ago
Anton Sinitsin 68585864ae
Update transformers to 4.41.2 (#583)
* updated transformers lib to 4.41.2

* fix all versions ranges

* fix _seen_tokens

* downgrade numpy

* seq_len fix
4 months ago
Priyanshupareek e268c99a6b
Restrict PyTorch version to <2.3.0 to resolve import error (#577)
* Pin PyTorch version to 2.2.2 to resolve import error

Addressing the import error encountered with PyTorch 2.3.0 as detailed in issue #576. 
fixes #576

* Update setup.cfg

Modified the version constraint for PyTorch in setup.cfg to torch>=1.12,<2.3.0 to avoid the import errors introduced in version 2.3.0 while still supporting earlier compatible versions. This change follows feedback from @mryab to allow flexibility for users on different versions.
6 months ago
justheuristic 2ad0b2b936
Fix p2p pushing in rpc_inference (by @miaoqijun ) , support transformers 4.38.2 (#563)
This pull request solves #560 using a solution proposed by @miaoqijun .
It also bumps transformers to the latest version to test with the latest code.

---------

Co-authored-by: Yingtong Dou <ytongdou@gmail.com>
7 months ago
Denis Mazur 0d91bbdac3
Bump transformers and accelerate versions (#554)
Bump versions for transformers and accelerate, remove falcon-rw-1b CI tests
8 months ago
justheuristic 25a0796b39
Hotfix: require peft version 0.5.0 (#539)
Peft: strict version check for now

Co-authored-by: horik <hr.mail.qaq@gmail.com>
12 months ago
justheuristic dcce43670f
Hotfix: set transformers version <=4.34 temporarily (#538)
* fix transformers version for now


Co-authored-by: horik <hr.mail.qaq@gmail.com>
12 months ago
Alexander Borzunov abd547735f
Force use_cache=True (#496) 1 year ago
Alexander Borzunov 26ebbfe8f0
Support macOS (#477)
This PR makes both clients and servers work on macOS. Specifically, it:

- Follows https://github.com/learning-at-home/hivemind/pull/586 to run a macOS-compatible `p2pd` binary (both x86-64 and ARM64 are supported)
- Fixes forking issues and tests on macOS, Python 3.10+
- Introduces basic support for serving model blocks on Apple M1/M2 GPUs (torch.mps)
- Increases max number of open files by default (it's not enough on Linux and is really small on macOS)
1 year ago
Alexander Borzunov 915b357740
Require transformers>=4.32.0 (#479)
It's necessary to load https://huggingface.co/petals-team/StableBeluga2 since it doesn't have deprecated `inv_freq` weights.
1 year ago
Alexander Borzunov 18e93afc73
Don't install cpufeature on non-x86_64 machines (#478)
Necessary since cpufeature crashes when installing on ARM.
1 year ago
Artem Chumachenko a14ae7334d
Update peft to 0.5.0 version (#475)
Update peft to 0.5.0
1 year ago
justheuristic 4f850996bb
Change transformers version assert (#472) 1 year ago
justheuristic 9250025140
Support transformers 4.32.x (#471) 1 year ago
justheuristic adda5f8c20
Temporarily require peft<0.5.0, transformers<4.32.0 (#470)
Peft 0.5 recently released and broke some compatilibities. This PR temporarily requires petals to use the previous stable version of peft while we work on 0.5.0 support.
1 year ago
Alexander Borzunov 593d980ad8
Use bitsandbytes 0.41.1 (#442) 1 year ago
Alexander Borzunov f3fafd14a4
Bump version to 2.0.1 (#411) 1 year ago
Alexander Borzunov eb0664b993
Support Python 3.11 (#393) 1 year ago
Alexander Borzunov e9a20e7e53
Require accelerate>=0.20.3 as transformers do (#383) 1 year ago
Alexander Borzunov 895327a0ae
Fix readme code example, require Python < 3.11 until supported (#374)
* Fix readme code example

* Require Python < 3.11 until it's supported
1 year ago
Alexander Borzunov c735dd7ba3
Update transformers to 4.31.0 and peft to 0.4.0 (#371) 1 year ago
Alexander Borzunov f97582fb5f
Require transformers < 4.31.0 until we're compatible (#369) 1 year ago
Alexander Borzunov 62d9ed5ce7
Implement shortest-path routing for inference (#362)
This PR:

1. **Adds shortest path routing for inference.** We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358.

2. **Makes a server ping neighboring servers in addition to next ones.** This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed.

3. **Fixes a `SequenceManager` bug with the first `update()`.** Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.
1 year ago
Alexander Borzunov 3f733a96e3
Use bitsandbytes 0.40.1.post1 (#357) 1 year ago
Alexander Borzunov 2c8959e713
Share more info about a server in DHT (#355) 1 year ago
Alexander Borzunov 1a78638c02
Test that bitsandbytes is not imported when it's not used (#351)
We avoid importing bitsandbytes when it's not used, since bitsandbytes doesn't always find correct CUDA libs and may raise exceptions because of that.
1 year ago
Artem Chumachenko b9f0a5467f
Support peft LoRA adapters (#335)
Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters.

---------

Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: justheuristic <justheuristic@gmail.com>
1 year ago
Alexander Borzunov dfc6578c8e
Use bitsandbytes 0.40.0.post4 with bias hotfix (#342)
This PR includes a bnb hotfix: 90b0ac57b0
1 year ago
Alexander Borzunov fa095f6461
Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 (#340)
NF4 inference with bitsandbytes 0.40.0.post3 is ~2x faster than int8 inference, though training is still ~3x slower, see:

- [bitsandbytes 0.40.0 Release notes](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0)
- [RPS benchmarks](https://github.com/bigscience-workshop/petals/pull/333#issuecomment-1614040385)

We've decided to use NF4 by default for LLaMA.
1 year ago
Alexander Borzunov de930918a0
Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) (#333) 1 year ago
Alexander Borzunov 66a47c763e
Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) (#337)
See https://github.com/learning-at-home/hivemind/pull/573.
1 year ago
Alexander Borzunov cb3f018f9f
Add LLaMA support (#323)
This PR:

1. **Abolishes the model conversion procedure.** Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms.

    - BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ.
    - LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name).

2. **Refactors the client to generalize it for multiple models.** Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`.

3. **Introduces** `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.).

4. **Introduces** `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers.

Upgrade instructions:

- Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and  `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).
1 year ago
Alexander Borzunov 0a313bf6c5
Update hivemind to 1.1.8, enable efficient bfloat16 encoding (#311)
This PR:

1. Updates hivemind to 1.1.8 (includes https://github.com/learning-at-home/hivemind/pull/565)
2. Enables efficient bfloat16 serialization by default (`USE_LEGACY_BFLOAT16 = False`)
3. Removes logging code that was included to hivemind in https://github.com/learning-at-home/hivemind/pull/542
1 year ago
Alexander Borzunov 454c193863
Fix OOMs happening in case of accelerate >= 0.16.0 (#310)
- After #285, `load_pretrained_block()` uses `accelerate.utils.set_module_tensor_to_device()`
- In accelerate>=0.16.0, it saves the tensor in the dtype previously used by the model instead of dtype of the weights (https://github.com/huggingface/accelerate/pull/920)
- Because of that, blocks and attention caches used float32, which caused OOMs
- This PR makes `load_pretrained_block()` respect `torch_dtype` (default: `"auto"`, which means reading `torch_dtype` from `config.json`)
2 years ago
Alexander Borzunov 98be9ffe4c
Relax the rest of Hugging Face dependencies (#305) 2 years ago
Alexander Borzunov 35662b4a16
Require bitsandbytes == 0.38.0.post2, hivemind == 1.1.7 (#302)
In particular, this PR fixes 8-bit support on nvidia16 GPUs (such as 1660) by including https://github.com/TimDettmers/bitsandbytes/pull/292. This support was requested multiple times on Discord.
2 years ago
Alexander Borzunov 2116df08bc
Fix deps, enable 8-bit by default for TP (#298)
This PR fixes issues of #290:

- hivemind bfloat16 codec crashed on dummy tensors (with 0 elements), see https://github.com/learning-at-home/hivemind/pull/560 (this PR makes Petals depend on the latest hivemind version from the repo, it's temporary)
- transformers version check mismatched with the version allowed in `setup.cfg`

Also:

- This PR enables 8-bit by default for TP. Even though TP in 8-bit may be slower, we currently prefer to host more blocks to increase the network's stability.
2 years ago
justheuristic 987f4d2b2f
Update bitsandbytes, hivemind, transformers (#290)
- new bitsandbytes supports newer *and* older GPUs
- new hivemind supports a better bfloat16 codec

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2 years ago
Alexander Borzunov a7d3d02194
Fix invalid author email in setup.cfg (#287) 2 years ago
Alexander Borzunov 6ba63c6cc8
Fix output shape when resuming generation (#211)
Before this PR, `model.generate()` returned one excess token when resuming generation with an existing (the last token of the previous session, `session.last_token_id`). This is an unexpected behavior not convenient for the downstream apps, so this PR changes it until it's too late.
2 years ago
Alexander Borzunov 6b12b0d050
Report server version and dht.client_mode in rpc_info(), check for updates on startup (#209)
This PR:

1. Shows the current Petals version and checks for updates on startup.
2. Reports the current version and DHT mode in `rpc_info()`, so it can be shown on http://health.petals.ml or used on clients for efficient routing.
2 years ago
Alexander Borzunov 82c9f93ce6
Bump version to 1.1.0 (#190) 2 years ago
Egiazarian Vage 93bed7da5a
Support libp2p relays for NAT traversal (#186)
- Added relay options to servers
- Enabled relay options by default
- Changed hivemind version to 1.1.5
- Moved reachability check to be performed after blocks are loaded

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2 years ago
Alexander Borzunov 0f6464103d
Remove protobuf from requirements (#182)
A correct protobuf version should be already installed by hivemind.

This also resolves version conflict on Colab, where protobuf versions required by Petals were different from the ones required by pre-installed tensorflow and tensorboard packages.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2 years ago
Alexander Borzunov 55698381d0
Disable chunked_forward() on AVX512 CPUs (#179) 2 years ago
justheuristic ae9e71fe8e
Add local tensor-parallel fwd/bwd (#143)
This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel

- 8bit approximation error same as in main (mean~=2% q0.9~=5%)
    - TP=1, 2, 3 (see screenshots above)
- forward, grad w.r.t. input and inference exact match with main with TP=1
- `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens
- throughput measured with and without TP
- TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message)


Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru>
Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru>
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2 years ago
Aleksandr Borzunov ff8ade8d3b Bump version to 1.0.0 2 years ago
justheuristic 91898c3c90
Switch to speedtest-cli (#157)
This pullrequest removes custom speed_test code in favour of speedtest-cli module.
This is necessary to ensure that random warnings / print-outs do not mess with our outputs.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2 years ago
justheuristic b04982c1a2
Bump transformers to 4.25.1 (#151)
- latest accelerate, transformers, huggingface_hub
- rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344
- remove unused code
- fix edge case where session crashes when receiving seq length 0
- assert transformer version when importing WrappedBloomBlock

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2 years ago