Commit Graph

26 Commits (b9f0a5467fc67fe6e93d2901484dd5f36d60a316)

Author SHA1 Message Date
Artem Chumachenko b9f0a5467f
Support peft LoRA adapters (#335)
Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters.

---------

Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: justheuristic <justheuristic@gmail.com>
11 months ago
Alexander Borzunov dfc6578c8e
Use bitsandbytes 0.40.0.post4 with bias hotfix (#342)
This PR includes a bnb hotfix: 90b0ac57b0
11 months ago
Alexander Borzunov fa095f6461
Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 (#340)
NF4 inference with bitsandbytes 0.40.0.post3 is ~2x faster than int8 inference, though training is still ~3x slower, see:

- [bitsandbytes 0.40.0 Release notes](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0)
- [RPS benchmarks](https://github.com/bigscience-workshop/petals/pull/333#issuecomment-1614040385)

We've decided to use NF4 by default for LLaMA.
11 months ago
Alexander Borzunov de930918a0
Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) (#333) 11 months ago
Alexander Borzunov 66a47c763e
Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) (#337)
See https://github.com/learning-at-home/hivemind/pull/573.
11 months ago
Alexander Borzunov cb3f018f9f
Add LLaMA support (#323)
This PR:

1. **Abolishes the model conversion procedure.** Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms.

    - BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ.
    - LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name).

2. **Refactors the client to generalize it for multiple models.** Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`.

3. **Introduces** `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.).

4. **Introduces** `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers.

Upgrade instructions:

- Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and  `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).
12 months ago
Alexander Borzunov 0a313bf6c5
Update hivemind to 1.1.8, enable efficient bfloat16 encoding (#311)
This PR:

1. Updates hivemind to 1.1.8 (includes https://github.com/learning-at-home/hivemind/pull/565)
2. Enables efficient bfloat16 serialization by default (`USE_LEGACY_BFLOAT16 = False`)
3. Removes logging code that was included to hivemind in https://github.com/learning-at-home/hivemind/pull/542
1 year ago
Alexander Borzunov 454c193863
Fix OOMs happening in case of accelerate >= 0.16.0 (#310)
- After #285, `load_pretrained_block()` uses `accelerate.utils.set_module_tensor_to_device()`
- In accelerate>=0.16.0, it saves the tensor in the dtype previously used by the model instead of dtype of the weights (https://github.com/huggingface/accelerate/pull/920)
- Because of that, blocks and attention caches used float32, which caused OOMs
- This PR makes `load_pretrained_block()` respect `torch_dtype` (default: `"auto"`, which means reading `torch_dtype` from `config.json`)
1 year ago
Alexander Borzunov 98be9ffe4c
Relax the rest of Hugging Face dependencies (#305) 1 year ago
Alexander Borzunov 35662b4a16
Require bitsandbytes == 0.38.0.post2, hivemind == 1.1.7 (#302)
In particular, this PR fixes 8-bit support on nvidia16 GPUs (such as 1660) by including https://github.com/TimDettmers/bitsandbytes/pull/292. This support was requested multiple times on Discord.
1 year ago
Alexander Borzunov 2116df08bc
Fix deps, enable 8-bit by default for TP (#298)
This PR fixes issues of #290:

- hivemind bfloat16 codec crashed on dummy tensors (with 0 elements), see https://github.com/learning-at-home/hivemind/pull/560 (this PR makes Petals depend on the latest hivemind version from the repo, it's temporary)
- transformers version check mismatched with the version allowed in `setup.cfg`

Also:

- This PR enables 8-bit by default for TP. Even though TP in 8-bit may be slower, we currently prefer to host more blocks to increase the network's stability.
1 year ago
justheuristic 987f4d2b2f
Update bitsandbytes, hivemind, transformers (#290)
- new bitsandbytes supports newer *and* older GPUs
- new hivemind supports a better bfloat16 codec

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
1 year ago
Alexander Borzunov a7d3d02194
Fix invalid author email in setup.cfg (#287) 1 year ago
Alexander Borzunov 6ba63c6cc8
Fix output shape when resuming generation (#211)
Before this PR, `model.generate()` returned one excess token when resuming generation with an existing (the last token of the previous session, `session.last_token_id`). This is an unexpected behavior not convenient for the downstream apps, so this PR changes it until it's too late.
1 year ago
Alexander Borzunov 6b12b0d050
Report server version and dht.client_mode in rpc_info(), check for updates on startup (#209)
This PR:

1. Shows the current Petals version and checks for updates on startup.
2. Reports the current version and DHT mode in `rpc_info()`, so it can be shown on http://health.petals.ml or used on clients for efficient routing.
1 year ago
Alexander Borzunov 82c9f93ce6
Bump version to 1.1.0 (#190) 1 year ago
Egiazarian Vage 93bed7da5a
Support libp2p relays for NAT traversal (#186)
- Added relay options to servers
- Enabled relay options by default
- Changed hivemind version to 1.1.5
- Moved reachability check to be performed after blocks are loaded

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
1 year ago
Alexander Borzunov 0f6464103d
Remove protobuf from requirements (#182)
A correct protobuf version should be already installed by hivemind.

This also resolves version conflict on Colab, where protobuf versions required by Petals were different from the ones required by pre-installed tensorflow and tensorboard packages.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
1 year ago
Alexander Borzunov 55698381d0
Disable chunked_forward() on AVX512 CPUs (#179) 1 year ago
justheuristic ae9e71fe8e
Add local tensor-parallel fwd/bwd (#143)
This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel

- 8bit approximation error same as in main (mean~=2% q0.9~=5%)
    - TP=1, 2, 3 (see screenshots above)
- forward, grad w.r.t. input and inference exact match with main with TP=1
- `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens
- throughput measured with and without TP
- TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message)


Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru>
Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru>
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
1 year ago
Aleksandr Borzunov ff8ade8d3b Bump version to 1.0.0 1 year ago
justheuristic 91898c3c90
Switch to speedtest-cli (#157)
This pullrequest removes custom speed_test code in favour of speedtest-cli module.
This is necessary to ensure that random warnings / print-outs do not mess with our outputs.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2 years ago
justheuristic b04982c1a2
Bump transformers to 4.25.1 (#151)
- latest accelerate, transformers, huggingface_hub
- rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344
- remove unused code
- fix edge case where session crashes when receiving seq length 0
- assert transformer version when importing WrappedBloomBlock

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2 years ago
Alexander Borzunov b8e1c1b7f5
Revert to hivemind==1.1.3 for stability (#129) 2 years ago
Alexander Borzunov 893987ebf8
Require hivemind==1.1.4 with p2pd v0.3.13 (#121) 2 years ago
Alexander Borzunov 7bd5916744
Make Petals a pip-installable package (attempt 2) (#102)
1. Petals can be now installed using `pip install git+https://github.com/bigscience-workshop/petals`
    - In case if you already cloned the repo, you can do `pip install .` or `pip install .[dev]`
2. Moved `src` => `src/petals`
    - Replaced `from src.smth import smth` with `from petals.smth import smth`
3. Moved `cli` => `src/petals/cli`
    - Replaced `python -m cli.run_smth` with `python -m petals.cli.run_smth` (all utilities are now available right after pip installation)
4. Moved the `requirements*.txt` contents to `setup.cfg` (`requirements.txt` for packages is not supported well by modern packaging utils)
5. Increased the package version from `0.2` to `1.0alpha1`
2 years ago