petals

Commit Graph

Author	SHA1	Message	Date
Artem Chumachenko	b9f0a5467f	Support peft LoRA adapters (#335 ) Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters. --------- Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> Co-authored-by: justheuristic <justheuristic@gmail.com>	11 months ago
Alexander Borzunov	dfc6578c8e	Use bitsandbytes 0.40.0.post4 with bias hotfix (#342 ) This PR includes a bnb hotfix: `90b0ac57b0`	11 months ago
Alexander Borzunov	fa095f6461	Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 (#340 ) NF4 inference with bitsandbytes 0.40.0.post3 is ~2x faster than int8 inference, though training is still ~3x slower, see: - [bitsandbytes 0.40.0 Release notes](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0) - [RPS benchmarks](https://github.com/bigscience-workshop/petals/pull/333#issuecomment-1614040385) We've decided to use NF4 by default for LLaMA.	11 months ago
Alexander Borzunov	de930918a0	Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) (#333 )	11 months ago
Alexander Borzunov	66a47c763e	Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) (#337 ) See https://github.com/learning-at-home/hivemind/pull/573.	11 months ago
Alexander Borzunov	cb3f018f9f	Add LLaMA support (#323 ) This PR: 1. Abolishes the model conversion procedure. Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms. - BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ. - LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name). 2. Refactors the client to generalize it for multiple models. Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`. 3. Introduces `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.). 4. Introduces `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers. Upgrade instructions: - Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).	12 months ago
Alexander Borzunov	0a313bf6c5	Update hivemind to 1.1.8, enable efficient bfloat16 encoding (#311 ) This PR: 1. Updates hivemind to 1.1.8 (includes https://github.com/learning-at-home/hivemind/pull/565) 2. Enables efficient bfloat16 serialization by default (`USE_LEGACY_BFLOAT16 = False`) 3. Removes logging code that was included to hivemind in https://github.com/learning-at-home/hivemind/pull/542	1 year ago
Alexander Borzunov	454c193863	Fix OOMs happening in case of accelerate >= 0.16.0 (#310 ) - After #285, `load_pretrained_block()` uses `accelerate.utils.set_module_tensor_to_device()` - In accelerate>=0.16.0, it saves the tensor in the dtype previously used by the model instead of dtype of the weights (https://github.com/huggingface/accelerate/pull/920) - Because of that, blocks and attention caches used float32, which caused OOMs - This PR makes `load_pretrained_block()` respect `torch_dtype` (default: `"auto"`, which means reading `torch_dtype` from `config.json`)	1 year ago
Alexander Borzunov	98be9ffe4c	Relax the rest of Hugging Face dependencies (#305 )	1 year ago
Alexander Borzunov	35662b4a16	Require bitsandbytes == 0.38.0.post2, hivemind == 1.1.7 (#302 ) In particular, this PR fixes 8-bit support on nvidia16 GPUs (such as 1660) by including https://github.com/TimDettmers/bitsandbytes/pull/292. This support was requested multiple times on Discord.	1 year ago
Alexander Borzunov	2116df08bc	Fix deps, enable 8-bit by default for TP (#298 ) This PR fixes issues of #290: - hivemind bfloat16 codec crashed on dummy tensors (with 0 elements), see https://github.com/learning-at-home/hivemind/pull/560 (this PR makes Petals depend on the latest hivemind version from the repo, it's temporary) - transformers version check mismatched with the version allowed in `setup.cfg` Also: - This PR enables 8-bit by default for TP. Even though TP in 8-bit may be slower, we currently prefer to host more blocks to increase the network's stability.	1 year ago
justheuristic	987f4d2b2f	Update bitsandbytes, hivemind, transformers (#290 ) - new bitsandbytes supports newer and older GPUs - new hivemind supports a better bfloat16 codec Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	1 year ago
Alexander Borzunov	a7d3d02194	Fix invalid author email in setup.cfg (#287 )	1 year ago
Alexander Borzunov	6ba63c6cc8	Fix output shape when resuming generation (#211 ) Before this PR, `model.generate()` returned one excess token when resuming generation with an existing (the last token of the previous session, `session.last_token_id`). This is an unexpected behavior not convenient for the downstream apps, so this PR changes it until it's too late.	1 year ago
Alexander Borzunov	6b12b0d050	Report server version and dht.client_mode in rpc_info(), check for updates on startup (#209 ) This PR: 1. Shows the current Petals version and checks for updates on startup. 2. Reports the current version and DHT mode in `rpc_info()`, so it can be shown on http://health.petals.ml or used on clients for efficient routing.	1 year ago
Alexander Borzunov	82c9f93ce6	Bump version to 1.1.0 (#190 )	1 year ago
Egiazarian Vage	93bed7da5a	Support libp2p relays for NAT traversal (#186 ) - Added relay options to servers - Enabled relay options by default - Changed hivemind version to 1.1.5 - Moved reachability check to be performed after blocks are loaded Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	1 year ago
Alexander Borzunov	0f6464103d	Remove protobuf from requirements (#182 ) A correct protobuf version should be already installed by hivemind. This also resolves version conflict on Colab, where protobuf versions required by Petals were different from the ones required by pre-installed tensorflow and tensorboard packages. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	1 year ago
Alexander Borzunov	55698381d0	Disable chunked_forward() on AVX512 CPUs (#179 )	1 year ago
justheuristic	ae9e71fe8e	Add local tensor-parallel fwd/bwd (#143 ) This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel - 8bit approximation error same as in main (mean~=2% q0.9~=5%) - TP=1, 2, 3 (see screenshots above) - forward, grad w.r.t. input and inference exact match with main with TP=1 - `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens - throughput measured with and without TP - TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message) Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru> Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru> Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	1 year ago
Aleksandr Borzunov	ff8ade8d3b	Bump version to 1.0.0	1 year ago
justheuristic	91898c3c90	Switch to speedtest-cli (#157 ) This pullrequest removes custom speed_test code in favour of speedtest-cli module. This is necessary to ensure that random warnings / print-outs do not mess with our outputs. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	2 years ago
justheuristic	b04982c1a2	Bump transformers to 4.25.1 (#151 ) - latest accelerate, transformers, huggingface_hub - rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344 - remove unused code - fix edge case where session crashes when receiving seq length 0 - assert transformer version when importing WrappedBloomBlock Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	2 years ago
Alexander Borzunov	b8e1c1b7f5	Revert to hivemind==1.1.3 for stability (#129 )	2 years ago
Alexander Borzunov	893987ebf8	Require hivemind==1.1.4 with p2pd v0.3.13 (#121 )	2 years ago
Alexander Borzunov	7bd5916744	Make Petals a pip-installable package (attempt 2) (#102 ) 1. Petals can be now installed using `pip install git+https://github.com/bigscience-workshop/petals` - In case if you already cloned the repo, you can do `pip install .` or `pip install .[dev]` 2. Moved `src` => `src/petals` - Replaced `from src.smth import smth` with `from petals.smth import smth` 3. Moved `cli` => `src/petals/cli` - Replaced `python -m cli.run_smth` with `python -m petals.cli.run_smth` (all utilities are now available right after pip installation) 4. Moved the `requirements*.txt` contents to `setup.cfg` (`requirements.txt` for packages is not supported well by modern packaging utils) 5. Increased the package version from `0.2` to `1.0alpha1`	2 years ago

26 Commits (b9f0a5467fc67fe6e93d2901484dd5f36d60a316)