petals

Commit Graph

Author	SHA1	Message	Date
justheuristic	6477cb85e7	Bump transformers to 4.43.1 (#596 ) * Update setup.cfg to transformers 4.43.1 * Update __init__.py to transformers 4.43.1 * add cache_position check for Mixtral Co-authored-by: xtinkt <ant.sinitsin@gmail.com> Co-authored-by: Anton Sinitsin <30695750+xtinkt@users.noreply.github.com>	3 months ago
Artem Chumachenko	f1e1b051d0	Update peft dependency, fix initialization and inference with new peft (#557 ) * Make fixes * lib number * Fix inference without adapter * Fix trainability * Fix versions * style * Update comments Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> * Remove unnesc todo --------- Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> Co-authored-by: justheuristic <justheuristic@gmail.com>	3 months ago
Anton Sinitsin	68585864ae	Update transformers to 4.41.2 (#583 ) * updated transformers lib to 4.41.2 * fix all versions ranges * fix _seen_tokens * downgrade numpy * seq_len fix	4 months ago
Priyanshupareek	e268c99a6b	Restrict PyTorch version to <2.3.0 to resolve import error (#577 ) * Pin PyTorch version to 2.2.2 to resolve import error Addressing the import error encountered with PyTorch 2.3.0 as detailed in issue #576. fixes #576 * Update setup.cfg Modified the version constraint for PyTorch in setup.cfg to torch>=1.12,<2.3.0 to avoid the import errors introduced in version 2.3.0 while still supporting earlier compatible versions. This change follows feedback from @mryab to allow flexibility for users on different versions.	6 months ago
justheuristic	2ad0b2b936	Fix p2p pushing in rpc_inference (by @miaoqijun ) , support transformers 4.38.2 (#563 ) This pull request solves #560 using a solution proposed by @miaoqijun . It also bumps transformers to the latest version to test with the latest code. --------- Co-authored-by: Yingtong Dou <ytongdou@gmail.com>	7 months ago
Denis Mazur	0d91bbdac3	Bump transformers and accelerate versions (#554 ) Bump versions for transformers and accelerate, remove falcon-rw-1b CI tests	8 months ago
justheuristic	25a0796b39	Hotfix: require peft version 0.5.0 (#539 ) Peft: strict version check for now Co-authored-by: horik <hr.mail.qaq@gmail.com>	12 months ago
justheuristic	dcce43670f	Hotfix: set transformers version <=4.34 temporarily (#538 ) * fix transformers version for now Co-authored-by: horik <hr.mail.qaq@gmail.com>	12 months ago
Alexander Borzunov	abd547735f	Force use_cache=True (#496 )	1 year ago
Alexander Borzunov	26ebbfe8f0	Support macOS (#477 ) This PR makes both clients and servers work on macOS. Specifically, it: - Follows https://github.com/learning-at-home/hivemind/pull/586 to run a macOS-compatible `p2pd` binary (both x86-64 and ARM64 are supported) - Fixes forking issues and tests on macOS, Python 3.10+ - Introduces basic support for serving model blocks on Apple M1/M2 GPUs (torch.mps) - Increases max number of open files by default (it's not enough on Linux and is really small on macOS)	1 year ago
Alexander Borzunov	915b357740	Require transformers>=4.32.0 (#479 ) It's necessary to load https://huggingface.co/petals-team/StableBeluga2 since it doesn't have deprecated `inv_freq` weights.	1 year ago
Alexander Borzunov	18e93afc73	Don't install cpufeature on non-x86_64 machines (#478 ) Necessary since cpufeature crashes when installing on ARM.	1 year ago
Artem Chumachenko	a14ae7334d	Update peft to 0.5.0 version (#475 ) Update peft to 0.5.0	1 year ago
justheuristic	4f850996bb	Change transformers version assert (#472 )	1 year ago
justheuristic	9250025140	Support transformers 4.32.x (#471 )	1 year ago
justheuristic	adda5f8c20	Temporarily require peft<0.5.0, transformers<4.32.0 (#470 ) Peft 0.5 recently released and broke some compatilibities. This PR temporarily requires petals to use the previous stable version of peft while we work on 0.5.0 support.	1 year ago
Alexander Borzunov	593d980ad8	Use bitsandbytes 0.41.1 (#442 )	1 year ago
Alexander Borzunov	f3fafd14a4	Bump version to 2.0.1 (#411 )	1 year ago
Alexander Borzunov	eb0664b993	Support Python 3.11 (#393 )	1 year ago
Alexander Borzunov	e9a20e7e53	Require accelerate>=0.20.3 as transformers do (#383 )	1 year ago
Alexander Borzunov	895327a0ae	Fix readme code example, require Python < 3.11 until supported (#374 ) * Fix readme code example * Require Python < 3.11 until it's supported	1 year ago
Alexander Borzunov	c735dd7ba3	Update transformers to 4.31.0 and peft to 0.4.0 (#371 )	1 year ago
Alexander Borzunov	f97582fb5f	Require transformers < 4.31.0 until we're compatible (#369 )	1 year ago
Alexander Borzunov	62d9ed5ce7	Implement shortest-path routing for inference (#362 ) This PR: 1. Adds shortest path routing for inference. We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358. 2. Makes a server ping neighboring servers in addition to next ones. This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed. 3. Fixes a `SequenceManager` bug with the first `update()`. Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.	1 year ago
Alexander Borzunov	3f733a96e3	Use bitsandbytes 0.40.1.post1 (#357 )	1 year ago
Alexander Borzunov	2c8959e713	Share more info about a server in DHT (#355 )	1 year ago
Alexander Borzunov	1a78638c02	Test that bitsandbytes is not imported when it's not used (#351 ) We avoid importing bitsandbytes when it's not used, since bitsandbytes doesn't always find correct CUDA libs and may raise exceptions because of that.	1 year ago
Artem Chumachenko	b9f0a5467f	Support peft LoRA adapters (#335 ) Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters. --------- Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> Co-authored-by: justheuristic <justheuristic@gmail.com>	1 year ago
Alexander Borzunov	dfc6578c8e	Use bitsandbytes 0.40.0.post4 with bias hotfix (#342 ) This PR includes a bnb hotfix: `90b0ac57b0`	1 year ago
Alexander Borzunov	fa095f6461	Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 (#340 ) NF4 inference with bitsandbytes 0.40.0.post3 is ~2x faster than int8 inference, though training is still ~3x slower, see: - [bitsandbytes 0.40.0 Release notes](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0) - [RPS benchmarks](https://github.com/bigscience-workshop/petals/pull/333#issuecomment-1614040385) We've decided to use NF4 by default for LLaMA.	1 year ago
Alexander Borzunov	de930918a0	Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) (#333 )	1 year ago
Alexander Borzunov	66a47c763e	Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) (#337 ) See https://github.com/learning-at-home/hivemind/pull/573.	1 year ago
Alexander Borzunov	cb3f018f9f	Add LLaMA support (#323 ) This PR: 1. Abolishes the model conversion procedure. Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms. - BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ. - LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name). 2. Refactors the client to generalize it for multiple models. Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`. 3. Introduces `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.). 4. Introduces `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers. Upgrade instructions: - Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).	1 year ago
Alexander Borzunov	0a313bf6c5	Update hivemind to 1.1.8, enable efficient bfloat16 encoding (#311 ) This PR: 1. Updates hivemind to 1.1.8 (includes https://github.com/learning-at-home/hivemind/pull/565) 2. Enables efficient bfloat16 serialization by default (`USE_LEGACY_BFLOAT16 = False`) 3. Removes logging code that was included to hivemind in https://github.com/learning-at-home/hivemind/pull/542	1 year ago
Alexander Borzunov	454c193863	Fix OOMs happening in case of accelerate >= 0.16.0 (#310 ) - After #285, `load_pretrained_block()` uses `accelerate.utils.set_module_tensor_to_device()` - In accelerate>=0.16.0, it saves the tensor in the dtype previously used by the model instead of dtype of the weights (https://github.com/huggingface/accelerate/pull/920) - Because of that, blocks and attention caches used float32, which caused OOMs - This PR makes `load_pretrained_block()` respect `torch_dtype` (default: `"auto"`, which means reading `torch_dtype` from `config.json`)	2 years ago
Alexander Borzunov	98be9ffe4c	Relax the rest of Hugging Face dependencies (#305 )	2 years ago
Alexander Borzunov	35662b4a16	Require bitsandbytes == 0.38.0.post2, hivemind == 1.1.7 (#302 ) In particular, this PR fixes 8-bit support on nvidia16 GPUs (such as 1660) by including https://github.com/TimDettmers/bitsandbytes/pull/292. This support was requested multiple times on Discord.	2 years ago
Alexander Borzunov	2116df08bc	Fix deps, enable 8-bit by default for TP (#298 ) This PR fixes issues of #290: - hivemind bfloat16 codec crashed on dummy tensors (with 0 elements), see https://github.com/learning-at-home/hivemind/pull/560 (this PR makes Petals depend on the latest hivemind version from the repo, it's temporary) - transformers version check mismatched with the version allowed in `setup.cfg` Also: - This PR enables 8-bit by default for TP. Even though TP in 8-bit may be slower, we currently prefer to host more blocks to increase the network's stability.	2 years ago
justheuristic	987f4d2b2f	Update bitsandbytes, hivemind, transformers (#290 ) - new bitsandbytes supports newer and older GPUs - new hivemind supports a better bfloat16 codec Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	2 years ago
Alexander Borzunov	a7d3d02194	Fix invalid author email in setup.cfg (#287 )	2 years ago
Alexander Borzunov	6ba63c6cc8	Fix output shape when resuming generation (#211 ) Before this PR, `model.generate()` returned one excess token when resuming generation with an existing (the last token of the previous session, `session.last_token_id`). This is an unexpected behavior not convenient for the downstream apps, so this PR changes it until it's too late.	2 years ago
Alexander Borzunov	6b12b0d050	Report server version and dht.client_mode in rpc_info(), check for updates on startup (#209 ) This PR: 1. Shows the current Petals version and checks for updates on startup. 2. Reports the current version and DHT mode in `rpc_info()`, so it can be shown on http://health.petals.ml or used on clients for efficient routing.	2 years ago
Alexander Borzunov	82c9f93ce6	Bump version to 1.1.0 (#190 )	2 years ago
Egiazarian Vage	93bed7da5a	Support libp2p relays for NAT traversal (#186 ) - Added relay options to servers - Enabled relay options by default - Changed hivemind version to 1.1.5 - Moved reachability check to be performed after blocks are loaded Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	2 years ago
Alexander Borzunov	0f6464103d	Remove protobuf from requirements (#182 ) A correct protobuf version should be already installed by hivemind. This also resolves version conflict on Colab, where protobuf versions required by Petals were different from the ones required by pre-installed tensorflow and tensorboard packages. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	2 years ago
Alexander Borzunov	55698381d0	Disable chunked_forward() on AVX512 CPUs (#179 )	2 years ago
justheuristic	ae9e71fe8e	Add local tensor-parallel fwd/bwd (#143 ) This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel - 8bit approximation error same as in main (mean~=2% q0.9~=5%) - TP=1, 2, 3 (see screenshots above) - forward, grad w.r.t. input and inference exact match with main with TP=1 - `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens - throughput measured with and without TP - TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message) Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru> Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru> Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	2 years ago
Aleksandr Borzunov	ff8ade8d3b	Bump version to 1.0.0	2 years ago
justheuristic	91898c3c90	Switch to speedtest-cli (#157 ) This pullrequest removes custom speed_test code in favour of speedtest-cli module. This is necessary to ensure that random warnings / print-outs do not mess with our outputs. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	2 years ago
justheuristic	b04982c1a2	Bump transformers to 4.25.1 (#151 ) - latest accelerate, transformers, huggingface_hub - rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344 - remove unused code - fix edge case where session crashes when receiving seq length 0 - assert transformer version when importing WrappedBloomBlock Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	2 years ago

1 2

53 Commits (19be29e89e2fe15a68e225d8c44986b66a058b7e)