petals

Commit Graph

Author	SHA1	Message	Date
Your Name	e9e506711e	debug version	10 months ago
Your Name	7dc1aa5151	seconds	10 months ago
Alexander Borzunov	c735dd7ba3	Update transformers to 4.31.0 and peft to 0.4.0 (#371 )	11 months ago
justheuristic	1ab35c2826	Typo in inference_session.py	11 months ago
Alexander Borzunov	a6fdfc0556	Fix AssertionError on rebalancing (#370 )	11 months ago
Alexander Borzunov	f97582fb5f	Require transformers < 4.31.0 until we're compatible (#369 )	11 months ago
Alexander Borzunov	3b300c32e4	Update readme to show new models (#365 )	11 months ago
Alexander Borzunov	62d9ed5ce7	Implement shortest-path routing for inference (#362 ) This PR: 1. Adds shortest path routing for inference. We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358. 2. Makes a server ping neighboring servers in addition to next ones. This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed. 3. Fixes a `SequenceManager` bug with the first `update()`. Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.	11 months ago
Ikko Eltociear Ashimine	fd30f7ce10	Fix typo in generation_algorithms.py (#364 )	11 months ago
Alexander Borzunov	11f0d992d7	Report inference, forward, and network RPS separately (#358 ) Inference RPS may be very different from forward RPS. E.g., currently bnb uses a completely different algorithm for NF4 inference. We report detailed RPS info that can be then used for shortest-path routing for inference.	11 months ago
Alexander Borzunov	9517dd1e3d	Update readme and "Getting started" link (#360 ) This updates readme with the latest updates and fixes an old Colab link, as pointed out in #359.	11 months ago
Alexander Borzunov	3f733a96e3	Use bitsandbytes 0.40.1.post1 (#357 )	11 months ago
Alexander Borzunov	81c4a45ca2	Make a server ping next servers (#356 ) This PR makes a server ping potential next servers in a chain and report the RTTs to DHT. This will be used for shortest-path routing.	11 months ago
Alexander Borzunov	2c8959e713	Share more info about a server in DHT (#355 )	11 months ago
justheuristic	37fdcb3fe0	Switch adapters slightly faster (#353 ) Currently, each `TransformerBackend.inference_step` looks for adapters and sets the correct adapter type for each block. This is not very expensive, but it can measurably affect inference time. This pull request uses faster adapter switching with just one variable assignment, without iterating over block.modules().	11 months ago
Alexander Borzunov	9703358df0	Fix bugs in _choose_num_blocks() added in #346 (#354 )	11 months ago
Alexander Borzunov	1a78638c02	Test that bitsandbytes is not imported when it's not used (#351 ) We avoid importing bitsandbytes when it's not used, since bitsandbytes doesn't always find correct CUDA libs and may raise exceptions because of that.	11 months ago
justheuristic	c511990236	Remove unused import os (#352 )	11 months ago
Alexander Borzunov	e12d4c666b	Spam less in server logs (#350 )	11 months ago
justheuristic	010857a834	Estimate adapter memory overhead in choose_num_blocks() (#346 ) * estimate adapter memory overhead * reduce number of heads based on that --------- Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	11 months ago
Alexander Borzunov	f605f093f7	Support LLaMA repos without "-hf" suffix (#349 )	11 months ago
Alexander Borzunov	90fbaab61e	Fix Docker build by avoiding Python 3.11 (#348 ) We want to use `3.10.x` since `grpcio-tools` is not compatible with 3.11 yet. However, `python~=3.10` meant `python>=3.10, python<4.0`, so we ended up with a broken build due to python 3.11 installed.	11 months ago
Alexander Borzunov	43acfe52a7	Import petals.utils.peft only when needed to avoid unnecessary import of bitsandbytes (#345 ) The motivation is the same as in #180.	11 months ago
Alexander Borzunov	294970fe18	Update Colab link	11 months ago
Alexander Borzunov	515a5120cb	Mention LLaMA in readme (#344 )	11 months ago
Max Ryabinin	13f4e3a88a	Fix convergence issues and switch to LLaMA in the SST-2 example (#343 ) * Fix convergence issues and switch to LLaMA in the SST-2 example	11 months ago
Artem Chumachenko	b9f0a5467f	Support peft LoRA adapters (#335 ) Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters. --------- Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> Co-authored-by: justheuristic <justheuristic@gmail.com>	11 months ago
Alexander Borzunov	dfc6578c8e	Use bitsandbytes 0.40.0.post4 with bias hotfix (#342 ) This PR includes a bnb hotfix: `90b0ac57b0`	11 months ago
Alexander Borzunov	b28f5016ea	Delete deprecated petals.cli scripts (#336 )	11 months ago
Alexander Borzunov	fa095f6461	Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 (#340 ) NF4 inference with bitsandbytes 0.40.0.post3 is ~2x faster than int8 inference, though training is still ~3x slower, see: - [bitsandbytes 0.40.0 Release notes](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0) - [RPS benchmarks](https://github.com/bigscience-workshop/petals/pull/333#issuecomment-1614040385) We've decided to use NF4 by default for LLaMA.	11 months ago
Alexander Borzunov	158013a671	Implement direct server-to-server communication (#331 ) Implement #226.	11 months ago
Alexander Borzunov	4d9c26fe5c	Allow free_disk_space_for() remove arbitrary files from Petals cache (#339 ) Before this PR, `free_disk_space_for()` was able to remove (a) only entire cached revisions (= git commits/branches) and (b) only from the repository we're loading right now. This PR allows this functions to remove arbitrary files separately from any repositories. This is useful for transition to Petals 1.2.0+, since it now uses original repos instead of the ones with converted models (see #323). In particular, the cache for `bigscience/bloom-petals` is now deprecated and should be removed in favor of `bigscience/bloom`. This is also useful as a way to free space before loading LoRA adapters (#335).	11 months ago
Alexander Borzunov	de930918a0	Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) (#333 )	11 months ago
Alexander Borzunov	66a47c763e	Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) (#337 ) See https://github.com/learning-at-home/hivemind/pull/573.	11 months ago
Alexander Borzunov	10c72acdf4	Fix warmup steps and minor issues in benchmarks (#334 ) The previous code was incorrect for the case of `warmup_steps != 1` (this mode was never used, but can be used in future).	11 months ago
Alexander Borzunov	d126ee3053	Add benchmark scripts (#319 ) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant.	11 months ago
Alexander Borzunov	fecee8c4dc	Show license links when loading models (#332 )	11 months ago
Alexander Borzunov	47a2b1ee65	Fix llama's lm_head.weight.requires_grad (#330 ) By default, `llama's lm_head.weight.requires_grad` was True, but we expect it to be False.	11 months ago
Alexander Borzunov	7a37513f77	Add AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification} (#329 ) This PR adds `petals.AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification}` classes, similar to their `transformers.Auto{Model, ModelForCausalLM, ModelForSequenceClassification}` counterparts.	11 months ago
Alexander Borzunov	cb3f018f9f	Add LLaMA support (#323 ) This PR: 1. Abolishes the model conversion procedure. Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms. - BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ. - LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name). 2. Refactors the client to generalize it for multiple models. Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`. 3. Introduces `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.). 4. Introduces `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers. Upgrade instructions: - Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).	11 months ago
Max Ryabinin	5c0733711a	Use number of tokens for attn_cache_size (#286 ) * Use number of tokens for attn_cache_size * Fix cache_bytes_per_block * Rename attn_cache_size to attn_cache_tokens	12 months ago
Max Ryabinin	c839173e57	Determine block dtype in a unified manner (#325 ) * Extract backend_dtype, remove duplicate DTYPE_MAP * Use bfloat16 as the default dtype, resolve dtype in load_pretrained_block	12 months ago
Max Ryabinin	3e7ae5116d	Remove unused imports and attributes (#324 ) * Remove unused imports and attributes	12 months ago
Alexander Borzunov	675bacb592	Bump version to 1.1.5 (#312 )	1 year ago
Alexander Borzunov	e026952338	Abort speedtest if it runs too long (#316 ) Addresses #192 and, specifically, #280.	1 year ago
Alexander Borzunov	6eb306a605	Raise error for unexpected .generate() kwargs (#315 ) Now, if a user passes unexpected kwargs to `.generate()`, they are __ignored__ and the code continues working as if the argument was correctly supported. For example, people often tried passing `repetition_penalty` and didn't notice that it does not have any effect. This PR fixes this problem.	1 year ago
Alexander Borzunov	d9e7bfc949	Divide compute throughput by average no. of used blocks (#314 ) See #192.	1 year ago
Alexander Borzunov	6137b1b4b0	Replace .make_sequence(..., mode="random") with mode="max_throughput" (#313 ) We need to sample the next server using its throughput as the weight to actually achieve max throughput for fine-tuning. As an example, imagine a situation where we have 3 servers with throughputs [1000, 500, 1] hosting the same blocks, then compare the uniform and weighted sampling strategies.	1 year ago
Alexander Borzunov	0a313bf6c5	Update hivemind to 1.1.8, enable efficient bfloat16 encoding (#311 ) This PR: 1. Updates hivemind to 1.1.8 (includes https://github.com/learning-at-home/hivemind/pull/565) 2. Enables efficient bfloat16 serialization by default (`USE_LEGACY_BFLOAT16 = False`) 3. Removes logging code that was included to hivemind in https://github.com/learning-at-home/hivemind/pull/542	1 year ago
Alexander Borzunov	8f6342a861	Refactor RemoteSequenceManager (#309 ) This PR: 1. Extracts `SequenceManagerConfig` and `SequenceManagerState` subclasses. The config is provided by caller and never changed from inside `RemoteSequenceManager`. The state is a part of the `RemoteSequenceManager`'s state shared between the main manager and its slices. We fix some slicing bugs along the way. 2. Removes `dht_prefix` and `p2p` arguments, makes `dht` argument optional. `dht_prefix` can always be overridden using `config.dht_prefix`. `p2p` actually needed only under the hood of `RemoteSequenceManager`, so it can extract it by itself without exposing this low-level class to callers. If strictly necessary, a caller can provide `p2p` as a part of `SequenceManagerState`. `dht` is also needed only by `RemoteSequenceManager`, so we can make it optional in the parent classes and create it automatically when it's not provided. 3. Simplifies retry logic. Previously, we could have "nested" retry loops: one in `._update()`, another in inference/forward/backward steps. The loop in `._update()` could introduce issues to concurrent inference/forward/backward calls, since it blocks the entire class if its delay period becomes too high. Now this logic is simplified: `._update()` performs only one attempt to fetch the DHT info, any retries are triggered by the inference/forward/backward steps. 4. Removes deprecated `RemoteTransformerBlock`. `RemoteTransformerBlock` was deprecated a long time ago, before Petals 1.0.0. Its removal is long due. 5. Removes `dht_utils.get_remote_module()`, `dht_utils.get_remote_sequence()`. This functions duplicate the functionality of the `RemoteSequential` constructor. 6. (minor) Removes `RemoteSequential.is_subsequence` flag. This flag worked incorrectly and was never used. I am removing it for the sake of simplicity.	1 year ago

1 2 3 4 5 ...

423 Commits (measurements) All Branches Search

423 Commits (measurements)

All Branches