petals

Commit Graph

Author	SHA1	Message	Date
Artem Chumachenko	d6f4f80f3f	Fix Mixtral-related issues (#570 ) This PR fixes problems related to #569: - block initialization - throughput calculation and cache usage - mixtral in tests Beam search is removed for Mixtral and Llama for now. Those models use DynamicCache, which requires special function to change: (see https://github.com/huggingface/transformers/blob/main/src/transformers/cache_utils.py#L161) --------- Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	1 month ago
Denis Mazur	0d91bbdac3	Bump transformers and accelerate versions (#554 ) Bump versions for transformers and accelerate, remove falcon-rw-1b CI tests	3 months ago
Max Ryabinin	03cbe90234	Optimize LLaMA for inference (#513 ) * Optimize LLaMa for inference * Fix model type detection in tests	6 months ago
FYY	a2484b3053	Fix file locks in NFS-mounted directories (#517 ) Fix #515.	8 months ago
Alexander Borzunov	5ce4f1a159	Store (start_block, end_block) in each DHT record for reliability (#510 ) This PR fixes gaps in the DHT server info caused by unavailable DHT keys. Now, one DHT key is enough to get info about all blocks hosted by a server - so we'll see info until all keys are unavailable. Also, this PR refactors `petals.client.routing` and `petals.server.block_selection` modules to use the common `compute_spans()` function (defined in `petals.utils.dht`) and `RemoteSpanInfo` class (defined in `petals.data_structures`).	8 months ago
Alexander Borzunov	dd4a3230bc	Add Falcon support (#499 ) This PR adds: - Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B. - CI tests for Falcon-RW-1B. - `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab). Limitations: - Backward pass support is broken for now, will be fixed in #500. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	8 months ago
justheuristic	c08d09c4d3	Rewrite MemoryCache alloc_timeout logic (#434 ) - rpc_inference: server will now accept allocation timeout from user, defaults to no timeout - bugfix: inference timeout is now measured from the moment the request is received - previously, you would have to wait for your timeout plus the time it takes to sort through the queue (other users' timeout) - now, you get AllocationFailed if you had to wait for over (timeout) seconds - regardless of other users - a request for inference with no timeout will now fail instantly if there is not enough memory available - dtype number of bytes is now correctly determined for int, bool & other types --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Aleksandr Borzunov <hxrussia@gmail.com>	9 months ago
Artem Chumachenko	a14ae7334d	Update peft to 0.5.0 version (#475 ) Update peft to 0.5.0	9 months ago
Alexander Borzunov	de2475f31c	Make client compatible with transformers' GenerationMixin (#464 ) This PR drops custom generation codes and introduces compatibility with `transformers.GenerationMixin` instead. This includes support for more sampling options (`top_p`, `top_k`, `repetition_penalty` requested in #460) and beam search - all that is now identical to running model with transformers locally. Most features (excluding beam search and other rarely used stuff) are also compatible with resuming existing sessions. ### Breaking changes If `.generate()` or forward passes are being run inside an `.inference_session()` context, they now use the opened session by default. So, these snippets are now equivalent: ```python # Using default session with model.inference_session(max_length=100): output_ids = model.generate(input_ids, max_new_tokens=3) # Explicitly specifying a session with model.inference_session(max_length=100) as sess: output_ids = model.generate(input_ids, max_new_tokens=3, session=sess) ``` Earlier, the 1st snippet was creating a new session, which is not what most people expected (= such code was most likely to introduce a bug, which is now fixed).	9 months ago
Alexander Borzunov	063e94b4c8	Move SequenceManagerConfig -> ClientConfig, petals.dht_utils -> petals.utils.dht (#463 )	9 months ago
Artem Chumachenko	568f21dc3b	Add customizable input tensors (#445 )	9 months ago
Alexander Borzunov	a1f7791d5e	Fix petals.utils.ping for servers with client-mode DHT (#430 ) Fix #429.	9 months ago
Alexander Borzunov	8666653cf5	Fix routing through relay, default network RPS, --token, logging, readme (#399 ) * Hide GeneratorExit in _iterate_inference_steps() * Update README.md about `--public_name` * Use .from_pretrained(..., use_auth_token=token) instead of token=token until it's fully supported across HF libs * Use default network speed 25 Mbit/s * Apply relay penalty in max-throughput routing * Replace RPS with "tokens/sec per block" in logs * Increase default expiration	10 months ago
Alexander Borzunov	057a2fb5de	Support Llama 2 (#379 )	10 months ago
Alexander Borzunov	3218534745	Fix --token arg (#378 )	10 months ago
justheuristic	398a384075	Inherit bitsandbytes compute dtype correctly (override peft quirk) (#377 )	10 months ago
Alexander Borzunov	c735dd7ba3	Update transformers to 4.31.0 and peft to 0.4.0 (#371 )	10 months ago
Alexander Borzunov	62d9ed5ce7	Implement shortest-path routing for inference (#362 ) This PR: 1. Adds shortest path routing for inference. We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358. 2. Makes a server ping neighboring servers in addition to next ones. This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed. 3. Fixes a `SequenceManager` bug with the first `update()`. Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.	10 months ago
Ikko Eltociear Ashimine	fd30f7ce10	Fix typo in generation_algorithms.py (#364 )	10 months ago
Alexander Borzunov	11f0d992d7	Report inference, forward, and network RPS separately (#358 ) Inference RPS may be very different from forward RPS. E.g., currently bnb uses a completely different algorithm for NF4 inference. We report detailed RPS info that can be then used for shortest-path routing for inference.	10 months ago
Alexander Borzunov	81c4a45ca2	Make a server ping next servers (#356 ) This PR makes a server ping potential next servers in a chain and report the RTTs to DHT. This will be used for shortest-path routing.	10 months ago
Alexander Borzunov	2c8959e713	Share more info about a server in DHT (#355 )	10 months ago
justheuristic	37fdcb3fe0	Switch adapters slightly faster (#353 ) Currently, each `TransformerBackend.inference_step` looks for adapters and sets the correct adapter type for each block. This is not very expensive, but it can measurably affect inference time. This pull request uses faster adapter switching with just one variable assignment, without iterating over block.modules().	10 months ago
Alexander Borzunov	9703358df0	Fix bugs in _choose_num_blocks() added in #346 (#354 )	10 months ago
justheuristic	c511990236	Remove unused import os (#352 )	10 months ago
Alexander Borzunov	e12d4c666b	Spam less in server logs (#350 )	10 months ago
justheuristic	010857a834	Estimate adapter memory overhead in choose_num_blocks() (#346 ) * estimate adapter memory overhead * reduce number of heads based on that --------- Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	10 months ago
Alexander Borzunov	43acfe52a7	Import petals.utils.peft only when needed to avoid unnecessary import of bitsandbytes (#345 ) The motivation is the same as in #180.	10 months ago
Artem Chumachenko	b9f0a5467f	Support peft LoRA adapters (#335 ) Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters. --------- Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> Co-authored-by: justheuristic <justheuristic@gmail.com>	10 months ago
Alexander Borzunov	4d9c26fe5c	Allow free_disk_space_for() remove arbitrary files from Petals cache (#339 ) Before this PR, `free_disk_space_for()` was able to remove (a) only entire cached revisions (= git commits/branches) and (b) only from the repository we're loading right now. This PR allows this functions to remove arbitrary files separately from any repositories. This is useful for transition to Petals 1.2.0+, since it now uses original repos instead of the ones with converted models (see #323). In particular, the cache for `bigscience/bloom-petals` is now deprecated and should be removed in favor of `bigscience/bloom`. This is also useful as a way to free space before loading LoRA adapters (#335).	10 months ago
Alexander Borzunov	de930918a0	Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) (#333 )	10 months ago
Alexander Borzunov	7a37513f77	Add AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification} (#329 ) This PR adds `petals.AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification}` classes, similar to their `transformers.Auto{Model, ModelForCausalLM, ModelForSequenceClassification}` counterparts.	11 months ago
Alexander Borzunov	cb3f018f9f	Add LLaMA support (#323 ) This PR: 1. Abolishes the model conversion procedure. Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms. - BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ. - LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name). 2. Refactors the client to generalize it for multiple models. Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`. 3. Introduces `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.). 4. Introduces `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers. Upgrade instructions: - Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).	11 months ago
Max Ryabinin	3e7ae5116d	Remove unused imports and attributes (#324 ) * Remove unused imports and attributes	11 months ago
Alexander Borzunov	0a313bf6c5	Update hivemind to 1.1.8, enable efficient bfloat16 encoding (#311 ) This PR: 1. Updates hivemind to 1.1.8 (includes https://github.com/learning-at-home/hivemind/pull/565) 2. Enables efficient bfloat16 serialization by default (`USE_LEGACY_BFLOAT16 = False`) 3. Removes logging code that was included to hivemind in https://github.com/learning-at-home/hivemind/pull/542	1 year ago
Alexander Borzunov	892fa2386a	Remove CustomLinear8bitLt (#297 ) This became a part of https://github.com/TimDettmers/bitsandbytes/releases/tag/0.37.0.	1 year ago
Alexander Borzunov	fee19e9b9b	Use get_logger(__name__) instead of get_logger(__file__) (#265 )	1 year ago
Alexander Borzunov	6b12b0d050	Report server version and dht.client_mode in rpc_info(), check for updates on startup (#209 ) This PR: 1. Shows the current Petals version and checks for updates on startup. 2. Reports the current version and DHT mode in `rpc_info()`, so it can be shown on http://health.petals.ml or used on clients for efficient routing.	1 year ago
Alexander Borzunov	16b69d6050	Fix GiBs in the "insufficient disk space" message (#187 )	1 year ago
Alexander Borzunov	6dd9a938bd	Import bitsandbytes only if it's going to be used (#180 )	1 year ago
justheuristic	ae9e71fe8e	Add local tensor-parallel fwd/bwd (#143 ) This pull request adds an option to run Petals server on multiple local GPUs. It uses https://github.com/BlackSamorez/tensor_parallel - 8bit approximation error same as in main (mean~=2% q0.9~=5%) - TP=1, 2, 3 (see screenshots above) - forward, grad w.r.t. input and inference exact match with main with TP=1 - `>=`80% GPU utilization with 3x 1080ti, batch = 8 tokens - throughput measured with and without TP - TP on 1080Tis has near-linear speedup comparable to the benchmarks (see first message) Co-authored-by: Iaroslav Lisniak <yalisnyak@nes.ru> Co-authored-by: Andrei Panferov <andrei@blacksamorez.ru> Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	1 year ago
Alexander Borzunov	9997ada3bb	Shield alloc & free from cancellation (#163 ) A handler's RPC code may be cancelled due to a request timeout or a client closing the connection. Before this PR: - If `.cancel()` happens while waiting for `hivemind.utils.enter_asynchronously()`, the lock will never be released. - If `.cancel()` happens while doing that before freeing memory, the memory will never be freed. This PR fixes it by deferring the cancellation with [asyncio.shield()](https://docs.python.org/3/library/asyncio-task.html#asyncio.shield). Now, the cancellation will happen only when all locks are released and alloc/free has completed.	1 year ago
Alexander Borzunov	523a7cad33	Fix issues related to `petals` as a module (#159 ) 1. Added `from petals.client import *` to `petals/__init__.py`, so you can write just that: ```python from petals import DistributedBloomForCausalLM ``` I didn't do the same with server, since its classes are supposed to by used by `petals.cli.run_server`, not end-users. Though it's still possible to do `from petals.server.smth import smth` if necessary. 2. Fixed one more logging issue: log lines from hivemind were shown twice due to a bug in #156. 3. Removed unused `runtime.py`, since the server actually uses `hivemind.moe.Runtime`, and `runtime.py` has no significant changes comparing to it.	1 year ago
Alexander Borzunov	668b736031	Fix logging: do not duplicate lines, enable colors in Colab (#156 )	1 year ago
Max Ryabinin	bd91be27ea	Add missing methods for SamplingAlgorithm, fix docstrings (#107 ) * Add missing methods for SamplingAlgorithm, fix docstrings * Add SamplingAlgorithm to _choose_sample_algorithm * Add test_sampling * Add a warning if sampling options were passed, but do_sample=False * Skip the sampling test for now Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	1 year ago
Alexander Borzunov	701ec7e53e	Clean up disk space (#152 )	1 year ago
Alexander Borzunov	e99bf36647	Use common folder for all caches, make it a volume in Dockerfile (#141 )	1 year ago
Max Ryabinin	3ca8b4f082	Fix typos with codespell (#126 )	1 year ago
Alexander Borzunov	f72c220404	Suppress quantization warning and fix dtype defaults in compute benchmark (#117 )	1 year ago
justheuristic	9e11f73242	Fix tile size on ampere (#116 ) Fix tile size on ampere Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>	1 year ago

1 2

54 Commits (main)