petals

Commit Graph

Author	SHA1	Message	Date
Artem Chumachenko	d6f4f80f3f	Fix Mixtral-related issues (#570 ) This PR fixes problems related to #569: - block initialization - throughput calculation and cache usage - mixtral in tests Beam search is removed for Mixtral and Llama for now. Those models use DynamicCache, which requires special function to change: (see https://github.com/huggingface/transformers/blob/main/src/transformers/cache_utils.py#L161) --------- Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	2 months ago
Denis Mazur	0d91bbdac3	Bump transformers and accelerate versions (#554 ) Bump versions for transformers and accelerate, remove falcon-rw-1b CI tests	4 months ago
justheuristic	c08d09c4d3	Rewrite MemoryCache alloc_timeout logic (#434 ) - rpc_inference: server will now accept allocation timeout from user, defaults to no timeout - bugfix: inference timeout is now measured from the moment the request is received - previously, you would have to wait for your timeout plus the time it takes to sort through the queue (other users' timeout) - now, you get AllocationFailed if you had to wait for over (timeout) seconds - regardless of other users - a request for inference with no timeout will now fail instantly if there is not enough memory available - dtype number of bytes is now correctly determined for int, bool & other types --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Aleksandr Borzunov <hxrussia@gmail.com>	9 months ago
Artem Chumachenko	a14ae7334d	Update peft to 0.5.0 version (#475 ) Update peft to 0.5.0	9 months ago
Alexander Borzunov	057a2fb5de	Support Llama 2 (#379 )	11 months ago
Alexander Borzunov	3218534745	Fix --token arg (#378 )	11 months ago
justheuristic	398a384075	Inherit bitsandbytes compute dtype correctly (override peft quirk) (#377 )	11 months ago
Alexander Borzunov	c735dd7ba3	Update transformers to 4.31.0 and peft to 0.4.0 (#371 )	11 months ago
justheuristic	37fdcb3fe0	Switch adapters slightly faster (#353 ) Currently, each `TransformerBackend.inference_step` looks for adapters and sets the correct adapter type for each block. This is not very expensive, but it can measurably affect inference time. This pull request uses faster adapter switching with just one variable assignment, without iterating over block.modules().	11 months ago
Alexander Borzunov	9703358df0	Fix bugs in _choose_num_blocks() added in #346 (#354 )	11 months ago
Alexander Borzunov	e12d4c666b	Spam less in server logs (#350 )	11 months ago
justheuristic	010857a834	Estimate adapter memory overhead in choose_num_blocks() (#346 ) * estimate adapter memory overhead * reduce number of heads based on that --------- Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	11 months ago
Artem Chumachenko	b9f0a5467f	Support peft LoRA adapters (#335 ) Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters. --------- Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> Co-authored-by: justheuristic <justheuristic@gmail.com>	11 months ago

13 Commits (e268c99a6b53eb4ab30ac208c8bc149ba374a013)