Commit Graph

13 Commits (e268c99a6b53eb4ab30ac208c8bc149ba374a013)

Author SHA1 Message Date
Artem Chumachenko d6f4f80f3f
Fix Mixtral-related issues (#570)
This PR fixes problems related to #569:
- block initialization
- throughput calculation and cache usage
- mixtral in tests

Beam search is removed for Mixtral and Llama for now. Those models use DynamicCache, which requires special function to change: (see https://github.com/huggingface/transformers/blob/main/src/transformers/cache_utils.py#L161)

---------

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2 months ago
Denis Mazur 0d91bbdac3
Bump transformers and accelerate versions (#554)
Bump versions for transformers and accelerate, remove falcon-rw-1b CI tests
4 months ago
justheuristic c08d09c4d3
Rewrite MemoryCache alloc_timeout logic (#434)
-    rpc_inference: server will now accept allocation timeout from user, defaults to no timeout
-    bugfix: inference timeout is now measured from the moment the request is received
    -    previously, you would have to wait for your timeout plus the time it takes to sort through the queue (other users' timeout)
    -    now, you get AllocationFailed if you had to wait for over (timeout) seconds - regardless of other users
-    a request for inference with no timeout will now fail instantly if there is not enough memory available
-    dtype number of bytes is now correctly determined for int, bool & other types


---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Aleksandr Borzunov <hxrussia@gmail.com>
9 months ago
Artem Chumachenko a14ae7334d
Update peft to 0.5.0 version (#475)
Update peft to 0.5.0
9 months ago
Alexander Borzunov 057a2fb5de
Support Llama 2 (#379) 11 months ago
Alexander Borzunov 3218534745
Fix --token arg (#378) 11 months ago
justheuristic 398a384075
Inherit bitsandbytes compute dtype correctly (override peft quirk) (#377) 11 months ago
Alexander Borzunov c735dd7ba3
Update transformers to 4.31.0 and peft to 0.4.0 (#371) 11 months ago
justheuristic 37fdcb3fe0
Switch adapters slightly faster (#353)
Currently, each `TransformerBackend.inference_step` looks for adapters and sets the correct adapter type for each block. This is not very expensive, but it can measurably affect inference time.

This pull request uses faster adapter switching with just one variable assignment, without iterating over block.modules().
11 months ago
Alexander Borzunov 9703358df0
Fix bugs in _choose_num_blocks() added in #346 (#354) 11 months ago
Alexander Borzunov e12d4c666b
Spam less in server logs (#350) 11 months ago
justheuristic 010857a834
Estimate adapter memory overhead in choose_num_blocks() (#346)
* estimate adapter memory overhead
* reduce number of heads based on that

---------

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
11 months ago
Artem Chumachenko b9f0a5467f
Support peft LoRA adapters (#335)
Implement an option to deploy PEFT adapters to a server. Clients can set active_adapter=... to use these adapters.

---------

Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: justheuristic <justheuristic@gmail.com>
11 months ago