Commit Graph

4 Commits (main)

Author SHA1 Message Date
Artem Chumachenko d6f4f80f3f
Fix Mixtral-related issues (#570)
This PR fixes problems related to #569:
- block initialization
- throughput calculation and cache usage
- mixtral in tests

Beam search is removed for Mixtral and Llama for now. Those models use DynamicCache, which requires special function to change: (see https://github.com/huggingface/transformers/blob/main/src/transformers/cache_utils.py#L161)

---------

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2 months ago
Denis Mazur 0d91bbdac3
Bump transformers and accelerate versions (#554)
Bump versions for transformers and accelerate, remove falcon-rw-1b CI tests
3 months ago
Max Ryabinin 03cbe90234
Optimize LLaMA for inference (#513)
* Optimize LLaMa for inference
* Fix model type detection in tests
6 months ago
Max Ryabinin 1ebd88ae7b
Optimize the Falcon block for inference (#500)
This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically,

* Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche
* If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR

The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match.

Lastly, the pull request makes the backward pass work (as discussed in https://github.com/bigscience-workshop/petals/pull/499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.
9 months ago