petals

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

History

Max Ryabinin 1ebd88ae7b Optimize the Falcon block for inference (#500 ) This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically, * Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche * If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match. Lastly, the pull request makes the backward pass work (as discussed in https://github.com/bigscience-workshop/petals/pull/499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.		9 months ago
..
bloom	Fix prompt tuning after #464 (#501 )	9 months ago
falcon	Optimize the Falcon block for inference (#500 )	9 months ago
llama	Fix prompt tuning after #464 (#501 )	9 months ago
__init__.py	Add Falcon support (#499 )	9 months ago