Commit Graph

5 Commits (main)

Author SHA1 Message Date
Max Ryabinin ae19b65095
Add position_ids argument to DistributedFalconModel (#525) 8 months ago
Alexander Borzunov 158621677b
Bump version to 2.2.0 (#502) 9 months ago
Max Ryabinin 1ebd88ae7b
Optimize the Falcon block for inference (#500)
This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically,

* Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche
* If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR

The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match.

Lastly, the pull request makes the backward pass work (as discussed in https://github.com/bigscience-workshop/petals/pull/499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.
9 months ago
Alexander Borzunov d40eb6c701
Fix prompt tuning after #464 (#501)
Unfortunately, running inference in models with `"ptune" in config.tuning_mode` was broken after #464.
9 months ago
Alexander Borzunov dd4a3230bc
Add Falcon support (#499)
This PR adds:

- Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B.
- CI tests for Falcon-RW-1B.
- `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab).

Limitations:

- Backward pass support is broken for now, will be fixed in #500.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
9 months ago