You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
petals/tests
Max Ryabinin 1ebd88ae7b
Optimize the Falcon block for inference (#500)
This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically,

* Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche
* If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR

The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match.

Lastly, the pull request makes the backward pass work (as discussed in https://github.com/bigscience-workshop/petals/pull/499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.
9 months ago
..
bootstrap.id Test Llama, rebalancing, throughput eval, and all CLI scripts (#452) 9 months ago
conftest.py Fix logging: do not duplicate lines, enable colors in Colab (#156) 1 year ago
server2.id Test Llama, rebalancing, throughput eval, and all CLI scripts (#452) 9 months ago
test_aux_functions.py Add customizable input tensors (#445) 9 months ago
test_block_exact_match.py Prioritize short inference, unmerge pools for long inference (#458) 9 months ago
test_cache.py Support macOS (#477) 9 months ago
test_chained_calls.py Test Llama, rebalancing, throughput eval, and all CLI scripts (#452) 9 months ago
test_dtype.py Add LLaMA support (#323) 11 months ago
test_full_model.py Fix `.generate(input_ids=...)` (#485) 9 months ago
test_optimized_layers.py Optimize the Falcon block for inference (#500) 9 months ago
test_peft.py Support peft LoRA adapters (#335) 10 months ago
test_priority_pool.py Support macOS (#477) 9 months ago
test_remote_sequential.py Fix `.generate(input_ids=...)` (#485) 9 months ago
test_sequence_manager.py Test Llama, rebalancing, throughput eval, and all CLI scripts (#452) 9 months ago
test_server_stats.py Test Llama, rebalancing, throughput eval, and all CLI scripts (#452) 9 months ago
test_tensor_parallel.py Test Llama, rebalancing, throughput eval, and all CLI scripts (#452) 9 months ago
test_utils.py Support peft LoRA adapters (#335) 10 months ago