You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
43ac6016ac
Currently, the schemas use `torch.float32`, so all inputs and outputs converted to float32 before sending and after receiving on both servers and clients. This creates a huge slowdown for the system. * This PR makes the schemas use the server's `--torch_dtype` argument (default is `torch.bloat16` for BLOOM-176B) * an option for client to request a specific output compression. Use case 1: client sends quantized inputs and expects quantized inputs in return. Use case 2: client uses quantization for gradients w.r.t. activations, but keeps grads w.r.t. __prompts__ as is for greater precision. * a comment explaining the purpose of NoSpendingPolicy - since we likely won't have it for the workshop * a test with custom compression (janky implementation for testing purposes) Co-authored-by: justheuristic <justheuristic@gmail.com> |
2 years ago | |
---|---|---|
.. | ||
scripts | 2 years ago | |
conftest.py | 2 years ago | |
test.id | 2 years ago | |
test_block_exact_match.py | 2 years ago | |
test_chained_calls.py | 2 years ago | |
test_full_model.py | 2 years ago | |
test_priority_pool.py | 2 years ago | |
test_remote_sequential.py | 2 years ago | |
test_utils.py | 2 years ago |