You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Summary: ```python parser.add_argument('--attn_cache_size', type=str, default=None, help='The size of GPU memory allocated for storing past attention keys/values between inference steps. ' 'Examples: 500MB, 1.2GB, 1073741824 (bytes). Note that 1KB != 1KiB here. ' 'Default: 0.5GiB * num_blocks * hidden_size / 14336. ' 'The latter is the hidden size of the bigscience/bloom-petals model.') parser.add_argument('--request_timeout', type=float, required=False, default=3 * 60, help='Timeout (in seconds) for the whole rpc_forward/rpc_backward/rpc_forward_stream/rpc_backward_stream request') parser.add_argument('--session_timeout', type=float, required=False, default=30 * 60, help='Timeout (in seconds) for the whole inference session') parser.add_argument('--step_timeout', type=float, required=False, default=60, help="Timeout (in seconds) for waiting the next step's inputs inside an inference session") parser.add_argument('--load_in_8bit', type=bool, default=None, help="Convert the loaded model into mixed-8bit quantized model. Default: True if GPU is available") ``` Co-authored-by: justheuristic <justheuristic@gmail.com> |
2 years ago | |
---|---|---|
.. | ||
__init__.py | 2 years ago | |
backend.py | 2 years ago | |
block_selection.py | 2 years ago | |
cache.py | 2 years ago | |
handler.py | 2 years ago | |
runtime.py | 2 years ago | |
server.py | 2 years ago | |
task_pool.py | 2 years ago | |
task_prioritizer.py | 2 years ago | |
throughput.py | 2 years ago |