petals

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

History

Alexander Borzunov 62d9ed5ce7 Implement shortest-path routing for inference (#362 ) This PR: 1. Adds shortest path routing for inference. We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358. 2. Makes a server ping neighboring servers in addition to next ones. This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed. 3. Fixes a `SequenceManager` bug with the first `update()`. Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.		11 months ago
..
routing	Implement shortest-path routing for inference (#362 )	11 months ago
__init__.py	Add LLaMA support (#323 )	12 months ago
from_pretrained.py	Add LLaMA support (#323 )	12 months ago
inference_session.py	Implement shortest-path routing for inference (#362 )	11 months ago
lm_head.py	Fix llama's lm_head.weight.requires_grad (#330 )	12 months ago
ptune.py	Fix llama's lm_head.weight.requires_grad (#330 )	12 months ago
remote_forward_backward.py	Lower payload size threshold for stream handlers (#251 )	1 year ago
remote_generation.py	Raise error for unexpected .generate() kwargs (#315 )	1 year ago
remote_sequential.py	Support peft LoRA adapters (#335 )	11 months ago
sequential_autograd.py	Remove unused imports and attributes (#324 )	1 year ago