You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
petals/src/petals/client
Alexander Borzunov 62d9ed5ce7
Implement shortest-path routing for inference (#362)
This PR:

1. **Adds shortest path routing for inference.** We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358.

2. **Makes a server ping neighboring servers in addition to next ones.** This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed.

3. **Fixes a `SequenceManager` bug with the first `update()`.** Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.
11 months ago
..
routing Implement shortest-path routing for inference (#362) 11 months ago
__init__.py Add LLaMA support (#323) 12 months ago
from_pretrained.py Add LLaMA support (#323) 12 months ago
inference_session.py Implement shortest-path routing for inference (#362) 11 months ago
lm_head.py Fix llama's lm_head.weight.requires_grad (#330) 12 months ago
ptune.py Fix llama's lm_head.weight.requires_grad (#330) 12 months ago
remote_forward_backward.py Lower payload size threshold for stream handlers (#251) 1 year ago
remote_generation.py Raise error for unexpected .generate() kwargs (#315) 1 year ago
remote_sequential.py Support peft LoRA adapters (#335) 11 months ago
sequential_autograd.py Remove unused imports and attributes (#324) 1 year ago