You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
62d9ed5ce7
This PR: 1. **Adds shortest path routing for inference.** We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358. 2. **Makes a server ping neighboring servers in addition to next ones.** This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed. 3. **Fixes a `SequenceManager` bug with the first `update()`.** Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens. |
11 months ago | |
---|---|---|
.. | ||
routing | 11 months ago | |
__init__.py | 12 months ago | |
from_pretrained.py | 12 months ago | |
inference_session.py | 11 months ago | |
lm_head.py | 12 months ago | |
ptune.py | 12 months ago | |
remote_forward_backward.py | 1 year ago | |
remote_generation.py | 1 year ago | |
remote_sequential.py | 11 months ago | |
sequential_autograd.py | 1 year ago |