main
peft_update
forward_backward
fix-docker
forward_kwargs
bump
test_main
fix-inference-retry
lora_from_hub
payload-size
partial_rollback
qkv_merge
no_qkv_merge
wip_triton
hivemind-dht-fork-process
repetition-penalty
amd-gpus
bnb-0-41-1
lru
beat-docker-into-submission
measurements
debug-leak
fix-nf4-and-dtypes
declare_adapters
empty-weights
download_8bit_weights
no-cpufeature
versions
test_opt_serving
borzunov-patch-2
borzunov-patch-1
processing_attention
yozh-dev-branch
server-increase-startup-timeout
vectorized_beam_search
friendly-timeout-errors
hivemind-1.1.4
fix3
hotfix_bnb
fix-ptune
server-dtypes
pip-installable-v2
pip-installable
diff-compression
client-convenience
server-timeouts
server-logging
beamsearch
fix-protobuf
fix-requirements
fix-joining-announce
bootstrap-peers
fault-tolerant-inference
examples_fix_hivemind
forward-backward-timeouts
fix-rebalancing-issues
add-sst2-example
enable-rebalancing
update_example_1
fix-too-many-open-files
update-hivemind
extract-module-container
instruction-readability-style
readme-clarifications
justheuristic-patch-5
fix-readme
ptune-example-personachat
rtfd
fix-pb2
investigate-segfault
upd-deps
priority-tasks
justheuristic-patch-4
cache
justheuristic-patch-3
generation-inference
deep_prompt_inference
warn-about-6b-instructions
update-readme-disclaimers-faq
justheuristic-patch-2
update-bullet-points
update-readme-pics
readme-release
remove-remote-block
prompt-inference
fix-cache
optimize_seq
fix-seq-backward-recovery
fix-distr-seq-cls
justheuristic-patch-1
fix-convert-8bit
memory_savings
distributed-deep-ptune
ptune-wip
pytest-verbose
rename-test-model
8bit_backward
8bit-model
8bit_model_inference
petals-readme-title
support-backend-dtypes
deep-prompt-tuning
mockup
efficient-forward-backward
fix-branch-name
dbaranchuk-patch-1
get_sequence
generation
fix-ci
fix-master-ci
test-push
facelift
CI
prompt-tuning
client-attempt2
measure-throughput
lm_head
load-balancing
sequence
demo-1
standardize
diff
rpc
update-model
client
fix-auth-token
multiple-experts
8bit_blocks
inference_chain
main_fix
v1.0.0
v1.1.0
v1.1.1
v1.1.2
v1.1.3
v1.1.4
v1.1.5
v2.0.0.post1
v2.0.0.post2
v2.0.0.post3
v2.0.1
v2.0.1.post1
v2.0.1.post2
v2.1.0
v2.2.0
${ noResults }
2 Commits (47d50e1e2938f8a0174caf670b25dea5345c6830)
Author | SHA1 | Message | Date |
---|---|---|---|
Alexander Borzunov |
47d50e1e29
|
Improve default arguments for clients and servers (#530)
This PR updates multiple default arguments in clients and servers: 1. **The client defaults to `torch_dtype=torch.float32` instead of `torch_dtype="auto"`.** The old default was to load weights in the dtype they are saved in (usually bfloat16/float16), which caused issues when the client was run on CPU (the default unless you call `.cuda()`). Specifically, bfloat16 is slow on most CPUs (unless a CPU supports AVX512) and float16 can't be run natively and leads to an exception. This default was a legacy of the earliest Petals versions designed to run BLOOM - its embeddings were so big that they didn't fit into RAM in float32 (e.g., in Colab). The newer models don't have this issue. In contrast, the new default leads to good speed on all CPUs and is consistent with PyTorch and HF Transformers. Also, the client now shows "bfloat16 on non-AVX512 CPU" in all cases (previously this warning was shown only if the machine has enough RAM to fit float32 weights, which could hide the crucial reason of inference being slow). **Note:** This change is backward-incompatible, so we have to increase at least the minor package version (2.2.0 -> 2.3.0.dev0). 2. **The server uses 2x smaller `--attn_cache_tokens`.** The old default led to loading 39 (out of 80) or 78 (out of 80) blocks for popular models on some GPU types, which visibly slowed down inference due to an excess network hop. It was also leaving too much cache, so that inference slowed down much before the cache is used. The new default leads to more efficient block layouts and makes the inference routing algorithm choose alternative paths through other servers when a particular server already has enough active inference sessions (= its cache is full). 3. **The client's max number of retries can be limited by the `PETALS_MAX_RETRIES` env var.** This is to limit `ClientConfig.max_retries` in tests, so we see tracebacks instead of retrying indefinitely in case of errors. |
7 months ago |
Alexander Borzunov |
063e94b4c8
|
Move SequenceManagerConfig -> ClientConfig, petals.dht_utils -> petals.utils.dht (#463) | 9 months ago |