Commit Graph

380 Commits (675bacb592bac7145d38ded2ea746da2b9b6c391)
 

Author SHA1 Message Date
Max Ryabinin a0e8bbd28d
Fix arguments in remove_old_models.py (#153)
* Fix arguments in remove_old_models.py

* Remove unnecessary args.author

* Fix the GitHub Action as well
2 years ago
Alexander Borzunov 701ec7e53e
Clean up disk space (#152) 2 years ago
justheuristic b04982c1a2
Bump transformers to 4.25.1 (#151)
- latest accelerate, transformers, huggingface_hub
- rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344
- remove unused code
- fix edge case where session crashes when receiving seq length 0
- assert transformer version when importing WrappedBloomBlock

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
2 years ago
Alexander Borzunov e4dc938dfe
Fix OOMs during server rebalancing (#150)
The cause of OOMs were the cyclic references `TransformerBackend <-> PrioritizedTaskPool` that could not have been garbage collected properly. Still, I've added explicit tensor removal just in case.
2 years ago
Alexander Borzunov 83d9493b6c
Improve block size calculations (#149) 2 years ago
Aleksandr Borzunov f42e559c77 Update README.md 2 years ago
Alexander Borzunov 6beb686909
Add link to privacy & security Wiki (#144) 2 years ago
Alexander Borzunov 84fec81543
Suppress asyncio error logs by default (#142) 2 years ago
Alexander Borzunov e99bf36647
Use common folder for all caches, make it a volume in Dockerfile (#141) 2 years ago
Alexander Borzunov 5f50ea9c79
Update Anaconda instructions (#140) 2 years ago
Alexander Borzunov e1d8793f00
Show route on client (#139) 2 years ago
Alexander Borzunov 4cb0ac4718
Update texts in "Terms of use" and "Privacy and security" sections (#138) 2 years ago
Alexander Borzunov a94c91d870
Add Docker commands, use permanent Discord links (#137) 2 years ago
Alexander Borzunov 77a00e17f0
Fix "could not unlink the shared memory file" during rebalancing (#135) 2 years ago
Alexander Borzunov 318d690a5c
Fix waiting until free memory is available (#136) 2 years ago
Alexander Borzunov e8fac92e59
Allow .generate() to reuse existing inference session (#132) 2 years ago
Alexander Borzunov 1fe3716589
Don't ban servers in case of client-caused handler errors (#134) 2 years ago
Alexander Borzunov 66f1799d32
Set default --step_timeout to 5 min (#133) 2 years ago
Alexander Borzunov b873d92ffa
Update README.md 2 years ago
Alexander Borzunov 5d5d2666b8
Mention parallel inference 2 years ago
Alexander Borzunov 955eae30b3
Mention 1 sec/token explicitly 2 years ago
Alexander Borzunov 33c210b973
Update Colab notebook 2 years ago
Alexander Borzunov f56edaa13f
Fix inference and rpc_info() fault tolerance (#131) 2 years ago
justheuristic 79a4308992
Clear trigger before engaging in update (#130)
Update sequence_manager.py
2 years ago
Alexander Borzunov b8e1c1b7f5
Revert to hivemind==1.1.3 for stability (#129) 2 years ago
justheuristic 68c85e7492
Avoid synchronous updates, ban peers based on request outcome (#127)
- sequence_manager now takes care for its own updated-ness - no need to manually update it
- if a peer fails a request, sequence manager will ban this peer temporarily. Ban times increase with failure streaks



Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2 years ago
Alexander Borzunov 9dbf5e2e6f
Set dht.num_workers = n_layer, update_period = 150, expiration = 300 (#125) 2 years ago
Max Ryabinin 3ca8b4f082
Fix typos with codespell (#126) 2 years ago
justheuristic 8491ed2bd3
Add checks for forward() inputs on the client side (#123) 2 years ago
Max Ryabinin 055f85b83e
Call block.load_state_dict only once (#124) 2 years ago
Artem Chumachenko 0855aa7347
Update notebooks to use full BLOOM-176B (#104)
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2 years ago
Max Ryabinin 4ffb4d83c7
Remove "-r" when installing Petals in examples (#122) 2 years ago
Alexander Borzunov d29ef70c85
Update README.md 2 years ago
Alexander Borzunov 1d9aa77697
Update README.md 2 years ago
Alexander Borzunov da36470a4b
Update README.md 2 years ago
Alexander Borzunov 81b94df14b
Rework readme, move code example to the top, link draft of Colab (#118) 2 years ago
Alexander Borzunov 893987ebf8
Require hivemind==1.1.4 with p2pd v0.3.13 (#121) 2 years ago
Alexander Borzunov fc6722576b
Choose --num_blocks for bigscience/bloom-petals automatically (#119) 2 years ago
Alexander Borzunov f72c220404
Suppress quantization warning and fix dtype defaults in compute benchmark (#117) 2 years ago
Alexander Borzunov 643a054170
Make server use smart defaults (#115)
Summary:

```python
parser.add_argument('--attn_cache_size', type=str, default=None,
                    help='The size of GPU memory allocated for storing past attention keys/values between inference steps. '
                         'Examples: 500MB, 1.2GB, 1073741824 (bytes). Note that 1KB != 1KiB here. '
                         'Default: 0.5GiB * num_blocks * hidden_size / 14336. '
                         'The latter is the hidden size of the bigscience/bloom-petals model.')

parser.add_argument('--request_timeout', type=float, required=False, default=3 * 60,
                    help='Timeout (in seconds) for the whole rpc_forward/rpc_backward/rpc_forward_stream/rpc_backward_stream request')
parser.add_argument('--session_timeout', type=float, required=False, default=30 * 60,
                    help='Timeout (in seconds) for the whole inference session')
parser.add_argument('--step_timeout', type=float, required=False, default=60,
                    help="Timeout (in seconds) for waiting the next step's inputs inside an inference session")

parser.add_argument('--load_in_8bit', type=bool, default=None,
                    help="Convert the loaded model into mixed-8bit quantized model. Default: True if GPU is available")
```

Co-authored-by: justheuristic <justheuristic@gmail.com>
2 years ago
justheuristic 9e11f73242
Fix tile size on ampere (#116)
Fix tile size on ampere

Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
2 years ago
justheuristic 617d70f7dc
Support --load_in_8bit on pre-Turing GPUs (#113)
- Linear8bitLt now supports for pre-turing GPUs by temporarily upcasting quantized weights.
- added a test for linear8bitlt accuracy with the new fallback, the accuracy is similar than the real thing, (slightly better due to non-quantized A)
- performance is roughly halfway between the default mode and memory_efficient_backward

Alternatives considered:
- cupy - slow, casting to float internally
- triton - fast but unstable af. every 3rd attempt to matmul is a segfault
- bnb.functional.igemm (no lt) - "CuBLAS Error 8" on old GPUs

Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
2 years ago
Alexander Borzunov 1ea44b0d3c
Measure throughput for different configs, devices, and dtypes separately (#114) 2 years ago
justheuristic 01838f9a99
Fix Linear8bitlt state config, update tests (#112)
* fix state initializer
* update tests to actually use new code
* keep bias during quantization
2 years ago
Aleksandr Borzunov 96033de921 Fix script for running servers robustly 2 years ago
Aleksandr Borzunov 85cf32d2a4 Add script to run servers robustly 2 years ago
justheuristic 088713912d
Patch Linear8bit to enable CxB backward (#111)
A patch to bitsandbytes 0.34.0 that introduces an option to run backward pass in default (fast) matrix layout.
Authors: cxb inversion by @borzunov, original 8bit code by @timdettmers

* optimized layout inversion code by @borzunov ([original code](https://colab.research.google.com/drive/1EJ0MKifajXSSVq7O2_QGwtb0l6gRAGrh?usp=sharing)) to use less forward calls
* implemented CustomLinear8bitLt, a child of Linear8bitLt that can do backward without CB
* added exact match tests for layouts and linear layers: see tests/test_linear8bitlt.py
* switched petals to the new layer type

Core idea: layouts apply the same permutation to every tile in the matrix. We can treat this as (batched) gather ops.
  Reshape input tensor so that ij-th gather operation op will apply to ij-th elements in each tile.

Prototype: 
Layout info: https://github.com/TimDettmers/bitsandbytes/blob/main/csrc/kernels.cu#L2130-L2136


Co-authored-by: Alexander Borzunov <hxrussia@gmail.com>
Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Tim Dettmers <tim.dettmers@gmail.com>
2 years ago
justheuristic 8dc0f513ba
Hotfix span selection (#110)
Fix an issue in span selection that was introduced in #106
2 years ago
justheuristic a2066a4096
Optimize RemoteSequenceManager (#106)
- [x] made RemoteSequenceManager into a background thread that pre-fetches information instead of running just in time
- [x] moved routing-related stuff to petals.client.routing
- [x] extract remote peer routing information to RemoteSequenceInfo
- [x] made sure that the code survives continued use (e.g. one hour)
- [x] updated every spot where update_ is called manually
- [x] modified get_sequence to check that the thread is alive, warn if not
- [x] removed max_retries, switched rpc_info to exponential backoff
- [x] fixed a bg that causes RemoteSeq* to lose user-defined hyperparameters (e.g. timeout) upon subsequencing (sequential[3:5])
- [x] moved client-side points strategy to client.routing
- [x] ensured that RemoteSequenceManager thread created in get_remote_module properly shuts down when the module is destroyed
- [x] resolved minor affected todos
- [x] modified tests to no longer use PYTHONPATH
- [x] worked around protocol error in rpc_info


Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Artem Chumachenko <artek.chumak@gmail.com>
2 years ago
Artem Chumachenko 7d859a947b
Expose request_timeout to DistributedBloomConfig (#105)
Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
2 years ago