petals

Commit Graph

Author	SHA1	Message	Date
Max Ryabinin	a0e8bbd28d	Fix arguments in remove_old_models.py (#153 ) * Fix arguments in remove_old_models.py * Remove unnecessary args.author * Fix the GitHub Action as well	2 years ago
Alexander Borzunov	701ec7e53e	Clean up disk space (#152 )	2 years ago
justheuristic	b04982c1a2	Bump transformers to 4.25.1 (#151 ) - latest accelerate, transformers, huggingface_hub - rearrange attention caches to support https://github.com/huggingface/transformers/pull/18344 - remove unused code - fix edge case where session crashes when receiving seq length 0 - assert transformer version when importing WrappedBloomBlock Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	2 years ago
Alexander Borzunov	e4dc938dfe	Fix OOMs during server rebalancing (#150 ) The cause of OOMs were the cyclic references `TransformerBackend <-> PrioritizedTaskPool` that could not have been garbage collected properly. Still, I've added explicit tensor removal just in case.	2 years ago
Alexander Borzunov	83d9493b6c	Improve block size calculations (#149 )	2 years ago
Aleksandr Borzunov	f42e559c77	Update README.md	2 years ago
Alexander Borzunov	6beb686909	Add link to privacy & security Wiki (#144 )	2 years ago
Alexander Borzunov	84fec81543	Suppress asyncio error logs by default (#142 )	2 years ago
Alexander Borzunov	e99bf36647	Use common folder for all caches, make it a volume in Dockerfile (#141 )	2 years ago
Alexander Borzunov	5f50ea9c79	Update Anaconda instructions (#140 )	2 years ago
Alexander Borzunov	e1d8793f00	Show route on client (#139 )	2 years ago
Alexander Borzunov	4cb0ac4718	Update texts in "Terms of use" and "Privacy and security" sections (#138 )	2 years ago
Alexander Borzunov	a94c91d870	Add Docker commands, use permanent Discord links (#137 )	2 years ago
Alexander Borzunov	77a00e17f0	Fix "could not unlink the shared memory file" during rebalancing (#135 )	2 years ago
Alexander Borzunov	318d690a5c	Fix waiting until free memory is available (#136 )	2 years ago
Alexander Borzunov	e8fac92e59	Allow .generate() to reuse existing inference session (#132 )	2 years ago
Alexander Borzunov	1fe3716589	Don't ban servers in case of client-caused handler errors (#134 )	2 years ago
Alexander Borzunov	66f1799d32	Set default --step_timeout to 5 min (#133 )	2 years ago
Alexander Borzunov	b873d92ffa	Update README.md	2 years ago
Alexander Borzunov	5d5d2666b8	Mention parallel inference	2 years ago
Alexander Borzunov	955eae30b3	Mention 1 sec/token explicitly	2 years ago
Alexander Borzunov	33c210b973	Update Colab notebook	2 years ago
Alexander Borzunov	f56edaa13f	Fix inference and rpc_info() fault tolerance (#131 )	2 years ago
justheuristic	79a4308992	Clear trigger before engaging in update (#130 ) Update sequence_manager.py	2 years ago
Alexander Borzunov	b8e1c1b7f5	Revert to hivemind==1.1.3 for stability (#129 )	2 years ago
justheuristic	68c85e7492	Avoid synchronous updates, ban peers based on request outcome (#127 ) - sequence_manager now takes care for its own updated-ness - no need to manually update it - if a peer fails a request, sequence manager will ban this peer temporarily. Ban times increase with failure streaks Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	2 years ago
Alexander Borzunov	9dbf5e2e6f	Set dht.num_workers = n_layer, update_period = 150, expiration = 300 (#125 )	2 years ago
Max Ryabinin	3ca8b4f082	Fix typos with codespell (#126 )	2 years ago
justheuristic	8491ed2bd3	Add checks for forward() inputs on the client side (#123 )	2 years ago
Max Ryabinin	055f85b83e	Call block.load_state_dict only once (#124 )	2 years ago
Artem Chumachenko	0855aa7347	Update notebooks to use full BLOOM-176B (#104 ) Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	2 years ago
Max Ryabinin	4ffb4d83c7	Remove "-r" when installing Petals in examples (#122 )	2 years ago
Alexander Borzunov	d29ef70c85	Update README.md	2 years ago
Alexander Borzunov	1d9aa77697	Update README.md	2 years ago
Alexander Borzunov	da36470a4b	Update README.md	2 years ago
Alexander Borzunov	81b94df14b	Rework readme, move code example to the top, link draft of Colab (#118 )	2 years ago
Alexander Borzunov	893987ebf8	Require hivemind==1.1.4 with p2pd v0.3.13 (#121 )	2 years ago
Alexander Borzunov	fc6722576b	Choose --num_blocks for bigscience/bloom-petals automatically (#119 )	2 years ago
Alexander Borzunov	f72c220404	Suppress quantization warning and fix dtype defaults in compute benchmark (#117 )	2 years ago
Alexander Borzunov	643a054170	Make server use smart defaults (#115 ) Summary: ```python parser.add_argument('--attn_cache_size', type=str, default=None, help='The size of GPU memory allocated for storing past attention keys/values between inference steps. ' 'Examples: 500MB, 1.2GB, 1073741824 (bytes). Note that 1KB != 1KiB here. ' 'Default: 0.5GiB * num_blocks * hidden_size / 14336. ' 'The latter is the hidden size of the bigscience/bloom-petals model.') parser.add_argument('--request_timeout', type=float, required=False, default=3 * 60, help='Timeout (in seconds) for the whole rpc_forward/rpc_backward/rpc_forward_stream/rpc_backward_stream request') parser.add_argument('--session_timeout', type=float, required=False, default=30 * 60, help='Timeout (in seconds) for the whole inference session') parser.add_argument('--step_timeout', type=float, required=False, default=60, help="Timeout (in seconds) for waiting the next step's inputs inside an inference session") parser.add_argument('--load_in_8bit', type=bool, default=None, help="Convert the loaded model into mixed-8bit quantized model. Default: True if GPU is available") ``` Co-authored-by: justheuristic <justheuristic@gmail.com>	2 years ago
justheuristic	9e11f73242	Fix tile size on ampere (#116 ) Fix tile size on ampere Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>	2 years ago
justheuristic	617d70f7dc	Support --load_in_8bit on pre-Turing GPUs (#113 ) - Linear8bitLt now supports for pre-turing GPUs by temporarily upcasting quantized weights. - added a test for linear8bitlt accuracy with the new fallback, the accuracy is similar than the real thing, (slightly better due to non-quantized A) - performance is roughly halfway between the default mode and memory_efficient_backward Alternatives considered: - cupy - slow, casting to float internally - triton - fast but unstable af. every 3rd attempt to matmul is a segfault - bnb.functional.igemm (no lt) - "CuBLAS Error 8" on old GPUs Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>	2 years ago
Alexander Borzunov	1ea44b0d3c	Measure throughput for different configs, devices, and dtypes separately (#114 )	2 years ago
justheuristic	01838f9a99	Fix Linear8bitlt state config, update tests (#112 ) * fix state initializer * update tests to actually use new code * keep bias during quantization	2 years ago
Aleksandr Borzunov	96033de921	Fix script for running servers robustly	2 years ago
Aleksandr Borzunov	85cf32d2a4	Add script to run servers robustly	2 years ago
justheuristic	088713912d	Patch Linear8bit to enable CxB backward (#111 ) A patch to bitsandbytes 0.34.0 that introduces an option to run backward pass in default (fast) matrix layout. Authors: cxb inversion by @borzunov, original 8bit code by @timdettmers * optimized layout inversion code by @borzunov ([original code](https://colab.research.google.com/drive/1EJ0MKifajXSSVq7O2_QGwtb0l6gRAGrh?usp=sharing)) to use less forward calls * implemented CustomLinear8bitLt, a child of Linear8bitLt that can do backward without CB * added exact match tests for layouts and linear layers: see tests/test_linear8bitlt.py * switched petals to the new layer type Core idea: layouts apply the same permutation to every tile in the matrix. We can treat this as (batched) gather ops. Reshape input tensor so that ij-th gather operation op will apply to ij-th elements in each tile. Prototype: Layout info: https://github.com/TimDettmers/bitsandbytes/blob/main/csrc/kernels.cu#L2130-L2136 Co-authored-by: Alexander Borzunov <hxrussia@gmail.com> Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Tim Dettmers <tim.dettmers@gmail.com>	2 years ago
justheuristic	8dc0f513ba	Hotfix span selection (#110 ) Fix an issue in span selection that was introduced in #106	2 years ago
justheuristic	a2066a4096	Optimize RemoteSequenceManager (#106 ) - [x] made RemoteSequenceManager into a background thread that pre-fetches information instead of running just in time - [x] moved routing-related stuff to petals.client.routing - [x] extract remote peer routing information to RemoteSequenceInfo - [x] made sure that the code survives continued use (e.g. one hour) - [x] updated every spot where update_ is called manually - [x] modified get_sequence to check that the thread is alive, warn if not - [x] removed max_retries, switched rpc_info to exponential backoff - [x] fixed a bg that causes RemoteSeq* to lose user-defined hyperparameters (e.g. timeout) upon subsequencing (sequential[3:5]) - [x] moved client-side points strategy to client.routing - [x] ensured that RemoteSequenceManager thread created in get_remote_module properly shuts down when the module is destroyed - [x] resolved minor affected todos - [x] modified tests to no longer use PYTHONPATH - [x] worked around protocol error in rpc_info Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Artem Chumachenko <artek.chumak@gmail.com>	2 years ago
Artem Chumachenko	7d859a947b	Expose request_timeout to DistributedBloomConfig (#105 ) Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>	2 years ago

1 2 3 4 5 ...

380 Commits (675bacb592bac7145d38ded2ea746da2b9b6c391) All Branches Search

380 Commits (675bacb592bac7145d38ded2ea746da2b9b6c391)

All Branches