petals

Commit Graph

Author	SHA1	Message	Date
Artem Chumachenko	d6f4f80f3f	Fix Mixtral-related issues (#570 ) This PR fixes problems related to #569: - block initialization - throughput calculation and cache usage - mixtral in tests Beam search is removed for Mixtral and Llama for now. Those models use DynamicCache, which requires special function to change: (see https://github.com/huggingface/transformers/blob/main/src/transformers/cache_utils.py#L161) --------- Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	1 month ago
Artem Chumachenko	d2fcbbc72e	Add Mixtral models (#553 ) * Add somehow workable version * Fix generation * Fixes * Choose right attn * style * fix bloom * remove unnes * Update src/petals/models/mixtral/model.py Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> * fix order of init --------- Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	2 months ago
justheuristic	2ad0b2b936	Fix p2p pushing in rpc_inference (by @miaoqijun ) , support transformers 4.38.2 (#563 ) This pull request solves #560 using a solution proposed by @miaoqijun . It also bumps transformers to the latest version to test with the latest code. --------- Co-authored-by: Yingtong Dou <ytongdou@gmail.com>	2 months ago
Denis Mazur	0d91bbdac3	Bump transformers and accelerate versions (#554 ) Bump versions for transformers and accelerate, remove falcon-rw-1b CI tests	3 months ago
Max Ryabinin	03cbe90234	Optimize LLaMA for inference (#513 ) * Optimize LLaMa for inference * Fix model type detection in tests	6 months ago
Max Ryabinin	ae19b65095	Add position_ids argument to DistributedFalconModel (#525 )	8 months ago
Alexander Borzunov	158621677b	Bump version to 2.2.0 (#502 )	9 months ago
Max Ryabinin	1ebd88ae7b	Optimize the Falcon block for inference (#500 ) This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically, * Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche * If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match. Lastly, the pull request makes the backward pass work (as discussed in https://github.com/bigscience-workshop/petals/pull/499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.	9 months ago
Alexander Borzunov	d40eb6c701	Fix prompt tuning after #464 (#501 ) Unfortunately, running inference in models with `"ptune" in config.tuning_mode` was broken after #464.	9 months ago
Alexander Borzunov	dd4a3230bc	Add Falcon support (#499 ) This PR adds: - Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B. - CI tests for Falcon-RW-1B. - `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab). Limitations: - Backward pass support is broken for now, will be fixed in #500. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>	9 months ago
Alexander Borzunov	b4d822afb2	Force use_cache=True in config only (#497 ) This reverts a part of #496 and instead overrides `use_cache` in `LlamaConfig`s only (so the correct value is visible by HF `.generate()` as well).	9 months ago
Alexander Borzunov	abd547735f	Force use_cache=True (#496 )	9 months ago
Alexander Borzunov	6bb3f54e39	Replace dots in repo names when building DHT prefixes (#489 )	9 months ago
Alexander Borzunov	de2475f31c	Make client compatible with transformers' GenerationMixin (#464 ) This PR drops custom generation codes and introduces compatibility with `transformers.GenerationMixin` instead. This includes support for more sampling options (`top_p`, `top_k`, `repetition_penalty` requested in #460) and beam search - all that is now identical to running model with transformers locally. Most features (excluding beam search and other rarely used stuff) are also compatible with resuming existing sessions. ### Breaking changes If `.generate()` or forward passes are being run inside an `.inference_session()` context, they now use the opened session by default. So, these snippets are now equivalent: ```python # Using default session with model.inference_session(max_length=100): output_ids = model.generate(input_ids, max_new_tokens=3) # Explicitly specifying a session with model.inference_session(max_length=100) as sess: output_ids = model.generate(input_ids, max_new_tokens=3, session=sess) ``` Earlier, the 1st snippet was creating a new session, which is not what most people expected (= such code was most likely to introduce a bug, which is now fixed).	9 months ago
Alexander Borzunov	063e94b4c8	Move SequenceManagerConfig -> ClientConfig, petals.dht_utils -> petals.utils.dht (#463 )	9 months ago
Alexander Borzunov	057a2fb5de	Support Llama 2 (#379 )	10 months ago
Alexander Borzunov	c735dd7ba3	Update transformers to 4.31.0 and peft to 0.4.0 (#371 )	10 months ago
Alexander Borzunov	2c8959e713	Share more info about a server in DHT (#355 )	10 months ago
Alexander Borzunov	f605f093f7	Support LLaMA repos without "-hf" suffix (#349 )	10 months ago
Alexander Borzunov	d126ee3053	Add benchmark scripts (#319 ) This PR: - Adds benchmark scripts for inference, forward pass, and full training step (e.g. used for experiments in our paper). - Fixes bug with dtypes in `petals.DistributedBloomForSequenceClassification`. - (minor refactor) Moves `DTYPE_MAP` to `petals.constants` as a useful constant.	11 months ago
Alexander Borzunov	fecee8c4dc	Show license links when loading models (#332 )	11 months ago
Alexander Borzunov	47a2b1ee65	Fix llama's lm_head.weight.requires_grad (#330 ) By default, `llama's lm_head.weight.requires_grad` was True, but we expect it to be False.	11 months ago
Alexander Borzunov	7a37513f77	Add AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification} (#329 ) This PR adds `petals.AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification}` classes, similar to their `transformers.Auto{Model, ModelForCausalLM, ModelForSequenceClassification}` counterparts.	11 months ago
Alexander Borzunov	cb3f018f9f	Add LLaMA support (#323 ) This PR: 1. Abolishes the model conversion procedure. Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms. - BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ. - LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name). 2. Refactors the client to generalize it for multiple models. Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`. 3. Introduces `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.). 4. Introduces `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers. Upgrade instructions: - Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).	11 months ago

24 Commits (main)