You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Go to file
Artem Chumachenko 1b21dd3217 Add memory cache usage 8 months ago
.github/workflows Add Falcon support (#499) 8 months ago
benchmarks benchmarks: Aggregate speed among workers, set default dtype torch32 (#454) 9 months ago
examples Remove deprecated comment in fine-tuning notebook (#443) 9 months ago
src/petals Add memory cache usage 8 months ago
tests Optimize the Falcon block for inference (#500) 8 months ago
.gitignore Fix convergence issues and switch to LLaMA in the SST-2 example (#343) 10 months ago
Dockerfile Fix Docker build by avoiding Python 3.11 (#348) 10 months ago
LICENSE Add MIT license 1 year ago
README.md Support macOS (#477) 8 months ago
pyproject.toml Speed up loading blocks using init with meta weights (#285) 1 year ago
setup.cfg Force use_cache=True (#496) 8 months ago

README.md


Run large language models at home, BitTorrent-style.
Fine-tuning and inference up to 10x faster than offloading


Generate text with distributed Llama 2 (70B), Stable Beluga 2, Guanaco-65B or BLOOM-176B and finetune them for your own tasks — right from your desktop computer or Google Colab:

from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM

# Choose any model available at https://health.petals.dev
model_name = "petals-team/StableBeluga2"

# Connect to a distributed network hosting model layers
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)

# Run the model as if it were on your computer
inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0]))  # A cat sat on a mat...

🚀  Try now in Colab

🦙 Want to run Llama 2? Request access to its weights at the ♾️ Meta AI website and 🤗 Model Hub, then run huggingface-cli login in the terminal before loading the model. Or just try it in our chatbot app.

🔏 Privacy. Your data will be processed by other people in the public swarm. Learn more about privacy here. For sensitive data, you can set up a private swarm among people you trust.

💬 Any questions? Ping us in our Discord!

Connect your GPU and increase Petals capacity

Petals is a community-run system — we rely on people sharing their GPUs. You can check out available models and help serving one of them! As an example, here is how to host a part of Stable Beluga 2 on your GPU:

🐧 Linux + Anaconda. Run these commands for NVIDIA GPUs (or follow this for AMD):

conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install git+https://github.com/bigscience-workshop/petals
python -m petals.cli.run_server petals-team/StableBeluga2

🪟 Windows + WSL. Follow this guide on our Wiki.

🐋 Docker. Run our Docker image for NVIDIA GPUs (or follow this for AMD):

sudo docker run -p 31330:31330 --ipc host --gpus all --volume petals-cache:/cache --rm \
    learningathome/petals:main \
    python -m petals.cli.run_server --port 31330 petals-team/StableBeluga2

🍏 macOS + Apple M1/M2 GPU. Install Homebrew, then run these commands:

brew install python
python3 -m pip install git+https://github.com/bigscience-workshop/petals
python3 -m petals.cli.run_server petals-team/StableBeluga2

📚  Learn more (how to use multiple GPUs, start the server on boot, etc.)

💬 Any questions? Ping us in our Discord!

🦙 Want to host Llama 2? Request access to its weights at the ♾️ Meta AI website and 🤗 Model Hub, generate an 🔑 access token, then add --token YOUR_TOKEN_HERE to the python -m petals.cli.run_server command.

🔒 Security. Hosting a server does not allow others to run custom code on your computer. Learn more here.

🏆 Thank you! Once you load and host 10+ blocks, we can show your name or link on the swarm monitor as a way to say thanks. You can specify them with --public_name YOUR_NAME.

How does it work?

  • Petals runs large language models like Llama and BLOOM collaboratively — you load a small part of the model, then join people serving the other parts to run inference or fine-tuning.
  • Single-batch inference runs at up to 6 steps/sec for Llama 2 (70B) and ≈ 1 step/sec for BLOOM-176B. This is up to 10x faster than offloading, enough to build chatbots and other interactive apps. Parallel inference reaches hundreds of tokens/sec.
  • Beyond classic language model APIs — you can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch.

📜  Read paper            📚  See FAQ

📚 Tutorials, examples, and more

Basic tutorials:

  • Getting started: tutorial
  • Prompt-tune Llama-65B for text semantic classification: tutorial
  • Prompt-tune BLOOM to create a personified chatbot: tutorial

Useful tools:

Advanced guides:

  • Launch a private swarm: guide
  • Run a custom model: guide

Benchmarks

The benchmarks below are for BLOOM-176B:

Network Single-batch inference
(steps/s)
Parallel forward
(tokens/s)
Bandwidth Round-trip
latency
Sequence length Batch size
128 2048 1 64
Offloading, max. possible speed on 1x A100 1
256 Gbit/s 0.18 0.18 2.7 170.3
128 Gbit/s 0.09 0.09 2.4 152.8
Petals on 14 heterogeneous servers across Europe and North America 2
Real world 0.83 0.79 32.6 179.4
Petals on 3 servers, with one A100 each 3
1 Gbit/s < 5 ms 1.71 1.54 70.0 253.6
100 Mbit/s < 5 ms 1.66 1.49 56.4 182.0
100 Mbit/s 100 ms 1.23 1.11 19.7 112.2

1 An upper bound for offloading performance. We base our offloading numbers on the best possible hardware setup for offloading: CPU RAM offloading via PCIe 4.0 with 16 PCIe lanes per GPU and PCIe switches for pairs of GPUs. We assume zero latency for the upper bound estimation. In 8-bit, the model uses 1 GB of memory per billion parameters. PCIe 4.0 with 16 lanes has a throughput of 256 Gbit/s, so offloading 176B parameters takes 5.5 seconds. The throughput is twice as slow (128 Gbit/s) if we have two GPUs behind the same PCIe switch.

2 A real-world distributed setting with 14 servers holding 2× RTX 3060, 4× 2080Ti, 2× 3090, 2× A4000, and 4× A5000 GPUs. These are personal servers and servers from university labs, spread across Europe and North America and connected to the Internet at speeds of 1001000 Mbit/s. 4 servers operate from under firewalls.

3 An optimistic setup that requires least communication. The client nodes have 8 CPU cores and no GPU.

We provide more evaluations and discuss these results in more detail in Section 3.3 of our paper.

🛠️ Contributing

Please see our FAQ on contributing.

📜 Citation

Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel. Petals: Collaborative Inference and Fine-tuning of Large Models. arXiv preprint arXiv:2209.01188, 2022.

@article{borzunov2022petals,
  title = {Petals: Collaborative Inference and Fine-tuning of Large Models},
  author = {Borzunov, Alexander and Baranchuk, Dmitry and Dettmers, Tim and Ryabinin, Max and Belkada, Younes and Chumachenko, Artem and Samygin, Pavel and Raffel, Colin},
  journal = {arXiv preprint arXiv:2209.01188},
  year = {2022},
  url = {https://arxiv.org/abs/2209.01188}
}

This project is a part of the BigScience research workshop.