petals

llm machine-learning p2p torrent

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Go to file

Artem Chumachenko 1b21dd3217 Add memory cache usage		8 months ago
.github/workflows	Add Falcon support (#499 )	8 months ago
benchmarks	benchmarks: Aggregate speed among workers, set default dtype torch32 (#454 )	9 months ago
examples	Remove deprecated comment in fine-tuning notebook (#443 )	9 months ago
src/petals	Add memory cache usage	8 months ago
tests	Optimize the Falcon block for inference (#500 )	8 months ago
.gitignore	Fix convergence issues and switch to LLaMA in the SST-2 example (#343 )	10 months ago
Dockerfile	Fix Docker build by avoiding Python 3.11 (#348 )	10 months ago
LICENSE	Add MIT license	1 year ago
README.md	Support macOS (#477 )	8 months ago
pyproject.toml	Speed up loading blocks using init with meta weights (#285 )	1 year ago
setup.cfg	Force use_cache=True (#496 )	8 months ago

README.md

Unescape Escape

Run large language models at home, BitTorrent-style.
Fine-tuning and inference up to 10x faster than offloading

Generate text with distributed Llama 2 (70B), Stable Beluga 2, Guanaco-65B or BLOOM-176B and fine‑tune them for your own tasks — right from your desktop computer or Google Colab:

from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM

# Choose any model available at https://health.petals.dev
model_name = "petals-team/StableBeluga2"

# Connect to a distributed network hosting model layers
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)

# Run the model as if it were on your computer
inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0]))  # A cat sat on a mat...

🚀 Try now in Colab

🦙 Want to run Llama 2? Request access to its weights at the ♾️ Meta AI website and 🤗 Model Hub, then run huggingface-cli login in the terminal before loading the model. Or just try it in our chatbot app.

🔏 Privacy. Your data will be processed by other people in the public swarm. Learn more about privacy here. For sensitive data, you can set up a private swarm among people you trust.

💬 Any questions? Ping us in our Discord!

Connect your GPU and increase Petals capacity

Petals is a community-run system — we rely on people sharing their GPUs. You can check out available models and help serving one of them! As an example, here is how to host a part of Stable Beluga 2 on your GPU:

🐧 Linux + Anaconda. Run these commands for NVIDIA GPUs (or follow this for AMD):

conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install git+https://github.com/bigscience-workshop/petals
python -m petals.cli.run_server petals-team/StableBeluga2

🪟 Windows + WSL. Follow this guide on our Wiki.

🐋 Docker. Run our Docker image for NVIDIA GPUs (or follow this for AMD):

sudo docker run -p 31330:31330 --ipc host --gpus all --volume petals-cache:/cache --rm \
    learningathome/petals:main \
    python -m petals.cli.run_server --port 31330 petals-team/StableBeluga2

🍏 macOS + Apple M1/M2 GPU. Install Homebrew, then run these commands:

brew install python
python3 -m pip install git+https://github.com/bigscience-workshop/petals
python3 -m petals.cli.run_server petals-team/StableBeluga2

📚 Learn more (how to use multiple GPUs, start the server on boot, etc.)

💬 Any questions? Ping us in our Discord!

🦙 Want to host Llama 2? Request access to its weights at the ♾️ Meta AI website and 🤗 Model Hub, generate an 🔑 access token, then add --token YOUR_TOKEN_HERE to the python -m petals.cli.run_server command.

🔒 Security. Hosting a server does not allow others to run custom code on your computer. Learn more here.

🏆 Thank you! Once you load and host 10+ blocks, we can show your name or link on the swarm monitor as a way to say thanks. You can specify them with --public_name YOUR_NAME.

How does it work?

Petals runs large language models like Llama and BLOOM collaboratively — you load a small part of the model, then join people serving the other parts to run inference or fine-tuning.
Single-batch inference runs at up to 6 steps/sec for Llama 2 (70B) and ≈ 1 step/sec for BLOOM-176B. This is up to 10x faster than offloading, enough to build chatbots and other interactive apps. Parallel inference reaches hundreds of tokens/sec.
Beyond classic language model APIs — you can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch.

📜 Read paper 📚 See FAQ

📚 Tutorials, examples, and more

Basic tutorials:

Getting started: tutorial
Prompt-tune Llama-65B for text semantic classification: tutorial
Prompt-tune BLOOM to create a personified chatbot: tutorial

Useful tools:

Chatbot web app (connects to Petals via an HTTP/WebSocket endpoint): source code
Monitor for the public swarm: source code

Advanced guides:

Launch a private swarm: guide
Run a custom model: guide

Benchmarks

The benchmarks below are for BLOOM-176B:

Network		Single-batch inference (steps/s)		Parallel forward (tokens/s)
Bandwidth	Round-trip latency	Sequence length		Batch size
Bandwidth	Round-trip latency	128	2048	1	64
Offloading, max. possible speed on 1x A100 ¹
256 Gbit/s		0.18	0.18	2.7	170.3
128 Gbit/s		0.09	0.09	2.4	152.8
Petals on 14 heterogeneous servers across Europe and North America ²
Real world		0.83	0.79	32.6	179.4
Petals on 3 servers, with one A100 each ³
1 Gbit/s	< 5 ms	1.71	1.54	70.0	253.6
100 Mbit/s	< 5 ms	1.66	1.49	56.4	182.0
100 Mbit/s	100 ms	1.23	1.11	19.7	112.2

¹ An upper bound for offloading performance. We base our offloading numbers on the best possible hardware setup for offloading: CPU RAM offloading via PCIe 4.0 with 16 PCIe lanes per GPU and PCIe switches for pairs of GPUs. We assume zero latency for the upper bound estimation. In 8-bit, the model uses 1 GB of memory per billion parameters. PCIe 4.0 with 16 lanes has a throughput of 256 Gbit/s, so offloading 176B parameters takes 5.5 seconds. The throughput is twice as slow (128 Gbit/s) if we have two GPUs behind the same PCIe switch.

² A real-world distributed setting with 14 servers holding 2× RTX 3060, 4× 2080Ti, 2× 3090, 2× A4000, and 4× A5000 GPUs. These are personal servers and servers from university labs, spread across Europe and North America and connected to the Internet at speeds of 100–1000 Mbit/s. 4 servers operate from under firewalls.

³ An optimistic setup that requires least communication. The client nodes have 8 CPU cores and no GPU.

We provide more evaluations and discuss these results in more detail in Section 3.3 of our paper.

🛠️ Contributing

Please see our FAQ on contributing.

📜 Citation

Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel. Petals: Collaborative Inference and Fine-tuning of Large Models. arXiv preprint arXiv:2209.01188, 2022.

@article{borzunov2022petals,
  title = {Petals: Collaborative Inference and Fine-tuning of Large Models},
  author = {Borzunov, Alexander and Baranchuk, Dmitry and Dettmers, Tim and Ryabinin, Max and Belkada, Younes and Chumachenko, Artem and Samygin, Pavel and Raffel, Colin},
  journal = {arXiv preprint arXiv:2209.01188},
  year = {2022},
  url = {https://arxiv.org/abs/2209.01188}
}

This project is a part of the BigScience research workshop.

README.md Unescape Escape