Commit Graph

29 Commits (158621677bac37572c2cf256c419472d507d451c)

Author SHA1 Message Date
Alexander Borzunov 158621677b
Bump version to 2.2.0 (#502) 9 months ago
Alexander Borzunov 26ebbfe8f0
Support macOS (#477)
This PR makes both clients and servers work on macOS. Specifically, it:

- Follows https://github.com/learning-at-home/hivemind/pull/586 to run a macOS-compatible `p2pd` binary (both x86-64 and ARM64 are supported)
- Fixes forking issues and tests on macOS, Python 3.10+
- Introduces basic support for serving model blocks on Apple M1/M2 GPUs (torch.mps)
- Increases max number of open files by default (it's not enough on Linux and is really small on macOS)
9 months ago
Alexander Borzunov 90840dfea2
Fix requiring transformers>=4.32.0 (#480) 9 months ago
Alexander Borzunov 915b357740
Require transformers>=4.32.0 (#479)
It's necessary to load https://huggingface.co/petals-team/StableBeluga2 since it doesn't have deprecated `inv_freq` weights.
9 months ago
Alexander Borzunov 6967904590
Bump version to 2.1.0 (#474)
* Bump version to 2.1.0
* Suggest using resharded repo
* LLaMA -> Llama in readme
9 months ago
Alexander Borzunov 722c4dc496
Bump version to 2.0.1.post2 (#459) 10 months ago
Alexander Borzunov a1f7791d5e
Fix petals.utils.ping for servers with client-mode DHT (#430)
Fix #429.
10 months ago
Alexander Borzunov f3fafd14a4
Bump version to 2.0.1 (#411) 11 months ago
Alexander Borzunov d49d9ad0cf
Bump version to 2.0.0.post3 (#391) 11 months ago
Aleksandr Borzunov ddcda02b06 Hardcode IPs until DNS issues get resolved 11 months ago
Alexander Borzunov b1ff8bdd6c
Bump version to 2.0.0.post1 (#384) 11 months ago
Alexander Borzunov 057a2fb5de
Support Llama 2 (#379) 11 months ago
Alexander Borzunov c735dd7ba3
Update transformers to 4.31.0 and peft to 0.4.0 (#371) 11 months ago
Alexander Borzunov 62d9ed5ce7
Implement shortest-path routing for inference (#362)
This PR:

1. **Adds shortest path routing for inference.** We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in `SequenceManager.update()`. We penalize servers who may not have enough cache for our request. This uses info added to DHT in #355, #356, #358.

2. **Makes a server ping neighboring servers in addition to next ones.** This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed.

3. **Fixes a `SequenceManager` bug with the first `update()`.** Previously, this update was likely to produce incorrect information and cause to `MissingBlocksErrors` until the next update happens.
11 months ago
Alexander Borzunov 2c8959e713
Share more info about a server in DHT (#355) 11 months ago
Alexander Borzunov e12d4c666b
Spam less in server logs (#350) 11 months ago
Alexander Borzunov 158013a671
Implement direct server-to-server communication (#331)
Implement #226.
11 months ago
Alexander Borzunov cb3f018f9f
Add LLaMA support (#323)
This PR:

1. **Abolishes the model conversion procedure.** Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms.

    - BLOOM is loaded from `bigscience/bloom`, but we use the DHT prefix `bigscience/bloom-petals` for backward compatibility. Same with smaller BLOOMs and BLOOMZ.
    - LLaMA can be loaded from any repo like `username/llama-65b-hf`, but we use the DHT prefix `llama-65b-hf` (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as `Llama` vs. `LLaMA` in the class name).

2. **Refactors the client to generalize it for multiple models.** Now, we have `petals.models` packages that contain model-specific code (e.g. `petals.models.bloom`, `petals.models.llama`). General code (e.g. CPU-efficient LM head, p-tuning) is kept in `petals.client`.

3. **Introduces** `WrappedLlamaBlock`, `DistributedLlamaConfig`, `DistributedLlamaForCausalLM`, `DistributedLlamaForSequenceClassification`, and `DistributedLlamaModel` compatible with Petals functionality (p-tuning, adapters, etc.).

4. **Introduces** `AutoDistributedConfig` that automatically chooses the correct config class (`DistributedLlamaConfig` or `DistributedBloomConfig`). The refactored configs contain all model-specific info for both clients and servers.

Upgrade instructions:

- Remove disk caches for blocks in old (converted) format to save disk space. That is, remove `~/.cache/petals/model--bigscience--bloom-petals` and  `~/.cache/petals/model--bigscience--bloomz-petals` directories (if present).
12 months ago
Alexander Borzunov 675bacb592
Bump version to 1.1.5 (#312) 1 year ago
Alexander Borzunov 0a313bf6c5
Update hivemind to 1.1.8, enable efficient bfloat16 encoding (#311)
This PR:

1. Updates hivemind to 1.1.8 (includes https://github.com/learning-at-home/hivemind/pull/565)
2. Enables efficient bfloat16 serialization by default (`USE_LEGACY_BFLOAT16 = False`)
3. Removes logging code that was included to hivemind in https://github.com/learning-at-home/hivemind/pull/542
1 year ago
Alexander Borzunov 93c4eba5d1
Bump version to 1.1.4 (#306) 1 year ago
Alexander Borzunov c519bffc59
Bump version to 1.1.3 (#278) 1 year ago
Alexander Borzunov b03efb1ef5
Bump version to 1.1.2 (#244) 1 year ago
Alexander Borzunov cea83d3356
Bump version to 1.1.1 (#214) 1 year ago
Alexander Borzunov 82c9f93ce6
Bump version to 1.1.0 (#190) 1 year ago
Aleksandr Borzunov ff8ade8d3b Bump version to 1.0.0 1 year ago
Alexander Borzunov 523a7cad33
Fix issues related to `petals` as a module (#159)
1. Added `from petals.client import *` to `petals/__init__.py`, so you can write just that:

    ```python
    from petals import DistributedBloomForCausalLM
    ```

    I didn't do the same with server, since its classes are supposed to by used by `petals.cli.run_server`, not end-users. Though it's still possible to do `from petals.server.smth import smth` if necessary.

2. Fixed one more logging issue: log lines from hivemind were shown twice due to a bug in #156.

3. Removed unused `runtime.py`, since the server actually uses `hivemind.moe.Runtime`, and `runtime.py` has no significant changes comparing to it.
1 year ago
Alexander Borzunov 668b736031
Fix logging: do not duplicate lines, enable colors in Colab (#156) 1 year ago
Alexander Borzunov 7bd5916744
Make Petals a pip-installable package (attempt 2) (#102)
1. Petals can be now installed using `pip install git+https://github.com/bigscience-workshop/petals`
    - In case if you already cloned the repo, you can do `pip install .` or `pip install .[dev]`
2. Moved `src` => `src/petals`
    - Replaced `from src.smth import smth` with `from petals.smth import smth`
3. Moved `cli` => `src/petals/cli`
    - Replaced `python -m cli.run_smth` with `python -m petals.cli.run_smth` (all utilities are now available right after pip installation)
4. Moved the `requirements*.txt` contents to `setup.cfg` (`requirements.txt` for packages is not supported well by modern packaging utils)
5. Increased the package version from `0.2` to `1.0alpha1`
2 years ago