mirror of https://github.com/rhasspy/piper
You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
242 lines
9.5 KiB
Markdown
242 lines
9.5 KiB
Markdown
# Training Guide
|
|
|
|
Check out a [video training guide by Thorsten Müller](https://www.youtube.com/watch?v=b_we_jma220)
|
|
|
|
For Windows, see [ssamjh's guide using WSL](https://ssamjh.nz/create-custom-piper-tts-voice/)
|
|
|
|
---
|
|
|
|
Training a voice for Piper involves 3 main steps:
|
|
|
|
1. Preparing the dataset
|
|
2. Training the voice model
|
|
3. Exporting the voice model
|
|
|
|
Choices must be made at each step, including:
|
|
|
|
* The model "quality"
|
|
* low = 16,000 Hz sample rate, [smaller voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30)
|
|
* medium = 22,050 Hz sample rate, [smaller voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30)
|
|
* high = 22,050 Hz sample rate, [larger voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L45)
|
|
* Single or multiple speakers
|
|
* Fine-tuning an [existing model](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main) or training from scratch
|
|
* Exporting to [onnx](https://github.com/microsoft/onnxruntime/) or PyTorch
|
|
|
|
## Getting Started
|
|
|
|
Start by installing system dependencies:
|
|
|
|
``` sh
|
|
sudo apt-get install python3-dev
|
|
```
|
|
|
|
Then create a Python virtual environment:
|
|
|
|
``` sh
|
|
cd piper/src/python
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
pip3 install --upgrade pip
|
|
pip3 install --upgrade wheel setuptools
|
|
pip3 install -e .
|
|
```
|
|
|
|
Run the `build_monotonic_align.sh` script in the `src/python` directory to build the extension.
|
|
|
|
Ensure you have [espeak-ng](https://github.com/espeak-ng/espeak-ng/) installed (`sudo apt-get install espeak-ng`).
|
|
|
|
|
|
## Preparing a Dataset
|
|
|
|
The Piper training scripts expect two files that can be generated by `python3 -m piper_train.preprocess`:
|
|
|
|
* A `config.json` file with the voice settings
|
|
* `audio` (required)
|
|
* `sample_rate` - audio rate in hertz
|
|
* `espeak` (required)
|
|
* `language` - espeak-ng voice or [alphabet](https://github.com/rhasspy/piper-phonemize/blob/master/src/phoneme_ids.cpp)
|
|
* `num_symbols` (required)
|
|
* Number of phonemes in the model (typically 256)
|
|
* `num_speakers` (required)
|
|
* Number of speakers in the dataset
|
|
* `phoneme_id_map` (required)
|
|
* Map from a phoneme (UTF-8 codepoint) to a list of ids
|
|
* Id 0 ("_") is padding (pad)
|
|
* Id 1 ("^") is the beginning of an utterance (bos)
|
|
* Id 2 ("$") is the end of an utterance (eos)
|
|
* Id 3 (" ") is a word separator (whitespace)
|
|
* `phoneme_type`
|
|
* "espeak" or "text"
|
|
* "espeak" phonemes use [espeak-ng](https://github.com/rhasspy/espeak-ng)
|
|
* "text" phonemes use a pre-defined [alphabet](https://github.com/rhasspy/piper-phonemize/blob/master/src/phoneme_ids.cpp)
|
|
* `speaker_id_map`
|
|
* Map from a speaker name to id
|
|
* `phoneme_map`
|
|
* Map from a phoneme (UTF-8 codepoint) to a list of phonemes
|
|
* `inference`
|
|
* `noise_scale` - noise added to the generator (default: 0.667)
|
|
* `length_scale` - speaking speed (default: 1.0)
|
|
* `noise_w` - phoneme width variation (default: 0.8)
|
|
* A `dataset.jsonl` file with one line per utterance (JSON objects)
|
|
* `phoneme_ids` (required)
|
|
* List of ids for each utterance phoneme (0 <= id < `num_symbols`)
|
|
* `audio_norm_path` (required)
|
|
* Absolute path to [normalized audio](https://github.com/rhasspy/piper/tree/master/src/python/piper_train/norm_audio) file (`.pt`)
|
|
* `audio_spec_path` (required)
|
|
* Absolute path to [audio spectrogram](https://github.com/rhasspy/piper/blob/fda64e7a5104810a24eb102b880fc5c2ac596a38/src/python/piper_train/vits/mel_processing.py#L40) file (`.pt`)
|
|
* `speaker_id` (required for multi-speaker)
|
|
* Id of the utterance's speaker (0 <= id < `num_speakers`)
|
|
* `audio_path`
|
|
* Absolute path to original audio file
|
|
* `text`
|
|
* Original text of utterance before phonemization
|
|
* `phonemes`
|
|
* Phonemes from utterance text before converting to ids
|
|
* `speaker`
|
|
* Name of utterance speaker (from `speaker_id_map`)
|
|
|
|
|
|
### Dataset Format
|
|
|
|
The pre-processing script expects data to be a directory with:
|
|
|
|
* `metadata.csv` - CSV file with text, audio filenames, and speaker names
|
|
* `wav/` - directory with audio files
|
|
|
|
The `metadata.csv` file uses `|` as a delimiter, and has 2 or 3 columns depending on if the dataset has a single or multiple speakers.
|
|
There is no header row.
|
|
|
|
For single speaker datasets:
|
|
|
|
```csv
|
|
id|text
|
|
```
|
|
|
|
where `id` is the name of the WAV file in the `wav` directory. For example, an `id` of `1234` means that `wav/1234.wav` should exist.
|
|
|
|
For multi-speaker datasets:
|
|
|
|
```csv
|
|
id|speaker|text
|
|
```
|
|
|
|
where `speaker` is the name of the utterance's speaker. Speaker ids will automatically be assigned based on the number of utterances per speaker (speaker id 0 has the most utterances).
|
|
|
|
|
|
### Pre-processing
|
|
|
|
An example of pre-processing a single speaker dataset:
|
|
|
|
``` sh
|
|
python3 -m piper_train.preprocess \
|
|
--language en-us \
|
|
--input-dir /path/to/dataset_dir/ \
|
|
--output-dir /path/to/training_dir/ \
|
|
--dataset-format ljspeech \
|
|
--single-speaker \
|
|
--sample-rate 22050
|
|
```
|
|
|
|
The `--language` argument refers to an [espeak-ng voice](https://github.com/espeak-ng/espeak-ng/) by default, such as `de` for German.
|
|
|
|
To pre-process a multi-speaker dataset, remove the `--single-speaker` flag and ensure that your dataset has the 3 columns: `id|speaker|text`
|
|
Verify the number of speakers in the generated `config.json` file before proceeding.
|
|
|
|
|
|
## Training a Model
|
|
|
|
Once you have a `config.json`, `dataset.jsonl`, and audio files (`.pt`) from pre-processing, you can begin the training process with `python3 -m piper_train`
|
|
|
|
For most cases, you should fine-tune from [an existing model](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main). The model must have the sample audio quality and sample rate, but does not necessarily need to be in the same language.
|
|
|
|
It is **highly recommended** to train with the following `Dockerfile`:
|
|
|
|
``` dockerfile
|
|
FROM nvcr.io/nvidia/pytorch:22.03-py3
|
|
|
|
RUN pip3 install \
|
|
'pytorch-lightning'
|
|
|
|
ENV NUMBA_CACHE_DIR=.numba_cache
|
|
```
|
|
|
|
As an example, we will fine-tune the [medium quality lessac voice](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/en/en_US/lessac/medium). Download the `.ckpt` file and run the following command in your training environment:
|
|
|
|
``` sh
|
|
python3 -m piper_train \
|
|
--dataset-dir /path/to/training_dir/ \
|
|
--accelerator 'gpu' \
|
|
--devices 1 \
|
|
--batch-size 32 \
|
|
--validation-split 0.0 \
|
|
--num-test-examples 0 \
|
|
--max_epochs 10000 \
|
|
--resume_from_checkpoint /path/to/lessac/epoch=2164-step=1355540.ckpt \
|
|
--checkpoint-epochs 1 \
|
|
--precision 32
|
|
```
|
|
|
|
Use `--quality high` to train a [larger voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L45) (sounds better, but is much slower).
|
|
|
|
You can adjust the validation split (5% = 0.05) and number of test examples for your specific dataset. For fine-tuning, they are often set to 0 because the target dataset is very small.
|
|
|
|
Batch size can be tricky to get right. It depends on the size of your GPU's vRAM, the model's quality/size, and the length of the longest sentence in your dataset. The `--max-phoneme-ids <N>` argument to `piper_train` will drop sentences that have more than `N` phoneme ids. In practice, using `--batch-size 32` and `--max-phoneme-ids 400` will work for 24 GB of vRAM (RTX 3090/4090).
|
|
|
|
|
|
### Multi-Speaker Fine-Tuning
|
|
|
|
If you're training a multi-speaker model, use `--resume_from_single_speaker_checkpoint` instead of `--resume_from_checkpoint`. This will be *much* faster than training your multi-speaker model from scratch.
|
|
|
|
|
|
### Testing
|
|
|
|
To test your voice during training, you can use [these test sentences](https://github.com/rhasspy/piper/tree/master/etc/test_sentences) or generate your own with [piper-phonemize](https://github.com/rhasspy/piper-phonemize/). Run the following command to generate audio files:
|
|
|
|
```sh
|
|
cat test_en-us.jsonl | \
|
|
python3 -m piper_train.infer \
|
|
--sample-rate 22050 \
|
|
--checkpoint /path/to/training_dir/lightning_logs/version_0/checkpoints/*.ckpt \
|
|
--output-dir /path/to/training_dir/output"
|
|
```
|
|
|
|
The input format to `piper_train.infer` is the same as `dataset.jsonl`: one line of JSON per utterance with `phoneme_ids` and `speaker_id` (multi-speaker only). Generate your own test file with [piper-phonemize](https://github.com/rhasspy/piper-phonemize/):
|
|
|
|
```sh
|
|
lib/piper_phonemize -l en-us --espeak-data lib/espeak-ng-data/ < my_test_sentences.txt > my_test_phonemes.jsonl
|
|
```
|
|
|
|
|
|
### Tensorboard
|
|
|
|
Check on your model's progress with tensorboard:
|
|
|
|
```sh
|
|
tensorboard --logdir /path/to/training_dir/lightning_logs
|
|
```
|
|
|
|
Click on the scalars tab and look at both `loss_disc_all` and `loss_gen_all`. In general, the model is "done" when `loss_disc_all` levels off. We've found that 2000 epochs is usually good for models trained from scratch, and an additional 1000 epochs when fine-tuning.
|
|
|
|
|
|
## Exporting a Model
|
|
|
|
When your model is finished training, export it to onnx with:
|
|
|
|
```sh
|
|
python3 -m piper_train.export_onnx \
|
|
/path/to/model.ckpt \
|
|
/path/to/model.onnx
|
|
|
|
cp /path/to/training_dir/config.json \
|
|
/path/to/model.onnx.json
|
|
```
|
|
|
|
The [export script](https://github.com/rhasspy/piper-samples/blob/master/_script/export.sh) does additional optimization of the model with [onnx-simplifier](https://github.com/daquexian/onnx-simplifier).
|
|
|
|
If the export is successful, you can now use your voice with Piper:
|
|
|
|
```sh
|
|
echo 'This is a test.' | \
|
|
piper -m /path/to/model.onnx --output_file test.wav
|
|
```
|