Merge branch 'rhasspy:master' into master

pull/88/head
Mateo Cedillo 12 months ago committed by GitHub
commit 4a7f37c4f6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -24,17 +24,7 @@ RUN curl -L "https://github.com/gabime/spdlog/archive/refs/tags/v${SPDLOG_VERSIO
RUN mkdir -p "lib/Linux-$(uname -m)"
ARG ONNXRUNTIME_VERSION='1.14.1'
RUN if [ "${TARGETARCH}${TARGETVARIANT}" = 'amd64' ]; then \
ONNXRUNTIME_ARCH='x64'; \
else \
ONNXRUNTIME_ARCH="$(uname -m)"; \
fi && \
curl -L "https://github.com/microsoft/onnxruntime/releases/download/v${ONNXRUNTIME_VERSION}/onnxruntime-linux-${ONNXRUNTIME_ARCH}-${ONNXRUNTIME_VERSION}.tgz" | \
tar -C "lib/Linux-$(uname -m)" -xzvf - && \
mv "lib/Linux-$(uname -m)"/onnxruntime-* \
"lib/Linux-$(uname -m)/onnxruntime"
# Use pre-compiled Piper phonemization library (includes onnxruntime)
ARG PIPER_PHONEMIZE_VERSION='1.0.0'
RUN mkdir -p "lib/Linux-$(uname -m)/piper_phonemize" && \
curl -L "https://github.com/rhasspy/piper-phonemize/releases/download/v${PIPER_PHONEMIZE_VERSION}/libpiper_phonemize-${TARGETARCH}${TARGETVARIANT}.tar.gz" | \

@ -18,9 +18,7 @@ Voices are trained with [VITS](https://github.com/jaywalnut310/vits/) and export
Our goal is to support Home Assistant and the [Year of Voice](https://www.home-assistant.io/blog/2022/12/20/year-of-voice/).
Download voices from [the release](https://github.com/rhasspy/piper/releases/tag/v0.0.2).
Supported languages:
[Download voices](https://github.com/rhasspy/piper/releases/tag/v0.0.2) for the supported languages:
* Catalan (ca)
* Danish (da)
@ -50,13 +48,12 @@ Supported languages:
Download a release:
* [amd64](https://github.com/rhasspy/piper/releases/download/v0.0.2/piper_amd64.tar.gz) (desktop Linux)
* [arm64](https://github.com/rhasspy/piper/releases/download/v0.0.2/piper_arm64.tar.gz) (Raspberry Pi 4)
If you want to build from source, see the [Makefile](Makefile) and [C++ source](src/cpp). Piper depends on a patched `espeak-ng` in [lib](lib), which includes a way to get access to the "terminator" used to end each clause/sentence.
* [amd64](https://github.com/rhasspy/piper/releases/download/v1.0.0/piper_amd64.tar.gz) (desktop Linux)
* [arm64](https://github.com/rhasspy/piper/releases/download/v1.0.0/piper_arm64.tar.gz) (Raspberry Pi 4)
The ONNX runtime is expected in `lib/Linux-$(uname -m)`, so `lib/Linux-x86_64`, etc. You can change this path in `src/cpp/CMakeLists.txt` if necessary.
Last tested with [onnxruntime](https://github.com/microsoft/onnxruntime) 1.14.1.
If you want to build from source, see the [Makefile](Makefile) and [C++ source](src/cpp).
You must download and extract [piper-phonemize](https://github.com/rhasspy/piper-phonemize) to `lib/Linux-$(uname -m)/piper_phonemize` before building.
For example, `lib/Linux-x86_64/piper_phonemize/lib/libpiper_phonemize.so` should exist for AMD/Intel machines (as well as everything else from `libpiper_phonemize-amd64.tar.gz`).
## Usage
@ -125,7 +122,7 @@ python3 -m piper_train.preprocess \
--sample-rate 22050
```
Datasets must either be in the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) format or from [Mimic Recording Studio](https://github.com/MycroftAI/mimic-recording-studio) (`--dataset-format mycroft`).
Datasets must either be in the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) format (with only id/text columns or id/speaker/text) or from [Mimic Recording Studio](https://github.com/MycroftAI/mimic-recording-studio) (`--dataset-format mycroft`).
Finally, you can train:

@ -0,0 +1,7 @@
Sateenkaari on spektrin väreissä esiintyvä ilmakehän optinen ilmiö.
Se syntyy, kun valo taittuu pisaran etupinnasta, heijastuu pisaran takapinnasta ja taittuu jälleen pisaran etupinnasta.
Koska vesipisara on dispersiivinen, valkoinen valo hajoaa väreiksi muodostaen sateenkaaren.
Prisman tuottama spektri on valon eri aallonpituuksien tasainen jatkumo ilman kaistoja.
Ihmissilmä kykenee erottamaan spektristä erikseen joitain satoja eri värejä.
Tämän mukaisesti Munsellin värisysteemi erottaa 100 eri värisävyä.
Päävärien näennäinen erillisyys on ihmisen näköjärjestelmän ominaisuus ja päävärien tarkka lukumäärä on jossain määrin vapaavalintainen.

@ -0,0 +1,4 @@
इन्द्रेणी वा इन्द्रधनुष प्रकाश र रंगबाट उत्पन्न भएको यस्तो घटना हो जसमा रंगीन प्रकाशको एउटा अर्धवृत आकाशमा देखिन्छ। जब सूर्यको प्रकाश पृथ्वीको वायुमण्डलमा भएको पानीको थोपा माथि पर्छ, पानीको थोपाले प्रकाशलाई परावर्तन, आवर्तन र डिस्पर्सन गर्दछ। फलस्वरुप आकाशमा एउटा सप्तरङ्गी अर्धवृताकार प्रकाशीय आकृति उत्पन्न हुन्छ। यो आकृतिलाई नै इन्द्रेणी भनिन्छ। इन्द्रेणी देखिनुको कारण वायुमण्डलमा पानीका कणहरु हुनु नै हो। वर्षा, झरनाबाट उछिट्टिएको पानी, शीत, कुहिरो आदिको इन्द्रेणी देखिने प्रक्रियामा महत्त्वपूर्ण भूमिका हुन्छ। इन्द्रेणीमा सात रंगहरु रातो, सुन्तला, पहेंलो, हरियो, आकाशे निलो, गाढा निलो र बैजनी रंग क्रमैसँग देखिन्छ। यसमा सबैभन्दा माथिल्लो छेउमा रातो रंग र अर्को छेउमा बैजनी रंग देखिन्छ। इन्द्रेणी पूर्ण वृत्ताकार समेत हुन सक्ने भए पनि साधरण अवलोकनकर्ताले जमिन माथि बनेको आधा भाग मात्र देख्न सकिन्छ ।
इन्द्रेणी देखिनुको कारण वायुमण्डलमा पानीका कणहरु हुनु नै हो। वर्षा, झरनाबाट उछिट्टिएको पानी, शीत, कुहिरो आदिको इन्द्रेणी देखिने प्रक्रियामा महत्त्वपूर्ण भूमिका हुन्छ।
इन्द्रेणीमा सात रंगहरु रातो, सुन्तला, पहेंलो, हरियो, आकाशे निलो, गाढा निलो र बैजनी रंग क्रमैसँग देखिन्छ। यसमा सबैभन्दा माथिल्लो छेउमा रातो रंग र अर्को छेउमा बैजनी रंग देखिन्छ।
इन्द्रेणी पूर्ण वृत्ताकार समेत हुन सक्ने भए पनि साधरण अवलोकनकर्ताले जमिन माथि बनेको आधा भाग मात्र देख्न सकिन्छ ।

@ -0,0 +1,7 @@
{"phoneme_ids":[1,0,31,0,120,0,14,0,32,0,18,0,122,0,44,0,23,0,121,0,14,0,122,0,30,0,74,0,3,0,27,0,26,0,3,0,31,0,28,0,120,0,18,0,23,0,32,0,30,0,74,0,26,0,3,0,34,0,120,0,39,0,30,0,18,0,21,0,31,0,31,0,39,0,3,0,120,0,18,0,31,0,74,0,122,0,26,0,32,0,121,0,37,0,34,0,39,0,3,0,120,0,21,0,24,0,25,0,14,0,23,0,18,0,20,0,39,0,26,0,3,0,120,0,27,0,28,0,32,0,74,0,26,0,18,0,26,0,3,0,120,0,21,0,24,0,25,0,74,0,42,0,10,0,2],"phonemes":["s","ˈ","a","t","e","ː","ŋ","k","ˌ","a","ː","r","ɪ"," ","o","n"," ","s","p","ˈ","e","k","t","r","ɪ","n"," ","v","ˈ","æ","r","e","i","s","s","æ"," ","ˈ","e","s","ɪ","ː","n","t","ˌ","y","v","æ"," ","ˈ","i","l","m","a","k","e","h","æ","n"," ","ˈ","o","p","t","ɪ","n","e","n"," ","ˈ","i","l","m","ɪ","ø","."],"processed_text":"Sateenkaari on spektrin väreissä esiintyvä ilmakehän optinen ilmiö.","text":"Sateenkaari on spektrin väreissä esiintyvä ilmakehän optinen ilmiö."}
{"phoneme_ids":[1,0,31,0,18,0,3,0,31,0,120,0,37,0,26,0,32,0,37,0,122,0,8,0,23,0,120,0,33,0,26,0,3,0,34,0,120,0,14,0,24,0,27,0,3,0,32,0,120,0,14,0,21,0,32,0,122,0,33,0,122,0,3,0,28,0,120,0,21,0,31,0,14,0,30,0,14,0,26,0,3,0,120,0,18,0,32,0,33,0,28,0,121,0,21,0,26,0,26,0,14,0,31,0,32,0,14,0,8,0,20,0,120,0,18,0,21,0,22,0,14,0,31,0,32,0,33,0,122,0,3,0,28,0,120,0,21,0,31,0,14,0,30,0,14,0,26,0,3,0,32,0,120,0,14,0,23,0,14,0,28,0,121,0,21,0,26,0,26,0,14,0,31,0,32,0,14,0,3,0,22,0,14,0,3,0,32,0,120,0,14,0,21,0,32,0,122,0,33,0,122,0,3,0,22,0,120,0,39,0,24,0,24,0,18,0,122,0,25,0,3,0,28,0,120,0,21,0,31,0,14,0,30,0,14,0,26,0,3,0,120,0,18,0,32,0,33,0,28,0,121,0,21,0,26,0,26,0,14,0,31,0,32,0,14,0,10,0,2],"phonemes":["s","e"," ","s","ˈ","y","n","t","y","ː",",","k","ˈ","u","n"," ","v","ˈ","a","l","o"," ","t","ˈ","a","i","t","ː","u","ː"," ","p","ˈ","i","s","a","r","a","n"," ","ˈ","e","t","u","p","ˌ","i","n","n","a","s","t","a",",","h","ˈ","e","i","j","a","s","t","u","ː"," ","p","ˈ","i","s","a","r","a","n"," ","t","ˈ","a","k","a","p","ˌ","i","n","n","a","s","t","a"," ","j","a"," ","t","ˈ","a","i","t","ː","u","ː"," ","j","ˈ","æ","l","l","e","ː","m"," ","p","ˈ","i","s","a","r","a","n"," ","ˈ","e","t","u","p","ˌ","i","n","n","a","s","t","a","."],"processed_text":"Se syntyy, kun valo taittuu pisaran etupinnasta, heijastuu pisaran takapinnasta ja taittuu jälleen pisaran etupinnasta.","text":"Se syntyy, kun valo taittuu pisaran etupinnasta, heijastuu pisaran takapinnasta ja taittuu jälleen pisaran etupinnasta."}
{"phoneme_ids":[1,0,23,0,120,0,27,0,31,0,23,0,14,0,3,0,34,0,120,0,18,0,31,0,74,0,28,0,121,0,21,0,31,0,14,0,30,0,14,0,3,0,27,0,26,0,3,0,17,0,120,0,21,0,31,0,28,0,18,0,30,0,31,0,121,0,21,0,122,0,34,0,74,0,26,0,18,0,26,0,8,0,34,0,120,0,14,0,24,0,23,0,27,0,21,0,26,0,18,0,26,0,3,0,34,0,120,0,14,0,24,0,27,0,3,0,20,0,120,0,14,0,22,0,27,0,14,0,122,0,3,0,34,0,120,0,39,0,30,0,18,0,21,0,23,0,31,0,74,0,3,0,25,0,120,0,33,0,27,0,17,0,27,0,31,0,32,0,14,0,18,0,26,0,3,0,31,0,120,0,14,0,32,0,18,0,122,0,44,0,23,0,121,0,14,0,122,0,30,0,18,0,26,0,10,0,2],"phonemes":["k","ˈ","o","s","k","a"," ","v","ˈ","e","s","ɪ","p","ˌ","i","s","a","r","a"," ","o","n"," ","d","ˈ","i","s","p","e","r","s","ˌ","i","ː","v","ɪ","n","e","n",",","v","ˈ","a","l","k","o","i","n","e","n"," ","v","ˈ","a","l","o"," ","h","ˈ","a","j","o","a","ː"," ","v","ˈ","æ","r","e","i","k","s","ɪ"," ","m","ˈ","u","o","d","o","s","t","a","e","n"," ","s","ˈ","a","t","e","ː","ŋ","k","ˌ","a","ː","r","e","n","."],"processed_text":"Koska vesipisara on dispersiivinen, valkoinen valo hajoaa väreiksi muodostaen sateenkaaren.","text":"Koska vesipisara on dispersiivinen, valkoinen valo hajoaa väreiksi muodostaen sateenkaaren."}
{"phoneme_ids":[1,0,28,0,30,0,120,0,21,0,31,0,25,0,14,0,26,0,3,0,32,0,120,0,33,0,27,0,32,0,122,0,14,0,25,0,14,0,3,0,31,0,28,0,120,0,18,0,23,0,32,0,30,0,74,0,3,0,27,0,26,0,3,0,34,0,120,0,14,0,24,0,27,0,26,0,3,0,120,0,18,0,30,0,74,0,3,0,120,0,14,0,122,0,24,0,24,0,27,0,25,0,28,0,74,0,32,0,121,0,33,0,122,0,23,0,31,0,21,0,18,0,26,0,3,0,32,0,120,0,14,0,31,0,14,0,21,0,26,0,18,0,26,0,3,0,22,0,120,0,14,0,32,0,23,0,33,0,25,0,27,0,3,0,120,0,21,0,24,0,25,0,14,0,44,0,3,0,23,0,120,0,14,0,21,0,31,0,32,0,27,0,22,0,14,0,10,0,2],"phonemes":["p","r","ˈ","i","s","m","a","n"," ","t","ˈ","u","o","t","ː","a","m","a"," ","s","p","ˈ","e","k","t","r","ɪ"," ","o","n"," ","v","ˈ","a","l","o","n"," ","ˈ","e","r","ɪ"," ","ˈ","a","ː","l","l","o","m","p","ɪ","t","ˌ","u","ː","k","s","i","e","n"," ","t","ˈ","a","s","a","i","n","e","n"," ","j","ˈ","a","t","k","u","m","o"," ","ˈ","i","l","m","a","ŋ"," ","k","ˈ","a","i","s","t","o","j","a","."],"processed_text":"Prisman tuottama spektri on valon eri aallonpituuksien tasainen jatkumo ilman kaistoja.","text":"Prisman tuottama spektri on valon eri aallonpituuksien tasainen jatkumo ilman kaistoja."}
{"phoneme_ids":[1,0,120,0,21,0,20,0,25,0,74,0,31,0,31,0,121,0,21,0,24,0,25,0,39,0,3,0,23,0,120,0,37,0,23,0,18,0,26,0,18,0,122,0,3,0,120,0,18,0,30,0,27,0,32,0,122,0,14,0,25,0,14,0,122,0,26,0,3,0,31,0,28,0,120,0,18,0,23,0,32,0,30,0,74,0,31,0,32,0,39,0,3,0,120,0,18,0,30,0,74,0,23,0,31,0,18,0,122,0,26,0,3,0,22,0,120,0,27,0,21,0,32,0,14,0,21,0,26,0,3,0,31,0,120,0,14,0,32,0,27,0,22,0,14,0,3,0,120,0,18,0,30,0,74,0,3,0,34,0,120,0,39,0,30,0,18,0,22,0,39,0,10,0,2],"phonemes":["ˈ","i","h","m","ɪ","s","s","ˌ","i","l","m","æ"," ","k","ˈ","y","k","e","n","e","ː"," ","ˈ","e","r","o","t","ː","a","m","a","ː","n"," ","s","p","ˈ","e","k","t","r","ɪ","s","t","æ"," ","ˈ","e","r","ɪ","k","s","e","ː","n"," ","j","ˈ","o","i","t","a","i","n"," ","s","ˈ","a","t","o","j","a"," ","ˈ","e","r","ɪ"," ","v","ˈ","æ","r","e","j","æ","."],"processed_text":"Ihmissilmä kykenee erottamaan spektristä erikseen joitain satoja eri värejä.","text":"Ihmissilmä kykenee erottamaan spektristä erikseen joitain satoja eri värejä."}
{"phoneme_ids":[1,0,32,0,121,0,39,0,25,0,39,0,26,0,3,0,25,0,120,0,33,0,23,0,14,0,21,0,31,0,121,0,18,0,31,0,32,0,74,0,25,0,3,0,25,0,120,0,33,0,26,0,31,0,18,0,24,0,24,0,74,0,26,0,3,0,34,0,120,0,39,0,30,0,74,0,31,0,121,0,37,0,31,0,32,0,18,0,122,0,25,0,74,0,3,0,120,0,18,0,30,0,27,0,32,0,122,0,14,0,122,0,3,0,31,0,120,0,14,0,32,0,14,0,3,0,120,0,18,0,30,0,74,0,3,0,34,0,120,0,39,0,30,0,74,0,31,0,121,0,39,0,34,0,37,0,39,0,10,0,2],"phonemes":["t","ˌ","æ","m","æ","n"," ","m","ˈ","u","k","a","i","s","ˌ","e","s","t","ɪ","m"," ","m","ˈ","u","n","s","e","l","l","ɪ","n"," ","v","ˈ","æ","r","ɪ","s","ˌ","y","s","t","e","ː","m","ɪ"," ","ˈ","e","r","o","t","ː","a","ː"," ","s","ˈ","a","t","a"," ","ˈ","e","r","ɪ"," ","v","ˈ","æ","r","ɪ","s","ˌ","æ","v","y","æ","."],"processed_text":"Tämän mukaisesti Munsellin värisysteemi erottaa 100 eri värisävyä.","text":"Tämän mukaisesti Munsellin värisysteemi erottaa 100 eri värisävyä."}
{"phoneme_ids":[1,0,28,0,120,0,39,0,122,0,34,0,39,0,30,0,21,0,18,0,26,0,3,0,26,0,120,0,39,0,18,0,26,0,26,0,121,0,39,0,21,0,26,0,18,0,26,0,3,0,120,0,18,0,30,0,74,0,24,0,24,0,74,0,31,0,37,0,122,0,31,0,3,0,27,0,26,0,3,0,120,0,21,0,20,0,25,0,74,0,31,0,18,0,26,0,3,0,26,0,120,0,39,0,23,0,42,0,22,0,121,0,39,0,30,0,22,0,18,0,31,0,32,0,121,0,18,0,24,0,25,0,39,0,26,0,3,0,120,0,27,0,25,0,74,0,26,0,121,0,14,0,21,0,31,0,33,0,122,0,31,0,3,0,22,0,14,0,3,0,28,0,120,0,39,0,122,0,34,0,39,0,30,0,21,0,18,0,26,0,3,0,32,0,120,0,14,0,30,0,23,0,122,0,14,0,3,0,24,0,120,0,33,0,23,0,33,0,25,0,121,0,39,0,122,0,30,0,39,0,3,0,27,0,26,0,3,0,22,0,120,0,27,0,31,0,31,0,14,0,21,0,26,0,3,0,25,0,120,0,39,0,122,0,30,0,74,0,26,0,3,0,34,0,120,0,14,0,28,0,14,0,122,0,34,0,14,0,24,0,121,0,21,0,26,0,32,0,14,0,21,0,26,0,18,0,26,0,10,0,2],"phonemes":["p","ˈ","æ","ː","v","æ","r","i","e","n"," ","n","ˈ","æ","e","n","n","ˌ","æ","i","n","e","n"," ","ˈ","e","r","ɪ","l","l","ɪ","s","y","ː","s"," ","o","n"," ","ˈ","i","h","m","ɪ","s","e","n"," ","n","ˈ","æ","k","ø","j","ˌ","æ","r","j","e","s","t","ˌ","e","l","m","æ","n"," ","ˈ","o","m","ɪ","n","ˌ","a","i","s","u","ː","s"," ","j","a"," ","p","ˈ","æ","ː","v","æ","r","i","e","n"," ","t","ˈ","a","r","k","ː","a"," ","l","ˈ","u","k","u","m","ˌ","æ","ː","r","æ"," ","o","n"," ","j","ˈ","o","s","s","a","i","n"," ","m","ˈ","æ","ː","r","ɪ","n"," ","v","ˈ","a","p","a","ː","v","a","l","ˌ","i","n","t","a","i","n","e","n","."],"processed_text":"Päävärien näennäinen erillisyys on ihmisen näköjärjestelmän ominaisuus ja päävärien tarkka lukumäärä on jossain määrin vapaavalintainen.","text":"Päävärien näennäinen erillisyys on ihmisen näköjärjestelmän ominaisuus ja päävärien tarkka lukumäärä on jossain määrin vapaavalintainen."}

File diff suppressed because one or more lines are too long

@ -30,6 +30,10 @@ def main():
choices=("x-low", "medium", "high"),
help="Quality/size of model (default: medium)",
)
parser.add_argument(
"--resume_from_single_speaker_checkpoint",
help="For multi-speaker models only. Converts a single-speaker checkpoint to multi-speaker and resumes training",
)
Trainer.add_argparse_args(parser)
VitsModel.add_model_specific_args(parser)
parser.add_argument("--seed", type=int, default=1234)
@ -82,12 +86,60 @@ def main():
num_speakers=num_speakers,
sample_rate=sample_rate,
dataset=[dataset_path],
**dict_args
**dict_args,
)
if args.resume_from_single_speaker_checkpoint:
assert (
num_speakers > 1
), "--resume_from_single_speaker_checkpoint is only for multi-speaker models. Use --resume_from_checkpoint for single-speaker models."
# Load single-speaker checkpoint
_LOGGER.debug(
"Resuming from single-speaker checkpoint: %s",
args.resume_from_single_speaker_checkpoint,
)
model_single = VitsModel.load_from_checkpoint(
args.resume_from_single_speaker_checkpoint,
dataset=None,
)
g_dict = model_single.model_g.state_dict()
for key in list(g_dict.keys()):
# Remove keys that can't be copied over due to missing speaker embedding
if (
key.startswith("dec.cond")
or key.startswith("dp.cond")
or ("enc.cond_layer" in key)
):
g_dict.pop(key, None)
# Copy over the multi-speaker model, excluding keys related to the
# speaker embedding (which is missing from the single-speaker model).
load_state_dict(model.model_g, g_dict)
load_state_dict(model.model_d, model_single.model_d.state_dict())
_LOGGER.info(
"Successfully converted single-speaker checkpoint to multi-speaker"
)
trainer.fit(model)
def load_state_dict(model, saved_state_dict):
state_dict = model.state_dict()
new_state_dict = {}
for k, v in state_dict.items():
if k in saved_state_dict:
# Use saved value
new_state_dict[k] = saved_state_dict[k]
else:
# Use initialized value
_LOGGER.debug("%s is not in the checkpoint", k)
new_state_dict[k] = v
model.load_state_dict(new_state_dict)
# -----------------------------------------------------------------------------

Loading…
Cancel
Save