Advanced
In reply to @giu
Katsuya@kn
8/6/2023

Yeah I don’t have good intuition for the size of dataset required but my guess is a lot less than TTS, maybe similar to ASR given one-to-many problem, so was thinking there is enough public speech dataset (>10k hrs) plus non-speech dataset which can be mixed together to synthesize for training.

In reply to @kn
Katsuya@kn
8/6/2023

*one-to-one problem