Yeah I don’t have good intuition for the size of dataset required but my guess is a lot less than TTS, maybe similar to ASR given one-to-many problem, so was thinking there is enough public speech dataset (>10k hrs) plus non-speech dataset which can be mixed together to synthesize for training.