P7-01: A unified model for zero-shot singing voice conversion and synthesis

Wu, Jui-Te, Wang, Jun-You, Jang, Jyh-Shing Roger, Su, Li*

Subjects (starting with primary): MIR tasks -> music synthesis and transformation ; MIR tasks -> music generation ; Musical features and properties -> timbre, instrumentation, and singing voice

Presented Virtually: 4-minute short-format presentation

Abstract:

Recent advances in deep learning not only facilitate the implementation of zero-shot singing voice synthesis (SVS) and singing voice conversion (SVC) tasks but also provide the opportunity to unify these two tasks into one generalized model. In this paper, we propose such a model that generate the singing voice of any target singer from any source singing content in either text or audio format. The model incorporates self-supervised joint training of the phonetic encoder and the acoustic encoder, with an audio-to-phoneme alignment process in each training step, such that these encoders map the audio and text data respectively into a shared, temporally aligned, and singer agnostic latent space. The target singer’s latent representations encoded at different granularity levels are all trained to match the source latent representations sequentially with the attention mechanisms in the decoding stage. This enables the model to generate unseen target singer’s voice with fine-grained resolution from either text or audio sources. Both objective and subjective experiments confirmed that the proposed model is competitive with the state-of-the-art SVC and SVS methods.

Direct link to video