On Prosody Modeling for ASR+TTS based Voice Conversion

Paper: [arXiv]
Authors: Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda
Comments: Accepted to ASRU 2021.

Abstract: A promising approach to voice conversion (VC) is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents, which are then taken as input by a text-to-speech (TTS) system to generate the converted speech. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. While few have considered source prosody transfer (SPT) by transferring prosodic clues from the source speech, to address the speaker mismatch during training and conversion, in this work, we propose to directly predict such clues from the linguistic representation in a target speaker dependent manner. We evaluate both methods on the voice conversion challenge 2020 benchmark and considered different linguistic representations. Results demonstrate the effectiveness of the proposed TTP method in both objective and subjective evaluations.

Methods (Models)

Source prosody transfer (SPT) Target text prediction (TTP)


We evaluated our proposed framework on the Voice Conversion Challenge 2020 (VCC 2020) dataset. [Paper] [Datasets]


We compared the prosody modeling methods on three kinds of representation.

Speech Samples

Ground truth Samples

Converted Samples


[Back to top]