A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion

Paper: [arXiv]

Authors: Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Yu-Tsao, Hsin-Min Wang, Tomoki Toda

Comments: Accepted to Interspeech 2021.

Abstract: We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC). The poor quality of dysathric speech can be greatly improved by statistical VC, but as normal speech utterances of the dysarthria patients are nearly impossible to collect, previous work failed to recover the individuality of the patient. In light of this, we suggest a novel, two-stage approach for DVC, which is highly flexible in that no normal speech of the patient is required. First, a powerful parallel sequence-to-sequence model converts the input dysarthric speech into a normal speech of a reference speaker as an intermediate product, and a nonparallel, frame-wise VC model realized with a variational autoencoder then converts the speaker identity of the reference speech back to that of the patient while assumed to be capable of preserving the enhanced quality. We provide several design choice investigation and experimental evaluation results, demonstrating the potential of our approach to improve the quality of the dysarthric speech while maintaining the speaker identity.