Authors: KOBAYASHI, Kazuhiro and OGITA, Kenichi (Nagoya Univ./TARVO, Inc.) and DING, Ma and VIOLETA, Lester and HUANG, WenChin and TODA, Tomoki (Nagoya Univ.)
Comments: Submitted to ASJ2024 Autumn (日本音響学会2024年秋季研究発表会).
Brief explanation (not the abstract of the paper):
This paper presents PESC (Pseudo-electrolaryngeal Speech Corpus), a dataset that consists of Japanese transcripts and speech data from multiple speakers, including both natural speech and electrolarynx (EL) speech. The corpus includes parallel data of 200 utterances per speaker from 14 healthy individuals, comprising both natural and pseudo-EL speech, as well as evaluation data of 50 utterances from 14 laryngectomees.
In this demo page, we present the voice conversion (VC) samples. The experiments was in a one-to-one setting, and the model was AAS-VC, a non-autoregressive sequence-to-sequence VC model whose implementation is open-sourced (see link above). We used the 200 parallel training utterances of each pseudo-speaker pair. The train/val/test split was 140/10/50. For simplicity, a Parallel WaveGAN vocoder was trained using the training utterances from all 20 speakers.