AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion

Paper: [arXiv]
Code: https://github.com/unilight/seq2seq-vc
Authors: Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda.
Comments: Submitted to ICASSP 2024.

Abstract: Non-autoregressive (NAR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to NAR modeling. However, the dependency of NAR seq2seq VC models on ground truth durations extracted from an AR model greatly limits its generalization ability to smaller training datasets. In this work, we first show the existence of the above-mentioned problem by varying the training data size. Then, we present AAS-VC, a non-AR seq2seq VC model based on automatic alignment search (AAS), which serves as a proper inductive bias to provide the required generalization ability for low resource settings. Experimental results show that AAS-VC can generalize well to a training dataset of only 5 minutes. We also conducted ablation studies to justify several model design choices. The audio samples and implementation are available online.

Proposed method

Dataset

We conducted all our experiments on the CMU Arctic database.
A male speaker (bdl) and a female speaker (clb) were chosen as source speakers, and a male speaker (rms) and a female speaker (slt) were chosen as the target speakers.

Compared systems

Speech Samples

Transcription: And there was Ethel Baird, whom also you must remember.


ModelNumber of training utterances (duration)clb(F)-slt(F)bdl(M)-slt(F)clb(F)-rms(M)bdl(M)-rms(M)
Source -
Target -
Analysis-Synthesis -
FS2-VC (No PT) 932
FS2-VC (PT) 932
AAS-VC 932
FS2-VC (No PT) 80
FS2-VC (PT) 80
AAS-VC 80

Transcription: It was introduced by Representative Dick of Ohio.


ModelNumber of training utterances (duration)clb(F)-slt(F)bdl(M)-slt(F)clb(F)-rms(M)bdl(M)-rms(M)
Source -
Target -
Analysis-Synthesis -
FS2-VC (No PT) 932
FS2-VC (PT) 932
AAS-VC 932
FS2-VC (No PT) 80
FS2-VC (PT) 80
AAS-VC 80

Transcription: But why continue the tirade, for tirade it was.


ModelNumber of training utterances (duration)clb(F)-slt(F)bdl(M)-slt(F)clb(F)-rms(M)bdl(M)-rms(M)
Source -
Target -
Analysis-Synthesis -
FS2-VC (No PT) 932
FS2-VC (PT) 932
AAS-VC 932
FS2-VC (No PT) 80
FS2-VC (PT) 80
AAS-VC 80

[Back to top]