Voice Transformer Network:テキスト音声合成に基づく事前学習を用いた音声変換用Transformer

論文リンク: [arXiv]
著者: Wen-Chin Huang, 林知樹, Yi-Chiao Wu, 亀岡弘和, 戸田智基
コメント: Interspeech 2020に採用

あらまし: We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pre-training. Seq2seq VC models are attractive owing to their ability to convert prosody. While recurrent and convolutional based seq2seq models have been successfully applied to VC, the use of the Transformer network, which has shown promising results in various speech processing tasks, has not yet been investigated. Nonetheless, the data-hungry property and the mispronunciation in the converted speech make seq2seq models far from practical. To this end, we propose a simple yet effective pre-training technique to transfer knowledge from learned TTS models, which benefit from large scale, easily accessible TTS corpora. VC models initialized with such pre-trained model parameters are able to generate effective hidden representation for high-fidelity, highly intelligible converted speech. Experimental results show that such pre-training scheme can facilitate data efficient training, meanwhile outperform an RNN-based seq2seq VC model in terms of intelligibility, naturalness and similarity.

提案手法

データ

本研究の実験は CMU Arctic というデータベースを用いる。
男性入力話者一名 (bdl) と女性入力話者一名 (clb) と女性目標話者一名 (slt)。

モデル

音声サンプル

Transcription: There were stir and bustle, new faces, and fresh facts.


Modelclb(F)-slt(F)bdl(M)-slt(F)
Source
Target
TTS adaptation (932)
ATTS2S (932)
VTN (932)
VTN (80)

Transcription: And there was Ethel Baird, whom also you must remember.


Modelclb(F)-slt(F)bdl(M)-slt(F)
Source
Target
TTS adaptation (932)
ATTS2S (932)
VTN (932)
VTN (80)

Transcription: He had become a man very early in life.


Modelclb(F)-slt(F)bdl(M)-slt(F)
Source
Target
TTS adaptation (932)
ATTS2S (932)
VTN (932)
VTN (80)

[英語][トップ]