English abstract / 英語要旨: This paper presents JATTS, our initiative on building an open-source toolkit that implements a comprehensive set of representative, modern-day text-to-speech (TTS) methods, as well as the benchmark results using Japanese datasets. We analyze how different design choices affect synthesis quality, including alignment strategies, model architectures, and training objectives. We also explore the prevalent, in-context learning-based approaches towards large-scale TTS, and discuss practical challenges in training and evaluation. Our findings provide insights into building more expressive and robust Japanese TTS systems and highlight the need for better datasets and benchmarks for future research.
Japanese abstract / 和文要旨: 本稿では,2021年以降に提案された日本語テキスト音声合成の代表的手法を網羅的に実装したオープンソースツールキットJATTSの開発について述べる。日本語データセットを用いた比較実験を通じて,アライメント手法やモデル構造および目標関数が音声合成性能に及ぼす影響を分析した。