SHEET / MOS-Bench

MOS-Bench is a benchmark designed to benchmark the generalization abilities of subjective speech quality assessment (SSQA) models.
SHEET stands for the Speech Human Evaluation Estimation Toolkit. SHEET was designed to conduct research experiments with MOS-Bench.

Key Features

MOS-Bench is the first large-scale collection of training and testing datasets for SSQA, covering a wide range of domains, including synthetic speech from text-to-speech (TTS), voice conversion (VC), singing voice synthetis (SVS) systems, and distorted speech with artificial and real noise, clipping, transmission, reverb, etc. Researchers can use the testing sets to benchmark their SSQA model.
This repository aims to provide training recipes. While there are many off-the-shelf speech quality evaluators like DNSMOS, SpeechMOS and speechmetrics, most of them do not provide training recipes, thus are not research-oriented. Newcomers may utilize this repo as a starting point to SSQA research.

MOS-Bench

As of September 2025, MOS-Bench has 8 training sets and 17 test sets. See the MOS-Bench page for more.

Instsallation

Full installation is needed if your goal is to do training.

Editable installation with virtualenv

You don't need to prepare an environment (using conda, etc.) first. The following commands will automatically construct a virtual environment in tools/. When you run the recipes, the scripts will automatically activate the virtual environment.

git clone https://github.com/unilight/sheet.git
cd sheet/tools
make

Usage

I just want to use your trained MOS predictor!

We utilize torch.hub to provide a convenient way to load pre-trained SSQA models and predict scores of wav files or torch tensors.

You can use the _id argument to specify which pre-trained model to use. If not specified, the default model is used. See the list of pre-trained models page for the complete table.

Note

Since SHEET is a on-going project, if you use our pre-trained model in you paper, it is suggested to specify the version. For instance: SHEET SSL-MOS v0.1.0, SHEET SSL-MOS v0.2.5, etc.

Tip

You don't need to install sheet following the installation instructions. However, you might need to install the following:

sheet-sqa
huggingface_hub

# load default pre-trained model
>>> predictor = torch.hub.load("unilight/sheet:v0.2.4post3", "sheet_ssqa", trust_repo=True, force_reload=True)
# use `_id` to specify which pre-trained model to use
>>> predictor = torch.hub.load("unilight/sheet:v0.2.4post3", "sheet_ssqa", trust_repo=True, force_reload=True, _id="bvcc/sslmos-wavlm_large/1337")
# if you want to use cuda, use either of the following
>>> predictor = torch.hub.load("unilight/sheet:v0.2.4post3", "sheet_ssqa", trust_repo=True, force_reload=True, cpu=False)
>>> predictor.model.cuda()

# you can either provide a path to your wav file
>>> predictor.predict(wav_path="/path/to/wav/file.wav")
3.6066928

# or provide a torch tensor with shape [num_samples]
>>> predictor.predict(wav=torch.rand(16000))
1.5806346
# if you put the model on cuda...
>>> predictor.predict(wav=torch.rand(16000).cuda())
1.5806346

I am new to MOS prediction research. I want to train models!

You are in the right place! This is the main purpose of SHEET. We provide complete experiment recipes, i.e., set of scripts to download and process the dataset, train and evaluate models.

Please follow the installation instructions first, then see training guide for how to start.

I already have my MOS predictor. I just want to do benchmarking!

We provide scripts to collect the test sets conveniently. These scripts can be run on Linux-like platforms with basic python requirements, such that you do not need to instal all the heavy packages, like PyTorch.

Please see benchmarking guide for detailed instructions.

Supported models

LDNet

Original repo link: https://github.com/unilight/LDNet
Paper link: https://arxiv.org/abs/2110.09103
Example config: egs/bvcc/conf/ldnet-ml.yaml

SSL-MOS

Original repo link: https://github.com/nii-yamagishilab/mos-finetune-ssl
Paper link: https://arxiv.org/abs/2110.02635
Example config: egs/bvcc/conf/ssl-mos-wav2vec2.yaml
Notes: We made some modifications to the original implementation. Please see the paper for more details.

UTMOS (Strong learner)

Original repo link: https://github.com/sarulab-speech/UTMOS22/tree/master/strong
Paper link: https://arxiv.org/abs/2204.02152
Example config: egs/bvcc/conf/utmos-strong.yaml

Note

After discussion with the first author of UTMOS, Takaaki, we feel that UTMOS = SSL-MOS + listener modeling + contrastive loss + several model arch and training differences. Takaaki also felt that using phoneme and reference is not really helpful for UTMOS strong alone. Therefore we did not implement every component of UTMOS strong. For instance, we did not use domain ID and data augmentation.

AlignNet

Original repo link: https://github.com/NTIA/alignnet
Paper link: https://arxiv.org/abs/22406.10205
Example config: egs/bvcc+nisqa+pstn+singmos+somos+tencent+tmhint-qi/conf/alignnet-wav2vec2.yaml

Supported features

Modeling

Listener modeling
Self-supervised learning (SSL) based encoder, supported by S3PRL

Note

Find the complete list of supported SSL models here

Training

Automatic best-n model saving and early stopiing based on given validation criterion
Visualization, including predicted score distribution, scatter plot of utterance and system level scores
Model averaging
Model ensembling by stacking