SHEET / MOS-Bench
- MOS-Bench is a benchmark designed to benchmark the generalization abilities of subjective speech quality assessment (SSQA) models.
- SHEET stands for the Speech Human Evaluation Estimation Toolkit. SHEET was designed to conduct research experiments with MOS-Bench.
Key Features
- MOS-Bench is the first large-scale collection of training and testing datasets for SSQA, covering a wide range of domains, including synthetic speech from text-to-speech (TTS), voice conversion (VC), singing voice synthetis (SVS) systems, and distorted speech with artificial and real noise, clipping, transmission, reverb, etc. Researchers can use the testing sets to benchmark their SSQA model.
- This repository aims to provide training recipes. While there are many off-the-shelf speech quality evaluators like DNSMOS, SpeechMOS and speechmetrics, most of them do not provide training recipes, thus are not research-oriented. Newcomers may utilize this repo as a starting point to SSQA research.
MOS-Bench
As of September 2025, MOS-Bench has 8 training sets and 17 test sets. See the MOS-Bench page for more.
Instsallation
Full installation is needed if your goal is to do training.
Editable installation with virtualenv
You don't need to prepare an environment (using conda, etc.) first. The following commands will automatically construct a virtual environment in tools/
. When you run the recipes, the scripts will automatically activate the virtual environment.
git clone https://github.com/unilight/sheet.git
cd sheet/tools
make
Usage
I just want to use your trained MOS predictor!
We utilize torch.hub
to provide a convenient way to load pre-trained SSQA models and predict scores of wav files or torch tensors.
You can use the _id
argument to specify which pre-trained model to use. If not specified, the default model is used. See the list of pre-trained models page for the complete table.
Note
Since SHEET is a on-going project, if you use our pre-trained model in you paper, it is suggested to specify the version. For instance: SHEET SSL-MOS v0.1.0
, SHEET SSL-MOS v0.2.5
, etc.
Tip
You don't need to install sheet following the installation instructions. However, you might need to install the following:
sheet-sqa
huggingface_hub
# load default pre-trained model
>>> predictor = torch.hub.load("unilight/sheet:v0.2.4post3", "sheet_ssqa", trust_repo=True, force_reload=True)
# use `_id` to specify which pre-trained model to use
>>> predictor = torch.hub.load("unilight/sheet:v0.2.4post3", "sheet_ssqa", trust_repo=True, force_reload=True, _id="bvcc/sslmos-wavlm_large/1337")
# if you want to use cuda, use either of the following
>>> predictor = torch.hub.load("unilight/sheet:v0.2.4post3", "sheet_ssqa", trust_repo=True, force_reload=True, cpu=False)
>>> predictor.model.cuda()
# you can either provide a path to your wav file
>>> predictor.predict(wav_path="/path/to/wav/file.wav")
3.6066928
# or provide a torch tensor with shape [num_samples]
>>> predictor.predict(wav=torch.rand(16000))
1.5806346
# if you put the model on cuda...
>>> predictor.predict(wav=torch.rand(16000).cuda())
1.5806346
I am new to MOS prediction research. I want to train models!
You are in the right place! This is the main purpose of SHEET. We provide complete experiment recipes, i.e., set of scripts to download and process the dataset, train and evaluate models.
Please follow the installation instructions first, then see training guide for how to start.
I already have my MOS predictor. I just want to do benchmarking!
We provide scripts to collect the test sets conveniently. These scripts can be run on Linux-like platforms with basic python requirements, such that you do not need to instal all the heavy packages, like PyTorch.
Please see benchmarking guide for detailed instructions.
Supported models
LDNet
- Original repo link: https://github.com/unilight/LDNet
- Paper link: https://arxiv.org/abs/2110.09103
- Example config: egs/bvcc/conf/ldnet-ml.yaml
SSL-MOS
- Original repo link: https://github.com/nii-yamagishilab/mos-finetune-ssl
- Paper link: https://arxiv.org/abs/2110.02635
- Example config: egs/bvcc/conf/ssl-mos-wav2vec2.yaml
- Notes: We made some modifications to the original implementation. Please see the paper for more details.
UTMOS (Strong learner)
- Original repo link: https://github.com/sarulab-speech/UTMOS22/tree/master/strong
- Paper link: https://arxiv.org/abs/2204.02152
- Example config: egs/bvcc/conf/utmos-strong.yaml
Note
After discussion with the first author of UTMOS, Takaaki, we feel that UTMOS = SSL-MOS + listener modeling + contrastive loss + several model arch and training differences. Takaaki also felt that using phoneme and reference is not really helpful for UTMOS strong alone. Therefore we did not implement every component of UTMOS strong. For instance, we did not use domain ID and data augmentation.
AlignNet
- Original repo link: https://github.com/NTIA/alignnet
- Paper link: https://arxiv.org/abs/22406.10205
- Example config: egs/bvcc+nisqa+pstn+singmos+somos+tencent+tmhint-qi/conf/alignnet-wav2vec2.yaml
Supported features
Modeling
- Listener modeling
- Self-supervised learning (SSL) based encoder, supported by S3PRL
Note
Find the complete list of supported SSL models here
Training
- Automatic best-n model saving and early stopiing based on given validation criterion
- Visualization, including predicted score distribution, scatter plot of utterance and system level scores
- Model averaging
- Model ensembling by stacking