End-to-end Binaural Speech Synthesis

In this work, we aim to realize a binaural communication system, which is capable of (a) encoding transmitter audio into a low-bitrate neural code, and (b) synthesizing binaural audio from these codes including environmental factors such as room reverb and noise floor.

Model illustration.

Audio samples

Please use headphones to listen to the following audio samples.
Sample 1: Pay attention to how the "baseline" and the "decoder-only" lack the environmental sounds, which are present in "ground truth" and "proposed".
Input mono speech Baseline Decoder only Proposed Ground truth
Sample 2: Pay attention to the tiny sounds of stepping shoes and rubbing clothes, which the "baseline" and the "decoder-only" lack, while present in "ground truth" and "proposed".
Input mono speech Baseline Decoder only Proposed Ground truth
Sample 3: Although we focused on speech, we also tested with singing input. The "proposed" suffers from data mismatch and the quality was greatly degraded. However, it still captured the environmental sounds like room reverb and noise.
Input mono speech Baseline Decoder only Proposed Ground truth