0

Effective and adaptive voice enhancement through pre-trained generation of audio encoders and sound encoders

Recent advances in voice enhancement (SE) have gone beyond traditional mask or signal prediction methods and instead shifted to pre-trained audio models for richer, more transferable features. These models (e.g. WAVLM) extract meaningful audio embeddings to enhance the performance of SE. Some methods use these embeddings to predict masks or combine them with spectral data to improve accuracy. Others explore generation techniques, using neurovocoders to reconstruct clear speech directly from noisy embeddings. Although effective, these approaches often involve freezing pre-trained models or requiring a lot of fine-tuning, which limits adaptability and increases computational costs, making transfer to other tasks more difficult.

Researchers at Xiaomi Inc. Milm Plus have proposed a lightweight and flexible SE method using pre-trained models. First, use frozen AudioEncoder to extract audio embeds from noisy voices. These are then washed with a small DeNoise encoder and passed to the Vocoder to produce a clean voice. Unlike task-specific models, AudioEncoder and Vocoder are both pre-trained separately, thereby adapting the system to tasks such as Dereverberation or Shipation. Experiments show that generative models outperform discriminatory performance in terms of speech quality and speaker fidelity. Despite its simplicity, the system is efficient, surpassing even the leading model of hearing tests.

The proposed voice enhancement system is divided into three main components. First, noisy voices pass through pre-trained AudioEncoder, which generates noisy audio embeddings. The DeNoise encoder then perfected these embeddings to produce a cleaner version that eventually translated it into Vocoder’s voice. Although Denoise encoder and Vocoder are individually trained, they both rely on the same frozen pre-trained AudioCoder. During training, the Denoise encoder minimizes the difference between noisy and clean embeddings, both of which are generated in parallel by paired speech samples, using square square error loss. The encoder is built using a VIT architecture with standard activation and standardization layers.

For Vocoder, train in a homemade way using only clean voice data. Vocoder reconstructs speech waveforms from audio embeddings by predicting Fourier spectral coefficients, which are converted back to audio by inverse short-term short-term Fourier transform. It uses a slightly modified version of the VOCOS framework, which is tailored to accommodate a variety of audio encoders. Using the generated adversarial network (GAN) settings, the generator is based on Convnext, and the discriminator includes multi-period and multi-resolution types. The training also combines confrontation, reconstruction and functional matching losses. Importantly, AudioEncoder uses publicly available models to keep the weights unchanged throughout the process.

Evaluation shows that generator audio encoders such as Dasheng always outperform discriminatory AudioCoder. On the DNS1 dataset, Dasheng’s speaker similarity score was 0.881, while WAVLM and Whisper scored 0.486 and 0.489, respectively. In terms of speech quality, non-invasive metrics such as DNSMOS and NISQAV2 (Nisqav2) also indicate significant improvements, even with a smaller Denoise encoder. For example, VIT3 has a DNSMO of 4.03 and NisqAV2 score is 4.41. Subjective listening tests involving 17 participants showed that Dasheng’s mean opinion score (MOS) was 3.87, an oscilloscope with an example of 3.11 and an LMS of 2.98 highlighting its strong perceived performance.

Effective and adaptive voice enhancement through pre-trained generation of audio encoders and sound encodersEffective and adaptive voice enhancement through pre-trained generation of audio encoders and sound encoders

In summary, the study proposes a practical and adaptive speech enhancement system that relies on pre-trained generative recorder encoder and vocoding encoder, thus avoiding the need to fine-tune the complete model. By using a lightweight encoder and reconstructing voice with a pre-trained Vocoder, the system can achieve computational efficiency and powerful performance. The evaluation showed that the generated audio encoder performed significantly better than discrimination in terms of speech quality and speaker loyalty. Compact Denoise encoder maintains high perceived quality even with fewer parameters. Subjective listening tests further demonstrate that this method has better perceptual clarity than existing latest models, thus emphasizing its effectiveness and versatility.


Check out the paper and GitHub pages. All credits for this study are to the researchers on the project. Ready to connect with 1 million+ AI development/engineers/researchers? See how NVIDIA, LG AI Research and Advanced AI companies leverage Marktechpost to reach target audiences [Learn More]


Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.