Create speech enhancement and automatic speech recognition (ASR) pipelines in Python using the speech brain
In this tutorial, we browse an advanced and practical workflow Voice Brain. We first use GTT to generate our own clean voice samples, deliberately adding noise to simulate real-world scenarios, and then apply Speechbrain’s Metricgan+ model to enhance the audio. Once the audio is granted, we will use the CRDNN system in the language model for automatic speech recognition and compare word error rates before and after enhancement. By adopting this step-by-step approach, we can experience firsthand how the voice brain enables us to build a complete pipeline of speech augmentation and recognition in just a few lines of code. Check The complete code is here.
!pip -q install -U speechbrain gTTS jiwer pydub librosa soundfile torchaudio
!apt -qq install -y ffmpeg >/dev/null
import os, time, math, random, warnings, shutil, glob
warnings.filterwarnings("ignore")
import torch, torchaudio, numpy as np, librosa, soundfile as sf
from gtts import gTTS
from pydub import AudioSegment
from jiwer import wer
from pathlib import Path
from dataclasses import dataclass
from typing import List, Tuple
from IPython.display import Audio, display
from speechbrain.pretrained import EncoderDecoderASR, SpectralMaskEnhancement
root = Path("sb_demo"); root.mkdir(exist_ok=True)
sr = 16000
device = "cuda" if torch.cuda.is_available() else "cpu"
We first use all the necessary libraries and tools to set up the COLAB environment. We install voice brains and audio processing packages, define basic paths and parameters, and prepare equipment so that we can prepare to build voice pipelines. Check The complete code is here.
def tts_to_wav(text: str, out_wav: str, lang="en"):
mp3 = out_wav.replace(".wav", ".mp3")
gTTS(text=text, lang=lang).save(mp3)
a = AudioSegment.from_file(mp3, format="mp3").set_channels(1).set_frame_rate(sr)
a.export(out_wav, format="wav")
os.remove(mp3)
def add_noise(in_wav: str, snr_db: float, out_wav: str):
y, _ = librosa.load(in_wav, sr=sr, mono=True)
rms = np.sqrt(np.mean(y**2) + 1e-12)
n = np.random.normal(0, 1, len(y))
n = n / (np.sqrt(np.mean(n**2)+1e-12))
target_n_rms = rms / (10**(snr_db/20))
y_noisy = np.clip(y + n * target_n_rms, -1.0, 1.0)
sf.write(out_wav, y_noisy, sr)
def play(title, path):
print(f"▶ {title}: {path}")
display(Audio(path, rate=sr))
def clean_txt(s: str) -> str:
return " ".join("".join(ch.lower() if ch.isalnum() or ch.isspace() else " " for ch in s).split())
@dataclass
class Sample:
text: str
clean_wav: str
noisy_wav: str
enhanced_wav: str
We define small utilities that power pipes from start to finish. We synthesize the voice with GTTS and convert it to WAV, inject controlled Gaussian noise at the target SNR, and add helper preview audio and normalize the text. We also created a sample data class so we neatly track the clean, noisy and enhanced paths of each utterance. Check The complete code is here.
sentences = [
"Artificial intelligence is transforming everyday life.",
"Open source tools enable rapid research and innovation.",
"SpeechBrain brings flexible speech pipelines to Python."
]
samples: List[Sample] = []
print("🗣️ Synthesizing short utterances with gTTS...")
for i, s in enumerate(sentences, 1):
cw = str(root/f"clean_{i}.wav")
nw = str(root/f"noisy_{i}.wav")
ew = str(root/f"enhanced_{i}.wav")
tts_to_wav(s, cw)
add_noise(cw, snr_db=3.0 if i%2 else 0.0, out_wav=nw)
samples.append(Sample(text=s, clean_wav=cw, noisy_wav=nw, enhanced_wav=ew))
play("Clean #1", samples[0].clean_wav)
play("Noisy #1", samples[0].noisy_wav)
print("⬇️ Loading pretrained models (this downloads once) ...")
asr = EncoderDecoderASR.from_hparams(
source="speechbrain/asr-crdnn-rnnlm-librispeech",
run_opts={"device": device},
savedir=str(root/"pretrained_asr"),
)
enhancer = SpectralMaskEnhancement.from_hparams(
source="speechbrain/metricgan-plus-voicebank",
run_opts={"device": device},
savedir=str(root/"pretrained_enh"),
)
In this step, we use GTTS to generate three spoken languages, save clean and noisy versions, and then organize them into our sample object. We then loaded the pre-trained ASR and Metricgan+ enhancement models of the speech brain, providing us with all the necessary components to convert noisy audio into transcripts with degeneration. Check The complete code is here.
def enhance_file(in_wav: str, out_wav: str):
sig = enhancer.enhance_file(in_wav)
if sig.dim() == 1: sig = sig.unsqueeze(0)
torchaudio.save(out_wav, sig.cpu(), sr)
def transcribe(path: str) -> str:
hyp = asr.transcribe_file(path)
return clean_txt(hyp)
def eval_pair(ref_text: str, wav_path: str) -> Tuple[str, float]:
hyp = transcribe(wav_path)
return hyp, wer(clean_txt(ref_text), hyp)
print("n🔬 Transcribing noisy vs enhanced (MetricGAN+)...")
rows = []
t0 = time.time()
for smp in samples:
enhance_file(smp.noisy_wav, smp.enhanced_wav)
hyp_noisy, wer_noisy = eval_pair(smp.text, smp.noisy_wav)
hyp_enh, wer_enh = eval_pair(smp.text, smp.enhanced_wav)
rows.append((smp.text, hyp_noisy, wer_noisy, hyp_enh, wer_enh))
t1 = time.time()
We create accessibility features to enhance noisy audio, transcribe speech and evaluate based on reference text. We then run these steps in all samples, comparing noisy and enhanced versions, and recording transcription and error rates as well as processing times. Check The complete code is here.
def fmt(x): return f"{x:.3f}" if isinstance(x, float) else x
print(f"n⏱️ Inference time: {t1 - t0:.2f}s on {device.upper()}")
print("n# ---- Results (Noisy → Enhanced) ----")
for i, (ref, hN, wN, hE, wE) in enumerate(rows, 1):
print(f"nUtterance {i}")
print("Ref: ", ref)
print("Noisy ASR:", hN)
print("WER noisy:", fmt(wN))
print("Enh ASR: ", hE)
print("WER enh: ", fmt(wE))
print("n🧵 Batch decoding (looping API):")
batch_files = [s.clean_wav for s in samples] + [s.noisy_wav for s in samples]
bt0 = time.time()
batch_hyps = [transcribe(p) for p in batch_files]
bt1 = time.time()
for p, h in zip(batch_files, batch_hyps):
print(os.path.basename(p), "->", h[:80] + ("..." if len(h) > 80 else ""))
print(f"⏱️ Batch elapsed: {bt1 - bt0:.2f}s")
play("Enhanced #1 (MetricGAN+)", samples[0].enhanced_wav)
avg_wn = sum(wN for _,_,wN,_,_ in rows) / len(rows)
avg_we = sum(wE for _,_,_,_,wE in rows) / len(rows)
print("n📈 Summary:")
print(f"Avg WER (Noisy): {avg_wn:.3f}")
print(f"Avg WER (Enhanced): {avg_we:.3f}")
print("Tip: Try different SNRs or longer texts, and switch device to GPU if available.")
We summarize our experiments by timed inference, printing each oscillator can be transcribed and enhanced before and after WER. We also batch multiple files, listen to enhanced samples, and report average wers, so we clearly see the benefits of Metricgan+ in the pipeline.
All in all, we clearly see the power of integrating voice augmentation and ASR into a unified pipeline and voice brain. By generating audio, damaging it with noise, enhancing the sound and eventually transcribing it, we can have an intimate insight into how these models improve recognition accuracy in noise environments. The results highlight the practical benefits of using open source voice technology. We end up with a working framework that can easily scale to larger datasets, different enhancement models or custom ASR tasks.
Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.
Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.