How to build a high-level voice AI pipeline using Whisperx for transcription, alignment, analysis and export?

In this tutorial, we introduce advanced implementation whisper,We explore transcription, alignment, and text-level timestamp in detail. We set up the environment, load and preprocess the audio, and then run the complete pipeline from transcription to alignment and analysis while ensuring memory efficiency and supporting batch processing. In the process, we also visualize the results, export them in multiple formats, and even extract keywords to gain deeper insights from the audio content. Check The complete code is here.

!pip install -q git+
!pip install -q pandas matplotlib seaborn


import whisperx
import torch
import gc
import os
import json
import pandas as pd
from pathlib import Path
from IPython.display import Audio, display, HTML
import warnings
warnings.filterwarnings('ignore')


CONFIG = {
   "device": "cuda" if torch.cuda.is_available() else "cpu",
   "compute_type": "float16" if torch.cuda.is_available() else "int8",
   "batch_size": 16, 
   "model_size": "base", 
   "language": None, 
}


print(f"πŸš€ Running on: {CONFIG['device']}")
print(f"πŸ“Š Compute type: {CONFIG['compute_type']}")
print(f"🎯 Model: {CONFIG['model_size']}")

We first install Whisperx and the required libraries, and then configure our settings. We detect whether it is available, select the calculation type, and set parameters such as batch size, model size, and language to prepare for transcription. Check The complete code is here.

def download_sample_audio():
   """Download a sample audio file for testing"""
   !wget -q -O sample.mp3 
   print("βœ… Sample audio downloaded")
   return "sample.mp3"


def load_and_analyze_audio(audio_path):
   """Load audio and display basic info"""
   audio = whisperx.load_audio(audio_path)
   duration = len(audio) / 16000 
   print(f"πŸ“ Audio: {Path(audio_path).name}")
   print(f"⏱️  Duration: {duration:.2f} seconds")
   print(f"🎡 Sample rate: 16000 Hz")
   display(Audio(audio_path))
   return audio, duration


def transcribe_audio(audio, model_size=CONFIG["model_size"], language=None):
   """Transcribe audio using WhisperX (batched inference)"""
   print("n🎀 STEP 1: Transcribing audio...")
  
   model = whisperx.load_model(
       model_size,
       CONFIG["device"],
       compute_type=CONFIG["compute_type"]
   )
  
   transcribe_kwargs = {
       "batch_size": CONFIG["batch_size"]
   }
   if language:
       transcribe_kwargs["language"] = language
  
   result = model.transcribe(audio, **transcribe_kwargs)
  
   total_segments = len(result["segments"])
   total_words = sum(len(seg.get("words", [])) for seg in result["segments"])
  
   del model
   gc.collect()
   if CONFIG["device"] == "cuda":
       torch.cuda.empty_cache()
  
   print(f"βœ… Transcription complete!")
   print(f"   Language: {result['language']}")
   print(f"   Segments: {total_segments}")
   print(f"   Total text length: {sum(len(seg['text']) for seg in result['segments'])} characters")
  
   return result

We download the sample audio file, load it for analysis, and then transcribe with hisperx. We set up batch inference using the selected model size and configuration and output key details such as language, segment count, and total text length. Check The complete code is here.

🚨 [Recommended Read] Vipe (Video Pose Engine): A powerful and universal 3D video annotation tool for space AI

def align_transcription(segments, audio, language_code):
   """Align transcription for accurate word-level timestamps"""
   print("n🎯 STEP 2: Aligning for word-level timestamps...")
  
   try:
       model_a, metadata = whisperx.load_align_model(
           language_code=language_code,
           device=CONFIG["device"]
       )
      
       result = whisperx.align(
           segments,
           model_a,
           metadata,
           audio,
           CONFIG["device"],
           return_char_alignments=False
       )
      
       total_words = sum(len(seg.get("words", [])) for seg in result["segments"])
      
       del model_a
       gc.collect()
       if CONFIG["device"] == "cuda":
           torch.cuda.empty_cache()
      
       print(f"βœ… Alignment complete!")
       print(f"   Aligned words: {total_words}")
      
       return result
   except Exception as e:
       print(f"⚠️  Alignment failed: {str(e)}")
       print("   Continuing with segment-level timestamps only...")
       return {"segments": segments, "word_segments": []}

We align transcripts to generate precise word-level timestamps. By loading the alignment model and applying it to the audio, we can optimize timing accuracy and then report total alignment words while ensuring memory is cleared for efficient processing. Check The complete code is here.

def analyze_transcription(result):
   """Generate statistics about the transcription"""
   print("nπŸ“Š TRANSCRIPTION STATISTICS")
   print("="*70)
  
   segments = result["segments"]
  
   total_duration = max(seg["end"] for seg in segments) if segments else 0
   total_words = sum(len(seg.get("words", [])) for seg in segments)
   total_chars = sum(len(seg["text"].strip()) for seg in segments)
  
   print(f"Total duration: {total_duration:.2f} seconds")
   print(f"Total segments: {len(segments)}")
   print(f"Total words: {total_words}")
   print(f"Total characters: {total_chars}")
  
   if total_duration > 0:
       print(f"Words per minute: {(total_words / total_duration * 60):.1f}")
  
   pauses = []
   for i in range(len(segments) - 1):
       pause = segments[i+1]["start"] - segments[i]["end"]
       if pause > 0:
           pauses.append(pause)
  
   if pauses:
       print(f"Average pause between segments: {sum(pauses)/len(pauses):.2f}s")
       print(f"Longest pause: {max(pauses):.2f}s")
  
   word_durations = []
   for seg in segments:
       if "words" in seg:
           for word in seg["words"]:
               duration = word["end"] - word["start"]
               word_durations.append(duration)
  
   if word_durations:
       print(f"Average word duration: {sum(word_durations)/len(word_durations):.3f}s")
  
   print("="*70)

We analyze transcription by generating detailed statistics such as total duration, segment count, word count, and character count. We also calculate the pauses between the minute words, pauses and average word durations to better understand the pacing and flow of the audio. Check The complete code is here.

def display_results(result, show_words=False, max_rows=50):
   """Display transcription results in formatted table"""
   data = []
  
   for seg in result["segments"]:
       text = seg["text"].strip()
       start = f"{seg['start']:.2f}s"
       end = f"{seg['end']:.2f}s"
       duration = f"{seg['end'] - seg['start']:.2f}s"
      
       if show_words and "words" in seg:
           for word in seg["words"]:
               data.append({
                   "Start": f"{word['start']:.2f}s",
                   "End": f"{word['end']:.2f}s",
                   "Duration": f"{word['end'] - word['start']:.3f}s",
                   "Text": word["word"],
                   "Score": f"{word.get('score', 0):.2f}"
               })
       else:
           data.append({
               "Start": start,
               "End": end,
               "Duration": duration,
               "Text": text
           })
  
   df = pd.DataFrame(data)
  
   if len(df) > max_rows:
       print(f"Showing first {max_rows} rows of {len(df)} total...")
       display(HTML(df.head(max_rows).to_html(index=False)))
   else:
       display(HTML(df.to_html(index=False)))
  
   return df


def export_results(result, output_dir="output", filename="transcript"):
   """Export results in multiple formats"""
   os.makedirs(output_dir, exist_ok=True)
  
   json_path = f"{output_dir}/{filename}.json"
   with open(json_path, "w", encoding="utf-8") as f:
       json.dump(result, f, indent=2, ensure_ascii=False)
  
   srt_path = f"{output_dir}/{filename}.srt"
   with open(srt_path, "w", encoding="utf-8") as f:
       for i, seg in enumerate(result["segments"], 1):
           start = format_timestamp(seg["start"])
           end = format_timestamp(seg["end"])
           f.write(f"{i}n{start} --> {end}n{seg['text'].strip()}nn")
  
   vtt_path = f"{output_dir}/{filename}.vtt"
   with open(vtt_path, "w", encoding="utf-8") as f:
       f.write("WEBVTTnn")
       for i, seg in enumerate(result["segments"], 1):
           start = format_timestamp_vtt(seg["start"])
           end = format_timestamp_vtt(seg["end"])
           f.write(f"{start} --> {end}n{seg['text'].strip()}nn")
  
   txt_path = f"{output_dir}/{filename}.txt"
   with open(txt_path, "w", encoding="utf-8") as f:
       for seg in result["segments"]:
           f.write(f"{seg['text'].strip()}n")
  
   csv_path = f"{output_dir}/{filename}.csv"
   df_data = []
   for seg in result["segments"]:
       df_data.append({
           "start": seg["start"],
           "end": seg["end"],
           "text": seg["text"].strip()
       })
   pd.DataFrame(df_data).to_csv(csv_path, index=False)
  
   print(f"nπŸ’Ύ Results exported to '{output_dir}/' directory:")
   print(f"   βœ“ {filename}.json (full structured data)")
   print(f"   βœ“ {filename}.srt (subtitles)")
   print(f"   βœ“ {filename}.vtt (web video subtitles)")
   print(f"   βœ“ {filename}.txt (plain text)")
   print(f"   βœ“ {filename}.csv (timestamps + text)")


def format_timestamp(seconds):
   """Convert seconds to SRT timestamp format"""
   hours = int(seconds // 3600)
   minutes = int((seconds % 3600) // 60)
   secs = int(seconds % 60)
   millis = int((seconds % 1) * 1000)
   return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"


def format_timestamp_vtt(seconds):
   """Convert seconds to VTT timestamp format"""
   hours = int(seconds // 3600)
   minutes = int((seconds % 3600) // 60)
   secs = int(seconds % 60)
   millis = int((seconds % 1) * 1000)
   return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"


def batch_process_files(audio_files, output_dir="batch_output"):
   """Process multiple audio files in batch"""
   print(f"nπŸ“¦ Batch processing {len(audio_files)} files...")
   results = {}
  
   for i, audio_path in enumerate(audio_files, 1):
       print(f"n[{i}/{len(audio_files)}] Processing: {Path(audio_path).name}")
       try:
           result, _ = process_audio_file(audio_path, show_output=False)
           results[audio_path] = result
          
           filename = Path(audio_path).stem
           export_results(result, output_dir, filename)
       except Exception as e:
           print(f"❌ Error processing {audio_path}: {str(e)}")
           results[audio_path] = None
  
   print(f"nβœ… Batch processing complete! Processed {len(results)} files.")
   return results


def extract_keywords(result, top_n=10):
   """Extract most common words from transcription"""
   from collections import Counter
   import re
  
   text = " ".join(seg["text"] for seg in result["segments"])
  
   words = re.findall(r'bw+b', text.lower())
  
   stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
                 'of', 'with', 'is', 'was', 'are', 'were', 'be', 'been', 'being',
                 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could',
                 'should', 'may', 'might', 'must', 'can', 'this', 'that', 'these', 'those'}
  
   filtered_words = [w for w in words if w not in stop_words and len(w) > 2]
  
   word_counts = Counter(filtered_words).most_common(top_n)
  
   print(f"nπŸ”‘ Top {top_n} Keywords:")
   for word, count in word_counts:
       print(f"   {word}: {count}")
  
   return word_counts

We format the results into clean tables, export transcripts to JSON/SRT/VTT/TXT/CSV formats, and use the Assistant Format to maintain accurate timestamps. We also batch multiple audio files end-to-end and extract top-level keywords, allowing us to quickly convert raw transcripts into artifacts that can be analyzed. Check The complete code is here.

def process_audio_file(audio_path, show_output=True, analyze=True):
   """Complete WhisperX pipeline"""
   if show_output:
       print("="*70)
       print("🎡 WhisperX Advanced Tutorial")
       print("="*70)
  
   audio, duration = load_and_analyze_audio(audio_path)
  
   result = transcribe_audio(audio, CONFIG["model_size"], CONFIG["language"])
  
   aligned_result = align_transcription(
       result["segments"],
       audio,
       result["language"]
   )
  
   if analyze and show_output:
       analyze_transcription(aligned_result)
       extract_keywords(aligned_result)
  
   if show_output:
       print("n" + "="*70)
       print("πŸ“‹ TRANSCRIPTION RESULTS")
       print("="*70)
       df = display_results(aligned_result, show_words=False)
      
       export_results(aligned_result)
   else:
       df = None
  
   return aligned_result, df


# Example 1: Process sample audio
# audio_path = download_sample_audio()
# result, df = process_audio_file(audio_path)


# Example 2: Show word-level details
# result, df = process_audio_file(audio_path)
# word_df = display_results(result, show_words=True)


# Example 3: Process your own audio
# audio_path = "your_audio.wav"  # or .mp3, .m4a, etc.
# result, df = process_audio_file(audio_path)


# Example 4: Batch process multiple files
# audio_files = ["audio1.mp3", "audio2.wav", "audio3.m4a"]
# results = batch_process_files(audio_files)


# Example 5: Use a larger model for better accuracy
# CONFIG["model_size"] = "large-v2"
# result, df = process_audio_file("audio.mp3")


print("n✨ Setup complete! Uncomment examples above to run.")

We run the full whisper pipeline end-to-end, load the audio, transcribe the audio and align it to word-level timestamps. When enabled, we analyze statistics, extract keywords, render a clean result table, and export everything to multiple formats for practical use.

In short, we built a complete whisper pipeline that not only transcribes audio, but also aligns it with precise word-level timestamps. We export results in multiple formats, batch files, and analyze patterns to make the output more meaningful. In this way, we now have a flexible, ready-to-use workflow for transcription and audio analysis on COLAB, which we are ready to extend further into the real world.


Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter. wait! Are you on the telegram? Now, you can also join us on Telegram.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

πŸ”₯[Recommended Read] NVIDIA AI Open Source VIPE (Video Pose Engine): A powerful and universal 3D video annotation tool for spatial AI

You may also like...