How to build advanced end-to-end voice AI agents using hug surface tubes?

by admin · September 17, 2025

In this tutorial, we built a high-level voice AI agent using a free-to-use model that embraces the face, and we kept the entire pipeline simple enough to run smoothly on Google Colab. We recognize language as speech recognition, Flan-T5 is used for natural language reasoning and is connected to the skin of speech synthesis through transformer pipes. By doing so, we avoid heavy dependencies, API keys, or complex settings, and we focus on showing how voice input can be transformed into meaningful conversations and restore naturally sounding voice responses in real time. Check The complete code is here.

!pip -q install "transformers>=4.42.0" accelerate torchaudio sentencepiece gradio soundfile


import os, torch, tempfile, numpy as np
import gradio as gr
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM


DEVICE = 0 if torch.cuda.is_available() else -1


asr = pipeline(
   "automatic-speech-recognition",
   model="openai/whisper-small.en",
   device=DEVICE,
   chunk_length_s=30,
   return_timestamps=False
)


LLM_MODEL = "google/flan-t5-base"
tok = AutoTokenizer.from_pretrained(LLM_MODEL)
llm = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL, device_map="auto")


tts = pipeline("text-to-speech", model="suno/bark-small")

We installed the necessary libraries and loaded three hugs: Speech-to-text whispers, Flan-t5 for generating responses, and Bark for text-to-speech. We automatically set up the device so that the GPU can be used (if available). Check The complete code is here.

SYSTEM_PROMPT = (
   "You are a helpful, concise voice assistant. "
   "Prefer direct, structured answers. "
   "If the user asks for steps or code, use short bullet points."
)


def format_dialog(history, user_text):
   turns = []
   for u, a in history:
       if u: turns.append(f"User: {u}")
       if a: turns.append(f"Assistant: {a}")
   turns.append(f"User: {user_text}")
   prompt = (
       "Instruction:n"
       f"{SYSTEM_PROMPT}nn"
       "Dialog so far:n" + "n".join(turns) + "nn"
       "Assistant:"
   )
   return prompt

We define a system prompt that guides our agent to keep it concise and structured, and implements a formatting function that will use past conversation history with user input and build a prompt string for the model to generate the assistant’s reply. Check The complete code is here.

def transcribe(filepath):
   out = asr(filepath)
   text = out["text"].strip()
   return text


def generate_reply(history, user_text, max_new_tokens=256):
   prompt = format_dialog(history, user_text)
   inputs = tok(prompt, return_tensors="pt", truncation=True).to(llm.device)
   with torch.no_grad():
       ids = llm.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           temperature=0.7,
           do_sample=True,
           top_p=0.9,
           repetition_penalty=1.05,
       )
   reply = tok.decode(ids[0], skip_special_tokens=True).strip()
   return reply


def synthesize_speech(text):
   out = tts(text)
   audio = out["audio"]
   sr = out["sampling_rate"]
   audio = np.asarray(audio, dtype=np.float32)
   return (sr, audio)

We create three core features for our voice proxy: using whispers to convert to text, generate_reply builds Flan-t5’s context-aware response, and Synthesize_speech converts the response to spoken audio with bark. Check The complete code is here.

def clear_history():
   return [], []


def voice_to_voice(mic_file, history):
   history = history or []
   if not mic_file:
       return history, None, "Please record something!"
   try:
       user_text = transcribe(mic_file)
   except Exception as e:
       return history, None, f"ASR error: {e}"


   if not user_text:
       return history, None, "Didn't catch that. Try again?"


   try:
       reply = generate_reply(history, user_text)
   except Exception as e:
       return history, None, f"LLM error: {e}"


   try:
       sr, wav = synthesize_speech(reply)
   except Exception as e:
       return history + [(user_text, reply)], None, f"TTS error: {e}"


   return history + [(user_text, reply)], (sr, wav), f"User: {user_text}nAssistant: {reply}"


def text_to_voice(user_text, history):
   history = history or []
   user_text = (user_text or "").strip()
   if not user_text:
       return history, None, "Type a message first."
   try:
       reply = generate_reply(history, user_text)
       sr, wav = synthesize_speech(reply)
   except Exception as e:
       return history, None, f"Error: {e}"
   return history + [(user_text, reply)], (sr, wav), f"User: {user_text}nAssistant: {reply}"


def export_chat(history):
   lines = []
   for u, a in history or []:
       lines += [f"User: {u}", f"Assistant: {a}", ""]
   text = "n".join(lines).strip() or "No conversation yet."
   with tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w") as f:
       f.write(text)
       path = f.name
   return path

We add interactive features to the proxy: clear_history resets the conversation, voice_to_voice processes voice input and returns spoken replies, text_to_to_voice flow inputs input and returns to echo, and export_chat saves the entire dialog box to a downloadable text file. Check The complete code is here.

with gr.Blocks(title="Advanced Voice AI Agent (HF Pipelines)") as demo:
   gr.Markdown(
       "## 🎙️ Advanced Voice AI Agent (Hugging Face Pipelines Only)n"
       "- **ASR**: openai/whisper-small.enn"
       "- **LLM**: google/flan-t5-basen"
       "- **TTS**: suno/bark-smalln"
       "Speak or type; the agent replies with voice + text."
   )


   with gr.Row():
       with gr.Column(scale=1):
           mic = gr.Audio(sources=["microphone"], type="filepath", label="Record")
           say_btn = gr.Button("🎤 Speak")
           text_in = gr.Textbox(label="Or type instead", placeholder="Ask me anything…")
           text_btn = gr.Button("💬 Send")
           export_btn = gr.Button("⬇️ Export Chat (.txt)")
           reset_btn = gr.Button("♻️ Reset")
       with gr.Column(scale=1):
           audio_out = gr.Audio(label="Assistant Voice", autoplay=True)
           transcript = gr.Textbox(label="Transcript", lines=6)
           chat = gr.Chatbot(height=360)
   state = gr.State([])


   def update_chat(history):
       return [(u, a) for u, a in (history or [])]


   say_btn.click(voice_to_voice, [mic, state], [state, audio_out, transcript]).then(
       update_chat, inputs=state, outputs=chat
   )
   text_btn.click(text_to_voice, [text_in, state], [state, audio_out, transcript]).then(
       update_chat, inputs=state, outputs=chat
   )
   reset_btn.click(clear_history, None, [chat, state])
   export_btn.click(export_chat, state, gr.File(label="Download chat.txt"))


demo.launch(debug=False)

We build a clean Gradio UI that lets us speak or type and then hear the agent respond. We connect the buttons to the callback, maintain the chat status, and stream the results to the chatbot, transcripts and audio players, all of which are launched in a Colab application.

In short, we see how to seamlessly embrace the face pipeline, allowing us to create a voice-driven conversation agent that will listen, think and respond. Now, we have a working demonstration that captures audio, transcribes, generates intelligent responses and returns voice output, all inside Colab. With this foundation, we can try a larger model, add multilingual support, and even extend the system using custom logic. Still, the core idea remains the same: we can aggregate ASR, LLM, and TTS into a smooth workflow for an interactive voice AI experience.

Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

🔥[Recommended Read] NVIDIA AI Open Source VIPE (Video Pose Engine): A powerful and universal 3D video annotation tool for spatial AI