How to build an agent voice AI assistant that understands, reasons, plans and responds with autonomous multi-step intelligence

by admin · November 9, 2025

In this tutorial, we’ll explore how to build an Agentic voice AI assistant that can understand, reason, and respond with natural speech in real time. We first build a standalone speech intelligence pipeline that integrates speech recognition, intent detection, multi-step reasoning, and text-to-speech synthesis. In the process, we designed an agent that can listen to commands, recognize targets, plan appropriate actions, and provide speech responses using models such as Whisper and SpeechT5. We study the entire system from a practical perspective, showing how perception, reasoning, and execution interact seamlessly to create autonomous conversational experiences. Check The complete code is here.

import subprocess
import sys
import json
import re
from datetime import datetime
from typing import Dict, List, Tuple, Any


def install_packages():
   packages = ['transformers', 'torch', 'torchaudio', 'datasets', 'soundfile',
               'librosa', 'IPython', 'numpy']
   for pkg in packages:
       subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', pkg])


print("🤖 Initializing Agentic Voice AI...")
install_packages()


import torch
import soundfile as sf
import numpy as np
from transformers import (AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline,
                        SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan)
from IPython.display import Audio, display, HTML
import warnings
warnings.filterwarnings('ignore')

We first install all the necessary libraries, including Transformers, Torch, and SoundFile, to implement speech recognition and synthesis. We also configure the environment to suppress warnings and ensure smooth execution of the entire voice AI setup. Check The complete code is here.

class VoiceAgent:
   def __init__(self):
       self.memory = []
       self.context = {}
       self.tools = {}
       self.goals = []
      
   def perceive(self, audio_input: str) -> Dict[str, Any]:
       intent = self._extract_intent(audio_input)
       entities = self._extract_entities(audio_input)
       sentiment = self._analyze_sentiment(audio_input)
       perception = {
           'text': audio_input,
           'intent': intent,
           'entities': entities,
           'sentiment': sentiment,
           'timestamp': datetime.now().isoformat()
       }
       self.memory.append(perception)
       return perception
  
   def _extract_intent(self, text: str) -> str:
       text_lower = text.lower()
       intent_patterns = {
           'create': ['create', 'make', 'generate', 'write'],
           'search': ['search', 'find', 'look for', 'show me'],
           'analyze': ['analyze', 'explain', 'understand', 'what is'],
           'calculate': ['calculate', 'compute', 'how much', 'sum'],
           'schedule': ['schedule', 'plan', 'set reminder', 'meeting'],
           'translate': ['translate', 'say in', 'convert to'],
           'summarize': ['summarize', 'brief', 'tldr', 'overview']
       }
       for intent, keywords in intent_patterns.items():
           if any(kw in text_lower for kw in keywords):
               return intent
       return 'conversation'
  
   def _extract_entities(self, text: str) -> Dict[str, List[str]]:
       entities = {
           'numbers': re.findall(r'd+', text),
           'dates': re.findall(r'bd{1,2}/d{1,2}/d{2,4}b', text),
           'times': re.findall(r'bd{1,2}:d{2}s*(?:am|pm)?b', text.lower()),
           'emails': re.findall(r'b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}b', text)
       }
       return {k: v for k, v in entities.items() if v}
  
   def _analyze_sentiment(self, text: str) -> str:
       positive = ['good', 'great', 'excellent', 'happy', 'love', 'thank']
       negative = ['bad', 'terrible', 'sad', 'hate', 'angry', 'problem']
       text_lower = text.lower()
       pos_count = sum(1 for word in positive if word in text_lower)
       neg_count = sum(1 for word in negative if word in text_lower)
       if pos_count > neg_count:
           return 'positive'
       elif neg_count > pos_count:
           return 'negative'
       return 'neutral'

Here we implement the agent’s perception layer. We design methods to extract intent, entities, and emotions from speech text, enabling systems to understand user input in its context. Check The complete code is here.

def reason(self, perception: Dict) -> Dict[str, Any]:
       intent = perception['intent']
       reasoning = {
           'goal': self._identify_goal(intent),
           'prerequisites': self._check_prerequisites(intent),
           'plan': self._create_plan(intent, perception['entities']),
           'confidence': self._calculate_confidence(perception)
       }
       return reasoning
  
   def act(self, reasoning: Dict) -> str:
       plan = reasoning['plan']
       results = []
       for step in plan['steps']:
           result = self._execute_step(step)
           results.append(result)
       response = self._generate_response(results, reasoning)
       return response
  
   def _identify_goal(self, intent: str) -> str:
       goal_mapping = {
           'create': 'Generate new content',
           'search': 'Retrieve information',
           'analyze': 'Understand and explain',
           'calculate': 'Perform computation',
           'schedule': 'Organize time-based tasks',
           'translate': 'Convert between languages',
           'summarize': 'Condense information'
       }
       return goal_mapping.get(intent, 'Assist user')
  
   def _check_prerequisites(self, intent: str) -> List[str]:
       prereqs = {
           'search': ['internet access', 'search tool'],
           'calculate': ['math processor'],
           'translate': ['translation model'],
           'schedule': ['calendar access']
       }
       return prereqs.get(intent, ['language understanding'])
  
   def _create_plan(self, intent: str, entities: Dict) -> Dict:
       plans = {
           'create': {'steps': ['understand_requirements', 'generate_content', 'validate_output'], 'estimated_time': '10s'},
           'analyze': {'steps': ['parse_input', 'analyze_components', 'synthesize_explanation'], 'estimated_time': '5s'},
           'calculate': {'steps': ['extract_numbers', 'determine_operation', 'compute_result'], 'estimated_time': '2s'}
       }
       default_plan = {'steps': ['understand_query', 'process_information', 'formulate_response'], 'estimated_time': '3s'}
       return plans.get(intent, default_plan)

We now focus on reasoning and planning. We teach agents how to identify targets, check prerequisites, and generate structured multi-step plans to logically execute user commands. Check The complete code is here.

 def _calculate_confidence(self, perception: Dict) -> float:
       base_confidence = 0.7
       if perception['entities']:
           base_confidence += 0.15
       if perception['sentiment'] != 'neutral':
           base_confidence += 0.1
       if len(perception['text'].split()) > 5:
           base_confidence += 0.05
       return min(base_confidence, 1.0)
  
   def _execute_step(self, step: str) -> Dict:
       return {'step': step, 'status': 'completed', 'output': f'Executed {step}'}
  
   def _generate_response(self, results: List, reasoning: Dict) -> str:
       intent = reasoning['goal']
       confidence = reasoning['confidence']
       prefix = "I understand you want to" if confidence > 0.8 else "I think you're asking me to"
       response = f"{prefix} {intent.lower()}. "
       if len(self.memory) > 1:
           response += "Based on our conversation, "
       response += f"I've analyzed your request and completed {len(results)} steps. "
       return response

In this section, we implement helper functions to calculate confidence, perform each planning step, and generate meaningful natural language responses for the user. Check The complete code is here.

class VoiceIO:
   def __init__(self):
       print("Loading voice models...")
       device = "cuda:0" if torch.cuda.is_available() else "cpu"
       self.stt_pipe = pipeline("automatic-speech-recognition", model="openai/whisper-base", device=device)
       self.tts_processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
       self.tts_model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
       self.vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
       self.speaker_embeddings = torch.randn(1, 512) * 0.1
       print("✓ Voice I/O ready")
  
   def listen(self, audio_path: str) -> str:
       result = self.stt_pipe(audio_path)
       return result['text']
  
   def speak(self, text: str, output_path: str = "response.wav") -> Tuple[str, np.ndarray]:
       inputs = self.tts_processor(text=text, return_tensors="pt")
       speech = self.tts_model.generate_speech(inputs["input_ids"], self.speaker_embeddings, vocoder=self.vocoder)
       sf.write(output_path, speech.numpy(), samplerate=16000)
       return output_path, speech.numpy()




class AgenticVoiceAssistant:
   def __init__(self):
       self.agent = VoiceAgent()
       self.voice_io = VoiceIO()
       self.interaction_count = 0
      
   def process_voice_input(self, audio_path: str) -> Dict:
       text_input = self.voice_io.listen(audio_path)
       perception = self.agent.perceive(text_input)
       reasoning = self.agent.reason(perception)
       response_text = self.agent.act(reasoning)
       audio_path, audio_array = self.voice_io.speak(response_text)
       self.interaction_count += 1
       return {
           'input_text': text_input,
           'perception': perception,
           'reasoning': reasoning,
           'response_text': response_text,
           'audio_path': audio_path,
           'audio_array': audio_array
       }

We set up the core speech input and output pipeline using Whisper for transcription and SpeechT5 for speech synthesis. We then integrate them with the agent’s inference engine to form a complete interactive assistant. Check The complete code is here.

  def display_reasoning(self, result: Dict):
       html = f"""
       
           🤖 Agent Reasoning Process
           📥 INPUT: {result['input_text']}
           🧠 PERCEPTION:
               
                   Intent: {result['perception']['intent']}
                   Entities: {result['perception']['entities']}
                   Sentiment: {result['perception']['sentiment']}
               
           
           💭 REASONING:
               
                   Goal: {result['reasoning']['goal']}
                   Plan: {len(result['reasoning']['plan']['steps'])} steps
                   Confidence: {result['reasoning']['confidence']:.2%}
               
           
           💬 RESPONSE: {result['response_text']}
       
       """
       display(HTML(html))




def run_agentic_demo():
   print("n" + "="*70)
   print("🤖 AGENTIC VOICE AI ASSISTANT")
   print("="*70 + "n")
   assistant = AgenticVoiceAssistant()
   scenarios = [
       "Create a summary of machine learning concepts",
       "Calculate the sum of twenty five and thirty seven",
       "Analyze the benefits of renewable energy"
   ]
   for i, scenario_text in enumerate(scenarios, 1):
       print(f"n--- Scenario {i} ---")
       print(f"Simulated Input: '{scenario_text}'")
       audio_path, _ = assistant.voice_io.speak(scenario_text, f"input_{i}.wav")
       result = assistant.process_voice_input(audio_path)
       assistant.display_reasoning(result)
       print("n🔊 Playing agent's voice response...")
       display(Audio(result['audio_array'], rate=16000))
       print("n" + "-"*70)
   print(f"n✅ Completed {assistant.interaction_count} agentic interactions")
   print("n🎯 Key Agentic Capabilities Demonstrated:")
   print("  • Autonomous perception and understanding")
   print("  • Intent recognition and entity extraction")
   print("  • Multi-step reasoning and planning")
   print("  • Goal-driven action execution")
   print("  • Natural language response generation")
   print("  • Memory and context management")


if __name__ == "__main__":
   run_agentic_demo()

Finally, we run a demo to visualize the agent’s complete reasoning process and hear its responses. We tested multiple scenarios to demonstrate the perfect coordination of perception, reasoning, and speech response.

In summary, we built an intelligent voice assistant that understands what we say and can reason, plan, and speak like a real agent. We experienced how perception, reasoning, and action work in harmony to create natural and adaptive voice interfaces. With this implementation, we aim to bridge the gap between passive voice commands and autonomous decision-making, demonstrating how agent intelligence can enhance human voice interactions with AI.

Check The complete code is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an AI media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.