Have you heard of Artificial General Intelligence (AGI)? Conforms to its auditory opponent –Audio General Intelligence. and Audio Flamingo 3 (AF3)NVIDIA has introduced a significant leap in how machines understand and justify sounds. Although past models could transcribe speech or classify audio clips, they lacked the ability to interpret audio in a context-rich, human-like manner, namely, propagated speech, ambient sounds and music, and extended durations. AF3 changed that.
With Audio Flamingo 3, NVIDIA Introduction Fully open source large audio model (LALM) This is not only heard, but also understood the reasons and reasons. Built on a five-stage course, AF3 is powered by the AF-Whisper encoder, supports long audio inputs (up to 10 minutes), multi-transform multi-audit chat, thinking on demand, and even voice-to-sound interactions. This sets a new standard for how AI systems interact with sounds, bringing us closer to AGI.


Audio Flamingo 3’s core innovation 3
- af-whisper: a unified audio encoder AF3 uses Af-whisper, a novel encoder adapted from Whisper-V3. It uses the same architecture to process speech, ambient sounds, and music, thus identifying the major limitations of early LALMs using separate encoders, resulting in inconsistencies. Af-Whisper uses audio capture datasets, synthetic metadata and dense 1280-dimensional embedding space to align with text representations.
- Audio chain: on-demand reasoning Unlike static quality inspection systems, AF3 is equipped with a “thinking” function. Using the AF-INK dataset (250K example), the model can perform inferences through the passing by the manager when prompted, allowing it to interpret its inference steps before getting an answer, which is a critical step towards transparent audio AI.
- Multi-change multi-voice dialogue With the AF-CHAT dataset (75K dialogue), AF3 can conduct context conversations in context conversations involving multiple audio inputs across rounds. This interaction that mimics the real world, humans refer to previous audio cues. It also uses stream text to the speech module to introduce speech to sound dialogue.
- Long audio reasoning The AF3 is the first fully open model that can reason with audio input up to 10 minutes. The model received Longaudio-XL training (example 1.25 million), supporting tasks such as satisfaction summary, podcast understanding, satirical detection, and time basis.


State-of-the-art benchmarks and real-world capabilities
AF3 has more than 20 benchmarks of open and closed models, including:
- MMAU (AVG): 73.14% (compared with qwen2.5-o+2.14%)
- Longaudiobench: 68.6 (GPT-4O evaluation), defeating Gemini 2.5 Pro
- Librispeech (ASR): 1.57%, better than PHI-4 mm
- Clothoaqa: 91.1% (89.2% compared to qwen2.5-o)
These improvements are more than just marginal. They redefined the expectations of audio language systems. AF3 also introduces benchmarks in voice chat and voice generation, achieving a generation latency of 5.94s (14.62s for QWEN2.5) and better similarity scores.
Data Pipeline: Datasets that teach audio inference
Nvidia not only scales the calculations, but also reconsider the data:
- Audioskills-XL: The 8M example combines environment, music and voice reasoning.
- longaudio-xl: Long-term speech covering audiobooks, podcasts, conferences.
- Af-thick: Promote short crib-style inference.
- AF-CHAT: Designed for multi-turn, multi-voice dialogue.
Each dataset, as well as training code and recipes, is fully open source, reproducible and future research.
Open source
AF3 is not only a model decline. Nvidia Release:
- Model weight
- Training Recipes
- Inference code
- Four open datasets
This transparency makes AF3 the most accessible latest audio language model. It opens new research directions for auditory reasoning, low-latency audio proxying, music comprehension and multimodal interaction.
Conclusion: Going towards general audio intelligence
Audio Flamingo 3 shows that deep audio understanding is not only possible, but can be reproduced and open. By combining scales, novel training strategies and different data, NVIDIA provides a model to listen, understand and justify ways to previous LALMS not understand and justify.
Check Paper,,,,, Codes and models that embrace faces. All credits for this study are to the researchers on the project.
Ready to connect with 1 million+ AI development/engineers/researchers? See how NVIDIA, LG AI Research and Advanced AI companies leverage Marktechpost to reach target audiences [Learn More]

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.