Stepfun AI releases Step-Audio 2 Mini: an open source 8B voice-to-speech AI model that exceeds GPT-4O-Audio

by admin · September 1, 2025

Stepfun AI Team has been released Step-Audio 2 Minian 8B parameter speech-to-speech large audio language model (LALM), which provides expression, grounding and real-time audio interaction. exist Apache 2.0 LicenseThis open source model achieves state-of-the-art performance in speech recognition, audio comprehension and voice conversation benchmarks, which can be achieved through commercial systems such as GPT-4O-Audio.

Key Features

1. Unified audio text symbolization

Unlike the cascading ASR+LLM+TTS pipeline, Step-Audio 2 integration Multi-mode discrete token modelingWhere Text and audio tokens share a single modeling stream.

This can:

Seamless reasoning across text and audio.
immediate Voice style switching During reasoning.
Consistency of semantics, rhythms, and emotional output.

2. Expressive and emotional

This model not only transcribes the speech, but also explains Paralinguistic characteristics Like tone, rhythm, emotion, tone and style. This allows for conversations with the emotional tones of reality, such as whispering, sadness or excitement. Benchmark on stepval-audio-parlalanical Show Step-Adio 2 implementation 83.1% accuracyfar beyond GPT-4O audio (43.5%) and Qwen-Omni (44.2%).

3. Search speech generation

Step-Audio 2 merge Multi-mode rag (retrieval machine):

Web Search Integration Used for factual basis.
Audio search– A new ability to retrieve real sounds from large libraries and blend them into responses, enabling them to Voice/style imitation When reasoning.

4. Tool calls and multimodal reasoning

This system goes beyond voice synthesis through support Tool calls. Benchmarks show that Step-Audio 2 matches text LLM Tool selection and parameter accuracyalthough Audio search tool call– Features not available in text LLM only.

Training and data scales

Text + Audio Corpus: 1.356T token
Audio Hours: 8m+reality and synthesis time
Diversity of speakers: ~ 50k sounds in language and dialect
Preprocessing pipeline: The multi-stage course covers ASR, TT, voice-to-voice translation and a dialogue synthesis of emotion-marking.

This massive training allows the Step-Audio 2 Mini to retain powerful text reasoning (through its Qwen2-Audio and Cosyvoice Foundation) while mastering fine-grained audio modeling.

Performance Benchmark

Automatic speech recognition (ASR)

English: Average 3.14% (average transcription that beat GPT-4O was 4.5%).
Chinese: Average CER was 3.08% (significantly lower than GPT-4O and QWEN-OMNI).
Strong across dialects and accents.

Audio Understanding (MMAU Benchmark)

Step-Audio 2: 78.0 average, exceeding Omni-R1 (77.0) and audio Flamingo 3 (73.1).
Strongest Voice and voice reasoning tasks.

Voice translation

COVOST 2 (S2TT): BLEU 39.26 (highest in open and closed models).
CVSS (S2ST): BLEU 30.87, before GPT-4O (23.68).

Dialogue Bench (URO BENCH)

China Dialogue: The best overall 83.3 (Basic) and 68.2 (Pro).
English dialogue: Competition with the GPT-4O (83.9 vs. 84.5) is far ahead of other open models.

in conclusion

Step-Audio 2 Mini Give developers and research communities access to advanced multimodal voice intelligence. By combining qwen2-audioThe reasoning ability Cosyvoice’s tokenized pipelineand enhancement Search-based groundingStepfun delivers the most capable Turn on the audio LLM.

Check Paper and Model embracing face. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Stepfun AI releases Step-Audio 2 Mini: an open source 8B voice-to-speech AI model that exceeds GPT-4O-Audio

Key Features

1. Unified audio text symbolization

2. Expressive and emotional

3. Search speech generation

4. Tool calls and multimodal reasoning

Training and data scales

Performance Benchmark

Automatic speech recognition (ASR)

Audio Understanding (MMAU Benchmark)

Voice translation

Dialogue Bench (URO BENCH)

in conclusion

You may also like...

live chat

Recent Posts

Stepfun AI releases Step-Audio 2 Mini: an open source 8B voice-to-speech AI model that exceeds GPT-4O-Audio

Key Features

1. Unified audio text symbolization

2. Expressive and emotional

3. Search speech generation

4. Tool calls and multimodal reasoning

Training and data scales

Performance Benchmark

Automatic speech recognition (ASR)

Audio Understanding (MMAU Benchmark)

Voice translation

Dialogue Bench (URO BENCH)

in conclusion

You may also like...

Stanford researchers introduce Biomni: a biomedical AI agent for automation across different tasks and data types

Features, Benefits, Alternatives and Reviews • AI Parabellum

Details revealed over time – Jon Loomer Digital

live chat

Recent Posts