OMNI-R1: Answer audio questions through text-driven reinforcement learning and automatic data generation

Recent developments have shown that RL can significantly enhance LLM’s reasoning capabilities. Building on this progress, the study aims to improve audio LLM, i.e., process audio and text to perform tasks such as question answering. The MMAU benchmark is a widely used data set designed to evaluate these models, which contain multiple choice questions about sound, speech, and music, some of which require external knowledge. Previous method R1-AQA uses GRPO (Group Relative Strategy Optimization) to fine-tune the Qwen2-Audio model on the AVQA dataset, thus achieving the latest (SOTA) results on the MMAU. Inspired by this, the authors applied GRPO to fine-tuning QWEN2.5-OMNI-7B, a newer multimodal model that further improves performance. Furthermore, they introduced a method to automatically generate audio quality data data, resulting in better results.
Compared to methods such as Sari, Sari uses a more complex combination of supervised fine-tuning with structural reasoning and RL, the author’s approach is simpler, relying solely on RL without explicit inference steps. They also conducted experiments using text-only input to study the role of GRPO in performance growth. Surprisingly, fine-tuning the model using only text data is almost the same as audio and text training. This finding suggests that GRPO enhances the model’s inference ability primarily through text, thus significantly promoting its improvement in audio quality inspection tasks.
Researchers from MIT Csail, Goethe University, IBM Research, etc. introduced Omni-R1, a fine-tuned version of multimodal LLM QWEN2.5-OMNI using the GRPO enhanced learning method. Omni-R1 is trained in the AVQA dataset and has tested new latest results for MMAU benchmarks on all audio categories. Surprisingly, many of the improvements stem from enhanced text-based reasoning rather than audio input. Fine-tuning with text-only data also results in significant performance improvements. In addition, the team used CHATGPT to generate a large-scale audio quality QA dataset, which further improved accuracy. Their work highlights the significant impact of text reasoning in audio LLM performance and promises to publicly publish all resources.
The Omni-R1 model uses the grpo enhanced learning method for the micro QWEN2.5-OMNI with a simple prompt format that allows direct answer selection, making it available for memory efficiency on a 48GB GPU. GRPO avoids value functions by comparing grouped outputs based solely on answer correctness. The researchers used audio subtitles from QWEN-2 audio to extend the training data and prompted Chatgpt to generate new question-solving corrects. This method produces two data sets: AVQA-GPT and VGGS-GPT, respectively, with 40K and 182K audio, respectively. Training on these automatically generated datasets improves performance, and VGGS-GPT helps Omni-R1 achieve state-of-the-art accuracy on the MMAU benchmark.
The researchers fine-tuned QWEN2.5-OMNI using GRPO on the AVQA, AVQA-GPT and VGGS-GPT datasets. The results showed significant performance growth, with the best average score of 71.3% from the MAU Test-Mini of VGGS-GPT. QWEN2.5-OMNI performed better than baselines including SARI, showing strong reasoning even without audio, suggesting a strong understanding of text-based comprehension. GRPO fine-tuning improves Qwen2-Audio to a greater extent due to its weak initial text reasoning. Surprisingly, there is no fine tuning for audio performance improvements, while text-only datasets such as Arc-easy produce comparable results. Improvements are primarily derived from enhanced text reasoning, although audio-based fine-tuning still has a slight advantage in optimal performance.
In summary, Omni-R1 is an audio LLM developed by fine-tuning QWEN2.5-OMNI by using GRPO enhanced learning method. Omni-R1 has achieved new latest results on the MMAU benchmark for sound, voice, music and overall performance. Two new large-scale datasets AVQA-GPT and VGGS-GPT were created using the automatically generated problem, further improving the accuracy of the model. Experiments show that GRPO mainly enhances text-based reasoning, which significantly contributes to performance. Surprisingly, fine-tuning with text only (no audio) improves audio-based performance, highlighting the value of a strong basic language understanding. These findings provide cost-effective strategies for developing language models with audio capability.
View paper. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 95k+ ml reddit.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.
🚨Build a Genai you can trust. ⭐️Parlant is your open source engine for controlled, compliance and purposeful AI conversations – Star Parlant on Github! (Promotion)