Uni-MoE-2.0-Omni: Omni-modal MoE based on open Qwen2.5-7B for text, image, audio and video understanding

How to build an open model reliably Understand text, images, audio and video while still functioning efficiently? Introduction to the research team of Harbin Institute of Technology Shenzhen Branch Uni-MoE-2.0-Omnia fully open, omnimodal large-scale model that pushes Lychee’s Uni-MoE family toward language-centric multimodal reasoning. The system is trained from scratch on Qwen2.5-7B dense backbone and scaled to an expert hybrid architecture of ~75B tokens with dynamic capacity routing, progressive supervised and reinforcement learning recipes, and carefully matched multi-modal data. It processes text, images, audio and video for understanding and can generate images, text and speech.

Architecture, a unified modal encoding around the core of language

At the heart of Uni-MoE-2.0-Omni is a Qwen2.5-7B style converter that acts as a language-centric hub. Around this center, the research team attached a unified speech encoder that maps a variety of audio including ambient sounds, speech, and music into a common representation space. On the vision side, a pre-trained visual encoder processes images and video frames and then feeds sequences of tokens into the same transformer. For generation, the context-aware MoE-based TTS module and task-aware diffusion transformer handle speech and image synthesis.

All modalities are converted into token sequences that share a unified interface with the language model. This design means that the same self-attention layer can see text, visual and audio markup, which simplifies cross-modal fusion and makes the language model the central controller for understanding and generation. The architecture is designed to support 10 cross-modal input configurations, such as image plus text, video plus voice, and three-modal combinations.

Full-modal 3D RoPE and MoE driven fusion

Cross-modal alignment is handled by the Omni Modality 3D RoPE mechanism, which encodes temporal and spatial structure directly into the rotational position embedding. Rather than just using the one-dimensional position of text, the system assigns markers three coordinates: time, height and width of visual and audio streams, and time of speech. This gives the converter a clear understanding of when and where each token occurred, which is important for video understanding and audio-visual reasoning tasks.

The expert hybrid layer replaces the standard MLP block with a MoE stack with three expert types. Empty experts act as empty functions, allowing calculations to be skipped during inference. Routing experts are modality-specific and store domain knowledge for audio, visual, or text. Shared experts are small and always active, providing a communication path for general information across modes. The routing network selects experts to activate based on input tokens, thus providing specialization without paying the full cost of a dense model where all experts are active.

Training secrets, from cross-modal pre-training to GSPO DPO

The training pipeline is organized into data matching recipes. First, a language-centric cross-modal pre-training stage uses paired image-text, audio-text, and video-text corpora. This step teaches the model to project each modality into a shared semantic space consistent with the language. The base model is trained on approximately 75B open source multi-modal markers and is equipped with special speech and image generation markers so that generative behavior can be learned by conditioning on language cues.

Next, a progressive supervised fine-tuning stage activates modality-specific experts divided into audio, visual, and text categories. At this stage, the research team introduced special control tokens so that the model could perform tasks such as text-conditional speech synthesis and image generation within the same language interface. After large-scale SFT (supervised fine-tuning), a data-balanced annealing stage reweights the mixture of datasets across modalities and tasks and trains with a lower learning rate. This avoids overfitting to a single mode and improves the stability of the final full-modal behavior.

To unlock long-form reasoning, Uni-MoE-2.0-Omni adds an iterative strategy optimization stage based on GSPO and DPO. GSPO uses the model itself or another LLM as a judge to evaluate responses and construct preference signals, while DPO converts these preferences into direct policy update targets that are more stable than standard reinforcement learning from human feedback. The research team applied this GSPO DPO cycle in multiple rounds to form the Uni-MoE-2.0-Thinking variant, which inherits the full-modal foundation and adds stronger step-by-step reasoning capabilities.

Generation, MoE TTS and mission-aware diffusion

For speech generation, Uni-MoE-2.0-Omni uses the context-aware MoE TTS module that sits on top of the language model. The LLM emits control tokens that describe timbre, style, and language, as well as textual content. MoE TTS uses this sequence and generates discrete audio tokens, which are then decoded into waveforms by an external codec model, aligned with the unified speech coder on the input side. This design makes speech generation a first-class controlled generation task rather than a separate pipeline.

On the visual side, task-aware diffusion converters are conditioned on task labels and image labels. Task flags encode whether the system should perform text-to-image generation, editing, or low-level enhancement. Image markup can capture semantics from a full-mode backbone, such as text plus image dialogue. The lightweight projector maps these markers into the diffusion transformer tuning space, enabling instruction-guided image generation and editing while keeping the main full-modal model frozen during the final visual fine-tuning stage.

Baseline and open checkpoints

Uni-MoE-2.0-Omni is evaluated on 85 multi-modal benchmarks covering image, text, video, audio, and cross-modal or tri-modal reasoning. The model outperforms Qwen2.5-Omni, which was trained on approximately 1.2T tokens, on more than 50 out of 76 shared benchmarks. Gains include an average increase of approximately 7% in video understanding across 8 tasks, an average increase of 7% in full-modal understanding across 4 benchmarks including OmniVideoBench and WorldSense, and an approximately 4% increase in audiovisual reasoning.

For long-form speech processing, Uni-MoE-2.0-Omni reduces word error rate by up to 4.2% relative to LibriSpeech long segmentation and brings about 1% WER improvement for TinyStories-en text-to-speech. Image generation and editing results are competitive with specialized vision models. The research team reported that GEdit Bench achieved a small but stable gain of about 0.5% compared to Ming Lite Omni, while also outperforming Qwen Image and PixWizard on several low-level image processing metrics.

Main points

  1. Uni-MoE-2.0-Omni is a completely open, all-modal large-scale model built from scratch on the Qwen2.5-7B dense backbone, upgraded to a Mixture of Experts architecture, supporting 10 cross-modal input types and joint understanding across text, images, audio and video.
  2. The model introduces a dynamic capacity MoE with shared, routing and empty experts and Omni Modality 3D RoPE to balance computation and functionality through per-token routing experts while maintaining spatiotemporal alignment between modalities within the self-attention layer.
  3. Uni-MoE-2.0-Omni uses a staged training pipeline, cross-modal pre-training, progressive supervised fine-tuning with modality-specific experts, data-balanced annealing, and GSPO plus DPO-based reinforcement learning to obtain the Uni-MoE-2.0-Thinking variant to achieve stronger long-form reasoning.
  4. The system supports full-modal understanding and generation of images, text and speech through a unified language-centric interface, and features dedicated Uni-MoE-TTS and Uni-MoE-2.0-Image headers derived from the same foundation for controllable speech and image synthesis.
  5. In 85 benchmark tests, Uni-MoE-2.0-Omni surpassed Qwen2.5-Omni on more than 50 of 76 shared tasks, with video understanding and full-modal understanding improved by approximately 7%, audio-visual reasoning improved by 4%, and the relative WER reduction of long speech was up to 4.2%.

Check Paperrepo, model weights, and project page. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...