New knowledge from the Chinese Academy of Sciences: LLM Stream-Omni for cross-mode real-time AI

Understand the limitations of the current Omni-Modal architecture
Large multi-model model (LMM) displays excellent overall barriers in text, visual and vocabulary modes, creating great potential for a variety of applications. Although vision-oriented LMMs have shown success, speech interactions based on visual information support speech interactions due to inherent representational differences across modalities. Recent Omni-Modal LMMs aim to unify text, vision, and speech by combining representations of a single modal encoder along a sequence dimension. However, they depend on large-scale data learning modal alignment in a data-driven way. This does not match the limited public three-mode dataset and does not have enough flexibility to produce intermediate text results during speech interactions.
Classify existing LMMs by pattern focus
Current LMMs fall into three categories: vision-oriented, speech-oriented and Omni-Modal. Vision-oriented LMMs (such as LLAVA) utilize visual coding to extract visual features, then use them in conjunction with text input, and pass them into LLMS to generate text. Voice-oriented LMMs use continuous methods such as Mini-Omni and Llama-omni to project project functions into LLM embedding space, or discrete voice units like Speechgpt and Moshi convert speech to discrete units for direct LLM processing. Omni-Modal LMMs from various encoders, such as Vita-1.5, Minicpm2.6-O and Qwen2.5-OMNI extract representations, were concatenated in tandem for multimodal understanding and synthesized using a speech decoder.
Introduction to Stream-OMNI: A Text-Centered Alignment Method
Researchers from the Chinese Academy of Sciences have proposed the proposed Stream-Omni, a large language-language language model designed to address the modal alignment challenges of Omni-Modal-Modal Systems. It utilizes the LLM main chain and aligns the visual and phonic ways of text based on its semantic relationships rather than simple tandem methods. Stream – Moni keeps the pattern consistent by integrating its semantic relationships with the text. For vision, this method applies sequence-dimensional concatenation to aligning vision and text. For speech, it introduces a CTC-based layer-size mapping for speech text alignment. The design of flow-OMNI overcomes the limitations of the tandem-based approach by introducing a targeted alignment mechanism.
Architecture Overview: Dual-layer voice integration and visual coding
The architecture of Stream-Omni adopts an LLM main chain with a progressive morphological alignment strategy. For visual text alignment, the stream Moni applies a visual encoder and projection layer to extract visual representations. For speech text alignment, it introduces a special speech layer at the bottom and top of the LLM skeleton, allowing bidirectional mapping between speech and text modes to be able to. Stream-Omni builds its training corpus through automated pipelines, leverages the LLAVA dataset for visual text pairs, librispeech and wenetspeech for voice text data, and creates the current instruction dataset by converting the current instruction dataset using text pair speech synthesis.
Benchmark cross-domain multimodal functionality
In visual comprehension tasks, Stream-Omni performs comparable to LMMs for advanced vision and outperforms Vita-1.5, reducing modal interference while maintaining strong visual capabilities. For voice interactions, using less voice data (23k hours) and using less voice data (23k hours) shows excellent knowledge-based performance compared to voice unit-based models such as Speechgpt, Moshi, and GLM-4-voice. Streaming media outperforms vita-1.5 in real-world visual understanding in the visual interface voice interaction evaluation of standard benchmarks. The quality of speech text mapping can achieve excellent ASR performance on the Librispeech benchmark in both accuracy and inference time.
Conclusion: Paradigm changes in multi-modal alignment
In short, the researchers introduced streaming, which is a solution to the modal alignment challenge in Omni-Modal systems. This approach shows that effective modal alignment can be achieved through visual-text pair sequence-dimensional concatenation and speech text integration layer-dimensional mapping, eliminating the need for a wide range of three-mode training data. Furthermore, this study establishes a new paradigm for Omni-Modal LMMS, suggesting that targeted alignment strategies based on semantic relationships can overcome the limitations of traditional approaches based on tandem in multimodal AI systems.
Check Paper and model on hugging face. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.
