This AI paper introduces LLADA-V: a large language model based purely on diffusion for visual instruction adjustment and multimodal reasoning

by admin · June 3, 2025

The Multimodal Large Language Model (MLLM) is designed to process and generate content in various patterns, including text, images, audio and video. These models are designed to understand and integrate information from different sources, enabling applications such as visual question and answer, image subtitles, and multimodal dialogue systems. The development of MLLM represents an important step in creating AI systems that can explain and interact with the world in a more human-like way.

The main challenge in developing effective MLLM is to integrate various input types, especially visual data, into the language model while maintaining high performance across tasks. Existing models often struggle to balance strong language understanding with effective visual reasoning, especially when scaling to complex data. Furthermore, many models require good performance on large datasets, making it difficult to adapt to a specific task or domain. These challenges highlight the need for a more efficient and scalable approach to multimodal learning.

Current MLLM mainly uses the self-rotating method to predict one token at a time from left to right. Although effective, this approach has limitations in dealing with complex multimodal environments. Alternative methods such as diffusion models have been explored. However, they often exhibit weaker language comprehension due to their limited architecture or insufficient training strategies. These limitations suggest that gaps effectively designed by purely diffusion-based models can provide competitive multimodal reasoning capabilities.

Researchers from Renmin University of China and the Ant Group have launched LLADA-V, a purely diffusion-based masking language modeling (MLLM) model that integrates visually-guided adjustments with masked diffusion models. LLADA-V is built on LLADA of large language diffusion models, integrating visual encoder and MLP connectors with Project Visual features into the language embedding space, thus achieving effective multi-mode alignment. This design represents the dominance of the self-rotation paradigm in the current multimodal approach, aiming to overcome existing limitations while maintaining data efficiency and scalability.

LLADA-V adopts a masked diffusion process, where the text response is gradually improved by iteratively predicted mask tokens. Unlike the self-rotation model that predicts tokens in turn, LLADA-V generates output by reversing the masked diffusion process. The model is trained in three stages: the first stage aligns the visual and language embeddings by mapping the visual features of Siglip2 into the language space of Llada. The second phase was used to micro-adjustment using 10 million single-image samples and 2 million mammoth VL multimode samples. The third phase focuses on a 900K QA pair using VisualWebinstruct and a hybrid dataset strategy. Two-way attention improves environmental understanding, thus achieving a strong multimodal understanding.

In evaluation across 18 multimodal tasks, LLADA-V showed excellent performance compared to hybrid autorotational diffusion and purely diffusion-based models. It outperforms Llama3-V on most multidisciplinary knowledge and mathematical reasoning tasks such as MMMU, MMMU-PRO and MMSTAR, which scored 60.1 on MMSTAR, but used the weaker LLADA-V language. LLADA-V also performed well in data efficiency, outperforming Llama3-V on MMMU-PRO and outperforming MMMU-PRO on 9 million samples of Llama3-V. Although it lags behind graph and document understanding benchmarks, such as AI2D and real-world scenario tasks, such as Realworldqa, the results of LLADA-V highlight its hopes for multimodal tasks.

In summary, LLADA-V solves the challenges of effective multi-models by introducing a purely diffusion-based architecture that combines visual guidance adjustment with masking diffusion. This method provides powerful multimodal reasoning capabilities while maintaining data efficiency. This work demonstrates the potential of diffusion models in multimodal AI, paving the way for further exploration of probabilistic approaches to complex AI tasks.

View paper and GitHub pages . All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.