Revisual-R1: Open Source 7B Multi-Mode Large Language Model (MLLM), implementing long, accurate and thoughtful reasoning

The challenge of multimodal reasoning

Latest breakthroughs in text-based language models such as DeepSeek-R1 show that RL can help develop strong inference capabilities. In this motivating case, the researchers attempted to apply the same RL technology to MLLM to enhance its ability to reason in visual and text input. However, these attempts have not been completely successful. MLLM is still struggling with complex inference tasks. This suggests that reusing only text-only models RL policies may not work properly in multimodal settings where interactions between different data types introduce new challenges requiring a more tailored approach.

Evolution of multimodal models

The latest research in MLLM is based on the progress of LLM by combining visual input with language understanding. Early models such as Editing and Mini Period 4 laid the foundation work, followed by guided tweaking models like Llama. Although closed source models show strong reasoning through lengthy COT outputs, open source models focus primarily on fine-tuning and COT adaptation. However, these often produce short answers that limit deeper reasons. RL, including technologies such as RLHF and GRPO, has shown promise to enhance reasoning in LLM. Inspired by this, recent work aims to apply RL to MLLM to improve visual reasoning and support richer, longer output.

Introduction to Revisual-R1

Researchers from Tsinghua University, Shanghai Ruotang University and Shanghai Artificial Intelligence Laboratory introduced Revisual-R1, a 7B parameter open source MLLM that sets new standards for multimodal reasoning. Their research reveals three key insights: (1) careful text preprocessing alone provides a strong cold start, showing the performance of many existing MLLMs even before RL; (2) commonly used GRPO algorithms suffer from gradient stagnation, which are resolved using a new method called priority distillation (PAD); (3) adding the final text-only RL phase after multi-modal RL further enhances the reasoning. Their three-stage approach, including text preprocessing, multi-modal RL and final text RL, strikes an effective balance between visual grounding and deep cognitive reasoning.

Develop syntax datasets

The syntax dataset was developed after discovering that the existing multimodal cold-start dataset lacked the depth required to train a strong inference model. Like DeepMath, text-only datasets show better benefits in both text and multimodal tasks, suggesting that text complexity can better stimulate reasoning. To solve this problem, the grammar uses a multi-stage curation process that combines multiple text and multi-modal samples. This data provides a phased enhanced optimization (SRO) framework with first training the model using multi-modal RL and augmenting the model with priority advantage distillation to avoid stagnant learning and effective length rewards to curb vertical words, and then text rl phases to improve inference and language mobility.

Three-stage training pipeline

The experiment of Revisual-R1 follows a structured three-stage training process: starting with plain text data to build a language foundation, then integrating multi-modal enhanced learning into visual text reasoning, and finally fine-tuning with text-only RL to improve reasoning and fluency. It is tested in various benchmarks and outperforms open source and some business models in both multimodal and mathematical inference tasks. The model achieved the highest results in 9 of 10 benchmarks. Ablation studies confirm the importance of training sequence and the advantage of priority distillation methods, which helps focus learning on high-quality responses, thereby significantly improving overall performance.

Summary and contributions

In short, Revisual-R1 is a 7B open source MLLM designed to address the challenges of complex multi-modal reasoning. It not only relies on scale, but uses a well-designed three-stage training process: starting with high-quality text data, used for the basic principles, and then enhancing multi-mode RL phase using new stability pad technology and ending with the ultimate text-based RL improvement. This thoughtful course greatly improves performance. The Revisual-R1 sets a new benchmark in the 7B model, which is excellent in tasks like Mathverse and Aime. This work highlights how structured training unlocks deeper reasoning in MLLM.

Check Paper and github page. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.

Revisual-R1: Open Source 7B Multi-Mode Large Language Model (MLLM), implementing long, accurate and thoughtful reasoning

The challenge of multimodal reasoning

Evolution of multimodal models

Introduction to Revisual-R1

Develop syntax datasets

Three-stage training pipeline

Summary and contributions

You may also like...

Leave a Reply Cancel reply

Recent Posts

Revisual-R1: Open Source 7B Multi-Mode Large Language Model (MLLM), implementing long, accurate and thoughtful reasoning

The challenge of multimodal reasoning

Evolution of multimodal models

Introduction to Revisual-R1

Develop syntax datasets

Three-stage training pipeline

Summary and contributions

You may also like...

Why Canon EOS R6 Mark II is great for creators – Cute tech gadgets

A climate and society student solves the problem of climate equity – Earth State

Beetroot juice changes oral microbiome and lowers blood pressure

Leave a Reply Cancel reply

Recent Posts