UCLA researchers release OpenVlthinker-7B: an enhanced learning-driven model for augmenting complex visual reasoning and step-by-step solution to problems with multimodal systems

by admin · March 29, 2025

Large visual models (LVLM) integrate large language models with image processing capabilities, allowing them to interpret images and produce coherent text responses. Although they are good at identifying visuals and responding to cues, they often falter when problems arise that require multi-step reasoning. Vision language tasks, such as understanding charts, solving visual mathematical problems or interpreting graphs requires more than recognition; they need to be able to follow logical steps based on visual cues. Despite the advancements in model architecture, the current system always strives to produce accurate, explainable answers in this complex situation.

A major limitation of current visual models is that they are unable to perform complex inferences, involving multiple steps of logical inference, especially when interpreting images in conjunction with text queries. These models are usually unable to validate or correct their reasoning internally, resulting in incorrect or shallow output. Again, the reasoning chains of these models are usually opaque or verifiable, making it difficult to ensure the robustness of their conclusions. The challenge is to bridge this reasoning gap, which is effectively solved by reinforcement learning techniques only, but visual language models are not fully accepted.

Prior to this study, efforts to strengthen reasoning in such systems depended primarily on standard fine-tuning or cues techniques. Although helpful for basic tasks, these methods often lead to detailed or repeated outputs with limited depth. Visual models such as QWEN2.5-VL-7B show promise due to their ability to follow visual guidance, but lack multi-step reasoning comparable to peers with text only (such as DeepSeek-r1). Even if the prompts are structured queries, it is difficult for these models to reflect on their output or verify intermediate inference steps. This is an important bottleneck, especially for use cases that require structured decisions, such as visual issues or educational support tools.

UCLA researchers have launched a model called OpenVlthinker-7B. The model was developed through a novel training method that combines supervised fine tuning (SFT) and enhanced learning (RL) in an iterative loop. The process first generates image subtitles using QWEN2.5-VL-3B and then feeds it into a distilled version of DeepSeek-R1 to produce a structured chain of reasoning. These outputs constitute the training data for the first round of SFT, thereby guiding the model to learn the basic reasoning structure. Thereafter, an enhanced learning phase using Group Relative Policy Optimization (GRPO) was applied to refine the reasoning of the model based on reward feedback. This combination allows the model to gradually break through itself by taking the refined output of each iteration as new training data for the next cycle.

This method involves careful data planning and multiple training phases. In the first iteration, 25,000 examples were used for SFT, adopted from datasets such as figeqa, geogogy3k, tabmwp and vizwiz. These examples are filtered to eliminate excessive verbose or redundant reflections, thereby improving training quality. GRPO is then applied to smaller, more difficult datasets of 5,000 samples. This results in an increase in accuracy of the MathVista benchmark from 62.5% to 65.6%. In the second iteration, another 5,000 high-quality examples were used for SFT, increasing the accuracy to 66.1%. The second round of GRPO will improve performance to 69.4%. During these phases, multiple benchmarks were performed on the model, MathVista, Mathverse and MathVision evaluated the model and showed a steady performance increase for each iteration.

In terms of quantifiers, OpenVlthinker-7b performs better than its basic model QWEN2.5-VL-7B. On Mathvista, it has a 70.2% accuracy compared to 50.2% of the base model. Mathematically, the improvement ranged from 46.8% to 68.5%. MATHVISION full test accuracy rose from 24.0% to 29.6%, while Mathvision Testmini increased from 25.3% to 30.4%. These improvements suggest that the model learns to follow inference patterns and can better summarize invisible multimodal tasks. Each iteration of the training contributes measurable benefits, demonstrating the intensity of combining fine-tuning with reward-based learning in a circular structure.

The core of the strength of this model lies in its iterative structure. It not only relies on large datasets, but focuses on quality and structure. Each cycle of SFT and RL improves the model’s ability to understand the relationship between images, questions and answers. Initially, self-verification and correction behavior of standard LVLM was lacking as a by-product of enhanced learning and had a verifiable reward signal. This allows OpenVlthinker-7b to produce logically consistent and interpretable inference trajectories. Even subtle improvements, such as reduced redundant self-reflection or improved accuracy through shorter chains of reasoning, contribute to their overall performance improvement.

Some key points in the study:

UCLA researchers developed OpenVlthinker-7b from a combined SFT and RL method using the QWEN2.5-VL-7B basic model.
Iterative training cycles involving title generation, inference distillation, and alternating SFT and GRPO enhanced learning.
The initial SFT used 25,000 filtering examples, while the RL phase used 5,000 smaller samples from the dataset, such as Geometry 3K and SupercleVr.
On MathVista, accuracy increased from 50.2% (base model) to 70.2%. Mathematical accuracy jumped from 46.8% to 68.5%, and other datasets also saw significant gains.
GRPO effectively improves reasoning behavior by rewarding correct answers, reducing verboseness and improving logical consistency.
Each training iteration leads to gradual improvement, thus confirming the effectiveness of the self-improvement strategy.
Establish a viable path to bring R1-style multi-step reasoning into multi-model models that can be used in educational, visual analysis and assistive technology applications.

Check Paper, hug face model and github page. All credits for this study are to the researchers on the project. Also, please keep an eye on us twitter And don’t forget to join us 85k+ ml reddit.

UCLA Post Researchers released OpenVlthinker-7B: an enhanced learning-driven model for augmenting complex visual inference and step-by-step solution to problems in multimodal systems, which first appeared on Marktechpost.