Wild researchers introduce VGR: a novel inference multi-modal large language model (MLLM) with enhanced fine-grained visual perception

Why multimodal reasoning is important for visual language tasks
Multimodal reasoning enables models to make informed decisions and answer questions by combining visual and textual information. This reasoning plays a central role in interpreting graphs, answering image-based questions, and understanding complex visual documents. The goal is to enable machines to use vision like humans, rather than just seeing what they see and connecting it with language-based reasoning.
Challenges of visual reasoning and language bias
A central challenge in this area is that many models outweigh linguistic information even for tasks that require visual interpretation. This dependency can lead to performance degradation in perceived heavy applications. These models often fail when the question requires identifying specific objects in an image or interpreting numerical data in a graph because they try to answer with previous language patterns rather than analyzing visual content. This creates a bottleneck for the task, requiring detailed visual understanding for accurate reasoning and decision making.
Current limitations of existing visual models
Various tools have been introduced to improve the performance of these tasks, but most tools remain insufficient when asked to analyze detailed visual cues. Some methods use pre-generated image titles or annotation areas to assist the model, while others rely on structured multi-step prompts to encourage reasoning. Despite these attempts, many models are still limited by static visual references or rigid pipelines. For example, models that use only text-based chains of ideas often miss visual nuances, while those that rely on rigid cues are not suitable for diverse open queries. These limitations have slowed down in creating models that truly integrate vision and reasoning.
Introduction to VGR: Visual Grounding Inference Framework
BONTEDANCE Inc. and researchers from the Chinese Academy of Sciences introduced a new model called “Visual Basic Reasoning” (VGR). The study introduces a method that enables the model to interact dynamically with visual elements during inference. VGR stands out by not processing images and text streams separately. Instead, it identifies important image areas when thinking about problems and uses these areas as part of the answer process. In addition to this model, the researchers created a new dataset, VGR-SFT, which enables the system to learn visual reasoning using embedded image cues. This approach eliminates the need for manual annotation and enables flexible visual focus.
How to enable effective image reasoning for selective visual replay
At the heart of VGR is a technology called selective visual replay. This feature enables the model to retrieve specific parts of the image when needed. It uses a visual encoder to extract tokens from the image area and store them in a visual memory pool. During the inference process, if the model encounters a situation where visual information is required, it is reintroduced into the inference stream to the replay and related image tokens. The system adopts Anyres policy to expand solution support and reduce token usage. Compared to the baseline method, VGR uses only 144 tokens for image snapshots and 720 tokens for high resolution areas, with a 70% reduction in total tokens. To train this capability, the model is guided by both standard supervised learning and auxiliary loss functions, thereby enhancing its ability to effectively select and interpret regions.
Benchmark results: less accuracy and efficiency
The model was tested using LLAVA-NEXT-7B as baseline and showed strong results. In the MMSTAR benchmark, VGR achieved an improvement of +4.1. It also outperforms the baseline +7.1 on the AI2D benchmark, and the baseline on ChartQA is impressive +12.9. These results are achieved while using only the 30% visual token count required by the baseline. In another comparison, VGR improved performance by 6.4 points on MMSTAR and 14.1 points on ChartQA, showing its efficiency and accuracy with less resources. This performance demonstrates the effectiveness of selective replay mechanisms to enhance multimodal inference through targeted visual participation.
Final Thought: Going beyond text-centric reasoning
In summary, this work shows that integrating visual signals into the inference process can overcome the limitations of text-based inference. The researchers solved a clear problem, developed an accurate method to solve the problem, and demonstrated its usefulness with measurable results. The solution is practical and effective and can redefine how visual cues are incorporated into an intelligent reasoning system.
Check Paper and model. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.
Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.
