This AI paper introduces grit: a method to teach MLLM to reason about images by interleaving text and visual grounding

The core idea of the Multimodal Large Language Model (MLLM) is to create models that can combine visual content with language logic. However, despite the advances and developments in this field, many models are still working to effectively connect these two areas, resulting in limited performance for complex inference tasks involving visual components.

The main challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce text output that interprets reasoning but cannot refer to specific parts of the image. This creates a gap in which the model may come up with an answer without clearly showing how visual evidence can facilitate their decisions. It is also difficult to ensure that the model generates visual reasoning steps that are directly connected to the answer. The basic question is how to naturally train the model to interweave text and image inference without large datasets annotated with visual references, which are scarce and expensive to produce.

Existing methods attempt to address this problem by using reinforcement learning or prompt strategies. Some systems generate bounding box coordinates as answers, while others generate step-by-step text reasoning chains. However, these methods have limitations. Models that produce only bounding boxes lack explanation, while those that produce only text risks ignore visual evidence. Previous methods often separated visual grounding and reasoning, making it difficult for the model to explain why specific visual elements lead to certain conclusions. While some models use intensive supervised data or other tools, they usually require a lot of annotations and do not scale well. This makes it difficult for developers to create models that can transparently interpret their inferences and handle various visual tasks with minimal data.

Researchers from UC Santa Cruz and eBay have introduced a new approach called “ground reasoning” with images and text (gravel), which allows MLLMs such as QWEN 2.5-VL and InternVL 3 to generate chains of reasoning that combine natural language with explicit bounding box coordinates that will point to the relevant image area. This unified approach allows the model to understand and visually answer the answer without the need for dense annotations or tagged chains of reasoning. Grit also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes the accuracy of the final answer and the structure of inference, encouraging the model to include specific tokens, e.g. and and bounding box format. The design eliminates the need for expensive annotated data while ensuring that the model learns to refer to visual content meaningfully in its logical steps.

The method in grit focuses on generating outputs that combine text reasoning and visual grounding. Instead of requiring the model to process cropped images or other visual data after generating bounding boxes, Grit teaches the model to use its internal understanding of the image. Bounding boxes are generated during inference and the model learns to reflect on these coordinates in its logical reasoning. The enhanced learning framework rewards the correct use of bounding box formats and inference structures, and guides the model to produce coherent, rooted chains of reasoning. Grit uses only 20 image problems from visual-spatial inference and TallyQA datasets – the entertaining triple, demonstrating excellent data efficiency. The model training was performed on an NVIDIA A100 GPU, with its optimization techniques such as ADAMW and cosine scheduler applying more than 200 training steps, which shows that the method shows the scalability of the method despite limited data.

Performance evaluations show that gravel-trained models outperform several baselines in terms of reasoning and grounding accuracy. For example, the gravel-trained QWEN 2.5-VL obtained 72.9% answer accuracy on visual-spatial inference, 47.8% for TallyQA, and 47.8% for GQA datasets, and 62.8%. On VSR, it has a ground score of 0.325 and Tallyqa has a ground score of 0.447. By contrast, baseline models such as direct query or thinking chains are often significantly reduced, showing limited ability to unify reasoning through visual grounding. The gravel model shows that there is a strong correlation between visual areas and textual reasoning, resulting in an output reflecting a meaningful connection between image evidence and logical thought. Grit also shows improvements to cross-domain benchmarks, although the growth in intra-domain data is more pronounced, highlighting the importance of training data diversity for a broader generalization.

In short, this study solves the problem of disconnected reasoning and visual basis in MLLM by introducing sand particles. This method allows the model to infer images through a simple, efficient method that requires minimal data. Courage successfully taught MLLM to combine visual evidence with logical reasoning in unified output, achieving strong performance across multiple benchmarks and demonstrating promising steps towards easier interpretation of AI systems.

View paper, projects, and GitHub pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.