Phantom: Multi-modal reasoning in VLM without rendering images

Although VLMs are good at understanding text and images, they usually rely solely on text when reasoning, limiting their ability to solve tasks that require visual thinking, such as spatial puzzles. People will naturally imagine solutions instead of describing all the details, but VLMS is also difficult to do this. Although some recent models can generate both text and images, training them for image generation often undermines their inference ability. The production of images also does not support step by step visual reasoning. As a result, unlocking VLM’s full potential for complex, visually rooted thinking remains a key challenge in the field.

COT prompts encourage the model to be understood step by step through the problem through examples with intermediate explanations. The idea has been extended to multimodal tasks where visual information is integrated into inference flows. Methods such as ICOT embed image areas in text sequences, while Visual Cot uses visual annotations to train models to improve spatial understanding. Some recent models can generate text and images at the same time. However, they require a lot of supervision and incur high computational costs. Additionally, researchers are using special tokens or potential representations rather than explicit reasoning steps to explore ways to embed reasoning in models by guiding their hidden state.

Researchers at the University of Massachusetts Amherst University and MIT have proposed a method inspired by how humans use psychological images that involves forming simple, task-related visual effects internally when thinking. They introduced Mirage, a framework that allows VLM to directly interweave visual inference into its text output without generating a complete image. Instead, the model inserts a compact visual cues derived from its hidden state. It undergoes two phases of training: first text and visual supervision, then text guidance. Reinforcement learning further improves its reasoning skills. Mirage enables VLMs to think more like humans, thereby improving their performance on complex multimodal tasks.

Mirage is a framework inspired by human psychological images that allows VLM to reason with compact visual cues rather than generating complete images. It adopts two training phases: First, it uses auxiliary images and joint supervision during the inference process to compress visual features (called latent tokens). It then relaxes this constraint, allowing the model to generate its potential tokens and uses them to guide reasoning. This setup implements interwoven multi-modal inference. The final reinforcement learning phase further uses accuracy and formatting rewards to fine-tune the model, encouraging correct answers and structured thinking processes.

The study evaluated models of four spatial reasoning tasks such as visual puzzles and geometric problems using a small dataset of 1,000 training samples. To support reasoning, it produces synthetic assistant images and thought steps, mimicking how humans use sketches and hints to facilitate the thinking process. This model is always superior to text-only and multi-modal baseline models, even in tasks that require extensive planning, such as maze solving. A smaller version of the model also yielded strong results, indicating that the method is robust. Ablation studies confirm that potential visual tokens are grounded first and then flexible training is performed. Overall, the interweaving of visual and textual reasoning without real images can improve understanding and accuracy.

In summary, this study was inspired by how humans use psychological images for reasoning, introducing a lightweight approach that allows VLM to think visually without producing actual images. By interweaving compact visual cues with text during the decoding process, the model can perform multi-modal inference through a two-stage training process: first, these cues are fixed to real image features, and then allowed to develop freely to support inference. The final reinforcement learning step can improve performance. This method is tested on spatial inference tasks and is always superior to the traditional text-only model. However, the challenge still exists in extending to other tasks and improving the quality of synthetic training data.

Check Paper and Github page. All credits for this study are to the researchers on the project.

Sponsorship Opportunities
Attract the most influential AI developers worldwide. 1M+ monthly readers, 500K+ community builders, unlimited possibilities. [Explore Sponsorship]

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.

Phantom: Multi-modal reasoning in VLM without rendering images

You may also like...

Leave a Reply Cancel reply

Phantom: Multi-modal reasoning in VLM without rendering images

You may also like...

ALS under a microscope: immune cells, genetics and answer seeking

Central Nervous System: Brain and Spinal Cord

Scientists discover the cause of hidden mysterious skin condition

Leave a Reply Cancel reply