Vebrain: A unified multi-mode AI framework for visual reasoning and realistic robot control

by admin · June 10, 2025

Bridge perception and action in robotics

Multimodal Large Language Model (MLLM) has the potential to perceive its surroundings, interpret the scenarios and take meaningful actions for machines such as machines and leg robots. Integrating this intelligence into physical systems is advancing the field of robotics technology, pushing it toward not only seeing and describing autonomous machines, but also based on contextual understanding, which can be planned and moved in their environment.

Despite the increased power of MLLM, an ongoing problem is that they cannot combine vision, reasoning, and physical interactions into a cohesive system. Typically, when asked to control a robot in the real world, models trained to understand images or text are short of. The core issue is that understanding the scenario is fundamentally different from its role in it. Multimodal understanding focuses on perception and analysis, while physical control requires precise and accurate decision-making based on this perception. This disconnect creates bottlenecks when trying to build agents that must simultaneously observe, reason and act in various environments.

Limitations of previous VLA models

Previous tools designed for robotic control rely heavily on visual language action (VLA) models. These models are trained on an extensive robotic dataset to convert visual observations into control signals. Although some solutions attempt to preserve the inference ability of MLLM by converting commands into text-based actions, they face difficulties in maintaining accuracy and adaptability during control tasks. For example, VLA often degrades performance when applied to multiple or long horse robot operations. Furthermore, due to the gap between image-based understanding and motion control, these tools often cannot span different environments or robot types.

Introduction to Fergland: A unified multi-modal framework

Researchers from Shanghai AI Labs, Tsinghua University and Sensetime Research have collaborated with several other institutions to launch a unified framework called “Vision reflects the brain (Vebrain). Vebrain redefined robot control as text-based tasks within 2D visual space, thus more consistent with MLLM’s capabilities. The framework integrates multimodal understanding, spatial reasoning and robotic control into one structure. A specially designed robot adapter processes the output of MLLM into an executable mobile strategy, enabling a single model to manage perception, reasoning, and control. Vebrain is also supported by a high-quality instruction dataset called Vebrain-600K, which combines over 600,000 multimodal tasks, including robot motion and inference steps.

Technical components: Building and Robot Adapters

To perform its functions, Vebrain leverages a QWEN2.5-VL-based architecture to enhance components that implement real-life control. The robot adapter contains four key modules. When the robot’s view changes, the point tracker updates the 2D key points, ensuring accurate positioning. The motion controller converts 2D key points into 3D motion by combining image data with depth maps. The Skill Executor Map predicts actions such as pre-trained robot skills, such as “turning” or “mastering”. Finally, the dynamic takeover module monitors failures or exceptions, and handes control to the MLLM when needed. These modules form a closed-loop system that makes decisions, behaviors and self-corrects to enable the robot to operate effectively under different circumstances.

Performance evaluation across multimodal and robot benchmarks

Vegetables were evaluated in 13 multimodal and 5 spatial benchmarks. On MMVET, it is 5.6% higher than QWEN2.5-VL. It scored 101.5 on Scanqa’s cider metric and scored 83.7 on MMBench. In the VSI benchmark, the average performance was 39.9, which outperformed the QWEN2.5-VL’s 35.9. In the robot evaluation, Vebrain’s success rate in the seven-legged robot task was 86.4%, significantly surpassing VLA and π0, obtaining 32.1% and 31.4% vla and π0, respectively. On robot arm tasks, it has a success rate of 74.3%, which outperforms others by up to 80%. These results show that Vebrain is able to handle long-distance and space-complex control challenges with high reliability.

in conclusion

This research provides a compelling direction for the AI embodied. The researchers successfully redefined robot control as a language task, thus enabling advanced reasoning and low-level actions to coexist. This method bridges the gap between image understanding and robot execution in a functional and scalable way. With its outstanding design and outstanding performance, Vebrain transforms signals towards a more unified, intelligent robotics system that is able to operate autonomously in a variety of tasks and environments.

View paper and GitHub pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 99K+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

Vebrain: A unified multi-mode AI framework for visual reasoning and realistic robot control

Bridge perception and action in robotics

Limitations of previous VLA models

Introduction to Fergland: A unified multi-modal framework

Technical components: Building and Robot Adapters

Performance evaluation across multimodal and robot benchmarks

in conclusion

You may also like...

live chat

Recent Posts

Vebrain: A unified multi-mode AI framework for visual reasoning and realistic robot control

Bridge perception and action in robotics

Limitations of previous VLA models

Introduction to Fergland: A unified multi-modal framework

Technical components: Building and Robot Adapters

Performance evaluation across multimodal and robot benchmarks

in conclusion

You may also like...

Optimize LLM usage with Routellm

The glass can accommodate 3D patterns that appear and disappear from light

Ice bath: The cold truth that plummeted that morning

live chat

Recent Posts