AI

This AI paper from NVIDIA introduces Cosmos-Reason1: a multi-model for physical common sense and specific reasoning

An artificial intelligence system designed for physical settings requires not only perception, but also inferring objects, actions, and consequences in dynamics, real-world environments. These systems must understand spatial arrangements, causality, and the development of events. In applications such as robotics, autonomous vehicles or assistive technologies, artificial intelligence must understand the physical limitations around it and the physical limitations that provide wise and safe decisions. This fusion of perception and structural reasoning about physical dynamics forms the backbone of physical AI.

The core problem of such systems is that they cannot use integrated visual and contextual information to conclude the physical environment. Despite significant progress in visual language models, it is still difficult for them to determine whether the task has been completed, what actions should be taken next, or whether the lawsuit filed is feasible. The gap between perception and decision making is especially important when AI needs to operate independently and interpret tasks in complex visual scenarios. These systems remain unreliable in high-risk or rapidly changing environments, with no mechanism to validate their reasoning.

Existing models such as LLAVA, GPT-4O, and GEMINI 2.0 Flash are proficient in text and visual data, but perform poorly. Tasks such as identifying temporal order, spatial continuity, or object persistence are rarely effectively handled. Popular benchmarks often fail to evaluate this situation, thus having a limited understanding of the model’s inference ability toward physical events or proxy actions. Furthermore, current systems often rely on text cues rather than making decisions based on visual evidence, resulting in inconsistent or incorrect conclusions when applied to the physical world.

Researchers from NVIDIA introduced Cosmos-Reason1, a family of visual models that specifically target physical environment reasoning. These models come in two sizes: 8 billion and 56 billion parameters. These models are constructed using a structured approach that includes defining ontology of physical common sense, building professional training data, and designing a comprehensive evaluation benchmark suite. These benchmarking features such as action prediction, task verification and judgment of physical feasibility. The research team developed data sets including Bridgedata V2, Robovqa, Robofail, Agibot, HoloAssist, and AV to strictly evaluate the model.

Cosmos-Reason1 uses a hybrid MAMBA-MLP converter architecture, which integrates visual and language components. The training process takes place in multiple stages. Initially, visual encoder and language models were carefully studied and fine-tuned using general supervision data. A physical AI-specific supervised fine-tuning (SFT) phase then introduces datasets focused on spatial, temporal, and object interactions. The final reinforcement learning (RL) phase applies rule-based rewards to improve performance in areas such as arrow detection, spatial puzzles, and object persistence. The RL setup uses a modular framework that utilizes distributed computing to effectively scale training. Model responses are constructed using tags, allowing the reward system to evaluate correctness and inference structures. Each question has answers from up to 9 models, and RL training has performed 500 iterations using 128 questions and 500 iterations.

Evaluation of Cosmos-Reason1 shows a large amount of performance compared to other models. Physically, the average accuracy of Cosmos-Reason1-56b is 60.2%, performing better than OpenAI O1, with a score of 59.9% and an average accuracy of 60.2%. The 8B variant also improved to 52.3%. The average score of the manifestation reasoning task for Cosmos-Reason1-56B was 63.7%, higher than the 53.5% at baseline. Benchmarks like Robovqa and HoloAssist showed great growth, with the 56B model scoring 80.0% and 57.8% respectively. Cosmos-Reason1-8b improves to 68.7% on intuitive physical tasks, showing strong growth in object permanence and spatial puzzle reasoning. However, due to the lack of sufficiently diverse training examples, the model faces challenges in datasets such as Robofail.

In summary, this study introduces a targeted and layered strategy to drive AI systems for reasoning about physical interactions. NVIDIA researchers have created a scalable training method and incorporated comprehensive assessments to address long-standing gaps in reflecting reasoning. Cosmos-Reason1 demonstrates how structured fine-tuning and enhanced learning can build AI systems that are more consistent with real-world physical logic and proxy behavior.


Check Paper and github pages. All credits for this study are to the researchers on the project. Also, please keep an eye on us twitter And don’t forget to join us 85k+ ml reddit.

This AI paper by NVIDIA introduces Cosmos-Reason1: a multi-model model that reflects reasoning in physics, first appeared on Marktechpost.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button