The lack of multimodal basic models in physical reasoning: Phyx benchmark emphasizes the key limitations of visual and symbolic integrals

The state-of-the-art models show the accuracy of human competition on AIME, GPQA, MATH-500 and OLYMPIADBENCH, thus solving the Olympic-class problem. Recent multimodal basic models have advanced benchmarks for subject knowledge and mathematical reasoning. However, these assessments miss out on the key aspects of machine intelligence: physical reasoning, which requires integration of discipline knowledge, symbolic manipulation, and real-world constraints. Solving physical problems solves the fundamental difference from pure mathematical reasoning because it requires the model to decode implicit conditions in the problem. For example, interpreting “smooth surface” as a zero coefficient of friction and maintaining body consistency between chains of reasoning, because the body laws remain constant regardless of the trajectory of reasoning.
MLLM stimulates the exploration of its reasoning abilities by integrating visual and textual data in various tasks to show excellent visual understanding. However, there is still uncertainty in whether these models have truly advanced inference capabilities, especially in the field of physics close to real-world scenarios. Several LLM benchmarks have emerged to evaluate reasoning capabilities, and the botanical garden is most relevant to physical reasoning. MLLM scientific benchmarks (such as Phys Roesish and Emma) contain multimodal physics problems, but they only include small sets of physics that are not sufficient to evaluate MLLMS’s ability to reason and solve advanced physics problems.
Researchers from the University of Hong Kong, the University of Michigan, the University of Toronto, the University of Waterloo and Ohio State have proposed Phyx, a small benchmark for evaluating the physical reasoning capabilities of foundation models. It contains 3,000 visual physical problems, spanning precisely six different physical fields: mechanics, electromagnetics, thermodynamics, wave/acoustics, optics, optics and modern physics. It solves physics-based reasoning through three core innovations: (a) 3,000 newly collected problems with realistic physics scenarios that require integrated visual analysis and causality, (b) expert-verified data design covers six fundamental physics areas, and (c) a strictly unified, unified, three-step evaluation protocol.
The researchers designed a four-stage data collection process to ensure high-quality data. The process begins with an in-depth investigation of core physics disciplines to determine coverage for various fields and subfields, and then recruit STEM graduate students as expert annotators. They comply with copyright restrictions and avoid data contamination by choosing answerless questions. In addition, quality control involves a three-stage cleaning process, including repeated testing through vocabulary overlay analysis and reviewed by a manual review by PhD in Physics. Students, then filter the shortest 10% of the questions based on text length, spawning 3,000 high-quality questions from the initial set.
Phyx presents a significant challenge to the current model, with even the worst performing human experts reaching 75.6% accuracy, performing better than all evaluated models and showing the gap between human expertise and the current model functionality. Benchmarks show that multiple-choice formats narrow performance gaps by allowing weaker models to rely on surface-level cues, but open-ended questions require real reasoning and precise answers. Comparing GPT-4O’s performance on PHYX with previously reported results from MathVista and Math-V (both 63.8%), the physical reasoning task has less accuracy, emphasizing that physical reasoning requires a deeper integration of abstract concepts and real-world knowledge than a purely mathematical background, posing a greater challenge.
In summary, the researchers introduced Phyx, the first large-scale benchmark for evaluating multimodal, visually rooted scenarios. Strict evaluation shows that the latest models show limitations in physical reasoning, relying primarily on memory knowledge, mathematical formulas and visual patterns of surfaces rather than a real understanding of physical principles. The benchmark focuses specifically on English cues and annotations, limiting the assessment of multilingual reasoning abilities. Similarly, while images depict scenes of physical reality, they are often schematic or textbook-like rather than photos in the real world, which may not fully capture the perceived complexity of natural environments.
View paper, code and project pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.
