Why Apple’s criticism of AI reasoning is too early

by admin · June 22, 2025

Recently, two famous but contradictory papers: Apple’s “Ideological Fantasy” and a rebuttal entitled “Ideological Illusion” that humans refuted, a debate around the reasoning ability (LRMS) of the Large Reasoning Model (LRMS). Apple’s paper claims the basic limitations of LRMS’ inference capabilities, while anthropology argues that these claims stem from assessment shortcomings rather than model failures.

Apple’s research systematically tested LRM on a controlled puzzle environment, observing “accuracy crashes” that exceeded a specific complexity threshold. These models, such as Claude-3.7 sonnet and DeepSeek-R1, have reportedly failed to solve problems such as Hanoi Tower and river transit because of the increase in complexity and even the reduction in reasoning work (chart use). Apple identified three different complexities: standard LLMS outperforms LRM at low complexity, LRMS at medium complexity Excel and all crashes at high complexity. Crucially, Apple’s evaluation concluded that the limitations of LRMS are due to their inability to apply precise computation and consistent algorithmic reasoning in the puzzle.

However, anthropomorphism challenges Apple’s conclusions, identifying key flaws in experimental design, rather than the model itself. They highlighted three main issues:

Token limit and logic failure: Anthropomorphism emphasizes that the failures observed in the Hanoi experiment of Apple Tower are mainly due to output token limitations rather than inference flaws. The model explicitly points out their token constraints, deliberately truncating their output. Therefore, it seems that “inference collapse” is essentially a practical limitation, not a cognitive failure.
Misclassification of inference decomposition: Anthropomorphization shows that Apple’s automated evaluation framework misunderstands intentional truncation into inference failure. This strict scoring method does not adapt to the model’s awareness and decision-making on output length, thus causing LRMS to unfairly punish.
Unsolvable problems are misunderstood: Perhaps most importantly, humans have shown that mathematically cannot solve some cases of Apple’s river crossing benchmarks (e.g., about three ships with six or more people). Score these unsolvable instances as steep biased results of failures, which makes the model seem unable to solve fundamentally unsolvable puzzles.

Anthromorphism further tests an alternative representation approach – following the model provides concise solutions (such as LUA capabilities) and finding high precision even on complex puzzles previously marked as failures. This result clearly shows that the problem is through evaluation methods rather than reasoning ability.

Another key point proposed by anthropomorphism is the complexity measure used by Apple (number of movements required). They believe this indicator confuses mechanical execution with real cognitive difficulty. For example, while the towers of Hanoi puzzles require exponential action, each decision-making step is trivial, and puzzles such as river crossing involve fewer steps, the cognitive complexity is higher due to limiting satisfaction and search requirements.

Both papers greatly contribute to understanding LRM, but the tension between their findings highlights the key gap in current AI evaluation practices. Apple’s conclusions – LRM inherently lacks powerful, popularizable reasoning – greatly weakens human criticism. Instead, human discoveries suggest that LRM is limited by its testing environment and assessment framework rather than its inherent reasoning capabilities.

In view of these insights, future research and practical assessment of LRMS must:

Clearly distinguish between reasoning and actual constraints: Tests should adapt to the actual reality of token limitations and model decisions.
Verify the solution to the problem: Ensuring that problems that can solve puzzles or test problems are essential for fair assessments.
Perfect complexity indicators: The indicators must reflect real cognitive challenges, not just the number of steps that are mechanically executed.
Explore various solution formats: Evaluating the LRMS functionality represented by various solutions can better reveal its potential inference advantages.

Ultimately, it seems too early for Apple to claim that LRMS’s “real reason”. Anthropic’s rebuttal suggests that LRMS does have complex inference capabilities that can handle substantive cognitive tasks when evaluated correctly. However, it also emphasizes the importance of careful, nuanced evaluation methods to truly understand the functions and limitations of AI models that emerge.

Check Apple paper and human paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.