Apple researchers use puzzle-based evaluation to reveal structural failures in large inference models

Artificial intelligence has made a major transition from basic language models to advanced models focused on inference tasks. These newer systems, known as large inference models (LRMS), represent a class of tools designed to simulate human-like thinking by developing intermediate inference steps before reaching conclusions. The focus has shifted from generating accurate outputs to understanding the process that leads to these answers. This shift raises questions about how these models manage tasks with hierarchical complexity and whether they are truly inference-based or simply use training patterns to guess the results.
Redefining the evaluation: Beyond the final answer accuracy
A recurring problem with evaluating machine reasoning is that traditional benchmarks mainly evaluate the final answer without checking the steps involved. The accuracy of the final answer alone does not reveal the quality of internal reasoning, and many benchmarks are contaminated by the data that may be seen during training. This creates misleading images of the real function of the model. To explore practical reasoning, researchers need to accurately control the environment of problem difficulty and can analyze intermediate steps. Without such a setup, it’s hard to determine whether these models can generalize the solution or remember only the pattern.
To more reliably evaluate reasoning, Apple’s research team designed a setup using four puzzle environments: Hanoi Tower, river crossings, checkers jumping and blockade of the world. These puzzles can precisely manipulate complexity by changing elements such as the number of disks, checkers, or proxy. Each task requires different inference abilities, such as constraint satisfaction and sequential planning. Importantly, these environments do not have typical data contamination and can thoroughly examine the results and the reasoning steps between them. This approach ensures a detailed study of how the model acts in various task requirements.
The study uses two sets of models to introduce a comparative study: Claude 3.7 sonnet and DeepSeek-R1, as well as their “thinking” variants and their standard LLM counterparts. Under the same token budget, these models were tested to measure accuracy and reasoning efficiency. This helps reveal performance variations across low, medium and high complex tasks. One of the most revealed observations is the formation of three performance areas. In simple tasks, non-thinking models outperform inference variants. For moderate complexity, the inference model gained edges, while both types of complexity peaked.
Comparative Insights: Thinking and Non-Thinking Models Under Pressure
In-depth analysis shows that the reasoning effort increases with the task difficulty until a certain point, but despite the availability of resources, it is then rejected. For example, in the Hanoi Tower, Claude 3.7 sonnet (thinking) maintains high precision until complexity reaches a certain threshold, and then performance drops to zero. Even if explicit solution algorithms are provided to these models, they cannot perform steps beyond a specific level of complexity. In one case, Claude 3.7 can correctly handle about 100 steps of the Hanoi Tower, but when $n = 3, a simpler river crossing task cannot be completed. This inconsistency exposes serious limitations of symbolic manipulation and precise calculations.
Performance decomposition also emphasizes how LRM handles its internal thinking process. Often engaged in “overthinking” models, generating correct intermediate solutions early in the process, but continuing to explore incorrect paths. This leads to inefficient use of tokens. At a moderately complex level, the model starts to find the correct answer later in its inference chain. However, under high complexity, they fail to produce accurate solutions. Quantitative analysis confirmed that as the complexity of the problem increases, the solution accuracy drops to near zero, and the number of allocated inference tokens begins to drop unexpectedly.
Crashing of scaling limits and inference
This study provides a sober assessment of how the current Learning Resource Management System (LRMS) operates. Apple’s research clearly shows that despite some progress, today’s inference models are still far from implementing generalized inference. This work identifies performance scaling, where folds are located, and why over-reliance on benchmark accuracy cannot capture deeper reasoning behavior. It turns out that a controlled puzzle environment is a powerful tool to find hidden weaknesses in these systems, and emphasizes the need for more powerful designs in the future.
Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 99K+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.
