Enigmata’s multi-stage and blended training enhanced learning recipes drive breakthrough performance in LLM puzzle reasoning

by admin · June 1, 2025

Large inference models (LRMS) trained with LLM using enhanced learning (RL) perform well in complex inference tasks including mathematics, STEM, and coding. However, existing LRMs face challenges when completing various puzzle tasks that require purely logical reasoning skills, which is easy and obvious to humans. The current approach to the puzzle only focuses on designing benchmarks for evaluation, and lacks the training methods and resources of modern LLM to address this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types, with little control over power generation or difficulty. Furthermore, thanks to the success of the “LLM+RLVR” paradigm, obtaining large, diverse and challenging verifiable puzzles prompts training agents is crucial.

Reinforcement learning with verifiable rewards (RLVR) has become a key method to improve model reasoning capabilities, thereby eliminating the need for reward models by directly allocating rewards based on objectively validated answers. The puzzle is especially suitable for RLVR. However, most previous RLVR studies have overlooked the potential of the puzzle to provide effective reward signals. In LLM’s puzzle reasoning, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control, but lack puzzle diversity. In addition, the improvement of LLMS’s problem-solving ability is mainly divided into two categories: tool integration and RLVR.

Researchers from Bondedance Seed, Fudan University, Tsinghua University, Nanjing University and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit to improve LLM with puzzle reasoning skills. It contains 36 tasks in seven categories, each with a generator that generates infinite examples of controllable difficulty and is automatically evaluated by rules-based validators. The researchers further developed Enigmata-eval as a strict benchmark and created an optimized multitasking RLVR strategy. Puzzle data from Enigmata can perform SOTA on advanced mathematical and STEM inference tasks such as larger numbers (such as Seed1.5 thinking) (such as Aime, Beyondaime, and GPQA). This shows the generalization of Enigmata.

Enigmata-data includes 36 puzzle tasks, divided into 7 main categories including encryption, arithmetic, logic, grid, graphics, search, and sequential puzzles, making it the only dataset with scalability, automatic verification, and public availability for multiple task categories. Data construction follows three-phase pipelines: task collection and design, development of automatic generators and validator, and sliding difficulty control. Furthermore, Enigmata-eval was developed by systematically sampling from a wider dataset with the goal of extracting 50 instances for each task. The final evaluation set contains 4,758 puzzle instances, rather than due to inherent constraints, where some tasks have fewer instances per difficulty level, thus the maximum is 5,400.

The proposed model performed better than most common models on Enigmata-eval, where 32B parameters showed the effectiveness of the dataset and training formula. The model stands out on the challenging ARC-AGI benchmarks, surpassing powerful inference models such as the Gemini 2.5 Pro, O3-Mini and O1. QWEN2.5-32B-ENIGMATA shows excellent performance in the structured inference category and excellent in encryption, arithmetic and logical tasks, indicating the effective development of rule-based inference capabilities. This model shows competitive performance in search tasks requiring strategic exploration and planning capabilities. Furthermore, encryption and arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult.

In this article, the researchers introduce Enigmata, a comprehensive suite of tools for providing advanced puzzle inference for LLMs and using rewards based on verifiable rules that will integrate seamlessly with RL. The trained Enigmata model demonstrates excellent performance and powerful generalization skills through RLVR training. Experiments show that when applied to larger models such as SEED1.5 Thinking (20b/200b Parameters), synthetic puzzle data will bring additional benefits in other fields, including math and stem reasoning rather than advanced models. Enigmata provides a solid foundation for the research community to drive inference model development and provides a unified framework that effectively bridges the logical puzzle of the wider inference capabilities in LLMS.

View paper, GitHub pages, and project pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.