Researchers from Metastone-AI and USTC have launched a reflective generative model, Metastone-S1, which achieves the performance of Openai O3-Mini through a new form of reflection generation.
Key innovation
Reflection generation form
- Unified policy and reward modeling: Metastone-S1 integrates a policy model (used to generate inference trajectories) using shared parameters, integrating a stepping process reward model (PRM) into a single architecture. This implementation only requires lightweight joining (the validator parameters in the 32B master model are only 53 million parameters compared to traditional standalone PRM), thus greatly reducing computational costs.
- Self-supervised Process Reward Model (SPRM): SPRM eliminates the need for expensive, process-level tagged data. It utilizes a self-supervised loss function that uses only the correctness of the final answer to judge the quality of the intermediate reasoning steps and is supported by a dynamic weighting mechanism to filter out noisy labels.
Test Time Scaling (TTS) Redefinition
Traditional LLM is usually improved through parameter scaling during training. Metastone-S1 adopts a unique approach by increasing computational depth to improve inference performance instead of simply increasing model size:
- Internal TTS: Expanded thinking chains to solve deeper, sequentially resolved problems, but can result in substantial computational costs.
- External TTS: Generate multiple inference paths in parallel and use PRMS to select the best. This usually requires additional models and separate tags.
- Metastone-S1 method: Combining these two paradigms into a single architecture provides efficient and accurate trajectory selection and provides minimal additional resource requirements.
Performance and benchmarking

Metastone-S1 comes in three sizes (1.5b, 7b and 32b parameters). The largest Metastone-S1-32B matches or outperforms leading proprietary and open source models in key reasoning and mathematical benchmarks, including the OpenAI O3-Mini.


Each size shows strong scaling properties and valid parameter usage. For example, Metastone-S1-1.5B is better than comparable size models on mathematical tasks, while sizes of 7B and 32B can effectively use capacity and TTS strategies.
Efficiency and “AHA Moment”
- Minimum overhead: Compared to traditional PRM, SPRM integration only adds a small number of parameters (e.g., 26m vs. 72b), producing the latest results from the task.
- Aha Moment: Training analysis reveals a clear point where the model begins to correctly score the correct inference paths accurately, improving discrimination and ultimate performance.
- Extension method: The performance of Metastone-S1 grows with logarithmic growth in computational budget (model size × inference token), which is a valid deployment trade-off around the 32 best sampling.
Flexible reasoning mode
To balance performance and resource usage, Metastone-S1 provides three TTS inference modes:
- Low (k = 2): The fastest reasoning for quick response.
- Media (k = 8): Better accuracy with medium calculations.
- High (k = 32): The maximum depth of challenging tasks.
in conclusion
Metastone-S1 uses its novel reflection generation structure to unify problem solving and solution verification in a single effective framework. By achieving OpenAi O3-Mini performance with fewer resources, it shows that innovations in LLM Architecture can match Brute-Force extensions, which provides new avenues for AI inference advancements and accessibility
View the paper, embrace the model on the surface and the GitHub page. All credits for this study are to the researchers on the project. Ready to connect with 1 million+ AI development/engineers/researchers? See how NVIDIA, LG AI Research and Advanced AI companies leverage Marktechpost to reach target audiences [Learn More] |

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.