SynpRef-40m and Skywork-Reward-V2: Scalable human alignment for state-of-the-art reward models

Understand the limitations of the current reward model
Although reward models play a crucial role in the reinforcement of learning from human feedback (RLHF), many of today’s best-performing open models still struggle to reflect a variety of complex human preferences. Even with complex training techniques, meaningful advances are limited. One major reason seems to be the disadvantages of current preference datasets, which are often too narrow to be artificially generated or poorly reviewed. Although some rules-based systems are effective for explicit tasks such as mathematics or coding, they often fail to capture subtle human judgment. In addition, common benchmarks such as RewardBench are becoming less reliable real-world RM performance metrics, showing a bad correlation with downstream task success.
Challenges of Preference Data Creation and New Methods
Traditionally, creating high-quality preference data relies on human annotations, but this approach is time-consuming, expensive, and sometimes inconsistent. To solve this problem, latest technologies such as RLAIF use LLMS to automatically annotate, sometimes even surpassing humans. Newer approaches aim to combine the advantages of both by integrating LLM-generated data with human-verified tags. Meanwhile, reward models have evolved from simple scoring systems such as Bradley-Terry models to more complex frameworks, including generation and direct optimization methods. Despite the many reliable open models and datasets available, the challenge is capturing accurately various tasks and languages across subtle human preferences.
Introduction to Synpref-40m: Large-scale human preference dataset
Skywork AI, a researcher from the 2050 study, introduced Synpref-40m, a massive dataset of 40 million preference pairs curated through a two-stage human pipeline. Human annotators ensure quality through rigorous verification, while LLMS uses human guidance to expand data planning. As a result, they developed Skywork-Reward-V2, a family of eight reward models (0.6B – 8B parameters) trained on a high-quality subset of 26 M. These models achieve state-of-the-art results in seven leading benchmarks, excelling in consistency, security, objectivity, and robustness. The study emphasizes that success comes not only from the amount of data, but also from careful iterative planning, integrating human expertise with AI scalability.
Scalable two-stage human curation pipeline
Current open reward models often suffer from overfitting to narrow benchmarks, such as reward boards, which limit their real-world usefulness. To address this problem, the researchers introduced a two-stage human pipeline for curating large-scale preference data. Phase 1 starts with the annotation of human verification to guide the LLM in tags that mark various preference attributes, and then performs iterative training and error analysis to perfect the reward model. Phase 2 scales this process using consistency checks between the best and human-trained “gold” reward models, thereby filtering reliable samples without further human input. This approach achieves a balance between quality and scalability, ultimately creating tens of millions of high-quality preference pairs.
Benchmark Skywork-Reward-V2: Compact and powerful model
The Skywork-Reward-V2 series demonstrates excellent performance in multiple benchmarks, outperforming larger models (such as 70B parameters) and emerging generation reward models. These models were trained using the QWEN3 (0.6b-8b) and Llama 3.1/3.2 (1b – 8b) skeletons, which received high scores on RewardBench, PPE, RM-Bench and Judge Bench, with the best performance (Llama-3.1-8B-40M) surpassing all other scores at 88.6. Despite the smaller model size, the Skywork-Reward-V2 model benefits from high-quality preference data (SynpRef-40m) and effective training setups, allowing them to be better generalized in real-world RLHF schemes. It is worth noting that even medium-sized models such as QWEN3-1.7B are better than some 70B models, and also emphasize the impact of training data quality and methods on pure parameter counting.
Conclusion and future prospects: Accurate scaling
In short, SynpRef-40m is a large-scale preference dataset built through two stages of human collaboration, combining human judgment with LLM-based scalability. Using a curated 26 million preference pair, the team developed Skywork-Reward-V2, a suite of eight reward models (0.6B – 8B parameters) that outperform existing models in seven key benchmarks. These models show strong generalizations in consistency with human values, ensuring correctness, security and robustness to bias. Extensive research confirms that data quality and curatorial methods are key drivers of performance. Going forward, researchers aim to explore new training strategies as reward models become the core of LLM development and consistency.
Check Paper, hug face model and github page. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter,,,,, Youtube and Spotify And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.