SynpRef-40m and Skywork-Reward-V2: Scalable human alignment for state-of-the-art reward models
Understand the limitations of the current reward model Although reward models play a crucial role in the reinforcement of learning from human feedback (RLHF), many of...