Wrong answers improve mathematical reasoning? Awakened by QWEN2.5-MATH through Verified Rewards (RLVR) Reinforced Learning (RLVR)

In natural language processing (NLP), RL methods (such as enhanced learning using human feedback (RLHF)) have enhanced model output by optimizing responses based on feedback signals. Specific variant learning with verifiable rewards (RLVR) extends this approach by leveraging automatic signals such as mathematical correctness or syntactic features, as feedback, enabling large-scale adjustments to the language model. RLVR is particularly interesting because it promises to enhance the model’s inference capabilities without extensive human supervision. This intersection of automatic feedback and inference tasks constitutes an exciting field of research, where developers aim to discover how models learn to use limited supervision to structure reasoning on mathematical, logical, or structurally.

The ongoing challenge in machine learning is to build models that can reason effectively with minimal or imperfect supervision. In mathematical problem-solving tasks, the correct answer may not be immediately obtained, and researchers work hard to guide the learning of the model. Models are often learned from ground data, but labeling large datasets with perfect accuracy is impractical, especially in inference tasks that require understanding complex structures such as proof or programmatic steps. Therefore, there is a public question as to whether the model can learn whether it will be exposed to noisy, misleading, or even incorrect signals during training. This problem is important because models that rely too much on perfect feedback may not be well generalized when supervised unavailable, limiting their utility in real-life situations.

Several existing techniques are designed to enhance the inference capability of the model through enhanced learning (RL), and RLVR is the focus. Traditionally, RLVR uses the “Ground Truth” tag, the correct answer, validated by humans or automation tools, and offers rewards during training. Some methods relax this requirement by using most voting tags or heuristics based on simple formats, such as reward answers that follow a specific output style. Other methods have been rewarded randomly, providing positive signals without considering the correctness of the answer. These methods are designed to explore whether models can be learned with minimal guidance, but they focus primarily on specific models, such as QWEN, raise concerns about the generality across different architectures.

Researchers at the University of Washington, Allen AI Institute and the University of California, Berkeley investigated the problem by testing various reward signals on QWEN2.5-MATH, a family of large language models to fine-tune mathematical reasoning. They tested ground truth rewards based on boxed expressions, random rewards and incorrect rewards, most fare rewards, format rewards. It is worth noting that they observe that even completely false signals, such as random rewards and wrong answer rewards, can lead to substantial performance improvements in the QWEN model. For example, training on QWEN2.5-MATH-7B with ground rewards on Math-500 can improve 28.8%, while using incorrect tags can achieve a 24.6% increase. Random rewards still generate a 21.4% increase, while format rewards lead to a 16.4% increase. Most voting rewards provide a 26.5% increase in accuracy. These improvements are not limited to a single model. QWEN2.5-MATH-1.5B also showed strong benefits: the accuracy of format rewards increased by 17.6%, while the wrong labels increased by 24.4%. However, the same reward strategy failed to bring similar benefits to other model families such as Llama3 and Olmo2, which showed minimal or negative changes when trained for false rewards. For example, Llama3.1-8b’s performance declined by up to 8.5% under certain false signals, highlighting the observed improved model-specific properties.

The research team’s approach involves fine-tuning the model with these different reward signals using RLVR training to replace the need for actual ground supervision with heuristics or random feedback. They found that the QWEN model can still learn to produce high-quality inference output even if there is no correct answer. A key insight is that the QWEN model tends to exhibit a unique behavior called “code reasoning”, generating mathematical solutions that are constructed like code, especially in Python-type formats, whether the reward signal makes sense or not. This trend of code reasoning becomes more frequent in training, with Qwen2.5-Math-7B rising to over 90% in Qwen2.5-Math-7B in QWEN2.5-MATH-7B when trained with false rewards. Answers that include code reasoning show higher accuracy, usually around 64%, while answers without this inference pattern are only 29%. These patterns appear consistently, suggesting that false rewards may unlock potential abilities learned during pre-training rather than introducing new inference skills.

Performance data emphasize the surprising robustness of the QWEN model. The gains obtained from random rewards (21.4% of Mathematics 500) and incorrect labels (24.6%) are almost matched with the 28.8% ground reward gain. A similar trend occurred across tasks (such as AMC), where format, error and random rewards increased by about 18%, just slightly below 25% of ground or most fare rewards. Even on AIME2024, false rewards such as format (+13.0%), incorrect (+8.7%) and random (+6.3%) resulted in meaningful benefits, although the advantages of ground real tags (+12.8%) are still obvious, especially the AIME2025 issues created after model truncation.

Several key points of research include:

QWEN2.5-MATH-7B gets 28.8% accuracy on MATH-500, but has 24.6% of incorrect rewards, 21.4% of random rewards, 16.4% of format rewards, 26.5% of format rewards, while most vote rewards.
A code reasoning pattern appears in the QWEN model, increasing from 66.7% to 90%+ under RLVR, which increases accuracy from 29% to 64%.
Non-QWEN models, such as Llama3 and Olmo2, did not show similar improvements, with Llama3.1-8B having a 8.5% drop in performance on false rewards.
In many cases, the benefits of false signals appear in 50 training steps, indicating that the reasoning ability is rapidly aroused.
The study warns that RLVR studies should avoid generalizing results based on QWEN model because false reward validity is not universal.

In conclusion, these findings suggest that while QWEN models can utilize false signals to improve performance, this is not the case for other model families. Non-QWEN models such as Llama3 and Olmo2 show flat or negative performance changes when trained with spurious signals. This study highlights the importance of validating RLVR methods to different models rather than relying solely on QWEN-centric results like many recent papers.

View paper, official releases and GitHub pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.