CURE: Co-development of enhanced learning frameworks for code and unit test generation in LLMS

introduce
Large Language Models (LLM) have greatly improved in reasoning and accuracy through enhanced learning (RL) and test time scaling techniques. Although performing better than traditional unit test generation methods, most existing methods (such as O1 code and UTGEN) require supervision from ground truth code. This supervision increases the cost of data collection and limits the size of available training data.
Limitations of existing methods
Traditional unit test generation depends on:
- Software analysis methodthis is based on rules and rigidity.
- Neural machine translation technologyusually lacks semantic alignment.
Although recent time-based proxy approaches have improved performance, they still depend heavily on the tagged code for fine-tuning. This dependency limits adaptability and scalability, especially in large-scale deployment scenarios in the real world.
Treatment: A homemade co-evolution approach
Researchers and Bytedance Seeds from the University of Chicago, Princeton, Peking University curea homemade reinforcement learning framework that jointly trains code generators and unit test generators without any basic code.
Treatment uses a self-play mechanism to operate:
- LLM generates both correct and wrong codes.
- Unit test generators learn to distinguish failure modes and improve themselves accordingly.
This two-way co-evolution can enhance code generation and verification without external supervision.
Architecture and methodology
Basic Model and Sampling Strategy
CURE is built on the QWEN2.5-7B and 14B instruction models, and QWEN3-4B is used in the long chain (COT) variant. Sample of each training step:
- 16 candidate codes are completed.
- 16 task-derived unit tests.
Sampling was performed using VLLM of temperature 1.0 and TOP-P 1.0. For long-term computational models, response length-aware transformations punish long outputs, thereby improving inference time efficiency.
Rewards features and optimizations
CURE introduces mathematically rooted reward formulas:
- maximize Reward accuracyDefined as the possibility of a correct score higher than an incorrect code in the generated unit test.
- Apply response-based reward adjustments to long-term responses to reduce latency.
Optimization is performed through a policy gradient method, and the encoder and unit tester are jointly updated to improve their mutual performance.

Benchmark datasets and evaluation metrics
Treatment was evaluated on five standard coded datasets:
- LiveBench
- MBPP
- livecodebench
- CodeContests
- CodeForces
Performance:
- Unit test accuracy
- One-tone code generation accuracy
- Optimum N (BON) accuracy for 16 codes and test samples.

Improve performance and efficiency
this Reasonflux-coder Implementation model derived by CURE:
- +37.8% In unit test accuracy.
- +5.3% Accuracy of generation with a single tone of code.
- +9.0% accuracy.
It is worth noting that Reasonflux-coder-4b implements 64.8% Reduced average unit test response length – bottom end improves inference speed. In all benchmarks, these models outperform traditional coding supervised fine-tuning models (e.g. QWEN2.5-Coding-Laboratory).
Applied to commercial LLM
When Reasonflux-coder-4b is GPT series model:
- GPT-4O-Mini Gain +5.5% BON accuracy.
- GPT-4.1-MINI improvement +1.8%.
- API costs are reduced while enhancing performance, indicating a cost-effective solution for production-level inference pipelines.
Reward model used as a label-free fine-tuning
The therapeutically trained unit test generator can be reused as a reward model in RL training. Unit tests generated using Reasonflux-coder-4b are quite improved with test supervision of human markers – Enhanced Completely label-free reinforced learning pipeline.
Broader applicability and future direction
In addition to BON, the ReasonFlux-Coder model will be seamlessly integrated with the proxy encoding framework:
- MPSC (Multi-lens self-isolation)
- letter
- s*
These systems benefit from the ability of CURE to iteratively refine code and tests. Healing can also improve the generation accuracy of proxy unit tests 25.1%enhance its versatility.
in conclusion
CURE represents a significant advance in self-learning learning for code generation and validation, enabling large language models to jointly develop their coding and unit test generation capabilities without relying on ground truth code. By leveraging a co-evolutionary enhanced learning framework, CURE not only enhances core performance metrics such as single-shot accuracy and optimal N selection, but also improves inference efficiency through optimization of response length-awareness. Its compatibility with existing proxy coding pipelines and its ability to act as a label-free reward model make it a scalable and cost-effective solution for training and deployment scenarios.
View paper and GitHub pages. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 99K+ ml reddit And subscribe Our newsletter.

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.
