NVIDIA AI releases PRORLV2: Inference reasoning of language model for extended reinforcement learning RL
What is prorlv2?
prorlv2 is the latest version of NVIDIA’s extended reinforcement learning (PRORL), designed specifically to push the boundaries of reasoning in large language models (LLMS). Through Extended Reinforcement Learning (RL) Steps From 2,000 to 3,000PRORLV2 systematically tests how extended RL unlocks previously inaccessible new solution space, creativity, and advanced reasoning, even using smaller models such as 1.5B parameter nemotron-nemotron-Remoning-Qwen-qwen-1.5b-v2.
Key Innovations in PRORLV2
PRORLV2 combines several innovations to overcome common RL limitations in LLM training:
- Enhancement++ – Baseline: A powerful RL algorithm that can implement long horse optimization in thousands of steps, thus handling typical instability in RL of LLM.
- KL Divergence Regularization and Reference Policy Reset: Regularly refresh the reference model with the current best checkpoint, by preventing the RL objective from prematurely dominating the RL targets, thereby making steady progress and continuing to explore.
- Decoupled Clipping and Dynamic Sampling (DAPO): Encourage various solution discovery by increasing unlikely tokens and focusing learning signals on intermediate difficulty hints.
- Fines for booking length: Periodic application helps maintain diversity and prevents entropy from collapsing as training prolongs.
- Scaling training steps: PRORLV2 moves the RL training field from 2,000 steps to 3,000 steps, and directly testing RL can expand the time when it takes for the reasoning ability to be directly tested.
How PRORLV2 extends LLM reasoning
Nemotron-Research-Reasoning-QWEN-1.5B-V2, trained on PRORLV2 at 3,000 RL steps, sets new standards for the open weight 1.5B model for inference tasks, including math, code, science and logic puzzles:
- Performance exceeds previous versions and competitors, such as DeepSeek-R1-1.5b.
- Continue to grow with more RL steps:Longer training leads to continuous improvement, especially on tasks where the basic model performs well, indicating the true expansion of the inference boundaries.
- Summary:PRORLV2 not only improves the accuracy of Pass @1, but also provides novel inference and solution strategies for tasks that are not seen during training.
- Benchmark: The gains include 14.7% improvements in average pass @1, encoding at 13.9%, logic puzzle at 54.8%, STEM reasoning at 25.1%, and the average number of guidance follow tasks at 18.1%, and V2 improvements in Uneen Chiend and Fard Benchmarks.


Why it matters
The main discovery of prorlv2 is Continue RL training, reliably expanding what LLM can learn and promote through careful exploration and regularization. Long-term RLs do not reach early plateaus or overfitting, but allow smaller models to compete with larger models in reasoning, suggesting that scaling RLs themselves are as important as model or dataset size.
Using Nemotron-Research-Reasoning-QWEN-1.5B-V2
The latest checkpoints can be used to test on the hug face. Loading the model:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")
in conclusion
PRORLV2 redefined the limitations of reasoning in language models by showing the importance of RL scaling laws and size or data. With advanced regularization and smart training schedules, it can be in-depth, creative and popularizable reasoning even in a compact building. The future lies in How far RL can push – not just How big The model can be obtained.
Check Informal Blog and Models hugging faces here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.