NVIDIA AI releases PRORLV2: Inference reasoning of language model for extended reinforcement learning RL

by admin · August 12, 2025

What is prorlv2?

prorlv2 is the latest version of NVIDIA’s extended reinforcement learning (PRORL), designed specifically to push the boundaries of reasoning in large language models (LLMS). Through Extended Reinforcement Learning (RL) Steps From 2,000 to 3,000PRORLV2 systematically tests how extended RL unlocks previously inaccessible new solution space, creativity, and advanced reasoning, even using smaller models such as 1.5B parameter nemotron-nemotron-Remoning-Qwen-qwen-1.5b-v2.

Key Innovations in PRORLV2

PRORLV2 combines several innovations to overcome common RL limitations in LLM training:

Enhancement++ – Baseline: A powerful RL algorithm that can implement long horse optimization in thousands of steps, thus handling typical instability in RL of LLM.
KL Divergence Regularization and Reference Policy Reset: Regularly refresh the reference model with the current best checkpoint, by preventing the RL objective from prematurely dominating the RL targets, thereby making steady progress and continuing to explore.
Decoupled Clipping and Dynamic Sampling (DAPO): Encourage various solution discovery by increasing unlikely tokens and focusing learning signals on intermediate difficulty hints.
Fines for booking length: Periodic application helps maintain diversity and prevents entropy from collapsing as training prolongs.
Scaling training steps: PRORLV2 moves the RL training field from 2,000 steps to 3,000 steps, and directly testing RL can expand the time when it takes for the reasoning ability to be directly tested.