QeRL: NVFP4 Quantitative Reinforcement Learning (RL) brings 32B LLM training to a single H100 while improving exploration

What would you build if you could run reinforcement learning (RL) post-training on a 32B LLM (on a single H100) in 4-bit NVFP4 with BF16-level accuracy and 1.2–1.5× step speedup? NVIDIA researchers (with collaborators at MIT, the University of Hong Kong, and Tsinghua University) have open sourced QeRL (Quantitative Enhanced Reinforcement Learning)a facilitative training framework reinforcement learning (RL) Enter after training 4-bit FP4 (NVFP4) While maintaining higher precision gradient mathematics through LoRA. Research team report Launch phase acceleration >1.5x, Approximately 1.8x end-to-end compared to QLoRA in one setting, and Demonstrating RL training of 32B policy on a single H100-80GB GPU for the first time.

What happened to QeRL in the Reinforcement Learning (RL) loop?

Most RLHF/GRPO/DAPO pipelines spend most of their wall clock time in roll out (Token generation). QeRL changes policy NVFP4 (FP4) weight path with dual-layer scaling and keep Achieve higher precision logarithm/gradient through LoRAso the backpropagation remains stable when the sampling path reaches the hardware-efficient FP4×BF16 kernel (Marlin). The result is faster pre-filling/decoding during rollout without the need to maintain a separate full-precision policy.

Mechanically, the research team integrated Marlin-based FP4 kernel In rollout and prepopulation, LoRA limits trainable parameters. This directly targets the stages in long inference trajectories that dominate RL cost and latency.

Quantification as exploration becomes schedulable

Core empirical findings: Deterministic FP4 quantification improves policy entropyflattening the token distribution early in training, Improve exploration Compared to 16-bit LoRA and NF4-based QLoRA baselines. To control this effect over time, QeRL introduces Adaptive Quantization Noise (AQN)Channel-level Gaussian perturbation mapped to LayerNorm scale parameters and anneal Index timetable. This allows kernel fusion to remain intact (no extra weight tensors) when transitioning from exploration to exploitation.

In terms of ablation, QeRL shows Faster reward growth and higher final score Mathematical reasoning tasks in two contexts GRPO and DAPOconsistent with the hypothesis that structured noise in parameter space can be a useful driver of exploration in RL, although such noise is often detrimental in supervised fine-tuning.

Report results

exist Qwen2.5 Backbone model, research team shows NVFP4+LoRA perform better than Vanilla Lola and QLoRA In terms of launch throughput and overall training time, >2x push throughput on 14B/32B Models for QLoRA and Approximately 1.8x end-to-end compared to QLoRA In a representative setting. They also showed Use GRPO to train 32B policy on a single H100-80GBachieved by the lower memory footprint of weight-only FP4.

The accuracy is competitive with higher accuracy baselines. for a 7B Model, research team report GSM8K = 90.8% and MATH500 = 77.4%, Beyond 16-bit LoRA and QLoRA depending on their settings and Match full parameter fine-tuning. On broader mathematical benchmarks such as BigMath, QeRL maintains parity or superiority while converging faster due to improved exploration.

What is this – and what is it not?

QeRL is Only weight FP4 and LoRA updates; Indeed no Requires FP4 accuracy for logits/gradients. The benefits are concentrated in Push/Prefill Throughput and Memory usageempirical evidence shows Quantization induced entropy Conducive to RL exploration sub-quality network Condition it with training. Generalization of reinforcement learning to patterns or safe/instrumental uses beyond mathematical reasoning tasks depends on reward design and sequence length.

Main points

  • QeRL combines NVFP4 4-bit weight quantization with LoRA to speed up the rollout phase and reduce memory, enabling RL of 32B LLM on a single H100-80GB.
  • Quantization acts as exploration: FP4 increases policy entropy, while Adaptive Quantization Noise (AQN) schedules channel noise through LayerNorm scaling.
  • Reported efficiency: >1.5x faster deployment compared to 16-bit LoRA; ~1.8x faster end-to-end compared to QLoRA; 2x faster rollout throughput compared to QLoRA on 14B/32B setups.
  • Accuracy rate maintenance: Qwen2.5-7B reaches 90.8% on GSM8K and 77.4% on MATH500, matching the full parameter fine-tuning under the paper settings.
  • NVFP4 is a hardware-optimized 4-bit floating point format with two levels of scaling (FP8 E4M3 block scaler + FP32 tensor scaler), enabling efficient Marlin-based kernels.

QeRL accelerates the rollout phase of RL. It quantizes weights into NVFP4 and uses LoRA to maintain higher precision updates and logs. It reports >1.5x rollout speed and can train 32B policies on a single H100-80GB GPU. It adds adaptive quantization noise, making exploration a controlled signal during training. The results are mainly shown on mathematical reasoning tasks using GRPO and DAPO. These gains rely on NVFP4 kernel support such as Marlin.


Check The complete code is here and Paper. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...