Dynamic fine-tuning (DFT): Bridging the generalization gap in supervised fine-tuning (SFT) of LLMS

by admin · August 15, 2025

Supervised fine-tuning (SFT) is a standard technique for adjusting new tasks by training them in expert demonstration datasets. It is valued for its simplicity and ability to rapidly develop behaviors similar to experts, but generally performs poorly compared to enhanced learning (RL). RL allows the model to explore various strategies, resulting in stronger generalizations. However, RL requires high computing resources, careful high parameter adjustments, and access to reward signals, which are not always practical. Despite the existence of a hybrid approach combining SFT and RL, the question remains: Can SFT itself be fundamentally improved? This is important when the dataset lacks negative samples or reward models.

Existing attempts to address SFT and RL challenges have led to various hybrid approaches. A common strategy combines the initial SFT phase with subsequent RL refinement, as shown by methods like TenchingGPT. Alternative approaches such as interleaved SFT and RL steps or direct preference optimization (DPO) are designed to integrate imitation and enhance signals more effectively. Techniques such as negative fine tuning (NFT) allow models to emanate themselves by modeling incorrect output. Theoretical work attempts to unify SFT and RL, treating SFT as reward-weighted or implicit RL. However, they did not establish an exact mathematical equivalent between SFT and offline policy gradients.

A team of researchers from Southeast University, UC Berkeley, Shanghai Joao University, South South Technical University and Wuhan University proposed Dynamic Fine Tuning (DFT), a solution to the limited generalization of SFT LLMS. Through mathematical analysis, they determined that the standard SFT gradient encoded defective reward structures, thus limiting the ability of the model to effectively generalize. DFT solves this problem by stabilizing the gradient update based on the probability of each token. This modification enhances the generalization of multiple benchmarks and fundamental models. Additionally, DFTs show competitive performance in offline RL settings, providing a simpler alternative to traditional RL approaches.

DFTs are evaluated in a standard SFT setup where only expert demonstration data is available, without negative samples, reward models or verification signals. It was trained using the Numinamath COT dataset, which contains 860K mathematical problems and solutions. The dataset covers a variety of sources, including Chinese high school mathematics practice and the US and International Mathematics Olympics. In offline RL settings, DFTs were tested using the Reject Sampling Fine Tuning (RFT) framework. Here, answers are generated for 10K math questions and the correct answers are retained, resulting in 140,000 training examples. Positive negative preference pairs were also created for DPO training from the generated responses.

In the SFT setup, DFT surpasses standard SFT in all evaluated LLMs and shows excellent generalization and robustness on challenging benchmarks that have minimal or negative impacts in standard SFT. It exhibits better learning efficiency and faster convergence characteristics and is superior to importance weighted SFT (IW-SFT) in most cases. In offline RL settings, DFTs are superior to both offline and online RL benchmarks. Its average score is 35.43, surpassing the best offline method, RFT, score +11.46, and outperforming the strongest online RL algorithm, GRPO, score +3.43 points. Additionally, DFT scored 64.71 on MATH500, slightly higher than GRPO and achieved significant growth on hard tasks such as AMC23 (+7.19 for GRPO) and Minerva Math (+6.23 for GRPO).

In this work, the researchers addressed the generalization gap between SFT and RL. They introduced Dynamic Fine Tuning (DFT), a simple and powerful method that dynamically uses token probability to reweight SFT losses. This single-line modification stable learning and enhances generalization, as demonstrated by the performance improvement of mathematical reasoning benchmarks. However, DFT evaluation is limited to mathematically centered datasets and models with up to 7B parameters, and is not tested for other domains or larger models. Furthermore, this study was limited to text-only scenarios. Future work aims to extend DFT to a wider benchmark, larger model and visual language tasks to verify its effectiveness across patterns.

Try it here. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.