Reinforcement learning, not fine-tuning: Nemotron-Tool-N1 train LLMS uses tools with minimal supervision and maximum generalization

Equipping LLM with external tools or features has become popular, performing well in different fields. Existing research depends on the integration of a large number of tool usage trajectories through advanced language models and SFT to enhance the tool name capabilities of LLMS. The key limitation is that synthetic datasets fail to capture explicit inference steps, resulting in shallow tool call training. In many cases, inference is completely omitted during training or delayed inference by prompting techniques. This leads to forgery: models simply learn to mimic surface-level patterns without really understanding the underlying decision-making process.
Existing research explores multiple ways to enhance the functionality of LLMS tools. Previous approaches focused on two key strategies to improve tool learning. The first approach focuses on dataset planning and model improvement, involving the creation of large-scale supervised datasets and the application of advanced training techniques such as SFT and DPO enhanced learning. LLM is used in conjunction with a variety of external tools, including search engines, calculators, vision tools, and Python interpreters to extend its functional capabilities. The second approach is to improve inference, from traditional train time scaling to more complex test time scaling strategies. Earlier methods rely on step-level supervision and learning reward models to guide inference trajectories.
Researchers from NVIDIA, Penn State University and the University of Washington have proposed the Nemotron-Research-Tool-N1 series to address the limitations of existing tool usage methods. It differs from traditional SFT and inferred trace distillation techniques by implementing a unique RL paradigm. Drawing inspiration from the success of DeepSeek-R1, a lightweight supervision method has been developed to focus on the structural effectiveness and functional correctness evaluation of tool calls. The Nemotron-Research-tool-N1 model adopts a binary reward mechanism, allowing the model to independently formulate inference strategies without explicit annotation inference trajectory.
The researchers unify and preprocess data from existing tool name datasets (XLAM) and a portion of the tools that provide single-turn and multi-turn synthesis tool trajectories. Created a lightweight prompt template to guide tool call generation with clear instructions for intermediate reasoning
Results from BFCL and API banking benchmarks show excellent performance of the Nemotron-Research-Tool-N1 model. In the BFCL benchmark, the Tool-N1-7B/14B model outperforms closed models such as GPT-4O and professional miniature models such as XLAM-2-70B and Toolace-8B. These models surpass the SFT baseline trained by the same data source, highlighting the effectiveness of R1-style RL approaches. In addition, API Bank benchmarks validated these findings, with the accuracy of tool-N1-7B/14B improving by 4.12% and 5.03% compared to GPT-4O. These results ultimately demonstrate the potential of the proposed method to enhance tool-name functionality of large language models through novel reinforcement learning paradigms.
In short, the researchers introduced Nemotron-Research-Tool-N1, a significant advance in the functionality of LLM tools. The study shows that a paradigm shift from the traditional SFT method is carried out by introducing a rule-based RL method. The proposed method enables the model to develop complex inference strategies without explicit annotated inference trajectories. The benchmark evaluation of BFCL and API-BANK always validates the effectiveness of the method, showing a significant improvement to existing benchmarks. These findings open new avenues to develop more adaptable and intelligent language models that can generate inference strategies independently.
Check Paper and github pages. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 90K+ ml reddit.
Here is a brief overview of what we built in Marktechpost:

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.