Reinforcement learning, not fine-tuning: Nemotron-Tool-N1 train LLMS uses tools with minimal supervision and maximum generalization

adminMay 13, 2025

0 23 3 minutes read

Reinforcement learning, not fine-tuning: Nemotron-Tool-N1 train LLMS uses tools with minimal supervision and maximum generalization

Equipping LLM with external tools or features has become popular, performing well in different fields. Existing research depends on the integration of a large number of tool usage trajectories through advanced language models and SFT to enhance the tool name capabilities of LLMS. The key limitation is that synthetic datasets fail to capture explicit inference steps, resulting in shallow tool call training. In many cases, inference is completely omitted during training or delayed inference by prompting techniques. This leads to forgery: models simply learn to mimic surface-level patterns without really understanding the underlying decision-making process.

Existing research explores multiple ways to enhance the functionality of LLMS tools. Previous approaches focused on two key strategies to improve tool learning. The first approach focuses on dataset planning and model improvement, involving the creation of large-scale supervised datasets and the application of advanced training techniques such as SFT and DPO enhanced learning. LLM is used in conjunction with a variety of external tools, including search engines, calculators, vision tools, and Python interpreters to extend its functional capabilities. The second approach is to improve inference, from traditional train time scaling to more complex test time scaling strategies. Earlier methods rely on step-level supervision and learning reward models to guide inference trajectories.

Researchers from NVIDIA, Penn State University and the University of Washington have proposed the Nemotron-Research-Tool-N1 series to address the limitations of existing tool usage methods. It differs from traditional SFT and inferred trace distillation techniques by implementing a unique RL paradigm. Drawing inspiration from the success of DeepSeek-R1, a lightweight supervision method has been developed to focus on the structural effectiveness and functional correctness evaluation of tool calls. The Nemotron-Research-tool-N1 model adopts a binary reward mechanism, allowing the model to independently formulate inference strategies without explicit annotation inference trajectory.

The researchers unify and preprocess data from existing tool name datasets (XLAM) and a portion of the tools that provide single-turn and multi-turn synthesis tool trajectories. Created a lightweight prompt template to guide tool call generation with clear instructions for intermediate reasoning … Tags and tool calls are included in …. This template helps minimize constraints in rigid formats and reduces the risk of over-fitting a specific timely pattern. The main backbone model used was QWEN2.5-7B/14B – Laboratory, and alternative backbone models were evaluated to evaluate the generalization capabilities of the proposed method, including multiple variants from the Llama family.

Results from BFCL and API banking benchmarks show excellent performance of the Nemotron-Research-Tool-N1 model. In the BFCL benchmark, the Tool-N1-7B/14B model outperforms closed models such as GPT-4O and professional miniature models such as XLAM-2-70B and Toolace-8B. These models surpass the SFT baseline trained by the same data source, highlighting the effectiveness of R1-style RL approaches. In addition, API Bank benchmarks validated these findings, with the accuracy of tool-N1-7B/14B improving by 4.12% and 5.03% compared to GPT-4O. These results ultimately demonstrate the potential of the proposed method to enhance tool-name functionality of large language models through novel reinforcement learning paradigms.

In short, the researchers introduced Nemotron-Research-Tool-N1, a significant advance in the functionality of LLM tools. The study shows that a paradigm shift from the traditional SFT method is carried out by introducing a rule-based RL method. The proposed method enables the model to develop complex inference strategies without explicit annotated inference trajectories. The benchmark evaluation of BFCL and API-BANK always validates the effectiveness of the method, showing a significant improvement to existing benchmarks. These findings open new avenues to develop more adaptable and intelligent language models that can generate inference strategies independently.

Check Paper and github pages. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 90K+ ml reddit.

Here is a brief overview of what we built in Marktechpost:

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.

adminMay 13, 2025

0 23 3 minutes read

Reinforcement learning, not fine-tuning: Nemotron-Tool-N1 train LLMS uses tools with minimal supervision and maximum generalization

admin

Leave a Reply Cancel reply

New study finds freshwater availability amounts for lithium mining overestimate – Air quality issues

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Wastewater technology is not as “green” as it should be

Explore UAE headphone price expectations in 2025

Tencent Open Source Hunyuan-A13B: 13B Activity Parameters MOE Model with Dual Mode Inference and 256K Context

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Rehabilitation strategies can improve clinical outcomes after concussion within the first three weeks

Interventions may reduce defects associated with premature birth to inhibit responses.

Hepatitis C drugs enhance Remdesivir’s antiviral activity against Covid-19

admin

Step-by-step guide for deploying a fully integrated FireCrawl-powered MCP server on Claude desktop

From robot development to brain power: upgrading agent AI

Related Articles

From text to action: How tool-enhanced AI agents redefine language models with inference, memory, and autonomy

Leveraging Artificial Intelligence to Boost In-Store Retail Media Adoption

Salesforce AI researchers introduce UAEVAL4RAG: A new benchmark for evaluating queries that are rejected by rag systems that cannot be answered

AG-UI (Proxy User Interaction Protocol): An open, lightweight, event-based protocol that standardizes how AI agents connect to front-end applications

Leave a Reply Cancel reply

Tencent Open Source Hunyuan-A13B: 13B Activity Parameters MOE Model with Dual Mode Inference and 256K Context

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Rehabilitation strategies can improve clinical outcomes after concussion within the first three weeks

Interventions may reduce defects associated with premature birth to inhibit responses.

Hepatitis C drugs enhance Remdesivir’s antiviral activity against Covid-19