ServiceNow AI releases April-Nemotron-15b-thinker: a compact and powerful inference model for enterprise-scale deployment and efficiency optimization

Today’s AI models promise to handle complex tasks such as solving mathematical problems, interpreting logical statements, and assisting corporate decision-making. Building such models requires the integration of mathematical reasoning, scientific understanding, and advanced pattern recognition. With the growing demand for smart proxy in real-time applications, such as coding assistants and business automation tools, there is an urgent need to combine strong performance with effective memory and token usage to make it deployable in real-time hardware environments.
A core challenge in AI development is the resource intensity of large-scale inference models. Despite its strong capabilities, these models often require a lot of memory and computing resources, limiting their real-world applicability. This creates a gap between what the advanced model can achieve and what the user can actually deploy. Even businesses with good resources may find running models that require dozens of memory or high inference costs. The problem is not just about building smarter models, but also about making sure they are efficient and deployable in real-world platforms. High-performance models such as QWQ-32B, O1-Mini, and Exaone-Deep-Deep-Deep-32b are excellent in tasks involving mathematical reasoning and academic benchmarks. However, their reliance on high-end GPUs and high-token consumption limits their use in production environments. These models highlight the ongoing trade-offs in AI deployments: high precision at the expense of scalability and efficiency.
Solving this gap, the ServiceNow researchers introduced April-Nemotron-15b-thinker. The model consists of 15 billion parameters, which is relatively small compared to its high-performing parameters, but it is almost twice the size of the model. The main advantages are its memory footprint and token efficiency. While providing competitive results, it requires nearly half of the QWQ -32B and Exaone -Deep -Deep -32b memory. This directly helps improve operational efficiency in enterprise environments, which enables the integration of high-performance inference models into real-world applications without large-scale infrastructure upgrades.
The development of April-Nemotron-15B-Inker follows a structured three-stage training method designed to enhance specific aspects of model inference capabilities. In the initial stage, called Continuous Pre-Training (CPT), the model is exposed to over 100 billion tokens. These tokens are not general texts, but are carefully selected examples from areas that require deep reasoning, mathematical logic, programming challenges, scientific literature, and logical inference tasks. This exposure provides the basic reasoning ability to distinguish models from other models. The second phase involves the use of 200,000 high-quality demonstrations on the supervision fine-tuning (SFT). These examples further calibrate the model’s response to inference challenges, enhancing the performance of tasks requiring accuracy and attention to detail. The final adjustment phase GRPO (Guideline Enhanced Preference Optimization) optimizes the output of the model by optimizing expected results across mission-critical tasks. The pipeline ensures that the model is intelligent, precise, structured and scalable.
In enterprise-specific tasks, such as MBPP, BFCL, Enterprise Rag, Mt Banch, Mixeval, Ifeval and Multi-Challenge, this model provides competitive or superior performance compared to larger models. Regarding productivity, it consumes 40% less tokens than QWQ -32B, which greatly reduces inference costs. From a memory point of view, it can achieve all of this, with about 50% of the memory required for QWQ-32B and Exaone-Deep-32b, indicating a great improvement in the deployment feasibility of feasibility. Even in academic benchmarks such as AIME-24, AIME-25, AMC-23, MATH-500, and GPQA, the model has its own, often equal or exceeding other larger models’ performance, while being significantly lighter in computing requirements.
Several key points about the research on April-Nemotron-15B-Inker:
- April-Nemotron-15b-thinker has 15 billion parameters, which is significantly smaller than QWQ-32B or Exaone-Deep-32B, but has performance performance.
- Use 3-stage training in CPT, 100B+ tokens, 200K fine-tuning demonstration in SFT, and ultimately GRPO improvements.
- Compared to the QWQ-32B, it consumes about 50% less memory and makes it easier to deploy on enterprise hardware.
- Compared with the QWQ-32B, 40% fewer tokens are used in production tasks, reducing inference costs and speed improvements.
- Better or equal to larger models on academic tasks such as MBPP, BFCL, Enterprise RAG, and GPQA and Math-500.
- Optimized for agent and enterprise tasks, suggesting the practicality of corporate automation, coding agents and logic assistants.
- Designed for real-world use, avoiding over-reliance on lab-scale computing environments.
Check Model embracing face. Also, don’t forget to follow us twitter.
Here is a brief overview of what we built in Marktechpost:

Asjad is an intern consultant at Marktechpost. He is mastering B.Tech in the field of mechanical engineering at Kharagpur Indian Institute of Technology. Asjad is a machine learning and deep learning enthusiast who has been studying the applications of machine learning in healthcare.