NOUS research team releases Hermes 4: Open AI model family with hybrid reasoning

by admin · August 28, 2025

Research Published Hermes 4an open model (based on the Llama 3.1 checkpoint) open model (14B, 70B and 405B parameter sizes) achieves cutting-edge level performance through purely training technology. Hermes 4 introduction Mixed reasoning – Models can be used between standard responses and explicit reasoning ... When complex issues require deeper deliberation, label.

What makes Hermès 4 particularly important is that it achieves state-of-the-art performance in maintaining open models while maintaining the concept of complete transparency and neutral consistency, which suggests that complex inference capabilities can be fully developed through an open source approach.

DataForge: Graph-based synthetic data generation

DataForge It is the main component behind the core structure of Hermes 4. But what DataForge? DataForge It is a revolutionary graph-based synthetic data generation system that can change the way training data is created. Unlike traditional curatorial methods, DataForge passes Directional Acyclic Graph (DAG) Implementation of each node PDDL (Planning Domain Definition Language) Operation Interface.

Each node specifies premises, post-structures, and transformations, thereby facilitating the automatic creation of complex data pipelines. By using pre-trained seed data from DCLM and FineWeb, the system can convert Wikipedia articles to RAP songs and then generate instructions based on that conversion – the answer is correct.

This method approximately generates 5 million samples total 19 billion tokensthe reasoning sample is intentionally reproduced – the average token is five times the non-strategy to accommodate long-term thinking traces of up to 16,000 tokens.

Reject sampling at an unprecedented scale

Hermes 4 use AtroposNous Research’s open source enhanced learning environment to implement reject sampling 1,000 different task validators. This large-scale verification infrastructure filter is used for high-quality inference trajectories across different fields.

Key verification environments include Answer format training (Reward the correct format on more than 150 output formats), Explain the following (Using RLVR-IFEVAL task with complex constraints), Pattern compliance (Used to generate JSON using Pydantic model) and Tool usage Training agent behavior.

The rejection sampling process creates a large number of proven inference trajectories with multiple unique solution paths that can achieve the same validation results. This approach ensures that the model learns powerful inference patterns rather than remembering specific solution templates.

Length control: Solve the long-term generation

One of Hermes 4’s most innovative contributions is Extended reasoning problem – The reasoning model produces too long chains of thought without termination. The team found that their 14B model reached maximum context length 60% of the time In reasoning mode, on livecodebench.

Their super effective solution involves a second supervised fine-tuning phase teaching model in order to stop the reasoning 30,000 tokens:

Generate inference tracking from the current policy
insert tokens at exactly 30,000 tokens


Train only on the termination decision, not the reasoning chain
Apply gradient updates solely to  and  Token


This approach achieved significant results: 78.4% reduction In the long-term generation of aime’24, 65.3% On Aime’25, then 79.8% On LiveCodebench, the relative accuracy cost is only 4.7% to 12.7%. By focusing the learning signal completely on the termination decision, the approach avoids the risk of model crashes while teaching effective “counting behavior”.






Benchmark performance and neutral alignment
Hermes 4 Demo The most advanced performance In an open model. 405B model achievement 96.3% On Mathematics 500 (Inference Mode), 81.9% On Aime’24, 78.1% On Aime’25, 70.5% On GPQA Diamonds, and 61.3% On livecodebench.
What is particularly noteworthy is its performance reject,accomplish 57.1% In the inference mode – the highest scores were scored in the evaluation model, performing significantly better than GPT-4O (17.67%) and Claude Sonnet 4 (17%). This suggests that the model is willing to engage in controversial topics while maintaining proper boundaries, reflecting Nous Research’s philosophy of neutral consistency.



Technical architecture and training
Hermes 4 training uses modified Torchtitan Passed through 192 NVIDIA B200 GPU. The system handles highly heterogeneous sample length distributions with efficient fillers (to achieve >99.9% batch efficiency), flexible attention and complex loss masks, in which case only assistant solid tokens contribute to cross-permeability losses.
The training follows a cosine learning rate plan and at a global batch size of 384 samples at 16,384 token context length, combining data parallelism, tensor parallelism, and fully fragmented data parallelism.
Summary
Hermes 4 marks a significant advance in open source AI development, demonstrating that border-level reasoning capabilities can be achieved through transparent, repeatable approaches without relying on proprietary training data or closed development processes. By combining graph-based synthetic data generation, large-scale repulsive sampling and elegant length control mechanisms, NOUS research has created models that not only match the performance of leading proprietary systems, but also maintain neutral consistency and recognizability, rather than a truly useful tool rather than a restrictive assistant.

Check Paper, technical details, model about hugging faces and chat. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter  And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.










Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.