AI

Huawei introduces Pangu Ultra Moe: Efficiently trained on Ascend NPU using simulation-driven architecture and system-level optimization 718B parameter sparse language model

Sparse large language models (LLMs) based on the expert (MOE) framework mix have gained appeal, as their capabilities are effectively extended by activating only a subset of each token parameter. This dynamic sparsity allows the MOE model to retain high representative capabilities while limiting the calculation of each token. However, as their complexity grows and model sizes approach trillions of parameters, training them requires algorithmic innovation and tightly integrated hardware and software optimization. These challenges are particularly important when deploying models on non-standard AI accelerators such as Ascend NPUs, which require specific architecture alignment to provide optimal performance.

A major technical challenge is the low utilization of hardware resources while training sparse LLMs. Since only a portion of the parameters per token is active, the workload across devices becomes unbalanced, resulting in synchronization delay and insufficient processing power. As different experts handle different numbers of tokens, sometimes exceeding capacity, this imbalance can also affect memory utilization. These inefficiencies are very efficient, such as in thousands of AI chips, communication and memory management bottlenecks greatly hinder throughput. In practice, the computational commitment to inadequate exploitation of sparseness limits the deployment of such models on hardware systems such as Ascend NPUs.

Several strategies have been proposed to address these challenges. These include auxiliary losses to balance the token distribution and drop strategies among experts, thereby limiting expert overload by discarding tokens beyond capacity. However, these techniques either reduce model performance or introduce inefficiency in memory and computation. Other efforts include heuristic expert placement and traditional communication patterns such as all dispatch, but these measures generally do not scale well or maintain high throughput. Furthermore, standard memory techniques such as recomputation are usually coarse, targeting the entire layer rather than a specific operation, thereby increasing runtime increase without proportional memory savings.

Researchers from Huawei Cloud’s Pangu team introduced highly structured and optimized training methods for large MOE models tailored for rising NPUs. They developed Pangu Ultra Moe, a sparse LLM with 718 billion parameters, with the focus on aligning the model architecture and system design with the capabilities of Ascend hardware. Their approach begins with a simulation-based model configuration process that uses metrics based on actual hardware behavior to evaluate thousands of architectures. These simulations provide information for design decisions before any physical exercise, saving a lot of computing resources and enabling informed adjustment of model hyperparameters.

The simulation method analyzes the combination of parameters such as the number of layers, hidden size, and expert count using a five-dimensional parallelism strategy, which includes pipeline parallelism, tensor parallelism, expert parallelism, data parallelism, and context parallelism. The final model configuration adopted by Huawei includes 256 experts, hidden dimensions of 7680 and 61 transformer layers. To further optimize performance, the researchers integrated the adaptive pipeline overlap mechanism to mask communication costs and used hierarchical all-communications to reduce node data transmission. They employ fine-grained recalculations, such as recalculating key-value vectors only in the attention module and introducing tensor swaps to offload activated memory dynamics as host devices.

Pangu Ultra Moe implements 30.0% model Flops utilization (MFU) and uses 6,000 tokens processed at a rate of 1.46 million per second using 6,000 rising npus. The baseline MFU is 18.9%, with 610,000 tokens per second on 4,000 NPUs. The researchers also proposed dynamic expert placement strategies that improve device-level load balancing and achieve 10% MFU improvement. The model competes in benchmark evaluation, with 81.3% on AIME2024, 97.4% on Math500, 94.8% on ClueWSC and 91.5% on MMLU. In the healthcare field, it outperforms the DeepSeek R1, with a score of 87.1% on MEDQA and an MEDMCQA score of 80.8%, confirming its strength in applications in specific fields.

This study illustrates how Huawei’s Pangu team effectively addresses the core difficulties of training large MOE models on professional hardware. Their system architecture search, efficient communication technology and tailor-made memory optimization represent a powerful framework for scalable AI training. This work demonstrates practical ways to understand the performance potential of sparse models and sets directions for future system-aware AI designs.


Check The paper is here. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 95k+ ml reddit.

Here is a brief overview of what we built in Marktechpost:


Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button