Huawei CloudMatrix: Peer-to-peer AI data center architecture that can scale and efficient LLM services

by admin · August 22, 2025

Parameter counting for LLM has developed rapidly, with extensive use of Experts (MOE)-designed mixtures as well as huge context lengths. Models such as DeepSeek-R1, Llama-4, and Qwen-3 now reach trillions of parameters, requiring huge computing, memory bandwidth and fast inter-chip communication. MOE improves efficiency, but creates challenges in expert routing, while context windows exceed the pressure attention of millions of tokens and KV cache storage, compared to concurrent users. In actual deployment, unpredictable investment, uneven expert activation and burst query further complicates the service. Addressing these pressures requires grounding rethinking of AI infrastructure through hardware-software co-design, adaptive orchestration and elastic resource management.

The latest advancements in LLM are shaped by three main trends: growing parameter counts, sparse MOE architectures and extended context windows. Models like Llama 4, DeepSeek-V3, and Google’s Palm Push Scale go into trillions of parameters, while Moe Designs activates only a subset of experts for each token, balancing efficiency with capacity. Meanwhile, the context window now spans hundreds of thousands to millions of tokens, allowing long-form inference via large key-value pasters, but with tight computation and memory. These advances put huge pressure on data centers, demanding higher compute, memory and bandwidth, while introducing challenges in parallelism, workload heterogeneity, data convergence and storage performance.

Huawei researchers introduced CloudMatrix, a new AI data center architecture designed to handle the growing demand for large-scale LLMs. Its first implementation, CloudMatrix384, combines a 384 Ascend 910C NPU with a 192 Kunpeng CPU, all linked by a high-width, low-latency unified bus that enables peer-to-peer communications completely. The design allows for flexible summary of computing, memory and network resources, making it ideal for MOE parallelism and distributed KV cache access. Most importantly, CloudMatrix-Infer provides an optimized service framework that includes peer resource pools, large-scale expert parallelism, and hardware-aware optimizations such as pipelines and INT8 quantization. DeepSeek-R1 evaluation shows the latest throughput, efficiency and scalability.

Huawei CloudMatrix is a new AI data center architecture built on peer-to-peer high-bandwidth interconnects and fine-grain resource classification. Its first massive implementation CloudMatrix384 integrates the 384 Ascend 910C NPU and 192 Kunpeng CPU, integrating it into a single supernode, all linked by a unified bus network that enables direct all communications. The design allows computing, memory and network resources to be shared seamlessly and scaled independently to run as a sticky system. By avoiding bottlenecks set by traditional hierarchical settings, CloudMatrix384 is particularly effective for communication-heavy tasks such as large MOE parallelism and distributed KV Cache Management, making it ideal for scalable LLM services.

The researchers used the CloudMatrix384 hypernode to evaluate CloudMatrix-infer on the DeepSeek-R1 model. The system reaches a pre-filled throughput of 6,688 tokens per second per NPU, and at a latency below 50 milliseconds, the decoding throughput per second is 1,943 tokens, with an effect of more than 50 ms, which outperforms comparable systems such as SGLANG on the NVIDIA H100, which is deepseek on the H800. Even with stricter latency requirements below 15 milliseconds, it still has 538 tokens when decoding. In addition, rising INT8 quantization on 910c ensures accuracy of 16 benchmarks, indicating that increasing efficiency does not harm model quality.

In short, Huawei CloudMatrix is the next generation AI data center architecture designed to overcome the scalability limitations of traditional clusters. Its first production system, CloudMatrix384, combines a 384 Ascend 910C NPU and a 192 Kunpeng CPU, connected in a fully peer supernode through a high-width, low-latency unified bus at a high-band. To take advantage of this design, the study proposes CloudMatrix-Infer, which divides pre-filling, decoding, and cache into independent pools, supports large-scale expert parallelism, and applies hardware-aware optimizations such as pipelines and INT8 quantization. Compared to NVIDIA-based systems, it achieved excellent throughput and latency performance in DeepSeek-R1 testing while retaining accuracy, demonstrating the potential of its large-scale AI deployment.

Check Technical paper. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.