DeepSeek AI unleashes DeepGemm: FP8 GEMM library that supports intensive and Moe Gemms powers V3/R1 training and reasoning

Effective matrix multiplication remains a key component of modern deep learning and high-performance computing. As models become increasingly complex, conventional approaches to general matrix multiplication (GEMM) often face challenges related to memory bandwidth constraints, numerical accuracy, and suboptimal hardware utilization. These issues are compounded by emerging mixed semen formats such as FP8 requiring careful treatment to avoid computational inaccuracy. Recent advances in GPU architecture, especially NVIDIA’s Hopper Tensor core, create opportunities for opportunities to improve performance, but only if the software is designed to take full advantage of these capabilities. In this case, tools are needed not only to address these performance bottlenecks, but also to maintain the simplicity and transparency of their design.
DeepGemm released by DeepSeek AI marks a thoughtful way to enhance FP8 GEMM operations. DeepGemm is designed specifically for efficient and clean FP8 matrix products, and DeepGemm supports standard and combined Experts (MOE) grouping GEMMs. The library is written in CUDA and stands out through lightweight (JIT) modules compiling with runtime kernels. This design choice means that no lengthy compilation time flow is required during the installation process, so it can be directly integrated into existing projects. DeepGemm is tailored to the NVIDIA Hopper Tensor core, ensuring it leverages modern hardware capabilities while solving inherent challenges such as inaccurate FP8 accumulation.
Technical details and benefits
DeepGemm uses fine-grained scaling at its core, combining FP8 arithmetic to balance speed and numerical accuracy. To offset the problem of FP8 tensor core accumulation, the library uses a two-level accumulation strategy through the CUDA core, which is often described as a promotion. This approach minimizes errors in the calculation process without sacrificing performance. The implementation is very concise, and a single core kernel function contains about 300 lines of code. This simplicity not only helps to understand the basic principles, but also promotes further improvements in the community.
DeepGemm draws inspiration from established libraries of Cutlass and Cute Cute, but it deliberately avoids heavy reliance on complex templates or algebraic frameworks. Instead, the focus is on providing a clean and accessible code base focused on optimizing GEMM operations for normal and grouped configurations. Support for grouped GEMMs designed for MOE models is implemented in two forms: continuous and masked layout. Each expert is carefully constructed to fit the token count of each expert, reflecting the practical requirements of modern reasoning and training tasks.
Performance insights and considerations
The performance data provided in the DeepGemm repository provides a clear understanding of its efficiency improvement. Testing of the NVIDIA H800 GPU using NVCC 12.8 shows that DeepGemm achieves acceleration compared to a carefully optimized Cutlass-based implementation over a range of matrix dimensions. For example, normal GEMM operation demonstrates an acceleration factor of about 1.4 to 2.7 times, depending on the specific matrix shape. In the context of grouped GEMMs of the MOE model, the continuous and masked layouts showed consistent improvements, albeit more moderate, with a speed of about 1.1 to 1.2 times.
These performance improvements are the result of several thoughtful design decisions. The library’s JIT assembly strategy allows dynamic optimization of kernel parameters such as block size, number of pipeline stages, and gauze sets, and conforms to specific GEMM shapes and hardware configurations. In addition, the utilization of Hopper’s tensor memory accelerator (TMA) helps optimize data movement, which is an important factor in achieving high performance in modern GPU architectures. The repository also details several utility features that help developers align tensor sizes and configure shared memory, ensuring that the library can be integrated smoothly into larger systems.
in conclusion
DeepGemm represents a measurement and effective method for FP8 GEMM computing challenges. By focusing on precision and performance, the library provides elegant solutions for researchers and practitioners who want to optimize matrix multiplication for the NVIDIA Hopper Tensor kernel. Its design emphasizes clarity and accessibility – involves eliminating compatibility steps through runtime JIT assembly in a concise code base. Whether for standard GEMMs or more professional grouping gemstones required for MOE models, DeepGemm provides a practical, well-documented platform for improved computing efficiency.
DeepGemm is a valuable resource for those looking to improve the deep learning pipeline or understand modern GPU optimization techniques. The repository is released under the MIT license and invites further exploration and improvements with the support of the developer community.
Check Github repository. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 80k+ ml subcolumn count.
🚨 Recommended Reading – LG AI Research Unleashes Nexus: An Advanced System Integration Agent AI Systems and Data Compliance Standards to Address Legal Issues in AI Datasets

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically and be understood by a wide audience through technical voices and also by a wide audience. . The platform has over 2 million views per month, demonstrating its popularity among its audience.
🚨Recommended open source AI platform: “Intellagent is an open source multi-agent framework that evaluates complex dialogue AI systems” (promoted)