This AI paper by DeepSeek-ai explores how DeepSeek-V3 can provide high-performance language modeling by minimizing hardware overhead and maximizing computing efficiency

by admin · May 17, 2025

The growth in developing and deploying large language models (LLMs) is closely related to architectural innovation, large-scale datasets, and hardware improvements. Models such as DeepSeek-V3, GPT-4O, Claude 3.5 sonnets, and Llama-3 have demonstrated how scaling rates enhance inference and dialogue. However, as its performance increases, the demand for computing, memory and communication bandwidth also increases, putting significant pressure on the hardware. There is no parallel progress in co-design of models and infrastructure, which are accessible to organizations with only a large number of resources. This makes optimizing training costs, reasoning speed and memory efficiency a key area of research.

The core challenge is the mismatch between model size and hardware functionality. LLM memory consumption increases by more than 1000% per year, while high-speed memory bandwidth increases by less than 50%. During inference, a priori context cached in the key-value (KV) store increases memory strain and slows down processing. The dense model activates all parameters of each token, thereby upgrading the computational cost, especially for models with millions of parameters. This results in billions of floating-point operations per token and high energy demand. The time per output token (TPOT) is a key performance metric that has also suffered from impact on the user experience. These problems require solutions, not just adding more hardware.

Technologies such as Multi-Query Attention (MQA) and Grouping-Query Attention (GQA) reduce memory usage. The KV cache of the window can reduce memory usage by storing only the most recent token, but can limit long-term understanding. Quantitative compression using low-bit formats, such as 4-bit and 8-bit compression, can further reduce memory, although sometimes precisely trade-offs are made. Precision formats such as BF16 and FP8 improve training speed and efficiency. Although useful, these technologies often solve personal problems rather than a comprehensive solution to the scaling challenges.

Researchers at DeepSeek-AI have proposed a more integrated and effective strategy by developing DeepSeek-V3, which aims to scale intelligently rather than overscaling. This model utilizes 2,048 NVIDIA H800 GPUs, achieving state-of-the-art performance while focusing on cost-effectiveness. Instead of relying on a broad infrastructure, the team designed the model architecture to work in harmony with hardware constraints. At the heart of this work are innovations such as multi-potential attention (MLA) for memory optimization, a computing efficiency framework for expert (MOE) framework, and FP8 hybrid mind training to accelerate performance without sacrificing accuracy. Custom multi-plane network topology is also used to minimize inter-device communication overhead. Overall, these components make DeepSeek-v3 a scalable and accessible solution that rivals larger systems when run with significantly more streamlined resources.

This architecture achieves memory efficiency by reducing the KV cache requirements to only 70 kb using MLA, while QWEN-2.5 and LLAMA-3.1 are 327 kb and 516 kb, respectively. This reduction can be achieved by pressing the attention head into the smaller potential vector of the joint training with the model. The MOE model further improves the computational efficiency, which increases the total parameters to 671 billion, but only 37 billion activates per token. This is in stark contrast to the intensive model that requires full parameter activation. For example, Llama-3.1 requires 2,448 Gflops per token, while DeepSeek-V3 runs at only 250 Gflops. Likewise, the architecture integrates a multi-type prediction (MTP) module, allowing multiple tokens to be generated in one step. The system is 1.8 times faster to generate, and real-world measurements show that 80-90% of the tokens accepted by speculative decodes.

Using a CX7 400 Gbps Infiniband NIC interconnected system, DeepSeek-V3 reaches a theoretical TPOT of 14.76 milliseconds, equaling 67 tokens per second. With a higher bandwidth setting like NVIDIA GB200 NVL72 with 900 GB/s, this number can be reduced to 0.82ms tpot, possibly achieving 1,200 tokens per second. The practical throughput is low due to overlapping computing communications and memory limitations, but this framework lays the foundation for future high-speed implementations. FP8 accuracy further increases the speed increase. The training frame is suitable for 1×128 of tiles and 128×128 quantization of blocks, with a precision loss of less than 0.25% compared to BF16. These results were verified on smaller 16B and 230B parameter versions prior to integration into the 671b model.

Some key points of insights on DeepSeek-V3 study include:

MLA compression reduces the KV cache size per token from 516 kb to 70 kb, greatly reducing memory requirements during inference.
Of the 671 billion total parameters, only 37 billion activated tokens greatly reduces computing and memory requirements without compromising model performance.
DeepSeek-V3 requires only 250 Gflops per token, while Llama-3.1 (such as Llama-3.1) has emphasized its computational efficiency.
On a 400 Gbps Infiniband network, 67 tokens per second (TPS) are reached, and using advanced interconnects like NVL72 may scale to 1,200 TPS.
Multi-token prediction (MTP) improves generation speed by 1.8 times, with token acceptance rate of 80-90%, thereby enhancing inference throughput.
FP8 mixed semen training can be verified by large-scale small-scale ablation, with faster accuracy degradation and less than 0.25%.
Ability to run on a $10,000 server with consumer-grade GPUs, it offers nearly 20 TPS, making high-performance LLMS easier to access.

In summary, the study proposes a comprehensive framework for building strong and resource-conscious large-scale language models. By directly addressing basic limitations such as memory limits, high computing costs, and inference latency, researchers demonstrate that smart building hardware co-design can unlock high performance without relying on a large number of infrastructure. DeepSeek-V3 is a clear example of how efficiency and scalability coexist, allowing for more broad adoption of cutting-edge AI capabilities across different organizations. This approach changes narrative from scaling to brute force to scaling through smarter engineering.

View paper. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 90K+ ml reddit.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.