AI

NVIDIA AI introduces fast dllm: a training-free framework that brings KV cache and parallel decoding to diffuse LLMS

The diffusion-based large language model (LLM) is exploring as a promising alternative to traditional self-circular models, thus providing multiple sentences that occur simultaneously. By using a bidirectional attention mechanism, these models are designed to speed up decoding, theoretically providing reasoning faster than autoregressive systems. However, despite their promisingness, diffusion models often struggle to provide competitive inference speeds in practice, limiting their ability to match the realistic performance of automated regression large language model LLM.

The main challenge is the inefficiency of inference based on diffusion LLM. These models generally do not support the key value (KV) caching mechanism, which is critical to speed up inference by reusing the attention states previously calculated. Without KV cache, each new generation step in the diffusion model repeats comprehensive attention to the calculation, making them intensive in the calculation. Furthermore, when multiple tokens are decoded simultaneously (the key feature of the diffusion model), the power generation quality often deteriorates due to the disruption of token dependency under the conditional independence assumption. This makes the diffusion model unreliable for actual deployment despite its theoretical advantages.

Attempts to improve diffusion LLM focus on strategies such as block generation and partial caching. For example, models such as Llada and Dream incorporate diffusion techniques of masks to facilitate multi-token generation. However, they still lack a valid key-value (KV) cache system, and parallel decoding in these models often results in incoherent output. While some methods use helper models to approximate token dependencies, these methods introduce additional complexity without completely solving potential performance issues. As a result, the power generation speed and quality of diffusion LLM continue to lag behind the autoregressive model.

Researchers from NVIDIA, the University of Hong Kong and MIT introduced Fast-Dllm, a framework that developed a framework that addresses these limitations without further retraining. Fast-Dllm brings two innovations to diffusion LLM: an approximate KV cache mechanism that constitutes blocks and a confidence-attracting parallel decoding strategy. Approximate KV caches are customized for the bidirectional mass body of the diffusion model and can be activated efficiently from previous decoding steps. Parallel decoding of parallel decoding based on confidence threshold selective decoding, thereby reducing errors caused by the assumption of token independence. This approach provides a balance between speed and power generation quality, making it a practical solution for diffusion-based text generation tasks.

In-depth, the KV cache method of Fast-DLLM is implemented by dividing the sequence into blocks. Before generating the block, KV activations of other blocks are calculated and stored, thereby reusing in subsequent decoding steps. After a block is generated, the cache is updated on all tokens, minimizing computational redundancy while maintaining accuracy. The DualCache version extends this approach by caching prefix and suffix tokens and exploits high similarity between adjacent inference steps, as demonstrated by the cosine similarity heatmap in the paper. For the parallel decoding component, the system evaluates the confidence of each token and decodes only the confidence that exceeds the set threshold. This prevents dependency violations from being sampled simultaneously and ensures higher quality generation even if multiple tokens are decoded in a single step.

Fast-DLLM achieves significant performance improvements in benchmarking. For example, on the GSM8K dataset, it reaches a 27.6× speed with an 8-elastic configuration baseline model, which generates a length of 1024 tokens with a precision of 76.0%. In the mathematical benchmark, 6.5 times acceleration was obtained with an accuracy of about 39.3%. The humanitarian benchmarks accelerated to 3.2 times and maintained accuracy at 54.3%, while on MBPP, the system reached 7.8× speeds in a generation length of 512 tokens. In all tasks and models, the accuracy remained within 1-2 points of the baseline, indicating that the acceleration of Fast-DLLM does not significantly reduce the output quality.

The research team effectively solved the core bottleneck of diffusion-based LLM by introducing a new caching strategy and confidence-driven decoding mechanism. By addressing inference efficiency and improving decoding quality, Fast-DLLM demonstrates that diffused LLM can approach or even exceed autoregressive models at speeds while maintaining high precision, making it deployable in real-world language generation applications.


View paper and project pages . All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.


Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button