How Radial Attention reduces the cost of video diffusion by 4.4 times without sacrificing quality
Introduction to video diffusion model and computing challenges
The diffusion model has made impressive progress in producing high-quality, coherent videos, which is based on the success of image synthesis. However, processing the additional time dimensions in videos significantly increase computational demands, especially because the scale in terms of self-spublishing reduces the sequence length. This makes it difficult to effectively train or run these models on long videos. Attempts like Sparse Video Origin use attention head classification to speed up reasoning, but they struggle with accuracy and generalization during training. Other methods replace Soft Max’s attention with linear alternatives, although these methods often require significant architectural changes. Interestingly, over time, the natural energy decay of the signal stimulates new, more effective modeling strategies.
The evolution of attention mechanism in video synthesis
Early video models extended the 2D architecture by combining time components, but new approaches such as DIT and Latte enhanced spatiotemporal modeling with advanced attention mechanisms. Although 3D intensive attention reaches state-of-the-art performance, its computational cost increases rapidly with the length of the video, which makes the generation of long-term video expensive. Techniques such as time-step distillation, quantization, and sparse attention help relieve this burden, but are often overlooked for the unique structure of video data. Although alternatives such as linear or layered attention increase efficiency, they are often effective in practice to maintain detail or expand.
Introduction to space-time energy decay and radial attention
Researchers at MIT, NVIDIA, Princeton, UC Berkeley, Stanford University and First Intelligence have identified a phenomenon called space-time energy decay in a video diffusion modelAs the spatial or temporal distance increases, attention between the tokens decreases, which reflects the way the signal fades naturally. Inspired by this, they proposed radial attention, a sparse attention mechanism with O(n log n) complexity. It uses a static attention mask, and the tokens mostly take place on nearby masks, and the attention window shrinks over time. This allows pre-trained models to generate videos up to four times longer, reducing training costs by 4.4 times and reducing inference time by 3.7 times while preserving video quality.
Sparse attention using the principle of energy decay
Radial attention is based on the insight that attention scores in video models decrease with increasing space and time distances, a phenomenon known as space-time energy decay. Radial attention strategically reduces the situation of weaker attention rather than focusing on all tokens equally. It introduces a sparse attention mask that decays exponentially outward in both time and time, retaining only the most relevant interactions. This leads to O(n log n) complexity, making it faster and more efficient than intensive attention. Additionally, with minimal fine tuning from the Lora adapter, pre-trained models can be adapted to efficiently and efficiently generate longer videos.
Evaluation of video diffusion model
Radial attention was evaluated on three leading text-to-video diffusion models: Mochi 1, Hunyuanvideo and Wan2.1, both showing improvements in speed and quality. Baselines such as SVG and PowerContention provide better perceptual quality and significant computational growth, including faster inference speeds up to 3.7 times, and 4.4 times lower training costs for extended videos compared to existing sparse attention baselines. It effectively extends to 4x longer video lengths and maintains compatibility with existing Loras (including style Loras). Importantly, in some cases, Lora’s fine-tuning and radial focus fine-tuning are superior to full fine-tuning, indicating its effectiveness and resource efficiency for high-quality long-term power generation.
Conclusion: Scalable and effective long video generation
In summary, radial attention is a sparse attention mechanism designed to effectively handle long-term video generation in diffusion models. Inspired by the observed decline in attention scores, as space and time distances increase, the researchers call radial radial attention for space-time energy decay, an approach that mimics the reduced calculation of natural decay. It utilizes static attention mode, with windows exponentially shrinking, achieving up to 1.9x performance and supporting up to 4x videos. With the help of lightweight fine tuning based on Lora, it significantly reduces the cost of training (4.4×) and inference (multiplied by 3.7x), while both retaining video quality in multiple state-of-the-art diffusion models.
Check Paper and Github page. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter,,,,, Youtube and Spotify And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.