AI

Samsung researchers introduce ANSE (generated active noise selection): a model-aware framework for improving text-to-video diffusion models through attention-based uncertainty estimation

Video generation models have become the core technology for creating dynamic content by converting text prompts into high-quality video sequences. In particular, the diffusion model has established itself as the primary approach to this task. These models start with random noise and iteratively perfect it into realistic video frames to work. The text-to-video (T2V) model extends this functionality by aligning time elements and text cues with the generated content, resulting in visually compelling video and semantic accuracy. Despite advances in architectural design, such as potential diffusion models and motion attract attention modules, significant challenges remain: ensuring consistent, high-quality video generation across different runs, especially the only change is the initial random noise seed. This challenge highlights the need for smarter, model-aware noise selection strategies to avoid unpredictable outputs and wasted computing resources.

The core problem is how the diffusion model initializes its generation process from Gaussian noise. The specific noise seeds used can seriously affect the final video quality, time consistency and timely fidelity. For example, the same text prompt may produce completely different videos based on random noise seeds. Current methods often try to solve this problem by using handmade noise priors or frequency-based adjustments. Methods such as FreeInit and Freqprior use external filtering techniques, while others such as PyoCo introduce structured noise patterns. However, these methods rely on assumptions that may not be held in different datasets or models, require multiple complete sampling passes (inducing high computational costs), and fail to exploit the model’s internal attention signals, which may indicate which seeds are the most promising seeds to generate. As a result, a more principled model-aware approach is needed to guide noise selection without heavy computational penalties or relying on handmade priors.

Samsung Research’s research team introduced Ansay ((onecitive noise sGeNerate), an active noise selection framework for video diffusion models. Anse solves the noise selection problem by using internal model signals, especially attention-based uncertainty estimation, to guide noise seed selection. The core of Anse is BANSA (Select Bayesian Active Noise by Attention)a novel acquisition function that quantifies the consistency and confidence of the model attention graph under random perturbation. The research team designed BANSA’s attention sampling approximation calculations masked by Bernoulli during the inference process to run effectively, which introduces randomness directly into the attention calculation without multiple complete forward passages. This random approach allows the model to estimate the stability of its attention behavior on different noise seeds and select those that promote more confident and coherent attention patterns that are related to improved video quality.

BANSA works by evaluating the entropy in the note graph, which is generated in a specific layer in an early desaccharification step. The researchers determined that Layer 14 of the Cogvideox-2b model and Layer 19 of the Cogvideox-5b model provide sufficient correlation (above the 0.7 threshold) with full-layer uncertainty estimates, thereby significantly reducing the overhead of computing indoors. The BANSA score is calculated by comparing the average entropy of a single attention graph with the entropy of the mean value, in which the lower BANSA score indicates higher confidence and consistency in attention patterns. This score is used to rank candidate noise seeds from 10 pools (M = 10), each noise seed evaluated using 10 random forward passes (k = 10). The noise seeds of the lowest BANSA score are then used to generate the final video, thereby improving quality without model retraining or external priors.

On the COGVIDEOX-2B model, the total VBENCH score increased from 81.03 to 81.66 (+0.63), the quality score +0.48 +0.48, and the semantic consistency increased +1.23. On the larger COGVIDEOX-5B model, ANSE improves the total VBENCH score from 81.52 to 81.71 (+0.25), with a mass +0.17, and a semantic consistency increase of +0.60. It is worth noting that these improved inference time increased by only 8.68%, Cogvideox-2B’s inference time increased by only 13.78%. In contrast, previous methods such as FreeInit and Freqprior require 200% increase in reasoning time, making ANSE significantly more efficient. These benefits are further highlighted by the qualitative assessment, suggesting that ANSE improves visual clarity, semantic consistency, and motion portrayal. For example, videos for “Play the Piano” and “Zebra Running” show more natural, anatomically correct movement under ANSE, whereas in “Explosion” (such as “Explosion”), ANSE-generated videos capture dynamic transitions.

The study also explores different acquisition functions, comparing BANSA with random noise selection and entropy-based approaches. BANSA using Bernoulli masked attention achieved the highest overall score (81.66 for Cogvideox-2b), outperforming the random (81.03) and entropy-based approach (81.13). The study also found that the number of random forward passes (k) improves performance to k = 10, exceeding that speed. Likewise, the performance is saturated with a noise pool size (M) of 10. A control experiment, the model intentionally selects seeds with the highest scores in BANSA resulting in reduced video quality, confirming that lower BANSA scores are associated with better generation results.

While Anse improves noise selection, it does not change the generation process itself, meaning that some low acne seeds can still lead to suboptimal videos. The team acknowledged this limitation and suggested that it is best to think of BANSA as an actual surrogacy for more computationally intensive methods, such as with each seed sampling after post-filtering. They also suggest that future work can integrate improvements in information theory or active learning strategies to further improve the quality of power generation.

Several key points of research include:

  • ANSE improves the total VBENCH score for video generation on Cogvideox-2B: from 81.03 to 81.66, and from 81.52 to 81.52 on Cogvideox-5B.
  • The mass and semantic alignment growth of COGVIDEOX-2B is +0.48 and +1.23, respectively, and the Cogvideox-5B is +0.17 and +0.60, respectively.
  • The increase in reasoning time is moderate: +8.68% of Cogvideox-2b, and the reasoning time of Cogvideox-5b increases.
  • BANSA scores from Bernoulli masking attention over noise selection methods outperform random and entropy methods.
  • The layer selection strategy reduces the computational load by calculating the uncertainty of COGVIDEOX-2B and COGVIDEOX-5B at 14 and 19, respectively.
  • Compared to methods such as FreeInit, Anse achieves efficiency by avoiding multiple complete sampling passes, which requires more inference time.
  • The study confirmed that low BANSA scores are reliably correlated with higher video quality, thus becoming an effective criterion for seed selection.

In summary, this study addresses the challenges of unpredictable video generation in diffusion models by introducing a model-perceived noise selection framework that utilizes internal attention signals. By quantifying uncertainty through BANSA and selecting noise seeds that minimize this uncertainty, the researchers provide an effective, efficient way to improve video quality and semantic alignment in text-to-video models. Anse’s design combines attention-based uncertainty estimation with computational efficiency, allowing it to scale at different model sizes without incurring significant runtime costs, providing a practical solution for enhancing video generation in T2V systems.


View paper and project pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button