This AI paper from Microsoft introduces Wina: a sparse activation framework without training for effective large language model inference

Large Language Models (LLMS) with billions of parameters provide many AI-driven services to various industries. However, their massive scale and complex architecture make computing costs during inference a major challenge. With the development of these models, optimizing the balance between computational efficiency and output quality has become a key area of research.
The core challenge lies in the processing reasoning of LLM. Each time the input is processed, the entire model is activated, thus consuming a wide range of computing resources. For most tasks, this complete activation is unnecessary, as only a small percentage of neurons contribute meaningfully to the final output. Existing sparse activation methods attempt to address this problem by selectively deactivating less important neurons. However, these methods usually focus only on the size of the hidden state, while ignoring the critical role of the weight matrix in propagating errors over the network. This supervision can lead to high approximation errors and deteriorate model performance, especially at higher sparsity.
Sparse activation techniques include methods like Hybrid Experts (MOEs) used in models such as GPT-4 and Mismtral, which rely on other training to learn experts whose input can be activated. Other methods, such as cyan and cats, are designed to reduce calculations by pruning neurons with hidden activated sizes, but they still leave room for improvement. These methods often struggle with balancing sparsity and accuracy, as they may mistakenly deactivate important neurons or keep the neurons with minimal impact. Furthermore, they require model-specific threshold adjustments, which reduce their flexibility in different architectures.
Researchers from Microsoft, Renmin University of China, New York University and Southern China University of Technology have proposed a new approach called Wina (weight-informed neuron activation) to address these problems. WINA introduces a sparse activation technique that uses hidden state amplitude and columnar ℓ2 weight matrix specifications to determine the neurons to be activated during the inference process. By taking into account the combined impact of input amplitude and weight importance, Wina creates a more efficient sparse strategy that adapts to different layers of the model without retuning or fine-tuning.
The Wina method is based on a simple and powerful idea: Neurons with strong activation and large magnitudes are more likely to affect downstream computing. To achieve this, Wina calculates the element quantity product of hidden state and weight specifications, thus selecting the TOUP-K component based on this combination metric. This strategy allows Wina to build a sparse subnet that retains the most important signals while ignoring redundant activations. The method also includes a tensor conversion step that enforces column orthogonality in the weight matrix, ensuring that theoretical error boundaries are efficiently translated into real-world performance. By combining these steps, Wina maintains severe approximation errors while providing a lot of computational savings.
The research team evaluated Wina on a variety of large language models, including QWEN-2.5-7B, LLAMA-2-7B, LLAMA-3-8B, and PHI-4-14B, and evaluated on various tasks and sparse levels. Wina beats Tzucchi and Cat on all tested models and sparse settings. For example, with 65% sparsity on QWEN-2.5-7B, Wina has an average performance of up to 2.94%, up to 2.94% over Teal and 1.41% higher than Teal-Transform. On Llama-3-8B, Wina provides 1.06% gain with 50% sparsity, while 2.41% sparsity is 2.41%. Even at higher sparsity, Wina still maintains stronger performance on inference-intensive tasks such as GSM8K and ARC Challenge. Wina also provides consistent computing savings, reducing floating point operations by up to 63.7% on Llama-2-7B, while floating point operations by 62.7% on PHI-4-14B.
All in all, Wina provides a powerful, training-free solution for large language models by combining hidden state sizes with weight matrix specifications. This approach addresses limitations of previous methods, such as cyan, which lead to approximate errors, improved accuracy and significant computational savings. The work of the research team represents an important step in developing more effective LLM inference methods that can adapt to different models without additional training.
View paper and GitHub pages . All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.
