This AI paper introduces Wings: a dual learner architecture to prevent unique text forgetting in a multimodal big language model

Multimodal LLM: Cross-text and visual extension capabilities
Extending large language models (LLMS) to handle multiple modes, especially images and text, allows for the development of more interactive and intuitive AI systems. Multimodal LLMS (MLLM) can explain visual effects, answer questions about images, and participate in conversations including text and pictures. Their ability to reason in the visual and language fields makes them increasingly valuable in applications such as education, content generation and interactive assistants.
Challenge of forgetting only text in MLLM
However, integrating vision into LLMS creates problems. When trained on a dataset that mixes images with text, MLLM often loses the ability to process plain text tasks. This phenomenon, known as text-only forgetting, occurs because the visual token inserted into the language sequence takes the model’s attention away from the text. As a result, MLLM begins to prioritize image-related content and perform poorly on tasks that require language comprehension, such as basic reasoning, comprehension, or text Q&A (Q&A) tasks.
Limitations of existing mitigation strategies
Several methods attempt to resolve this degradation. Some methods reintroduce a large amount of text-only data during the training process, while others alternate between plain text and multi-modal fine-tuning. These strategies are designed to remind the model of its original language competence. Other designs include adapter layers or timely adjustments. However, these techniques often increase training costs, require complex switching logic during inference, or fail to fully recover text comprehension. This problem stems from how the model’s attention changes when an image token is introduced into a sequence.
Introduction to Wings: Alibaba and Nanjing University’s dual learner approach
Alibaba Group’s AI business team and researchers at Nanjing University have launched a new approach called Wings. The design adds two new modules – Visual and Text Learners – into each layer of MLLM. These learners work in parallel with the core attention mechanisms of the model. This structure is similar to the “wings” attached to both sides of the layer of interest. The routing component controls each learner’s attention based on the current token mix, thereby allowing the model to dynamically balance its focus between visual and textual information.
Low-level residual attention (Lorra): balancing efficiency and mode awareness
The wing architecture uses a mechanism called Low-Level Residual Attention (Lorra), which keeps the computational lightweight while allowing learners to capture pattern-specific information. In the first phase of training, only visual learners are activated to align image features. In the second phase, both visual learners and text learners train together with the router module that uses attention weight assignment responsibility. Each learner uses effective attention blocks to interact with the image or surrounding text, and their output is combined with the output of the main model. This ensures that visual attention does not overwhelm the text understanding.
Wing performance benchmarks across text and multimodal tasks
In terms of performance, the wings showed strong results. On the MMLU dataset, it achieved a text-only score of 60.53, which is a 9.70 point improvement compared to similar baseline models. For CMMLU, it scored 69.82, 9.36 points higher than the baseline. In reasoning tasks such as Racing Peak, it scored 11.9 points and in WSC, it recorded 11.12 points. In multimodal benchmarks such as MMMU-VAL, Wings increased by 4.78 points. It also proves strong results in IIT benchmarks, handling mixed text and image multi-circle dialogue more efficiently than other open source MLLMs on other scales.
Conclusion: Moving towards a more balanced and popularizable MLLM
In summary, the researchers addressed the problem of catastrophic text forgetting in MLLM by introducing Wings, a building paired with dedicated visual and text learners and attention routing. By analyzing attention transfer and designing targeted interventions, they maintain text performance while enhancing visual understanding and providing a more balanced, more effective multi-model model.
Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.
Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.
