This AI paper introduces Wings: a dual learner architecture to prevent unique text forgetting in a multimodal big language model

admin10 hours ago

0 5 3 minutes read

This AI paper introduces Wings: a dual learner architecture to prevent unique text forgetting in a multimodal big language model

Multimodal LLM: Cross-text and visual extension capabilities

Extending large language models (LLMS) to handle multiple modes, especially images and text, allows for the development of more interactive and intuitive AI systems. Multimodal LLMS (MLLM) can explain visual effects, answer questions about images, and participate in conversations including text and pictures. Their ability to reason in the visual and language fields makes them increasingly valuable in applications such as education, content generation and interactive assistants.

Challenge of forgetting only text in MLLM

However, integrating vision into LLMS creates problems. When trained on a dataset that mixes images with text, MLLM often loses the ability to process plain text tasks. This phenomenon, known as text-only forgetting, occurs because the visual token inserted into the language sequence takes the model’s attention away from the text. As a result, MLLM begins to prioritize image-related content and perform poorly on tasks that require language comprehension, such as basic reasoning, comprehension, or text Q&A (Q&A) tasks.

Limitations of existing mitigation strategies

Several methods attempt to resolve this degradation. Some methods reintroduce a large amount of text-only data during the training process, while others alternate between plain text and multi-modal fine-tuning. These strategies are designed to remind the model of its original language competence. Other designs include adapter layers or timely adjustments. However, these techniques often increase training costs, require complex switching logic during inference, or fail to fully recover text comprehension. This problem stems from how the model’s attention changes when an image token is introduced into a sequence.

Introduction to Wings: Alibaba and Nanjing University’s dual learner approach

Alibaba Group’s AI business team and researchers at Nanjing University have launched a new approach called Wings. The design adds two new modules – Visual and Text Learners – into each layer of MLLM. These learners work in parallel with the core attention mechanisms of the model. This structure is similar to the “wings” attached to both sides of the layer of interest. The routing component controls each learner’s attention based on the current token mix, thereby allowing the model to dynamically balance its focus between visual and textual information.

Low-level residual attention (Lorra): balancing efficiency and mode awareness

The wing architecture uses a mechanism called Low-Level Residual Attention (Lorra), which keeps the computational lightweight while allowing learners to capture pattern-specific information. In the first phase of training, only visual learners are activated to align image features. In the second phase, both visual learners and text learners train together with the router module that uses attention weight assignment responsibility. Each learner uses effective attention blocks to interact with the image or surrounding text, and their output is combined with the output of the main model. This ensures that visual attention does not overwhelm the text understanding.

Wing performance benchmarks across text and multimodal tasks

In terms of performance, the wings showed strong results. On the MMLU dataset, it achieved a text-only score of 60.53, which is a 9.70 point improvement compared to similar baseline models. For CMMLU, it scored 69.82, 9.36 points higher than the baseline. In reasoning tasks such as Racing Peak, it scored 11.9 points and in WSC, it recorded 11.12 points. In multimodal benchmarks such as MMMU-VAL, Wings increased by 4.78 points. It also proves strong results in IIT benchmarks, handling mixed text and image multi-circle dialogue more efficiently than other open source MLLMs on other scales.

Conclusion: Moving towards a more balanced and popularizable MLLM

In summary, the researchers addressed the problem of catastrophic text forgetting in MLLM by introducing Wings, a building paired with dedicated visual and text learners and attention routing. By analyzing attention transfer and designing targeted interventions, they maintain text performance while enhancing visual understanding and providing a more balanced, more effective multi-model model.

Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

admin10 hours ago

0 5 3 minutes read

This AI paper introduces Wings: a dual learner architecture to prevent unique text forgetting in a multimodal big language model

Multimodal LLM: Cross-text and visual extension capabilities

Challenge of forgetting only text in MLLM

Limitations of existing mitigation strategies

Introduction to Wings: Alibaba and Nanjing University’s dual learner approach

Low-level residual attention (Lorra): balancing efficiency and mode awareness

Wing performance benchmarks across text and multimodal tasks

Conclusion: Moving towards a more balanced and popularizable MLLM

admin

Leave a Reply Cancel reply

New study finds freshwater availability amounts for lithium mining overestimate – Air quality issues

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Wastewater technology is not as “green” as it should be

Explore UAE headphone price expectations in 2025

IBM’s MCP Gateway: Unified FastAPI-based Model Context Protocol Gateway for Next Generation AI Toolchain

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Rehabilitation strategies can improve clinical outcomes after concussion within the first three weeks

Interventions may reduce defects associated with premature birth to inhibit responses.

Hepatitis C drugs enhance Remdesivir’s antiviral activity against Covid-19

Multimodal LLM: Cross-text and visual extension capabilities

Challenge of forgetting only text in MLLM

Limitations of existing mitigation strategies

Introduction to Wings: Alibaba and Nanjing University’s dual learner approach

Low-level residual attention (Lorra): balancing efficiency and mode awareness

Wing performance benchmarks across text and multimodal tasks

Conclusion: Moving towards a more balanced and popularizable MLLM

admin

Mistral AI Release Mistral Small 3.2: Enhanced Notes Below, Repeat Reduction and Stronger Features Requirements for AI Integration

IBM's MCP Gateway: Unified FastAPI-based Model Context Protocol Gateway for Next Generation AI Toolchain

Related Articles

Altman says AI etiquette comes with a price tag, but is it worth it?

Top AI books to be read in 2025

Automating documents with Generative AI: Beyond Law and Finance

How to create an intelligent multi-agent workflow using the handover feature of the Mistral Agent API

Leave a Reply Cancel reply

IBM’s MCP Gateway: Unified FastAPI-based Model Context Protocol Gateway for Next Generation AI Toolchain

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Rehabilitation strategies can improve clinical outcomes after concussion within the first three weeks

Interventions may reduce defects associated with premature birth to inhibit responses.

Hepatitis C drugs enhance Remdesivir’s antiviral activity against Covid-19