QWEN Release QWEN2.5-VL-32B-INSTRUCT: 32B Parameter VLM exceeds QWEN2.5-VL-72B and other models such as GPT-4O Mini

In the evolving field of artificial intelligence, visual language models (VLM) have become an indispensable tool to enable computers to interpret and generate insights from visual and textual data. Despite the advancements, the challenges of balancing model performance with computing efficiency are still being done, especially when deploying large models in resource-limited settings.
QWEN has launched QWEN2.5-VL-32B-INSTRUCT, which is 3.2 billion parameter VLM, surpassing its larger predecessor, QWEN2.5-VL-72B and other models such as GPT-4O MINI, which is released under the Apache 2.0 license. This development reflects a commitment to open source collaboration and meets the need for a high-performance but easy-to-manage model.
Technically, the QWEN2.5-VL-32B-Instruct model provides several enhancements:
- Visual understanding: This model performs excellently in identifying objects and analyzing text, charts, icons, graphics, and layouts in images.
- Agent Function: It is a dynamic visual agent, a tool that can reason and guide computer and telephone interactions.
- Video understanding: This model can understand videos in one hour and can identify relevant segments, thus demonstrating advanced time positioning.
- Object localization: It accurately identifies objects in the image by generating bounding boxes or points, thus providing stable JSON output for coordinates and properties.
- Structured output generation: This model supports structured output of data such as invoices, forms and forms, thus benefiting applications in finance and business.
These features enhance the applicability of the model in various fields where nuanced multimodal understanding is required.
The empirical evaluation highlights the advantages of the model:
- Visual tasks: Regarding the massive multitasking language understanding (MMMU) benchmark, the model scored 70.0, surpassing the QWEN2-VL-72B’s 64.5. In Mathvista, it hits 74.7 compared to the top 70.5. It is worth noting that in OCRBENCHV2, the model scored 57.2/59.1, a significant improvement over the previous 47.8/46.1. In Android control tasks, it reached 69.6/93.3, surpassing the previous 66.4/84.4.
- Text Tasks: The model scores 78.4 on MMLU and 78.4 on mathematical score, showing a competitive performance of 78.4 on GPT-4O Mini in some fields (such as GPT-4O Mini).
These results highlight the level of balance of the model in various tasks.
In short, QWEN2.5-VL-32B-INSTRUCTION represents a major advance in visual modeling and achieves a harmonious integration of performance and efficiency. Its open source availability under the Apache 2.0 license encourages the global AI community to explore, adapt and build such powerful models, and may accelerate innovation and application across sectors.
Check Model weight. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 85k+ ml reddit.
QWEN releases QWEN2.5-VL-32B-INSTRUCTION: The 32B parameter VLM exceeds QWEN2.5-VL-72B, and other models (such as GPT-4O Mini) appear first on MarkTechPost.