Baidu releases ERNIE-4.5-VL-28B-A3B-Thinking: an open source, compact multi-modal inference model under the ERNIE-4.5 family

by admin · November 12, 2025

How do we get large model-level multimodal inference of documents, graphs, and videos while only running 3B class models in production? Baidu has added a new model to the ERNIE-4.5 open source family. ERNIE-4.5-VL-28B-A3B-Thinking is a visual language model focused on document, diagram, and video understanding with a small active parameter budget.

Architecture and training setup

ERNIE-4.5-VL-28B-A3B-Thinking is built on the ERNIE-4.5-VL-28B-A3B expert hybrid architecture. The series is designed with a heterogeneous multimodal MoE, with shared parameters across text and visuals, and modality-specific experts. At the model level, it has a total of 30B parameters, while the architecture is in the 28B-VL branch, and only 3B parameters are activated per token through the A3B routing scheme. This provides compute and memory profiles for Class 3B models while retaining a larger pool of inference capacity.

The model underwent an additional intermediate training phase on a large corpus of visual language inference. This stage aims to improve the representation power and semantic consistency between visual and linguistic modalities, which is important for dense text in documents and fine structures in diagrams. Most importantly, ERNIE-4.5-VL-28B-A3B-Thinking uses multi-modal reinforcement learning on verifiable tasks, combining GSPO and IcePop strategies with dynamic difficulty sampling to stabilize MoE training and push the model to difficult examples.

key capabilities

Baidu researchers position the model as a lightweight multi-modal inference engine that can only activate 3B parameters, while approaching the behavior of large flagship systems on internal benchmarks. Officially listed abilities include visual reasoning, STEM reasoning, visual fundamentals, image thinking, tool utilization, and video comprehension.

Thinking in images is central. The model can zoom in on a region, reason about the cropped view, and then integrate these local observations into a final answer. When internal knowledge is insufficient, tool utilization can extend this by invoking tools such as image search. Both features are exposed as part of the inference parser and tool call parser paths in your deployment.

Performance and Positioning

Compared with Qwen-2.5-VL-7B and Qwen-2.5-VL-32B, the lightweight visual language model ERNIE-4.5-VL-28B-A3B achieves competitive or superior performance on many benchmarks while using fewer activation parameters. The ERNIE-4.5-VL model also supports thinking and non-thinking modes, with thinking mode improving reasoning-focused tasks while maintaining strong perceptual quality.

For specific Thinking variants, Baidu researchers describe ERNIE-4.5-VL-28B-A3B-Thinking as closely matching the performance of industry flagship models in internal multi-mode benchmarks.

Main points

ERNIE-4.5-VL-28B-A3B-Thinking uses a hybrid expert architecture with approximately 30B total parameters and only 3B active parameters per token, providing efficient multi-modal reasoning.
The model is optimized for document, diagram, and video understanding with an additional mid-training stage for visual language reasoning and multimodal reinforcement learning using GSPO, IcePop, and dynamic difficulty sampling.
By thinking in images, models can iteratively zoom in on image regions and reason about crops, while tool utilization can invoke external tools such as image search for long-tail recognition.
It demonstrates strong performance in analyzing style diagrams, STEM circuit problems, visual grounding with JSON bounding boxes, and localization of video clips with timestamped answers.
The model is released under the Apache License 2.0, supports deployment through Transformer, vLLM and FastDeploy, and can be fine-tuned through ERNIEKit using SFT, LoRA and DPO to enable commercial multi-modal applications.

comparison table

Model	training phase	Total/active parameters	Way	context length (tags)
ERNIE-4.5-VL-28B-A3B-Base	pre-training	28B in total, 3B active per token	words, vision	131,072
ERNIE-4.5-VL-28B-A3B (PT)	Chat model after training	28B in total, 3B active per token	words, vision	131,072
ERNIE-4.5-VL-28B-A3B-Thinking	Inference-oriented mid-term training on ERNIE-4.5-VL-28B-A3B	28B architecture, 3B activity per token, HF model size 30B parameters	words, vision	131,072 (FastDeploy example uses 131,072 maximum model length)
Qwen2.5-VL-7B-Command	Visual language model after training	Total about 8B (Level 7B)	Text, images, videos	32,768 text positions in configuration (max_position_embeddings)
Qwen2.5-VL-32B-Instruction	Enhanced tuning of large VL models after training	Total 33B	Text, images, videos	32,768 text positions (same Qwen2.5-VLTextConfig family)

ERNIE-4.5-VL-28B-A3B-Thinking is a practical version for teams that want to do multi-modal reasoning on documents, charts, and videos with only 3B activation parameters, while still using the Mixture-of-Experts architecture with ~30B total parameters and Apache License 2.0. It connects thinking with imagery, tool exploitation, and multimodal reinforcement learning into a deployable stack that directly targets real-world analysis and understanding workloads.

Check repo, model weights and technical details. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an AI media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

Baidu releases ERNIE-4.5-VL-28B-A3B-Thinking: an open source, compact multi-modal inference model under the ERNIE-4.5 family

Architecture and training setup

key capabilities

Performance and Positioning

Main points

comparison table

You may also like...

live chat

Recent Posts

Baidu releases ERNIE-4.5-VL-28B-A3B-Thinking: an open source, compact multi-modal inference model under the ERNIE-4.5 family

Architecture and training setup

key capabilities

Performance and Positioning

Main points

comparison table

You may also like...

Mavericks never really overcome bad or good starts

Cutting emissions: Automatic electric trailer trucks change aircraft taxiing

Study shows

live chat

Recent Posts