Embrace the Face Release Smolvla: A compact visual language action model for affordable, efficient robotics
Although robotic control has recently progressed through large-scale visual language action (VLA) models, actual deployment is still limited by hardware and data requirements. Most VLA models depend on transformer-based backbones with billions of parameters, resulting in significant memory and computational costs. This limits experiments to well-resourced labs and clouds, excluding practitioners using low-cost hardware. Furthermore, many of the current advances in VLA research are still proprietary or based on non-replicable methods, which hinders open research. Finally, data heterogeneity (variance in morphology, sensor and control mode) between robotic platforms presents further challenges to generalization and cross-platform learning.
Embrace the Face to introduce Smolvla: Lightweight Open VLA Framework
Hug face gift Smolvlaa compact visual language action model developed to affordability and deployment efficiency. Unlike regular VLA, Smolvla is fully trained on datasets collected by the community and optimized to run on a single GPU or CPU environment. The model architecture integrates a pre-verified visual language model (SMOLVLM-2) and a trimmed version of the transformer-based action expert. This structure can effectively perform low-level control from natural language instructions and RGB camera input.
A notable feature of Smolvla is its asynchronous inference stack that distorts action predictions from execution. This design enables low-latency control to be suitable for real-time applications, even in resource-constrained settings. Smolvla is released under an open license and comes with accompanying code, training data and deployment tools.
Architectural Overview and Design Trade-offs
The Smolvla model forms two main components:
- Perception module (SMOLVLM-2): Preprocessed compact visual encoder process RGB images, sensorimotor states and sequences of language descriptions. To improve efficiency, the model limits the visual token by downsampling and uses only the lower half of the transformer layer, based on empirical findings, these findings often produce more transferable features.
- Action Expert: A sequence of continuous control actions predicted by a lightweight transformer with flow matching. Action experts alternate between self-attention and cross-attention layers, balancing internal action coherence and conditions in perceived input. Causal masking is used to perform time consistency.
To reduce computational overhead, linear projection is used for the token dimensions of the alignment mode. Generate action blocks instead of single-step predictions, thus reducing the frequency of inference calls. The model is trained using JIT assembly of BFLOAT16 PRECISION and TORCH for runtime optimization.
Empirical Assessment: Simulation and Real-World Performance
Smolvla was evaluated in simulation benchmarks (Libero and Meta-World) and in realistic robotic tasks using low-cost SO100 and SO101 platforms. The model trains from scratch on ~23k episodes of 481 community datasets and uses VLM to automatically generate task tags. Evaluation metrics include task-level success rates under distribution and distribution conditions.
exist Libero The average success rate of the Smolvla (0.45b) benchmark is 87.3%, which is very matched or surpassed the larger models such as π₀ (3.3b). exist Metaworld,This model is better than diffusion strategies and smaller scale VLA at task difficulty level. These results are noteworthy given Smolvla’s smaller training footprint and the lack of robot-specific audits.

In the real world, Smolvla has achieved an average success rate of 78.3% on the pick, stacking and classification tasks, both of which ACTs (train from scratch) and π₀ (filling). Furthermore, Smolvla outlined in the robotics implementation that, despite only training on SO100 data, still maintains performance on SO101.
Performance meaning of asynchronous inference
Smolvla’s asynchronous inference stack improves control efficiency through overlapping prediction and execution. Compared with traditional synchronous inference, this approach reduces the average task time by about 30%, and doubles the number of operations completed in a fixed time scheme. This is especially beneficial for edge deployments that reason for delayed degradation of real-time performance.
in conclusion
Smolvla shows that compact, reproducible and open source VLA models can support efficient robotic control on low-cost hardware. With careful architecture selection (Lyer pruning, split action prediction and asynchronous execution), Smolvla maintains performance while greatly reducing computational requirements.
The model’s open training and deployment stack combined with real-world evaluations lays a practical foundation for further research on effective and accessible robot learning. Future directions include the ability to expand cross-seat datasets, the ability to scale models without sacrificing latency, and joint training in exploring multimodal corporat beyond robotic data.
View paper and model on hugging face . All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.
 
																								 
																								