AI

Meta AI releases V-JEPA 2: Open Self-Supervised World Model for Understanding, Prediction and Planning

Meta AI has launched V-JEPA 2, a scalable open source world model designed to learn from video at the scale of the internet and enable powerful visual understanding, future state predictions and zero photo planning. Based on Joint Predictive Architecture (JEPA), V-JEPA 2 demonstrates how self-regulatory learning can be carried out from passive Internet videos, combined with minimal robot interaction data, to bring a modular basis to intelligent physical agents.

Scalable self-supervised preprocessing from 1 million hours of video

V-JEPA 2 is estimated in over 1 million hours of internet-scale video and 1 million images. The model uses the visual mask deno goal and learns to reconstruct masked spatiotemporal patches in the latent representation space. This approach can avoid inefficiency in pixel-level prediction by focusing on predictable scene dynamics while ignoring irrelevant noise.

To predict JEPA to this level, the meta-researchers introduced four key technologies:

  • Data extension: A 22M sample data set (VideoMix22M) was constructed from public resources such as SSV2, Kinetics, HowTO100M, YT-Temporal-1b and Imagenet.
  • Model Scaling: Use VIT-G to expand the encoder capacity to more than 1B parameters.
  • Training schedule: A progressive solution strategy is adopted and preprocessing is extended to 252K iterations.
  • Space-time enhancement: Training on gradually longer and higher resolution clips to reach 64 frames at 384×384 resolution.

These design choices resulted in an average accuracy of 88.2% for the six benchmark tasks, including SSV2, Diving 48, Jester, Kinetics, Coin and Imagenet, all with previous baselines.

Learning understanding through masked representation

V-JEPA 2 has strong sports understanding. In V2 benchmarks for something, it achieves 77.3% TOP-1 accuracy, surpassing models like Intervideo and Videomaev2. To understand the appearance, it remains competitive with state-of-the-art image text preprocessing models such as Dinov2 and Pecoreg.

Evaluate the representation of the encoder using careful probes and verify that learning alone can produce transferable and domain ignorant visual features suitable for a variety of classification tasks.

Time reasoning through video questions answered

To evaluate time reasoning, the V-JEPA 2 encoder is aligned with the multi-mode large verbal model and evaluated on multiple video questioning tasks. Despite the lack of language supervision during preprocessing, the model is still implemented:

  • 84.0% of perceptual tests
  • 76.9% of Tempcompass
  • 44.5% of MVP
  • 36.7% of the temporal phylum
  • Tomato 40.3%

These results challenge the hypothesis that visual language alignment requires co-training from the outset, suggesting that the validated video encoder can be aligned after the horizon with strong generalization.

V-JEPA 2-AC: A potential world model for learning robotics programs

A key innovation in this release is V-JEPA 2-AC, a variant of the action condition of the pre-examination encoder. V-JEPA 2-AC is fine-tuned using only 62 hours of unlabeled robot video from the robot dataset to predict future video embeddings conditioned on robot movements and poses. The architecture is a 300m parameter transformer with obstacles to focus and uses teacher conception and launch target training.

This allows zero-fire planning through model predictive control. The model permeates the action sequence by using a cross-condensation method (CEM) to minimize the distance between the imagined future state and the visual target. It has achieved great success on robotic weapons not seen in different laboratories, such as gripping and picking – without any reward supervision or other data collection.

Benchmark: Strong performance and planning efficiency

V-JEPA 2-AC compared to baselines such as OCTO (behavioral cloning) and universe (potential diffusion world model):

  • The plan is executed for about 16 seconds per step (4 minutes for the universe).
  • The success rate of arriving in the task is 100%.
  • Mastery and manipulation tasks across object types are better than others.

It is worth noting that it operates using a monocular RGB camera without calibration or fine-tuning of a specific environment, enhancing the generalization ability of learning world models.

in conclusion

Meta’s V-JEPA 2 represents a significant advancement in scalable self-study learning and learning for physical intelligence. By decoupling from action adjustment and leveraging large-scale passive video, V-JEPA 2 proves that universal visual representations can be leveraged in the real world with perception and control.


Check Paper,,,,, Model embracing face and Github page. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 99K+ ml reddit And subscribe Our newsletter.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button