Generalist AI launches GEN-θ: a new class of materialized base models built for multi-modal training directly on high-fidelity original physical interactions

How to build a single model that can learn physics skills from messy real-world robot data without relying on simulations? Generalist Artificial Intelligence Already revealed GEN-θa series of concrete base models trained directly on high-fidelity raw physical interaction data rather than Internet videos or simulations. The system is built to establish scaling laws for robots, much like large language models do for text, but now based on continuous sensorimotor streams from real robots operating in homes, warehouses and workplaces.

Harmonious reasoning, real-time thinking and action

GEN-θ is introduced as a concrete base model architecture that builds on the strengths of visual and language models and extends them with native support for human-level reflexes and physical common sense. The core feature is harmonic reasoning, where models are trained to think and act simultaneously on an asynchronous, continuous-time stream of perception and action tokens.

The design targets the specific constraints of robotics. Language models can take more time to think before replying, but robots must take action while the physics continue to evolve. Harmonic inference creates harmonic interactions between sensing and acting flows so that GEN-θ can scale to very large model sizes without relying on System1-System2 architecture or heavy inference time to bootstrap the controller.

GEN-θ is a clear crossover embodiment. The same architecture was run on different robots and tested on 6DoF, 7DoF and 16+DoF semi-humanoid systems, allowing a single pre-training run to serve a heterogeneous fleet.

Beyond the intelligence threshold of robotics

this Generalist Artificial Intelligence The team reports a phase shift in capabilities as GEN-θ scales in high-data regimes. Their extended research experiments also show that models must be large enough to absorb large amounts of physical interaction data.

Their behavior is as follows:

  • 1B models struggle to absorb complex and diverse sensorimotor data during pre-training, and their weights stop absorbing new information, which the research team describes as rigidity.
  • The 6B model started to benefit from pre-training and showed strong multi-tasking capabilities.
  • The 7B+ model internalizes large-scale robot pre-training, so thousands of post-training steps for downstream tasks are sufficient for transfer.

The graph above plots the next action validation prediction error for fully preserved long-range downstream tasks across model sizes and pre-training calculations. The 1B model stabilizes early, while the 6B and 7B models continue to improve as pre-training increases. The research team linked this phase transition to Moravec’s Paradox, arguing that physical common sense and flexibility appear to require a higher computational threshold than abstract verbal reasoning, and that GEN-θ operates beyond this activation point.

The Generalist AI team says GEN-θ has scaled to 10B+ model sizes, and larger variants can be adapted to new tasks with less and less post-training.

The scaling laws of robotics

Another focus of this research is scaling laws that relate pre-training data and computation to downstream post-training performance. The research team sampled checkpoints from GEN-θ training run on different subsets of the pre-training dataset, and then post-trained these checkpoints on multi-task, language-conditioned data. This supervised fine-tuning phase spans 16 task sets, covering agile tasks such as building Lego, industrial workflows such as fast food packaging, and general tasks including any style of instruction.

Across a variety of tasks, more pre-training can improve post-training validation loss and next-action prediction errors. At sufficient model scale, the relationship between pretraining dataset size and downstream validation error can be well described by a power law of the following form.

L(D)=(Dc​/D)αD​

where (D) is the number of action trajectories in pre-training, and (L(D)) is the validation error of the downstream task. This formula allows robotics teams to estimate how much pre-training data is needed to achieve target next-action prediction error, or how much downstream labeled data can be traded for additional pre-training.

Robot-scale data engine and infrastructure

GEN-θ was trained on an in-house dataset containing 270,000 hours of real-world operational trajectories collected in thousands of homes, warehouses, and workplaces around the world. Currently, data operations add more than 10,000 hours per week. The Generalist AI team claims that GEN-θ is trained on orders of magnitude more real-world operational data than previous large robotics datasets currently available.

To maintain this regime, the research team built custom hardware, data loaders, and network infrastructure, including dedicated Internet lines to handle uplink bandwidth from distributed sites. The pipeline uses multi-cloud contracts, custom upload machines, and approximately 10,000 compute cores for continuous multi-modal processing. The research team reports compression and data loading techniques for tens of petabytes of data from a cutting-edge video base model, resulting in a system capable of absorbing 6.85 years of real-world operational experience with daily training.

How to pretrain GEN-θ is as important as its size?

The generalist AI team performed large-scale ablation on 8 pre-training datasets and 10 long-term task sets. They found that different data mixes, not just more data, produced models with different behaviors across 3 sets of tasks, flexibility, real-world application, and generalization. Performance is measured using the validation mean square error of the next operation and the inverse Kullback Leibler divergence between the model policy and a Gaussian distribution around the real operation.

Low MSE and low inverse KL models are better candidates for supervised fine-tuning. A model with higher MSE but lower reverse KL has a more multimodal distribution of actions and can be a better starting point for reinforcement learning.

Main points

  1. GEN-θ is a concrete base model trained on high-fidelity raw physical interaction data (rather than simulations or Internet videos) that uses harmonic reasoning to think and act simultaneously under real-world physical conditions.
  2. Extended experiments show a smart threshold around 7B parameters, where smaller models become rigid under high data load, while larger models continue to improve with more pre-training.
  3. GEN-θ exhibits a clear scaling law, where downstream post-training performance follows a power law in the amount of pre-training data, allowing the team to predict how much data and computation will be needed to target error levels.
  4. The system is trained on more than 270,000 hours of real-world operational data, growing by approximately 10,000 hours per week, and is supported by a custom multi-cloud infrastructure that absorbs 6.85 years of experience per training day.
  5. Large-scale ablation on 8 pre-training datasets and 10 long-term task sets shows that data quality as measured by validation MSE and inverse KL and mixture design matter as much as scale, as different mixtures produce models more suitable for supervised fine-tuning or reinforcement learning.

GEN-θ positions specific underlying models as a serious attempt to introduce scaling laws into robotics using harmonic inference, large-scale multimodal pre-training, and explicit analysis of data blending. Research shows that the 7B+ model, trained on 270,000 hours of real-world operational data, with an additional 10,000 hours added per week, can cross intelligence thresholds, and more physical interaction data predictably improves agility, application and downstream performance on generalization tasks.


Check technical details. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an AI media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...