Genie Envisioner: A unified video generation platform for scalable, instruction-driven robotic operations

by admin · August 11, 2025

Embodying AI agents that can perceive, think and act in the real world, this is a critical step towards the future of robotics. A core challenge is the skills of building scalable, reliable robotic operations that intentionally control and control objects through selective contact. Although advances span analytical methods, model-based approaches, and large-scale data-driven learning, most systems still operate at very different stages of data collection, training, and evaluation. These phases often require custom settings, manual curation, and task-specific tweaks, creating friction that slows progress, hides failure modes and reduces repeatability. This highlights the need for a unified framework to simplify learning and evaluation.

Robot manipulation research has shifted from analytical models to neural world models that directly use pixels and latent space to learn dynamics from sensory inputs. Large-scale video generation models can produce realistic visual effects, but often lack action conditions, long-term temporal consistency, and multi-view reasoning required for control. Visual language action models follow instructions but are limited by imitation-based learning, which can prevent error recovery and planning. Policy evaluation remains challenging because the physics simulator needs to be tweaked a lot, while real-world testing is resource-intensive. Existing evaluation metrics often emphasize visual quality rather than task success, highlighting the need for a benchmark for better capture of real-world manipulation performance.

Developed by researchers at Agibot Genie Team, NUS LV-LAB and BUAA, Genie Envisioner (GE) is a unified platform for robotic operations that combines strategy learning, simulation and evaluation into a video generation framework. Its core GE-Base is a large-scale, instruction-driven video diffusion model that captures the spatial, temporal and semantic dynamics of real-world tasks. GE-ACT maps these representations to precise action trajectories, while GE-SIM provides fast, action-based video simulations. The EWMBENCH benchmark evaluates visual realism, body accuracy and guiding action alignment. GE has been trained across robotics and tasks to achieve scalable, memory and physics-based embodied intelligence research.

GE’s design is divided into three key parts. GE-BASE is a multi-view, guided conditional video diffusion model trained in over 1 million robotic operational episodes. It learns potential trajectories that capture how the scenario develops under a given command. On this basis, GE-ACT converts these potential video representations into real action signals, providing fast, precise motor control even in training data rather than on robots through a lightweight, flow-matched decoder. GE-SIM reuses the success rate of GE-BASE as a neural simulator for action conditions, thus enabling video-based closed-loop, video-based speeds far exceed real hardware. The EWMBENCH kit then performs an overall evaluation of the system on overall video realism, body consistency, and consistency between instruction and result operation.

In the evaluation, Genie Envisioner demonstrates powerful real-world and simulation performance in a variety of robot manipulation tasks. GE-ACT enables rapid control generation (54-step trajectory in 200 ms) and consistently exceeds the leading vision language action baseline in ladder and end-to-end success rates. It adapts to new robot types, such as Agilex Cobot Magic and Dual Franka, with only one hour of specific task data, and performs well in complex deformable object tasks. GE-SIM provides high-fidelity, action condition simulation for scalable, closed-loop strategy testing. The EWMBENCH benchmark confirms the excellent temporal consistency, motion consistency and scene stability of GE-Base, rather than the state-of-the-art video model, closely aligned with human quality judgments.

In summary, Genie Envisioner is a unified, scalable platform for two-arm robot manipulation that combines policy learning, simulation and evaluation into a video basic framework. Its core GE-Base is a guided video diffusion model that captures spatial, temporal, and semantic patterns of real-world robot interactions. GE-ACT converts these representations into accurate, adaptive action plans, even on new robot types with minimal retraining. GE-SIM provides high-fidelity, action-condition simulations for closed-loop policy improvements, while EWMBENCH conducts a rigorous assessment of realism, consistency and consistency. Extensive real-world testing highlights the system’s outstanding performance, making it a solid foundation for universal, guidance-driven embodying intelligence.

Check Paper and github pages. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.