Scaling World Models for Generalist Robots

20-03-2026

Speaker 1: Shenyuan Gao

Shenyuan Gao is a final-year PhD student at HKUST and a research intern with the NVIDIA GEAR team. He has published multiple papers at top-tier conferences in robot learning and computer vision (ICML/NeurIPS/CVPR/ECCV/RSS/IROS). His primary research interests include generative world models and their applications in embodied AI.

Speaker 2: Seonghyeon Ye

Seonghyeon Ye is a final-year Ph.D. student at KAIST AI and a research intern with the NVIDIA GEAR team. His research focuses on developing world action models for generalist robot policies. Recently, his work has centered on advancing this frontier through projects such as LAPA, DreamGen, and DreamZero. His research has been published in top conference venues including ICLR (‘22, ‘23, ‘24, ‘25), CoRL (‘25), NeurIPS (‘24), ICML (‘24), CVPR (‘25), and EMNLP (‘21, ‘22, ‘23, ‘24). Beyond academia, he has conducted research internships at Microsoft Research and LG AI.

Abstract

This episode covers two recent works on world models from the NVIDIA GEAR team:

DreamDojo is a generalist robot world model that can make generalizable future predictions controlled by action inputs. It is pretrained on 44k hours of diverse human egocentric data with latent actions as a universal proxy to enhance physical knowledge transfer. After distillation, the model can operate at a real-time speed of 10FPS with improved context consistency. DreamDojo enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning.

DreamZero is a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz.