Speaker
Wenlong Huang is a PhD candidate in Computer Science at Stanford University, advised by Professor Fei-Fei Li. He received his B.A. in Computer Science from UC Berkeley, where he was advised by Professor Deepak Pathak, Dr. Igor Mordatch, and Professor Pieter Abbeel. He studies the intersection between robotic manipulation, foundation models, and 3D computer vision. His works have won the Outstanding Paper Award in Robot Learning at ICRA 2023, the Best Paper Award at the CoRL 2024 LEAP Workshop, and the Best Paper Finalist at ICRA 2025. He received Stanford School of Engineering Fellowship and was selected as a finalist for the NVIDIA Graduate Fellowship and the Citadel GQS Fellowship.
Abstract
A pinnacle goal of spatial intelligence is an action-conditioned 3D world model—a model of physics intuition that predicts how a scene evolves under a contemplated action for a specific embodiment. Towards this goal, I will introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared spatial representation of 3D point flows. By tying raw sensory observation to an embodiment-agnostic action space, PointWorld invites a natural analogy to large-scale language modeling: a next-token prediction objective for interaction in 3D space and time, capturing properties central to manipulation. I will present a systematic investigation of how to scale up such models, a careful examination of the behaviors emerged with scale, and a demonstration that a single checkpoint can zero-shot synthesize diverse in-the-wild manipulation behaviors. I’ll close with an alternative roadmap for scaling robotic intelligence: rather than scaling through on-task demonstrations, we can scale through three pillars grounded in space—world modeling, behavioral modeling, and semantic reasoning—so robots can imagine and evaluate possible futures, then synthesize behaviors that go beyond the data they were trained on.