Towards Spatial Supersensing in Video

Speaker

Shusheng Yang is currently a second-year Ph.D. student at NYU Courant, advised by Prof. Saining Xie. His research lies at the intersection of computer vision and multimodal learning, with a specific focus on visual representation learning, spatial intelligence, life-long video understanding, and unified model/world modeling. Please checkout Shusheng’s latest work here: https://scholar.google.com/citations?user=v6dmW5cntoMC&hl=en.

Abstract

While MLLMs have advanced the field of video understanding, we demonstrate that they still treat video as sparse frames, underrepresent spatial structure, and rely heavily on textual recall. In this work, we propose a hierarchy for “spatial supersensing” capabilities in video, arguing that genuine video intelligence necessitates not only text-based knowledge recall and semantic perception but also spatial cognition and predictive world modeling. To measure progress, we introduce VSI-SUPER, a benchmark designed for video sequences of arbitrary length. To test whether current limitations stem from data scarcity, we curate VSI-590K and train Cambrian-S; while this model excels on standard benchmarks, it remains limited on VSI-SUPER. Finally, we prototype predictive sensing using latent frame prediction and surprise estimation to handle unbounded visual streams. This approach improves Cambrian-S’s performance on VSI-SUPER and marks an early step toward spatial supersensing.

Video

Coming soon. Stay tuned. :-)