Speaker 1: Weiyao Wang
Weiyao Wang is a Research Engineer at Meta MSL dedicated to giving AI a 3D reasoning of the physical world. An alumnus of Duke University, Weiyao has authored numerous high-impact papers at CVPR, ICCV, and NeurIPS, focusing on 3D and multi-modal. He is a core contributor on SAM 3D, a generative framework that breaks the “data barrier” for 3D object reconstruction.
Abstract 1: SAM 3D: 3Dfy Anything in Images
The Segment Anything Model (SAM) revolutionized 2D computer vision by providing a foundation for universal image segmentation. However, our physical world is inherently 3D, and the “data barrier” for high-fidelity 3D reconstruction has long hindered the development of a similarly universal 3D foundation model. In this talk, I will present SAM 3D, a generative framework capable of “3D-fying” any object within a single natural image. Unlike previous methods that struggle with occlusion and scene clutter, SAM 3D predicts geometry, texture, and layout with unprecedented robustness. I will detail our human-and-model-in-the-loop pipeline that allowed us to scale 3D data annotation to new heights, and our multi-stage training framework that aligns synthetic pretraining with real-world complexity. With a 5:1 win rate in human preference tests over state-of-the-art baselines, SAM 3D marks a significant step toward visually grounded 3D reconstruction in the wild.
Speaker 2: Xitong Yang
Xitong Yang is a Research Scientist at Meta MSL, working on 3D human pose estimation and fine-grained video understanding. He received his Ph.D. in Computer Science from the University of Maryland, and has published papers on large-scale video understanding and multimodal learning on top-tier venues such as CVPR, ECCV, etc. Xitong is one of the core contributors of SAM 3D Body.
Abstract 2: SAM 3D Body: Robust Full-Body Human Mesh Recovery
Human mesh recovery aims to reconstruct detailed 3D human shape and pose from images, yet remains challenging under real-world conditions. In this talk, I’ll present SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery that achieves state-of-the-art accuracy and strong in-the-wild generalization across body, hands, and feet. 3DB is the first regression-based model built on Momentum Human Rig (MHR), a new parametric mesh representation that decouples skeletal structure from surface shape. The model adopts an encoder–decoder architecture and supports prompt-guided inference with 2D keypoints and masks, enabling interactive control in the spirit of the SAM family. I’ll cover the large-scale data engine behind 3DB—including hard example mining and a multi-stage annotation pipeline—followed by architectural details and evaluation results on both internal and external benchmarks.