Learning to Edit 3D Objects and Scenes

Speaker

Fangyin Wei is a final-year PhD candidate in Computer Science at Princeton University, working with Szymon Rusinkiewicz and Thomas Funkhouser. Fangyin’s research lies in the intersection of computer vision and graphics, with a goal of learning to build a realistic 3D world. Her past research spans topics from generative models for image synthesis/editing to neural renderers for 3D shape synthesis/editing, where she tackled various challenges from learning disentangled representations to learning without labels. The research outcomes have been published in top venues in both computer vision (CVPR, ECCV, ICCV, 3DV) and graphics (SIGGRAPH Asia). She received her B.S. in CS from Peking University in 2018. Her work experience includes research internships on 3D vision topics at Google Research, Uber ATG R&D, Meta Reality Labs, and Microsoft Research.

Abstract

3D editing plays a key role in many fields ranging from AR/VR, industrial and art design, to robotics. However, existing 3D editing tools either (i) demand labor-intensive manual efforts and struggle to scale to many examples, or (ii) use optimization and machine learning but produce unsatisfactory results (eg, losing details, supporting only coarse editing, etc.). These shortcomings often arise from editing in geometric space rather than structure-aware semantic space, where the latter is the key to automatic 3D editing at scale. While learning a structure-aware space will result in significantly improved efficiency and accuracy, labeled datasets to train 3D editing models don’t exist. In this talk, I will present novel approaches for learning to edit 3D objects and scenes in structure-aware semantic space with noisy or no supervision.

The first part of the talk addresses how to extract the underlying structure to edit 3D objects, with a focus on editing two critical properties: semantic shape parts and articulations. Our semantic editing method enables specific edits to an object’s semantic parameters (eg, the pose of a person’s arm or the length of an airplane’s wing), leading to better preservation of input details and improved accuracy compared to previous work. Next, I will introduce a 3D annotation-free method that learns to model geometry, articulation, and appearance of articulated objects from color images. The model works on an entire category (as opposed to typical NeRF extensions that only overfit on a single scene) and enables various applications such as few-shot reconstruction and static object animation. It also generalizes to real-world captures. The second part of the talk tackles how to extract structure for scene editing. I will present an automatic system that removes clutter (frequently moving objects such as clothes or chairs) from 3D scenes and inpaints the resulting holes with coherent geometry and texture. We address challenges including the lack of well-defined clutter annotations, entangled semantics and geometry, and multi-view inconsistency.

In summary, this presentation will demonstrate techniques to exploit the underlying structure of 3D data for editing. Our work opens up new research directions such as leveraging structures from image-text joint embedding models to empower 3D editing models with stronger semantic understanding.

Video