Generalist Embodied AI in an Open World

Speaker

Xiaojian Ma is a research scientist at Beijing Institute for General Artificial Intelligence (BIGAI). He received his Ph.D. in Computer Science at UCLA and a bachelor’s degree in Computer Science at Tsinghua University. His research interest primarily focuses on large-scale multimodal learning for understanding, reasoning, and skill learning. In particular, He is interested in building models/agents that can learn from 2D/3D vision and text data, and perform a wide range of reasoning, embodied planning, and control tasks. He has worked at DeepMind, NVIDIA Research, and Google Brain Robotics with a focus on large-scale machine learning. His research has been recognized with the best paper award at the ICML workshop and research fellowships.

Abstract

From generalist manipulators to humanoids, robotics, and embodied AI is at the center of the stage again but surrounded by a completely different AI landscape, where largely pretrained models like LLMs and VLMs are roaring at multiple fronts of human intelligence. Indeed, embodied AI itself is also experiencing a paradigm shift: from close-world and static settings to more realistic, open-world, and dynamic environments. In this talk, I will present some of our recent efforts to bring more open-endedness to the world of embodied agents. We will first cover SQA3D, a new benchmark for embodied reasoning in 3D scenes. It combines the best of both worlds with open-vocabulary, knowledge-extensive, and situated reasoning and imposes substantial challenges to existing ML models including LLMs. Moving from this foundational groundwork, I will provide some updates on developing open-world generalist embodied agents by leveraging these large models and their principles. Specifically, we explore some key ingredients in developing a vision-based multi-task agent controller in Minecraft, including multimodal fusion and horizon prediction. To further enable solving complex long-term tasks, a hierarchical goal execution agent architecture based on large models is proposed and it becomes one of the best agents so far on the “ObtainDiamond ’’ challenge. Finally, I will review some ongoing and possible future directions.

Video