Learning visual language models for video understanding


Antoine Yang is a Research Scientist at Google DeepMind on the Computer Vision team in London. In 2023, he completed his Ph.D. in the WILLOW team of Inria Paris and École Normale Supérieure. In 2020, he received a double MSc degree in Applied Mathematics from École Polytechnique and ENS Paris-Saclay. He previously interned at Huawei Noah’s Ark Lab and Google Research Perception.


Language models have become increasingly powerful in recent years, but they cannot perceive the world around us. This talk presents several state-of-the-art methods to develop visual language models that can reason about videos. We first focus on solving video question answering in a scalable manner, without manual annotation (zero-shot setting). For this, we propose an approach that automatically generates video question answering data from narrated videos using text-only question-generation models, and show that a multi-modal transformer trained contrastively on the generated data can answer visual questions in a zero-shot manner. We then present an alternative zero-shot video question answering approach that bypasses the data generation procedure and directly leverages bidirectional language models. This is done by adding light trainable parameters to incorporate vision into the language model while keeping its weights frozen, and leveraging Web-scrapped video-caption pairs for training. Finally, we address the challenging problem of temporally localizing and captioning events in untrimmed videos. We design a visual language model that views this task as a sequence-to-sequence problem and that can largely benefit from pretraining on narrated videos at scale.