Large-Scale Visual Representation Learning with Vision Transformers


Bio: Xiaohua Zhai is a staff researcher and a manager in the Google Research, Brain team, Zürich. He received his PhD degree from Peking University in 2014. His research interests include large-scale representation learning, multimodal learning, transfer learning and self-supervised learning.



Attention-based neural networks such as Vision Transformers (ViT) [1] have recently achieved state-of-the-art results on many computer vision benchmarks (e.g. the Visual Task Adaptation Benchmark [2]). Scale is a primary ingredient in attaining excellent results. In this talk, I will first share our empirical results of the scaling laws study of training Vision Transformers (ViT), and the recipe to train a ViT-G [3] model with up to two billion parameters. Then I will present how to train a text model to “read out” good representations from a pre-trained and locked image model for new tasks, named “Locked-image Tuning” (LiT) [4]. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. LiT provides an alternative way to fine-tuning, that adapts an existing model to new tasks.


  1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  2. A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
  3. Scaling Vision Transformers
  4. LiT: Zero-Shot Transfer with Locked-image Text Tuning