Speaker
Yue Zhao is a fourth-year PhD student at the University of Texas at Austin, supervised by Prof. Philipp Krähenbühl. He obtained his MPhil degree from the Multimedia Laboratory at the Chinese University of Hong Kong, supervised by Prof. Dahua Lin. More previously, he got his Bachelor’s degree from Tsinghua University. His current research interests are computer vision, particularly video analysis and understanding. He is a recipient of the 2024-2025 NVIDIA fellowship.
Abstract
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-instruction-tuned model (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.