A Vision-and-Language Approach to Computer Vision in the Wild: Modeling & Benchmark


Bio: Chunyuan Li is currently a Principal Researcher in the Deep Learning Team at Microsoft Research, Redmond. Before that, Chunyuan obtained his PhD at Duke University, working on probabilistic deep learning. He also spent time with Uber AI, Adobe Research, NIST and INRIA. At MSR, Chunyuan is mainly working on large-scale pre-training in computer vision (CV) and vision-language multimodality (MM), with a focus on building transferable vision models that can effortlessly generalize to a wide range of downstream CV & MM tasks. Chunyuan’s research has been published in many top venue conferences, including multiple oral / spotlight presentations in NeurIPS, ICLR, ICML, CVPR and ACL, as well as the Best Paper Finalist Award in CVPR 2022.

Homepage: https://chunyuan.li/.


The future of AI is in creating systems like foundation models that are pre-trained once, and will handle countless many downstream tasks directly (zero-shot), or adapt to new tasks quickly (few-shot). In this talk, I will focus on discussing our recent research explorations in building such a transferable system in computer vision (CV) that can effortlessly generalize to a wide range of visual recognition tasks in the wild. (1) As a research background, I will briefly mention our efforts on modeling. We are taking a vision-and-language (VL) approach, where every visual recognition task can be reformulated as an image-and-text matching problem. This is exemplified by UniCL[1] / Florence [2] for image classification, GLIP [3] for object detection, and KLITE [4] that demonstrates the advantage of the reformulation of CV as VL (it allows leveraging external knowledge). (2) I will also talk about benchmark ELEVATER [5] to evaluate the task-level transfer ability of pre-trained visual models, to measure the research progress in this direction. It consists of 20 image classification datasets and 35 object detection datasets. Based on which, we are also organizing an ECCV workshop [6] that aims to bring together the community effort to collaboratively tackle the challenge of computer vision in the wild.


  1. Unified Contrastive Learning in Image-Text-Label Space https://arxiv.org/abs/2204.03610
  2. Florence: A New Foundation Model for Computer Vision https://arxiv.org/abs/2111.11432
  3. Grounded Language-Image Pre-training https://arxiv.org/abs/2112.03857
  4. K-LITE: Learning Transferable Visual Models with External Knowledge https://arxiv.org/abs/2204.09222
  5. ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models https://arxiv.org/abs/2204.08790
  6. ECCV Workshop https://computer-vision-in-the-wild.github.io/eccv-2022/