LLaVA: A Vision-and-Language Approach to Computer Vision in the Wild

Speaker

Chunyuan Li is currently a Research Lead at ByteDance/TikTok, based in the Seattle area. From 2018 to 2023, He worked as a Principal Researcher in the Deep Learning Team at Microsoft Research, Redmond. Before that, Chunyuan obtained his PhD at Duke University, working on probabilistic deep learning. He also spent time with Uber AI, Adobe Research, NIST and INRIA. At MSR, Chunyuan is mainly working on large-scale pre-training in computer vision (CV) and vision-language multimodality (MM), with a focus on building transferable vision models that can effortlessly generalize to a wide range of downstream CV & MM tasks. Chunyuan’s research has been frequently published in top venue conferences, including dozens of oral / spotlight presentations in NeurIPS, ICLR, ICML, CVPR and ACL, as well as receiving the Best Paper Finalist Award in CVPR 2022. He has served as an Area Chair for NeurIPS, ICML, EMNLP & AAAI, and a Guest Editor of IJCV. More info: https://chunyuan.li/.

Abstract

The future of AI is in creating systems like foundation models that are pre-trained once, and will handle countless many downstream tasks directly (zero-shot), or adapt to new tasks quickly (few-shot). In this talk, I will discuss our vision-language approach to achieving “Computer Vision in the Wild (CVinW)”: building such a transferable system in computer vision (CV) that can effortlessly generalize to a wide range of visual recognition tasks in the wild. I will first describe the definition and current status of CVinW, and briefly summarize our efforts on benchmark and modeling. I will dive into Large Language-and-Vision Assistant (LLaVA) and its series, which is the first open-source project to exhibit the GPT-4V level capabilities in image understanding and reasoning. demonstrate a promising path to build customizable large multimodal models that follow humans’ intent with an affordable cost.

Video

Coming soon. Stay tuned. :-)