Vision-and-Language Alignment - Towards Universal Multimodal AI

Speaker

Junnan Li is currently a senior research manager at Salesforce Research. Before that, he obtained his PhD at the National University of Singapore. His main research focus is in building generative AI models that can understand and generate data in multiple modalities including vision, language, and code. In particular, he is interested in the efficient pre-training of multimodal models. He believes in the value of open-source research. Show less

Abstract

Vision and language are two of the most fundamental modalities for machine intelligence. This talk will introduce a series of works from Salesforce Research in building vision-and-language foundational models. The papers covered include ALBEF (NeurIPS’21), ALPRO (CVPR’22), , BLIP (ICML’22), PnP-VQA (EMNLP’22 findings), and the latest BLIP-2. In particular, this talk will focus on BLIP-2, the next-gen vision-language pre-training framework that enables LLMs to understand images.

Video