Learning from Language Models for Visual Intelligence

Speaker

Boyi Li is a Research Scientist at NVIDIA Research and a Postdoctoral Scholar at Berkeley AI Research. Her research interest is in computer vision and machine learning. Her research primarily focuses on multimodal and data-efficient machine learning for building various intelligent systems.

Homepage: https://sites.google.com/site/boyilics/home

Abstract

The computer vision community has embraced specialized models trained on fixed object categories like ImageNet or COCO. However, relying solely on visual knowledge may limit flexibility and generality, requiring additional labeled data and hindering user interaction. In this talk, I will discuss our recent work on leveraging language models for visual intelligence. We explore how language models enhance flexibility and investigate their grounding information. Firstly, I will present LSeg, our novel multimodal approach for Language-driven semantic image segmentation. LSeg combines text and image embeddings, enabling generalization to unseen categories without retraining. Secondly, we examine the importance of extralinguistic signals by comparing multimodal cues to large language models. Our text-only framework achieves state-of-the-art performance in unsupervised constituency parsing, suggesting valuable grounding information in language models. These findings open up opportunities for multimodal representations in visual intelligence.

Video

Coming soon. Stay tuned. :-)