The Case for Reasoning Beyond Recognition


Jack Hessel is a Research Scientist at the Allen Institute for AI. He previously earned a PhD from Cornell. These days, he works on improving human-AI collaboration. This includes aligning model behavior with human intent (RLHF), and expanding language models with new modalities (e.g., vision) for a more complete view of the world.


Algorithms that can jointly process modalities like images+text are needed for next generation search, accessibility, and robot interaction tools. Simply recognizing objects in images, however, is rarely sufficient; to truly be useful, machines must be capable of deeper commonsense inferences about sophisticated multimodal contexts. I’ll discuss our recent+ongoing work in three (related) directions beyond recognition: 1) visual temporal reasoning; 2) visual abductive reasoning; and 3) visual uncommonsense reasoning.