Guardian of Trust in Language Models: Automatic Jailbreak and Systematic Defense

Speaker

Haohan Wang is an assistantprofessor in the School of Information Sciences at the University of IllinoisUrbana-Champaign. His research focuses on the development of trustworthymachine learning methods for computational biology and healthcare applications.In his work, he uses statistical analysis and deep learning methods, with anemphasis on data analysis using methods least influenced by spurious signals.Wang earned his PhD in computer science through the Language TechnologiesInstitute of Carnegie Mellon University. He is also an organizer of TrustworthyMachine Learning Initiative.

Abstract

Large Language Models (LLMs) excel in Natural Language Processing (NLP) with human-like text generation, butthe misuse of them has raised a significant concern. In this talk, we introducean innovative system designed to address these challenges. Our system leveragesLLMs to play different roles, simulating various user personas to generate"jailbreaks" – prompts that can induce LLMs to produce outputscontrary to ethical standards or specific guidelines. Utilizing a knowledgegraph, our method efficiently creates new jailbreaks, testing the LLMs’adherence to governmental and ethical guidelines. Empirical validation ondiverse models, including Vicuna-13B, LongChat-7B, Llama-2-7B, and ChatGPT, hasdemonstrated its efficacy. The system’s application extends to Visual LanguageModels, highlighting its versatility in multimodal contexts.

The second part of our talk shifts focus to defensivestrategies against such jailbreaks. Recent studies have uncovered variousattacks that can manipulate LLMs, including manual and gradient-basedjailbreaks. Our work delves into the development of robust prompt optimizationas a novel defense mechanism, inspired from principled solutions fromtrustworthy machine learning. This approach involves system prompts – parts ofthe input text inaccessible to users – and aims to counter both manual andgradient-based attacks effectively. Despite current methods, adaptive attackslike GCG remain a challenge, necessitating a formalized defensive objective.Our research proposes such an objective and demonstrates how robust promptoptimization can enhance the safety of LLMs, safeguarding against realisticthreat models and adaptive attacks.

Video