Evaluating the Universality of 'Do Anything Now' Jailbreak Prompts on Large Language Models: Content Warning: This paper contains unfiltered and harmful examples. Conference

Nabavirazavi, S, Zad, S, Iyengar, SS. (2025). Evaluating the Universality of 'Do Anything Now' Jailbreak Prompts on Large Language Models: Content Warning: This paper contains unfiltered and harmful examples. . 691-696. 10.1109/CCWC62904.2025.10903781

cited authors

  • Nabavirazavi, S; Zad, S; Iyengar, SS

authors

abstract

  • The advent of large language models (LLMs) and their capabilities in diverse downstream tasks have attracted considerable adversarial attention, leading to various attack strategies. Among notable adversarial techniques targeting LLMs, jailbreak attacks have recently emerged as a means to bypass safety mechanisms and generate harmful content. Jailbreaks mislead LLMs to instruct users on tasks that have a negative societal impact, including illegal activities, malware distribution, and fraud. To systematically evaluate and robustify the corresponding refusal mechanisms and safety alignment measures against jailbreak strategies in subsequent iterations, a scalable approach is essential. The widespread presence of jailbreak scenarios, the inconsistency between different LLM usage policies, and the frequency of LLM updates make it challenging to have a unified evaluation pipeline. This paper introduces LLM Jailbreak Evaluator (LLM-JBE), an innovative framework that efficiently evaluates jailbreak prompts across a wide range of LLMs. We demonstrate the capabilities of LLM-JBE by measuring the effectiveness of jailbreak prompts across various OpenAI models. In particular, we demonstrate that basic jailbreak prompts remain highly effective and achieve attack success rates above 50% on GPT-4, GPT-3.5, and o1. The code is available at github.com/sina-nabavi/LLM-JBE.

publication date

  • January 1, 2025

Digital Object Identifier (DOI)

start page

  • 691

end page

  • 696