Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation Conference

Zhuang, J, Jin, H, Zhang, Y et al. (2025). Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation . 2147-2160. 10.18653/v1/2025.findings-emnlp.114

cited authors

Zhuang, J; Jin, H; Zhang, Y; Kang, Z; Zhang, W; Dagher, GG; Wang, H

authors

Zhang, Wenbin

abstract

Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs’ moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of the intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and indicate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our “FSTR+SPIN” variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs’ safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.

publication date

January 1, 2025

Digital Object Identifier (DOI)

https://doi.org/10.18653/v1/2025.findings-emnlp.114

start page

2147

end page

2160

FIU Discovery