Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation Preprint

Zhuang, Jun, Jin, Haibo, Zhang, Ye et al. (2025). Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation . 10.48550/arxiv.2505.18556

cited authors

  • Zhuang, Jun; Jin, Haibo; Zhang, Ye; Kang, Zhengjian; Zhang, Wenbin; Dagher, Gaby G; Wang, Haohan

authors

publication date

  • August 25, 2025

keywords

  • 46 Information and Computing Sciences
  • 4609 Information Systems

Digital Object Identifier (DOI)