Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation Preprint

Zhuang, Jun, Jin, Haibo, Zhang, Ye et al. (2025). Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation . 10.48550/arxiv.2505.18556

cited authors

Zhuang, Jun; Jin, Haibo; Zhang, Ye; Kang, Zhengjian; Zhang, Wenbin; Dagher, Gaby G; Wang, Haohan

authors

Zhang, Wenbin

publication date

August 25, 2025

keywords

46 Information and Computing Sciences
4609 Information Systems

Digital Object Identifier (DOI)

https://doi.org/10.48550/arxiv.2505.18556

FIU Discovery

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation Preprint

Overview

cited authors

authors

publication date

Research

keywords

Identifiers

Digital Object Identifier (DOI)