Paper Number
ICIS2025-2633
Paper Type
Short
Abstract
Evaluating the effectiveness of hate speech detoxification is an emerging challenge, particularly as large language models (LLMs) become central to content moderation. While text detoxification (TD) presents a promising alternative to deletion or banning, current evaluation methods remain limited. Human evaluation is costly and inconsistent, and existing automatic metrics often fail to capture social sensitivity. We introduce SAFE-TD, a Structured Agentic Framework for Evaluation of TD, which simulates three agent roles to assess detoxified outputs from multiple perspectives. Our preliminary analysis reveals four outcome types and identifies a critical risk: the generation of implicit hate speech that appears neutral but retains harmful meaning. These findings expose under-explored trade-offs in TD and limitations in existing evaluation practices. SAFE-TD contributes a scalable, socially grounded approach to evaluating LLM-based TD, offering a foundation for more ethical and nuanced AI development for online safety.
Recommended Citation
Phan, Thuy Linh (Isabella); Boyce, James; Xie, Hetiao (Slim); Namvar, Morteza; and Risius, Marten, "Same Same but Different: Evaluating Hate Speech Detoxification through an LLM-based Agentic Framework" (2025). ICIS 2025 Proceedings. 34.
https://aisel.aisnet.org/icis2025/gen_ai/gen_ai/34
Same Same but Different: Evaluating Hate Speech Detoxification through an LLM-based Agentic Framework
Evaluating the effectiveness of hate speech detoxification is an emerging challenge, particularly as large language models (LLMs) become central to content moderation. While text detoxification (TD) presents a promising alternative to deletion or banning, current evaluation methods remain limited. Human evaluation is costly and inconsistent, and existing automatic metrics often fail to capture social sensitivity. We introduce SAFE-TD, a Structured Agentic Framework for Evaluation of TD, which simulates three agent roles to assess detoxified outputs from multiple perspectives. Our preliminary analysis reveals four outcome types and identifies a critical risk: the generation of implicit hate speech that appears neutral but retains harmful meaning. These findings expose under-explored trade-offs in TD and limitations in existing evaluation practices. SAFE-TD contributes a scalable, socially grounded approach to evaluating LLM-based TD, offering a foundation for more ethical and nuanced AI development for online safety.
When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.
Comments
12-GenAI