Paper Number

ECIS2026-1681

Paper Type

SP

Abstract

Hate speech detection in political discourse is hindered by the scarcity of domain-specific hate examples and severe class imbalance in election-related data. To address this challenge, we develop a topic-aware synthetic data generation pipeline that uses large language models to produce contextually grounded hate-speech samples aligned with discourse from the 2024 U.S. election. We manually annotate 6,499 tweets, apply BERTopic to identify thematic structure, and generate synthetic hate tweets conditioned on representative examples and topic-level cues. These synthetic samples are combined with the original dataset to fine-tune transformer-based classifiers. The augmented dataset yields significant improvements in hate-speech detection, with the best-performing model increasing its Hate-class F1 score from 0.67 to 0.88 after augmentation. These findings demonstrate that LLM-generated synthetic data can effectively enrich rare hate expressions and substantially enhance classifier performance in politically charged contexts.

Share

COinS
 
Jun 14th, 12:00 AM

Synthetic Data Generation Using LLMs For Hate Speech Detection In Political Posts

Hate speech detection in political discourse is hindered by the scarcity of domain-specific hate examples and severe class imbalance in election-related data. To address this challenge, we develop a topic-aware synthetic data generation pipeline that uses large language models to produce contextually grounded hate-speech samples aligned with discourse from the 2024 U.S. election. We manually annotate 6,499 tweets, apply BERTopic to identify thematic structure, and generate synthetic hate tweets conditioned on representative examples and topic-level cues. These synthetic samples are combined with the original dataset to fine-tune transformer-based classifiers. The augmented dataset yields significant improvements in hate-speech detection, with the best-performing model increasing its Hate-class F1 score from 0.67 to 0.88 after augmentation. These findings demonstrate that LLM-generated synthetic data can effectively enrich rare hate expressions and substantially enhance classifier performance in politically charged contexts.

When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.