Hate speech detection is a critical task, especially on social media where harmful content spreads quickly. However, the inherently subjective nature of hate speech leads to frequent disagreement among annotators, particularly for subtle or borderline content, according to a new study from researchers Dehghan, Somaiyeh; Sen, Mehmet Umut; and Yanikoglu, Berrin. Their paper, published on arXiv, examines this largely overlooked problem and evaluates a range of aggregation methods for handling annotator disagreement.
The Problem of Annotator Disagreement
The researchers note that traditional approaches often discard non-consensus samples or force a 'gold standard' through expert adjudication, ignoring valuable information about uncertainty and diverse human perspectives. This practice can bias models and produce over-optimistic results. The study analyzes methods including majority voting, ordinal strategies (minimum, maximum, and mean), and their impact across binary, 4-class, and 6-class classification tasks.
Key Findings: Modeling Disagreement Improves Robustness
The paper demonstrates that filtering non-consensus samples results in over-optimistic results. Instead, the authors show that annotator disagreement, when properly modeled, is a valuable resource for building more robust and reliable systems. They also leverage annotators' perceived hate speech strength scores to explore regression-based and hybrid modeling approaches, finding that this perceived strength provides a complementary signal that enhances classification performance.
| Aggregation Method | Description | Impact on Performance |
|---|---|---|
| Majority Voting | Standard label assignment based on most common annotation | Baseline method |
| Ordinal (min, max, mean) | Uses ordered labels from annotators | Mixed results across tasks |
| Regression-based | Uses continuous hate speech strength scores | Enhances classification |
| Hybrid | Combines classification with strength signals | Achieves new state-of-the-art |
State-of-the-Art Results for Turkish Tweets
The researchers applied their methods to Turkish tweets and established new state-of-the-art results for hate speech detection in that language. The study highlights that the perceived strength signal, when incorporated, improves model accuracy and robustness.
Implications for Enterprise AI Applications
While focused on hate speech, the findings have broader implications for any classification task where human annotation is subjective — including content moderation, customer feedback analysis, and even areas like supply chain risk assessment where expert judgments may vary. For CTOs and technology leaders building AI systems, the research underscores the importance of preserving annotator disagreement rather than discarding it, as it can lead to more reliable models.
The paper is available on arXiv and was submitted on February 12, 2025, with multiple revisions through June 2026.