Detecting harmful content in multi-turn conversations remains a challenge for large language models (LLMs) because they often rely solely on internal parametric knowledge without explicit grounding in external normative principles. This can lead to inconsistent judgments in socially nuanced contexts and limited interpretability. To address this, researchers have proposed RoTRAG—a retrieval-augmented generation framework that incorporates concise human-written moral norms, called Rules of Thumb (RoTs), into LLM-based harm assessment.
The Challenge of Harm Detection in Multi-Turn Dialogue
According to the paper published on arXiv, most existing methods for harm detection rely mainly on models’ internal parametric knowledge. This approach often produces inconsistent judgments when dealing with socially nuanced contexts, offers limited interpretability, and leads to redundant reasoning across conversational turns. Multi-turn dialogue requires reasoning over the full conversational context rather than isolated utterances, making the problem more complex.
RoTRAG: Retrieving Moral Norms for Grounded Reasoning
RoTRAG addresses these limitations by retrieving relevant RoTs from an external corpus for each turn. These RoTs serve as explicit normative evidence for turn-level reasoning and final severity classification. To improve efficiency, the framework introduces a lightweight binary routing classifier that decides whether a new turn requires retrieval-grounded reasoning or can reuse existing context. This mechanism reduces redundant computation without sacrificing performance, according to the researchers.
Performance Gains: 40% F1 Improvement and Reduced Error
The research team evaluated RoTRAG on two benchmark datasets: ProsocialDialog and Safety Reasoning Multi Turn Dialogue. Compared with competitive baselines, RoTRAG consistently improved both harm classification and severity estimation. The reported results include an average relative gain of around 40% in F1 across the benchmark datasets and an average relative reduction of 8.4% in distributional error. The following table summarizes key outcomes:
| Metric | Improvement |
|---|---|
| Average relative F1 gain | ~40% |
| Average relative reduction in distributional error | 8.4% |
| Computational overhead reduction | Reduced redundant computation |
Implications for Enterprise AI
For enterprise technology decision-makers evaluating AI for content moderation, customer service chatbots, or social media monitoring, RoTRAG demonstrates a practical approach to making LLM-based harm detection more consistent and interpretable. By grounding judgments in external normative principles, the framework reduces reliance on opaque internal knowledge and provides explicit reasoning via retrieved Rules of Thumb. The lightweight routing classifier also addresses efficiency concerns, making the approach suitable for real-time applications. While the research focuses on dialogue harm detection, the retrieval-augmented methodology could be adapted to other domains requiring principled reasoning, such as compliance checking or automated moderation in enterprise communication platforms.