Reinforcement Learning with Chain-of-Thought Supervision Boosts Hateful Meme Detection Accuracy by Over 2%

A new reinforcement learning-based post-training method using Group Relative Policy Optimization and chain-of-thought supervision improves hateful and propagandistic meme detection. On the FHM benchmark, accuracy rose from 79.9% to 82.0%; on ArMeme, macro-F1 increased by 7.6 points to 0.612. The approach also generates natural-language explanations for predictions.

iGEN Editorial

June 16, 2026

Reinforcement Learning with Chain-of-Thought Supervision Boosts Hateful Meme Detection Accuracy by Over 2%

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone, according to a new paper on arXiv. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation has remained underexplored. A team of researchers—Kmainasi, Mohamed Bayan; Kutlu, Mucahid; Shahroor, Ali Ezzat; Hasnat, Abul; and Alam, Firoj—proposes a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality via task-specific rewards and Group Relative Policy Optimization (GRPO).

Technical Approach: GRPO and Chain-of-Thought Supervision

The researchers conducted a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks. They extended existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations. The core contribution is a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality. Additionally, they investigated self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels.

Benchmark Results on FHM and ArMeme

Experiments on the Hateful Memes (FHM) and ArMeme benchmarks showed significant gains. The proposed approach improved over previously reported results: on FHM accuracy from 79.9% to 82.0% (a +2.1% increase); on ArMeme macro-F1 from 0.536 to 0.612 with explanations (a +7.6 point gain, +6.1 compared to the original ArMeme benchmark). The method also generates natural-language explanations for its predictions.

Metric	Baseline	Proposed Approach	Improvement
FHM Accuracy	79.9%	82.0%	+2.1%
ArMeme Macro-F1 (with explanations)	0.536	0.612	+7.6 points
ArMeme Macro-F1 (vs original benchmark)	—	—	+6.1 points

On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas the GRPO-based approach provides more balanced per-class performance along with explanations.

Generating Natural-Language Explanations

A key advantage of the method is its ability to produce natural-language explanations for each classification decision. This makes the model more transparent and trustworthy, a critical requirement for enterprise content moderation systems.

Implications for Enterprise AI Content Moderation

For enterprises operating social platforms or customer communications channels, the ability to accurately flag harmful memes while explaining the reasoning can reduce false positives and improve moderation efficiency. The researchers publicly released their code, data extensions, and evaluation resources, enabling adoption and further refinement by the developer community.

Sources:

Reinforcement Learning with Chain-of-Thought Supervision Boosts Hateful Meme Detection Accuracy by Over 2%

Technical Approach: GRPO and Chain-of-Thought Supervision

Benchmark Results on FHM and ArMeme

Generating Natural-Language Explanations

Implications for Enterprise AI Content Moderation

Recommended Stories

Auditing Reward Hackability in Code RL Training Environments Reveals 28.5% Weak Test Suites

RecourseBench: Modular Framework Promises Reproducible Evaluation of AI Recourse Methods

Jailbreaking Frontier AI Models Is Cheap and Easy, New Report Warns Enterprise Users

Some Claude AI Chat Logs Made Publicly Accessible via Google Search