Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone, according to a new paper on arXiv. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation has remained underexplored. A team of researchers—Kmainasi, Mohamed Bayan; Kutlu, Mucahid; Shahroor, Ali Ezzat; Hasnat, Abul; and Alam, Firoj—proposes a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality via task-specific rewards and Group Relative Policy Optimization (GRPO).
Technical Approach: GRPO and Chain-of-Thought Supervision
The researchers conducted a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks. They extended existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations. The core contribution is a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality. Additionally, they investigated self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels.
Benchmark Results on FHM and ArMeme
Experiments on the Hateful Memes (FHM) and ArMeme benchmarks showed significant gains. The proposed approach improved over previously reported results: on FHM accuracy from 79.9% to 82.0% (a +2.1% increase); on ArMeme macro-F1 from 0.536 to 0.612 with explanations (a +7.6 point gain, +6.1 compared to the original ArMeme benchmark). The method also generates natural-language explanations for its predictions.
| Metric | Baseline | Proposed Approach | Improvement |
|---|---|---|---|
| FHM Accuracy | 79.9% | 82.0% | +2.1% |
| ArMeme Macro-F1 (with explanations) | 0.536 | 0.612 | +7.6 points |
| ArMeme Macro-F1 (vs original benchmark) | — | — | +6.1 points |
On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas the GRPO-based approach provides more balanced per-class performance along with explanations.
Generating Natural-Language Explanations
A key advantage of the method is its ability to produce natural-language explanations for each classification decision. This makes the model more transparent and trustworthy, a critical requirement for enterprise content moderation systems.
Implications for Enterprise AI Content Moderation
For enterprises operating social platforms or customer communications channels, the ability to accurately flag harmful memes while explaining the reasoning can reduce false positives and improve moderation efficiency. The researchers publicly released their code, data extensions, and evaluation resources, enabling adoption and further refinement by the developer community.