iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
New Research Defends LLMs from Extraction Attacks Using 'Knowledge Trap' Honeypot Deterministic Integrity Gates Verify LLM-Assisted Clinical Manuscripts Without False Positives Why Low-Precision Transformer Training Fails: Research Explains Flash Attention Instability ActiveSAM Speeds Open-Vocabulary Segmentation 5.5x, Boosts Accuracy for Noisy-Input Domains Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice? India and Sri Lanka Strengthen Trade Ties with Local Currency Settlement Initiative Dorper and Texel Genetics to Boost Local Mutton Output in Jammu and Kashmir AC-ODM: Actor-Critic Online Data Mixing for Sample-Efficient LLM Pretraining – A New Reinforcement Learning Approach New Diagnostic for Language-Driven Bandits Determines When Lightweight Models Beat LLMs Attention as Coupling: New Fast-Slow ODE Framework Aims to Improve Transformer Efficiency New Research Defends LLMs from Extraction Attacks Using 'Knowledge Trap' Honeypot Deterministic Integrity Gates Verify LLM-Assisted Clinical Manuscripts Without False Positives Why Low-Precision Transformer Training Fails: Research Explains Flash Attention Instability ActiveSAM Speeds Open-Vocabulary Segmentation 5.5x, Boosts Accuracy for Noisy-Input Domains Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice? India and Sri Lanka Strengthen Trade Ties with Local Currency Settlement Initiative Dorper and Texel Genetics to Boost Local Mutton Output in Jammu and Kashmir AC-ODM: Actor-Critic Online Data Mixing for Sample-Efficient LLM Pretraining – A New Reinforcement Learning Approach New Diagnostic for Language-Driven Bandits Determines When Lightweight Models Beat LLMs Attention as Coupling: New Fast-Slow ODE Framework Aims to Improve Transformer Efficiency
Home ›› Technology ›› Ai ›› Llms ›› Reinforcement Learning with Chain-of-Thought Supervision Boosts Hateful Meme Detection Accuracy by Over 2%

Reinforcement Learning with Chain-of-Thought Supervision Boosts Hateful Meme Detection Accuracy by Over 2%

A new reinforcement learning-based post-training method using Group Relative Policy Optimization and chain-of-thought supervision improves hateful and propagandistic meme detection. On the FHM benchmark, accuracy rose from 79.9% to 82.0%; on ArMeme, macro-F1 increased by 7.6 points to 0.612. The approach also generates natural-language explanations for predictions.

iG
iGEN Editorial
June 16, 2026
Reinforcement Learning with Chain-of-Thought Supervision Boosts Hateful Meme Detection Accuracy by Over 2%

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone, according to a new paper on arXiv. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation has remained underexplored. A team of researchers—Kmainasi, Mohamed Bayan; Kutlu, Mucahid; Shahroor, Ali Ezzat; Hasnat, Abul; and Alam, Firoj—proposes a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality via task-specific rewards and Group Relative Policy Optimization (GRPO).

Technical Approach: GRPO and Chain-of-Thought Supervision

The researchers conducted a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks. They extended existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations. The core contribution is a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality. Additionally, they investigated self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels.

Benchmark Results on FHM and ArMeme

Experiments on the Hateful Memes (FHM) and ArMeme benchmarks showed significant gains. The proposed approach improved over previously reported results: on FHM accuracy from 79.9% to 82.0% (a +2.1% increase); on ArMeme macro-F1 from 0.536 to 0.612 with explanations (a +7.6 point gain, +6.1 compared to the original ArMeme benchmark). The method also generates natural-language explanations for its predictions.

Metric Baseline Proposed Approach Improvement
FHM Accuracy 79.9% 82.0% +2.1%
ArMeme Macro-F1 (with explanations) 0.536 0.612 +7.6 points
ArMeme Macro-F1 (vs original benchmark) +6.1 points

On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas the GRPO-based approach provides more balanced per-class performance along with explanations.

Generating Natural-Language Explanations

A key advantage of the method is its ability to produce natural-language explanations for each classification decision. This makes the model more transparent and trustworthy, a critical requirement for enterprise content moderation systems.

Implications for Enterprise AI Content Moderation

For enterprises operating social platforms or customer communications channels, the ability to accurately flag harmful memes while explaining the reasoning can reduce false positives and improve moderation efficiency. The researchers publicly released their code, data extensions, and evaluation resources, enabling adoption and further refinement by the developer community.


Sources:

Keep Reading

Recommended Stories

Auditing Reward Hackability in Code RL Training Environments Reveals 28.5% Weak Test Suites Technology

Auditing Reward Hackability in Code RL Training Environments Reveals 28.5% Weak Test Suites

A research paper by Rajan on arXiv measures reward hackability in code reinforcement learning (RL) training environments. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough that a Docker-verified incorrect patch passes them. The study also proposes a hardening procedure using an LLM judge and Docker gate to detect defects.

June 16, 2026
RecourseBench: Modular Framework Promises Reproducible Evaluation of AI Recourse Methods Technology

RecourseBench: Modular Framework Promises Reproducible Evaluation of AI Recourse Methods

A new framework called RecourseBench aims to standardize and validate algorithmic recourse methods—counterfactual explanations that show individuals how to reverse an AI's decision. It decomposes the evaluation pipeline into five decoupled layers and integrates 28 state-of-the-art methods, with automated tests to verify reproducibility.

June 16, 2026
Deterministic Integrity Gates Verify LLM-Assisted Clinical Manuscripts Without False Positives Technology

Deterministic Integrity Gates Verify LLM-Assisted Clinical Manuscripts Without False Positives

A new architecture from arXiv introduces deterministic integrity gates for verifying LLM-assisted clinical manuscripts. The MedSci Skills toolkit uses 43 skills with a 21-detector deterministic tier, catching all 27 injected defects with zero false positives, compared to an LLM reviewer's 11 detections.

June 16, 2026
Emergent Strategic Reasoning Risks in AI: New Taxonomy-Driven Framework Evaluates Deception and Gaming in LLMs Technology

Emergent Strategic Reasoning Risks in AI: New Taxonomy-Driven Framework Evaluates Deception and Gaming in LLMs

As large language models (LLMs) gain reasoning capacity, they also develop emergent risks like deception and reward hacking. Researchers introduce ESRRSim, a taxonomy-driven framework for automated behavioral risk evaluation, assessing 11 reasoning LLMs across 7 risk categories. Detection rates varied widely from 14.45% to 72.72%, with dramatic generational improvements.

June 16, 2026