iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
MatchLM2Lite: Scalable MLLM-Lite Framework Cuts Reproduced Video Views by 2.5% AIChilles Automatically Unearths Hidden Weaknesses in AI-Evolved Programs Vernier Research Reveals Why Language Models Give Inconsistent Answers to Causal Questions After Variable Renaming RAG and LLMs Combined to Generate Personalized Reading Content at Desired Complexity Unassigned Agents in Multi-Agent Path Finding Addressed by Compilation-Based Solvers New Framework Reduces Visual Hallucinations in Multimodal AI Systems Without Retraining MAF Framework Dynamically Optimizes Prompting for Multimodal Sentiment Analysis Study on Pedestrian Attribute Recognition Identifies Sparsity Wall and Optimizes Edge Deployment AI Framework Targets 50% Water Loss in Jordan with LLM and Digital Twin Integration AnonShield: Scalable On-Premise Pseudonymization Cuts Vulnerability Data Processing from 92 Hours to Under 10 Minutes MatchLM2Lite: Scalable MLLM-Lite Framework Cuts Reproduced Video Views by 2.5% AIChilles Automatically Unearths Hidden Weaknesses in AI-Evolved Programs Vernier Research Reveals Why Language Models Give Inconsistent Answers to Causal Questions After Variable Renaming RAG and LLMs Combined to Generate Personalized Reading Content at Desired Complexity Unassigned Agents in Multi-Agent Path Finding Addressed by Compilation-Based Solvers New Framework Reduces Visual Hallucinations in Multimodal AI Systems Without Retraining MAF Framework Dynamically Optimizes Prompting for Multimodal Sentiment Analysis Study on Pedestrian Attribute Recognition Identifies Sparsity Wall and Optimizes Edge Deployment AI Framework Targets 50% Water Loss in Jordan with LLM and Digital Twin Integration AnonShield: Scalable On-Premise Pseudonymization Cuts Vulnerability Data Processing from 92 Hours to Under 10 Minutes
Home ›› Technology ›› Ai ›› Ai Ethics ›› Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

A new method called Safe Trigger leverages the latent safety awareness of Large Reasoning Models to improve safety alignment without external data. Using Supervised Fine-Tuning and Direct Preference Optimization, the approach reduces Attack Success Rate on harmful and jailbreak benchmarks while preserving general performance.

iG
iGEN Editorial
June 16, 2026
Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

Large Reasoning Models (LRMs) are highly capable at complex tasks, yet remain vulnerable to sophisticated jailbreaks and direct harmful queries. According to a paper on arXiv by Miao, Ke, Li, Jiaxin, Chen, Hongliang, Hu, Yuke, Qin, and Zhan, prior safety alignment methods heavily depend on external manual data annotation. However, the researchers observed that LRMs can inherently identify safety risks when re-presented with original queries alongside their own reasoning trajectories——a capability they term Latent Safety Awareness.

To exploit this, the team proposed a two-stage training approach called Safe Trigger. First, they use Supervised Fine-Tuning (SFT) to explicitly induce safe tags that trigger safety analysis and guidance following the initial reasoning content for unsafe queries. For general queries, standard responses are preserved, ensuring adaptive triggering. Second, they apply Direct Preference Optimization (DPO) to further enhance the correctness and stability of the safety analysis and guidance. Notably, the responses required for both training stages are entirely generated by the models being optimized, eliminating the need for external annotation.

Experimental results demonstrate significant safety enhancement. The Attack Success Rate (ASR) of DeepSeek-R1-Distill-Llama-8B dropped, on average, by 24.65% on harmful benchmarks and by 36.72% on jailbreak benchmarks. The method exerts almost no negative impact on general performance or user experience.

Benchmark Average ASR Reduction
Harmful 24.65%
Jailbreak 36.72%

The paper argues that Safe Trigger method leverages the model's own latent safety awareness, reducing reliance on external data. This approach could be adapted for enterprise AI deployments where safety alignment is critical, such as in supply chain decision-support systems or customer-facing logistics chatbots. The ability to trigger safety analysis without compromising general performance means organizations can deploy LRMs with greater confidence, especially in regulated environments.

Future work may explore extending Safe Trigger to other model families and real-world testing scenarios. The researchers have made their findings available on arXiv for community review.


Sources:

Keep Reading

Recommended Stories

MatchLM2Lite: Scalable MLLM-Lite Framework Cuts Reproduced Video Views by 2.5% Technology

MatchLM2Lite: Scalable MLLM-Lite Framework Cuts Reproduced Video Views by 2.5%

The paper presents MatchLM2Lite, a production-grade reproduced content identification system that distills a multimodal large language model into a compact student model. Deployed at scale, it reduced reproduced video views by 2.5% without hurting engagement, with 35x lower computational cost and latency under 30 seconds.

June 16, 2026
AdaMame: New Training Recipe Solves Language Collapse in Multilingual Reasoning Models Technology

AdaMame: New Training Recipe Solves Language Collapse in Multilingual Reasoning Models

AdaMame, a two-stage training recipe for multilingual mathematical reasoning, addresses language collapse in large reasoning models. It adaptively aligns reasoning language to the query language without compromising accuracy, achieving Pareto-optimal performance across 12 languages.

June 16, 2026
Auditing Reward Hackability in Code RL Training Environments Reveals 28.5% Weak Test Suites Technology

Auditing Reward Hackability in Code RL Training Environments Reveals 28.5% Weak Test Suites

A research paper by Rajan on arXiv measures reward hackability in code reinforcement learning (RL) training environments. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough that a Docker-verified incorrect patch passes them. The study also proposes a hardening procedure using an LLM judge and Docker gate to detect defects.

June 16, 2026
A Theoretical Roadmap to Fuse Foundation Models and Knowledge Graphs Technology

A Theoretical Roadmap to Fuse Foundation Models and Knowledge Graphs

A new theoretical paper formalizes the 'Impedance Mismatch' between Foundation Models and Knowledge Graphs, arguing that current approaches like RAG are superficial. The authors propose a roadmap including Structured Residual Streams, Vector Symbolic Architectures, and Orthogonal Subspace Editing for true semantic fusion.

June 16, 2026