Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

A new method called Safe Trigger leverages the latent safety awareness of Large Reasoning Models to improve safety alignment without external data. Using Supervised Fine-Tuning and Direct Preference Optimization, the approach reduces Attack Success Rate on harmful and jailbreak benchmarks while preserving general performance.

iGEN Editorial

June 16, 2026

Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

Large Reasoning Models (LRMs) are highly capable at complex tasks, yet remain vulnerable to sophisticated jailbreaks and direct harmful queries. According to a paper on arXiv by Miao, Ke, Li, Jiaxin, Chen, Hongliang, Hu, Yuke, Qin, and Zhan, prior safety alignment methods heavily depend on external manual data annotation. However, the researchers observed that LRMs can inherently identify safety risks when re-presented with original queries alongside their own reasoning trajectories——a capability they term Latent Safety Awareness.

To exploit this, the team proposed a two-stage training approach called Safe Trigger. First, they use Supervised Fine-Tuning (SFT) to explicitly induce safe tags that trigger safety analysis and guidance following the initial reasoning content for unsafe queries. For general queries, standard responses are preserved, ensuring adaptive triggering. Second, they apply Direct Preference Optimization (DPO) to further enhance the correctness and stability of the safety analysis and guidance. Notably, the responses required for both training stages are entirely generated by the models being optimized, eliminating the need for external annotation.

Experimental results demonstrate significant safety enhancement. The Attack Success Rate (ASR) of DeepSeek-R1-Distill-Llama-8B dropped, on average, by 24.65% on harmful benchmarks and by 36.72% on jailbreak benchmarks. The method exerts almost no negative impact on general performance or user experience.

Benchmark	Average ASR Reduction
Harmful	24.65%
Jailbreak	36.72%

The paper argues that Safe Trigger method leverages the model's own latent safety awareness, reducing reliance on external data. This approach could be adapted for enterprise AI deployments where safety alignment is critical, such as in supply chain decision-support systems or customer-facing logistics chatbots. The ability to trigger safety analysis without compromising general performance means organizations can deploy LRMs with greater confidence, especially in regulated environments.

Future work may explore extending Safe Trigger to other model families and real-world testing scenarios. The researchers have made their findings available on arXiv for community review.

Sources:

Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

Recommended Stories

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

Yann LeCun's new AI startup AMI Labs raises $1bn to build flexible intelligence beyond LLMs

Anthropic Believes Its Own AI Dominance Is the Only Path to Safety

From Construction to Injection: Edit-Based Fingerprints for Large Language Models