Large Reasoning Models (LRMs) are highly capable at complex tasks, yet remain vulnerable to sophisticated jailbreaks and direct harmful queries. According to a paper on arXiv by Miao, Ke, Li, Jiaxin, Chen, Hongliang, Hu, Yuke, Qin, and Zhan, prior safety alignment methods heavily depend on external manual data annotation. However, the researchers observed that LRMs can inherently identify safety risks when re-presented with original queries alongside their own reasoning trajectories——a capability they term Latent Safety Awareness.
To exploit this, the team proposed a two-stage training approach called Safe Trigger. First, they use Supervised Fine-Tuning (SFT) to explicitly induce safe tags that trigger safety analysis and guidance following the initial reasoning content for unsafe queries. For general queries, standard responses are preserved, ensuring adaptive triggering. Second, they apply Direct Preference Optimization (DPO) to further enhance the correctness and stability of the safety analysis and guidance. Notably, the responses required for both training stages are entirely generated by the models being optimized, eliminating the need for external annotation.
Experimental results demonstrate significant safety enhancement. The Attack Success Rate (ASR) of DeepSeek-R1-Distill-Llama-8B dropped, on average, by 24.65% on harmful benchmarks and by 36.72% on jailbreak benchmarks. The method exerts almost no negative impact on general performance or user experience.
| Benchmark | Average ASR Reduction |
|---|---|
| Harmful | 24.65% |
| Jailbreak | 36.72% |
The paper argues that Safe Trigger method leverages the model's own latent safety awareness, reducing reliance on external data. This approach could be adapted for enterprise AI deployments where safety alignment is critical, such as in supply chain decision-support systems or customer-facing logistics chatbots. The ability to trigger safety analysis without compromising general performance means organizations can deploy LRMs with greater confidence, especially in regulated environments.
Future work may explore extending Safe Trigger to other model families and real-world testing scenarios. The researchers have made their findings available on arXiv for community review.