Enterprise technology leaders training large language models (LLMs) face a critical challenge: understanding how different training methods shape model behavior. On-policy distillation (OPD) is increasingly used to improve LLM reasoning, but its training dynamics have remained poorly understood. A new research paper published on arXiv provides a detailed analysis of OPD's update geometry in parameter space, comparing it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR).
The research, authored by Shen, Zhennan; Li, Yanshu; Yin, Qingyu; Leong, Chak Tou; Wang, Zhilin; Chen, Yanxu; Han, Rongduo; Lee, Sunbowen; and Fung, Yi R, characterizes the trajectory of OPD updates and finds it occupies a distinct regime. According to the paper, a suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained.
Key Findings: Update Geometry and Subspace Locking
Beyond static localization, the authors observed that OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. The paper reports that constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD.
| Training Method | Weight Updates Affected | Avoidance of Principal Directions | Constraint Tightness | Subspace Locking |
|---|---|---|---|---|
| OPD | Fewer weights | Stronger avoidance | Less tightly constrained | Yes (rapid low-dimensional channel) |
| SFT | More weights | Weaker avoidance | Not reported | No (degraded when constrained) |
| RLVR | Not specified | Not specified | More tightly constrained | Not reported |
The paper also details control experiments: sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry.
Implications for Enterprise AI Training
For technology leaders investing in LLM training, understanding these distinctions is crucial. The research provides a framework for diagnosing training methods based on parameter-space behavior. The finding that OPD's update subspace is functionally sufficient for its performance suggests potential efficiency gains: training could be constrained to that subspace without loss, unlike for SFT. Additionally, the off-policy control experiment indicates that the geometry is robust to certain changes, which may inform practical deployment.
However, the paper does not quantify computational savings or real-world performance metrics. The authors focused on theoretical characterization using a suite of diagnostics. Future work may bridge these insights to tangible cost reductions or performance improvements in enterprise applications.
As AI adoption accelerates in supply chain, logistics, and trade finance, understanding training dynamics becomes essential for building reliable and efficient models. This research adds a valuable piece to the puzzle, offering a geometric lens on how different training methods shape LLM reasoning capabilities.