Enterprises deploying AI agents for complex multi-step tasks often rely on large models that are too expensive for real-time inference. Knowledge distillation, which transfers capabilities from a large teacher model to a smaller student, is a common solution. But in multi-turn interactions, student errors can accumulate, pushing the agent into states the teacher never encountered—rendering the teacher's guidance useless at the moment it is most needed.
According to a new paper on arXiv, a research team has developed a method called Guided On-Policy Distillation (Guided-OPD) that directly tackles this compounding error problem. The technique is designed for multi-turn agents that plan, invoke tools, and interact with environments.
The Problem with Multi-Turn Distillation
Standard On-Policy Distillation (OPD) works by having the student interact with an environment and using the teacher to supervise its actions. However, the researchers found a characteristic failure mode: small student errors across turns push the trajectory out of the teacher's familiar state distribution. As the paper states, "the teacher's supervision becomes least reliable precisely where the student needs it most." This undermines the entire distillation process.
How Guided-OPD Works
Guided-OPD mixes teacher-generated turns and student-generated turns within each rollout. The teacher's intervention probability follows a curriculum that decays over time. Early in training, the student's trajectory stays close to the teacher's distribution because of heavy guidance. As training progresses, the intervention is gradually withdrawn, eventually restoring the purely on-policy regime used at inference.
| Metric | Improvement over Vanilla OPD |
|---|---|
| Average Score | 21.1% |
| Average Success Rate | 25.5% |
Experimental Results
The team evaluated Guided-OPD on three benchmarks: ALFWorld, ScienceWorld, and WebShop. Using a Qwen3-30B-A3B teacher model and distilling Qwen3 student models, Guided-OPD improved average Score by 21.1% and average Success Rate by 25.5% compared to vanilla OPD. Notably, gains were larger for smaller student models, making the technique especially attractive for cost-sensitive deployments.
The paper, which appears on arXiv under computer science and machine learning, does not name the research institution or provide a timeline for commercial availability. However, the results suggest that enterprises using large language models for customer service, code generation, or tool-use agents could reduce inference costs without sacrificing task success rates.
Implications for Enterprise Automation
Multi-turn agents are increasingly used in customer support, supply chain optimization, and internal knowledge retrieval. The ability to deploy a smaller, faster model that matches the performance of a much larger one—without the need for expensive hardware—could lower operational costs. The curriculum approach offers a practical path to achieve this reliability.