New Algorithm for Multi-Turn AI Agents Reduces Compounding Errors in Knowledge Distillation

A new algorithm called Guided On-Policy Distillation (Guided-OPD) addresses the failure mode where small student models compound errors in multi-turn tasks. By mixing teacher and student turns and using a curriculum that decays teacher intervention, the method improves average score by 21.1% and success rate by 25.5% over vanilla OPD.

iGEN Editorial

June 16, 2026

New Algorithm for Multi-Turn AI Agents Reduces Compounding Errors in Knowledge Distillation

Enterprises deploying AI agents for complex multi-step tasks often rely on large models that are too expensive for real-time inference. Knowledge distillation, which transfers capabilities from a large teacher model to a smaller student, is a common solution. But in multi-turn interactions, student errors can accumulate, pushing the agent into states the teacher never encountered—rendering the teacher's guidance useless at the moment it is most needed.

According to a new paper on arXiv, a research team has developed a method called Guided On-Policy Distillation (Guided-OPD) that directly tackles this compounding error problem. The technique is designed for multi-turn agents that plan, invoke tools, and interact with environments.

The Problem with Multi-Turn Distillation

Standard On-Policy Distillation (OPD) works by having the student interact with an environment and using the teacher to supervise its actions. However, the researchers found a characteristic failure mode: small student errors across turns push the trajectory out of the teacher's familiar state distribution. As the paper states, "the teacher's supervision becomes least reliable precisely where the student needs it most." This undermines the entire distillation process.

How Guided-OPD Works

Guided-OPD mixes teacher-generated turns and student-generated turns within each rollout. The teacher's intervention probability follows a curriculum that decays over time. Early in training, the student's trajectory stays close to the teacher's distribution because of heavy guidance. As training progresses, the intervention is gradually withdrawn, eventually restoring the purely on-policy regime used at inference.

Metric	Improvement over Vanilla OPD
Average Score	21.1%
Average Success Rate	25.5%

Experimental Results

The team evaluated Guided-OPD on three benchmarks: ALFWorld, ScienceWorld, and WebShop. Using a Qwen3-30B-A3B teacher model and distilling Qwen3 student models, Guided-OPD improved average Score by 21.1% and average Success Rate by 25.5% compared to vanilla OPD. Notably, gains were larger for smaller student models, making the technique especially attractive for cost-sensitive deployments.

The paper, which appears on arXiv under computer science and machine learning, does not name the research institution or provide a timeline for commercial availability. However, the results suggest that enterprises using large language models for customer service, code generation, or tool-use agents could reduce inference costs without sacrificing task success rates.

Implications for Enterprise Automation

Multi-turn agents are increasingly used in customer support, supply chain optimization, and internal knowledge retrieval. The ability to deploy a smaller, faster model that matches the performance of a much larger one—without the need for expensive hardware—could lower operational costs. The curriculum approach offers a practical path to achieve this reliability.

Sources:

New Algorithm for Multi-Turn AI Agents Reduces Compounding Errors in Knowledge Distillation

The Problem with Multi-Turn Distillation

How Guided-OPD Works

Experimental Results

Implications for Enterprise Automation

Recommended Stories

Reinforcement-Aware Knowledge Distillation Boosts LLM Reasoning Efficiency

Reinforcement Learning Foundation Models: Synthetic MDPs Could Bridge the Gap

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

MEAL Benchmark Enables Continuous Multi-Agent RL Training on 100 Tasks in Hours Using GPU Acceleration