iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Amazfit Cheetah 2 Ultra: The Most Expensive Smartwatch Yet—Is It Worth the Price? New Automated Jailbreak Attack UNIATTACK Achieves High Success Rate Against Multi-Layered LLM Defenses UXBench: Measuring the Actionability of LLM-Generated UX Critiques LaWAM: Latent World Action Model Enables Efficient, Dynamics-Aware Robot Control with Low Latency Sub-Quadratic Vision Transformers Cut Self-Attention Cost for Faster Image Captioning NordVPN's Private Server Add-On Gives Enterprises Isolated Hardware and Static IP for Secure Remote Access India Soyabean Acreage Seen Rising Up to 10% on High Prices, Weak Monsoon Outlook FlowMPC: New Framework Combines Flow Matching and World Models to Improve Robot Manipulation DYNA Framework Uses Temporal Knowledge Graphs to Reduce LLM Forgetting Without Retraining RAMS: Resource-Adaptive Model Switching for Embedded Edge Perception Under Load Amazfit Cheetah 2 Ultra: The Most Expensive Smartwatch Yet—Is It Worth the Price? New Automated Jailbreak Attack UNIATTACK Achieves High Success Rate Against Multi-Layered LLM Defenses UXBench: Measuring the Actionability of LLM-Generated UX Critiques LaWAM: Latent World Action Model Enables Efficient, Dynamics-Aware Robot Control with Low Latency Sub-Quadratic Vision Transformers Cut Self-Attention Cost for Faster Image Captioning NordVPN's Private Server Add-On Gives Enterprises Isolated Hardware and Static IP for Secure Remote Access India Soyabean Acreage Seen Rising Up to 10% on High Prices, Weak Monsoon Outlook FlowMPC: New Framework Combines Flow Matching and World Models to Improve Robot Manipulation DYNA Framework Uses Temporal Knowledge Graphs to Reduce LLM Forgetting Without Retraining RAMS: Resource-Adaptive Model Switching for Embedded Edge Perception Under Load
Home ›› Technology ›› Ai ›› New Algorithm for Multi-Turn AI Agents Reduces Compounding Errors in Knowledge Distillation

New Algorithm for Multi-Turn AI Agents Reduces Compounding Errors in Knowledge Distillation

A new algorithm called Guided On-Policy Distillation (Guided-OPD) addresses the failure mode where small student models compound errors in multi-turn tasks. By mixing teacher and student turns and using a curriculum that decays teacher intervention, the method improves average score by 21.1% and success rate by 25.5% over vanilla OPD.

iG
iGEN Editorial
June 16, 2026
New Algorithm for Multi-Turn AI Agents Reduces Compounding Errors in Knowledge Distillation

Enterprises deploying AI agents for complex multi-step tasks often rely on large models that are too expensive for real-time inference. Knowledge distillation, which transfers capabilities from a large teacher model to a smaller student, is a common solution. But in multi-turn interactions, student errors can accumulate, pushing the agent into states the teacher never encountered—rendering the teacher's guidance useless at the moment it is most needed.

According to a new paper on arXiv, a research team has developed a method called Guided On-Policy Distillation (Guided-OPD) that directly tackles this compounding error problem. The technique is designed for multi-turn agents that plan, invoke tools, and interact with environments.

The Problem with Multi-Turn Distillation

Standard On-Policy Distillation (OPD) works by having the student interact with an environment and using the teacher to supervise its actions. However, the researchers found a characteristic failure mode: small student errors across turns push the trajectory out of the teacher's familiar state distribution. As the paper states, "the teacher's supervision becomes least reliable precisely where the student needs it most." This undermines the entire distillation process.

How Guided-OPD Works

Guided-OPD mixes teacher-generated turns and student-generated turns within each rollout. The teacher's intervention probability follows a curriculum that decays over time. Early in training, the student's trajectory stays close to the teacher's distribution because of heavy guidance. As training progresses, the intervention is gradually withdrawn, eventually restoring the purely on-policy regime used at inference.

Metric Improvement over Vanilla OPD
Average Score 21.1%
Average Success Rate 25.5%

Experimental Results

The team evaluated Guided-OPD on three benchmarks: ALFWorld, ScienceWorld, and WebShop. Using a Qwen3-30B-A3B teacher model and distilling Qwen3 student models, Guided-OPD improved average Score by 21.1% and average Success Rate by 25.5% compared to vanilla OPD. Notably, gains were larger for smaller student models, making the technique especially attractive for cost-sensitive deployments.

The paper, which appears on arXiv under computer science and machine learning, does not name the research institution or provide a timeline for commercial availability. However, the results suggest that enterprises using large language models for customer service, code generation, or tool-use agents could reduce inference costs without sacrificing task success rates.

Implications for Enterprise Automation

Multi-turn agents are increasingly used in customer support, supply chain optimization, and internal knowledge retrieval. The ability to deploy a smaller, faster model that matches the performance of a much larger one—without the need for expensive hardware—could lower operational costs. The curriculum approach offers a practical path to achieve this reliability.


Sources:

Keep Reading

Recommended Stories

FlowMPC: New Framework Combines Flow Matching and World Models to Improve Robot Manipulation Technology

FlowMPC: New Framework Combines Flow Matching and World Models to Improve Robot Manipulation

Researchers introduce FlowMPC, a framework that pairs imitation-learned flow matching policies with a learned world model for test-time planning using MPPI. On ManiSkill manipulation tasks PickCube and PickSingleYCB, adding the world model improved performance over the flow matching policy alone, with clear gains in end-of-episode success.

June 16, 2026
StarOR: New AI Framework Combines Tree Search and Reinforcement Learning for Optimization Modeling Technology

StarOR: New AI Framework Combines Tree Search and Reinforcement Learning for Optimization Modeling

A new AI framework called StarOR combines Monte Carlo Tree Search with test-time reinforcement learning to solve hierarchical optimization modeling problems. It decomposes modeling into four stages, uses a LoRA adapter updated via GRPO, and achieves state-of-the-art results on five benchmarks with a 4B parameter backbone, outperforming existing methods and frontier LLMs.

June 16, 2026
Trust-Region Diffusion Policies Enable Expressive AI for Complex Control Tasks Technology

Trust-Region Diffusion Policies Enable Expressive AI for Complex Control Tasks

Researchers introduce Trust-Region Diffusion Policies (TruDi), a method that enables diffusion models to be used in massively parallel on-policy reinforcement learning. By enforcing a KL-divergence constraint over the entire diffusion trajectory, TruDi achieves stable training and outperforms strong baselines across 73 diverse tasks, showing particular gains on challenging humanoid control problems.

June 16, 2026
New Survey Unifies LLM Policy Optimization Methods on First Principles from REINFORCE to GRPO Technology

New Survey Unifies LLM Policy Optimization Methods on First Principles from REINFORCE to GRPO

A new survey on arXiv revisits LLM policy optimization from first principles, modeling all methods as modifications of either the trajectory probability or reward function. It covers the path from REINFORCE to GRPO and beyond, identifying compound failures that require joint design of both sides.

June 16, 2026