Multi-turn tool-use agents must reason, call external tools, and adapt to observations across several interaction turns. Post-training such agents is challenging: reinforcement learning (RL) often suffers from sparse rewards and weak credit assignment, while supervised fine-tuning (SFT) on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. Researchers have proposed PACT (Privileged Trace Co-Training) to tackle this problem, offering a new approach that keeps rollout generation prompt-only while using expert traces exclusively as training-time optimization signals.
The Challenge of Training Tool-Use Agents
Tool-use agents are AI systems that can invoke external APIs, databases, or software tools to complete tasks. In multi-turn settings, they must maintain context across several steps, making training difficult. According to the research paper, RL methods suffer from sparse rewards and weak credit assignment despite matching the prompt-only inference setting. SFT on expert traces provides dense process supervision but can over-constrain the model, forcing it to follow fixed trajectories rather than exploring alternative solutions.
PACT: A New Co-Training Framework
PACT introduces two complementary signals that use expert traces to guide optimization without using them during rollout generation. First, a trace-conditioned RL surrogate evaluates prompt-only rollouts under the context of expert traces. Second, a component-aware SFT loss supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further incorporates a prompt-only anchoring mechanism. The researchers also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout.
| Training Method | Strengths | Weaknesses |
|---|---|---|
| Reinforcement Learning (RL) | Matches prompt-only inference | Sparse rewards, weak credit assignment |
| Supervised Fine-Tuning (SFT) | Dense process supervision | Over-constrains to fixed trajectories |
| PACT (this work) | Combines both signals, prompt-only rollout | None reported in source |
Experimental Results
The team evaluated PACT on three benchmarks: FTRL, BFCL, and ToolHop. Across all three, PACT consistently improved over strong SFT- and RL-based baselines. The paper highlights the value of privileged trace co-training for multi-turn tool-use learning, showing that expert traces can be effectively used as optimization signals without being revealed during inference.
Implications for Enterprise Automation
While the research is primarily academic, the ability to train more robust multi-turn tool-use agents has direct relevance for enterprise technology. Such agents could automate complex workflows in supply chain management, trade documentation, and logistics — where systems must reason, call multiple APIs, and adapt to changing observations. The PACT framework addresses a key limitation that has prevented wider deployment of AI agents in production: balancing exploration and adherence to expert knowledge.
The paper is authored by Du, Zhenbang; Luo, Jun; Zheng, Zhiwei; Yuan, Xiangchi; Kejing; Shi, Dachuan; Jin, Qirui; He, Qijia; Zou, Shaofeng; Liang, Yingbin; and Lee, Wenke. It is available on arXiv under the identifier 2606.16215.