A newly published study on arXiv reveals a counterintuitive phenomenon in AI training dubbed the Quality-Utility Paradox: using high-reward data generated by a stronger Oracle model to train Small Language Models (SLMs) can actually degrade their mathematical reasoning abilities, even though the data appears higher quality according to reward models.
The Paradox
Researchers found that when SLMs are trained on traces refined or synthesized by a stronger Oracle, they consistently underperform compared to training on traces generated by the SLM itself and selected through rejection sampling. The paradox holds across multiple model families, including Qwen2.5, LLaMA-3, and DeepSeek, according to the paper.
Why Oracle Refinement Backfires
The analysis shows that Oracle refinement couples logical repair with a distributional drift away from the SLM's native reasoning distribution. This drift increases the learner's adaptation cost, which can outweigh the benefit of improved reasoning logic. In other words, while the refined data appears more correct in isolation, it becomes less compatible with the small model's inherent reasoning style, requiring the model to unlearn its own patterns — a costly process.
Style-Aligned Refinement
To test this mechanism, the authors introduced Style-Aligned Refinement, a technique that preserves the native trajectory of the SLM while retaining logical repair from the Oracle. This intervention lowers adaptation cost and restores downstream utility, providing a concrete solution to the paradox.
Key Findings at a Glance
| Approach | Data Source | Reasoning Performance (Relative) |
|---|---|---|
| Oracle refinement | Stronger model's traces | Lower (due to distributional drift) |
| Rejection sampling | SLM's own traces | Higher (better compatibility) |
| Style-Aligned Refinement | Oracle improvement on SLM's traces | Restored (combines logic + style) |
Implications for Enterprise AI
For CTOs and technology leaders deploying AI in specialized domains such as supply chain, trade documentation, or customs compliance, the study highlights a critical consideration: the highest-scoring training data according to automated metrics may not yield the best-performing small models. Instead, learner-data compatibility must be jointly optimized. The findings suggest that rejection sampling of the model's own outputs — or style-aligned refinement — can reduce training costs and improve real-world reasoning accuracy, aligning with enterprise needs for efficient, domain-specific model fine-tuning.
The datasets and code are available at the URL provided in the paper.