The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

A research paper identifies a 'Quality-Utility Paradox' in mathematical reasoning distillation: data refined by stronger models (Oracle) receives high reward scores but impairs small model performance compared to using the model's own self-generated traces. The authors propose Style-Aligned Refinement to preserve native reasoning patterns while incorporating logical corrections.

iGEN Editorial

June 16, 2026

The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

A newly published study on arXiv reveals a counterintuitive phenomenon in AI training dubbed the Quality-Utility Paradox: using high-reward data generated by a stronger Oracle model to train Small Language Models (SLMs) can actually degrade their mathematical reasoning abilities, even though the data appears higher quality according to reward models.

The Paradox

Researchers found that when SLMs are trained on traces refined or synthesized by a stronger Oracle, they consistently underperform compared to training on traces generated by the SLM itself and selected through rejection sampling. The paradox holds across multiple model families, including Qwen2.5, LLaMA-3, and DeepSeek, according to the paper.

Why Oracle Refinement Backfires

The analysis shows that Oracle refinement couples logical repair with a distributional drift away from the SLM's native reasoning distribution. This drift increases the learner's adaptation cost, which can outweigh the benefit of improved reasoning logic. In other words, while the refined data appears more correct in isolation, it becomes less compatible with the small model's inherent reasoning style, requiring the model to unlearn its own patterns — a costly process.

Style-Aligned Refinement

To test this mechanism, the authors introduced Style-Aligned Refinement, a technique that preserves the native trajectory of the SLM while retaining logical repair from the Oracle. This intervention lowers adaptation cost and restores downstream utility, providing a concrete solution to the paradox.

Key Findings at a Glance

Approach	Data Source	Reasoning Performance (Relative)
Oracle refinement	Stronger model's traces	Lower (due to distributional drift)
Rejection sampling	SLM's own traces	Higher (better compatibility)
Style-Aligned Refinement	Oracle improvement on SLM's traces	Restored (combines logic + style)

Implications for Enterprise AI

For CTOs and technology leaders deploying AI in specialized domains such as supply chain, trade documentation, or customs compliance, the study highlights a critical consideration: the highest-scoring training data according to automated metrics may not yield the best-performing small models. Instead, learner-data compatibility must be jointly optimized. The findings suggest that rejection sampling of the model's own outputs — or style-aligned refinement — can reduce training costs and improve real-world reasoning accuracy, aligning with enterprise needs for efficient, domain-specific model fine-tuning.

The datasets and code are available at the URL provided in the paper.

Sources:

The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

The Paradox

Why Oracle Refinement Backfires

Style-Aligned Refinement

Key Findings at a Glance

Implications for Enterprise AI

Recommended Stories

New Research Shows Pretraining Data Composition Can Engineer Neural Scaling Laws for Particle Physics

SorryDB Benchmark Tests AI Provers on Real-World Lean Theorem Completion Tasks

Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training

New Hindsight Self-Distillation Method Improves LLM Reasoning by Localizing Credit at Divergence Points