Artificial Intelligence #ai#machine learning
The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning
A research paper identifies a 'Quality-Utility Paradox' in mathematical reasoning distillation: data refined by stronger models (Oracle) receives high reward scores but impairs small model performance compared to using the model's own self-generated traces. The authors propose Style-Aligned Refinement to preserve native reasoning patterns while incorporating logical corrections.
Jun 16, 2026 1 source