Cloud-based LLM fine-tuning platforms increasingly serve reinforcement learning from human feedback (RLHF) workloads, where a learned reward model serves as a proxy for human quality judgments. As Gao et al. (2023) demonstrated, this proxy can diverge from real-world feedback under sustained optimization—a phenomenon called reward overoptimization. Existing platform schedulers fail to detect this divergence. Non-clairvoyant schedulers optimize job completion time without any quality signal; SLAQ-style quality-aware schedulers use training loss, which drops monotonically even during reward hacking; and classical per-job early stopping requires human monitoring and does not free shared GPUs. According to a preprint on arXiv, researchers propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler.
The Problem of Reward Overoptimization
Reward overoptimization occurs when the learned reward model, used as a proxy for human quality, diverges from downstream evaluation metrics under sustained optimization pressure. This leads to wasted compute and degraded model quality. The paper frames scheduler-level early stopping as a detection problem and evaluates EvalStop in a discrete-event simulator where RLHF workloads mix reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers.
How EvalStop Works
EvalStop monitors eval-score trends and triggers job termination after k consecutive declines. It is designed as a thin wrapper that can be composed with any base scheduler (e.g., SRTF-Est). The primitive releases GPUs to other jobs and saves the best checkpoint before termination. The paper tests EvalStop on a simulated mix of RLHF and non-RLHF workloads.
Performance Benchmarks
On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieved precision of 98% and recall of 99%, with a false positive rate of 1.5%. Compared to the SRTF-Est baseline, EvalStop improved job completion time (JCT) by 9% and cut wasted compute by 22% (p<0.05). Trivial fixed-progress and loss-plateau competitors either incurred a 65% false positive rate on healthy RLHF runs or missed over half of true hacking cases. The gains compose across every base scheduler tested (9–25% JCT improvement). Detection quality remained stable under eval noise (precision at least 91% at noise standard deviation ≤0.05) and across hacking base rates (precision at least 89% for 20–80% hacking fractions).
| Metric | EvalStop | Competitor (Fixed-Progress) | Competitor (Loss-Plateau) |
|---|---|---|---|
| Precision | 98% | Not reported | Not reported |
| Recall | 99% | Not reported | Not reported |
| False Positive Rate | 1.5% | 65% | Not reported |
| JCT Improvement vs SRTF-Est | 9% | Not reported | Not reported |
| Wasted Compute Reduction vs SRTF-Est | 22% | Not reported | Not reported |
| Hacking Detection Rate | High | Not reported | <50% |
Implications for Enterprise AI Platforms
For CTOs and technology procurement leaders evaluating cloud LLM fine-tuning platforms, EvalStop offers a lightweight, composable mechanism to reduce wasted GPU hours and improve model quality without altering existing schedulers. The paper's simulation results suggest that even under noisy evaluation signals or varying proportions of reward-hacking jobs, EvalStop maintains high detection precision above 89%. This could translate to more efficient use of expensive compute resources in multi-tenant AI platforms, particularly those serving RLHF workloads for enterprise applications such as supply chain optimization or trade finance document analysis.