Large language models (LLMs) are increasingly used to allocate limited computation across verification, test-time scaling, and other selective-compute decisions. These policies rely on a global signal comparability assumption: equal scores should carry comparable decision value across inputs. According to a recent paper on arXiv by Yang Jinlong, this assumption fails in practice due to heteroskedastic signals — uncertainty quality varies across cost strata, with some regions exhibiting near-random discriminability despite concentrating many errors.
The paper, titled "Heteroskedastic Signals in Budgeted LLM Verification: Structural Heterogeneity Limits Optimization Gains," uses budgeted verification as a controlled diagnostic setting. The authors find that global online adaptation yields inconsistent gains over static thresholding, while structural heterogeneity limits optimization improvements.
The Heteroskedastic Signal Problem
Under a global signal comparability assumption, allocation policies assume that an uncertainty score of, say, 0.8 means the same thing whether the input is simple or complex. The paper shows that this is not the case. In certain cost strata — inputs requiring more computational resources — the signal quality degrades disproportionately. The authors characterize the resulting distortion of global allocation and show that its upper bound scales with cross-stratum signal-quality dispersion.
The Intervention Hierarchy
To separate weak signals, optimization instability, and structural heterogeneity, the authors propose a controlled intervention hierarchy:
- Threshold: A simple static thresholding policy.
- MP-Adapt: Online adaptation without stratification.
- MP-Strat: Stratified adaptation that partially recovers performance.
- CST (Cost-Stratified Thresholding): A deliberately simple intervention that adjusts thresholds per cost stratum.
Across benchmarks, CST stood out. The paper reports that CST improves hit rate by up to 17 percentage points in strongly heterogeneous settings without gradient updates. This result identifies structural heterogeneity, rather than optimizer weakness alone, as the primary bottleneck.
Benchmarks and Models
The experiments were conducted on two popular coding and math benchmarks: MBPP and MATH. The models tested include:
| Model | Benchmark | Key Finding |
|---|---|---|
| Qwen3-8B | MBPP, MATH | CST improved hit rate significantly in heterogeneous settings |
| LLaMA3-8B | MBPP, MATH | Similar pattern of heteroskedastic signals |
| GPT-4o-mini | MBPP, MATH | Global adaptation inconsistent; CST partially recovers performance |
According to the paper, the results show that misaligned feedback structure cannot always be repaired by stronger optimization. Enterprise teams deploying LLMs for verification tasks — such as validating outputs in supply chain document processing or code generation — should account for signal heterogeneity across cost strata. A one-size-fits-all confidence threshold may leave valuable accuracy on the table.
Implications for Enterprise AI
For CTOs and technology procurement leaders, this research underscores a critical design principle: when LLMs are used for budgeted verification, the verification policy must be adapted to the structural heterogeneity of signals. Simple stratified thresholding can outperform complex global optimization methods, reducing the need for expensive retraining or fine-tuning. The paper suggests that ignoring heteroskedasticity can lead to suboptimal allocation of compute resources, especially in high-stakes settings like trade documentation verification or customs classification where error costs are high.
The paper is available on arXiv under the identifier 2606.15841 and is licensed under a Creative Commons Attribution 4.0 International License.