The Challenge of LLM Uncertainty in Enterprise Applications
As large language models (LLMs) are increasingly deployed in supply chain forecasting, customs classification, and logistics optimization, executives need to trust the confidence scores these models attach to their outputs. According to a new study on arXiv, the scale on which LLMs report confidence — typically 0–100 — is not neutral; it influences the quality of uncertainty estimation. The research, titled "Rescaling Confidence: What Scale Design Reveals About LLM Metacognition," was conducted by Yuyang Dai and Yuxia Wang.
Key Findings on Scale Design
The study examined verbalized confidence across six LLMs and three datasets. It found that more than 78% of responses concentrated on just three round-number values, indicating heavy discretization. The authors systematically manipulated confidence scales along three dimensions: granularity, boundary placement, and range regularity. They evaluated metacognitive sensitivity using a metric called meta-d' .
The most striking result: a 0–20 scale consistently improved metacognitive efficiency over the standard 0–100 format. In contrast, boundary compression (e.g., compressing the ends of the scale) degraded performance, and round-number preferences persisted even when irregular ranges were used.
Implications for Supply Chain AI
For enterprise technology leaders evaluating LLM-based tools for trade documentation, customs risk scoring, or demand forecasting, these findings are directly relevant. If a supply chain AI system reports 85% confidence in a shipment delay prediction, that number may be an artefact of the scale rather than a genuine measure of certainty. The study suggests that adopting a 0–20 scale could yield more reliable uncertainty estimates.
| Dimension | Description | Effect on Metacognition |
|---|---|---|
| Granularity | Number of distinct values available | Higher granularity not necessarily better; 0–20 outperformed 0–100 |
| Boundary placement | Where the scale ends are set | Compression degraded performance |
| Range regularity | Even vs. uneven intervals | Round-number preferences persisted |
What This Means for Technology Procurement
When sourcing LLM-powered platforms for logistics, CTOs should examine how confidence scores are calibrated. The study demonstrates that the scale design should be treated as a first-class experimental variable in LLM evaluation, according to the researchers. This echoes broader concerns in the field about model calibration and trustworthiness.
Next Steps for Practitioners
Enterprises integrating LLMs into supply chain workflows should request transparency from vendors on the confidence scale used and consider customizing scales for their use cases. The arXiv study provides a starting point for evidence-based scale design, but further research is needed to generalize these results to domain-specific models.
The findings are published on arxiv.org under the Computer Science > Artificial Intelligence category. The paper notes that six LLMs and three datasets were tested, and all code and data are available for reproduction.