Enterprise AI systems increasingly rely on large language models (LLMs) to make decisions that must align with organizational values, ethical guidelines, and regulatory requirements. However, ensuring that a model consistently prioritizes the right values—especially when those values conflict—remains a fundamental challenge. Traditional methods judge adherence only from output behavior, which can be fragile. A new paper on arXiv presents a technique called Constitutional Value Potentials (CVP) that reads and steers a model's internal priority margins directly from its activations.
The Problem of Value Conflicts in LLMs
When an LLM faces a choice between two competing values—for example, honesty versus privacy—what matters is not merely which value it mentions but which one it is willing to sacrifice. According to the paper, output evidence is "most fragile on value conflicts." The authors, Che, Tong, Wu, and Rui, argue that the arbitration between values can be read from activations in a structured margin readout, rather than inferred solely from final output.
How Constitutional Value Potentials Work
CVP learns a scalar potential for each value from the model's hidden state. This potential represents an internal pressure to preserve that value. Critically, the training signal comes not from the prompt but from an independent judge's verdict on which value the model's own response actually preserved. The signed difference between two potentials forms a priority margin. A constitutional clause becomes the claim that a margin stays positive, and a single monitor score flags when it does not.
The authors tested CVP on three scales of the Qwen2.5 model family. The monitor predicts conflict violations with an AUROC (Area Under the Receiver Operating Characteristic curve) up to 0.95, outperforming a strong hidden-state probe. It also generalizes to held-out synthetic conflicts across all three model scales.
Key Results and Capabilities
The readout signal appears early—from the prompt tail and the first response token. This early detection enables two important capabilities:
- Adversarial priority hack detection: The same signal reveals whether an adversarial input has actually pushed the model toward a violation, rather than only whether the prompt looks adversarial.
- Intervention tests: Steering along a value direction shifts judged trade-offs in the intended direction. This suggests that value-relevant priorities are accessible as activation-space margins, not just as output behavior.
| Metric | CVP Performance |
|---|---|
| Conflict prediction AUROC | Up to 0.95 |
| Model scales tested | Three Qwen2.5 sizes |
| Detection timing | From prompt tail and first response token |
| Generalization | Held-out synthetic conflicts |
| Intervention effectiveness | Shifts trade-offs as intended |
Implications for Enterprise AI
For CTOs and technology leaders deploying LLMs in sensitive applications—such as customer service, content moderation, or regulatory compliance—the ability to monitor and steer internal value priorities is a significant step forward. Rather than relying solely on post-hoc output checks, CVP offers a real-time monitor that can flag potential violations before they manifest in output. The method also supports proactive steering, allowing organizations to nudge model behavior toward desired value trade-offs without full retraining.
While the research is preliminary and focused on synthetic conflicts, the results suggest that internal value representations are structured and accessible. This could lead to more reliable and auditable AI systems in enterprise environments. As the paper states, "these results suggest that some constitution-relevant priorities are accessible as activation-space margins, rather than only as output behavior."