Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? A new study published on arXiv by researchers Bolívar and Francisco León Zúñiga addresses this question, extending the benchmark established by Willis et al. using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD).
Research Background and Methodology
The prior benchmark by Willis et al. found consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. Bolívar and Zúñiga applied the identical protocol to four frontier models released in 2025-2026: Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini. The experiment tested three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise), using Moran iterations (n=500 per condition).
Key Findings
Cooperative bias persists across providers (H1): Ten of twelve model-prompt combinations favoured cooperative equilibria in balanced noiseless conditions. However, cross-provider divergence is substantial (H3): Gemini 2.5 Flash reached up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reached 70% cooperative equilibria under Self-Refine.
Support for aggressive capability parity is partial (H2): Self-Refine raised the Inclination to Cooperate under Decent (ICD) in all models, with Gemini 3.1 Pro Refine achieving the highest ICD in the dataset (0.925). However, Default and Prose prompts showed no systematic narrowing of the gap between providers.
Evidence on noise robustness is directionally positive but not robustly confirmed (H4): Average noise sensitivity was about 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated.
Comparative Results Summary
| Model | Prompt Style | Equilibrium Outcome (Balanced Noiseless) | Notes |
|---|---|---|---|
| Claude Sonnet 4.6 | Default | Cooperative | Low noise sensitivity (6 pp) |
| Gemini 2.5 Flash | Default | Aggressive (up to 77% under biased) | Highest aggression |
| Gemini 3.1 Pro | Self-Refine | Cooperative (ICD 0.925) | Highest ICD in dataset |
| GPT-5.4 Mini | Self-Refine | 70% Cooperative | Strong cooperative bias |
The study concludes that provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.
Implications for Enterprise AI Agent Systems
For technology leaders evaluating multi-agent AI deployments, these findings underscore that model choice can significantly affect system-level cooperation, with implications for tasks requiring coordinated action, such as automated negotiation, supply chain scheduling, or collaborative problem-solving. The persistence of cooperative bias in newer models suggests inherent behavioral tendencies that may reduce the need for explicit coordination mechanisms, but the substantial cross-provider variation warns against assuming uniform behavior across vendors. Noise sensitivity remains unresolved, highlighting the need for robust prompting strategies in real-world environments where imperfect information is common.