Study Finds Persistent Cooperative Bias in Next-Gen LLM Agents but Significant Provider Divergence

A new study by Bolívar and Zúñiga extends previous benchmarks on cooperative behavior in LLM agent systems, testing four frontier models from Anthropic, Google, and OpenAI. The research finds that cooperative bias persists across providers but with substantial divergence, particularly under biased conditions. Noise remains a universal challenge.

iGEN Editorial

June 16, 2026

Study Finds Persistent Cooperative Bias in Next-Gen LLM Agents but Significant Provider Divergence

Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? A new study published on arXiv by researchers Bolívar and Francisco León Zúñiga addresses this question, extending the benchmark established by Willis et al. using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD).

Research Background and Methodology

The prior benchmark by Willis et al. found consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. Bolívar and Zúñiga applied the identical protocol to four frontier models released in 2025-2026: Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini. The experiment tested three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise), using Moran iterations (n=500 per condition).

Key Findings

Cooperative bias persists across providers (H1): Ten of twelve model-prompt combinations favoured cooperative equilibria in balanced noiseless conditions. However, cross-provider divergence is substantial (H3): Gemini 2.5 Flash reached up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reached 70% cooperative equilibria under Self-Refine.

Support for aggressive capability parity is partial (H2): Self-Refine raised the Inclination to Cooperate under Decent (ICD) in all models, with Gemini 3.1 Pro Refine achieving the highest ICD in the dataset (0.925). However, Default and Prose prompts showed no systematic narrowing of the gap between providers.

Evidence on noise robustness is directionally positive but not robustly confirmed (H4): Average noise sensitivity was about 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated.

Comparative Results Summary

Model	Prompt Style	Equilibrium Outcome (Balanced Noiseless)	Notes
Claude Sonnet 4.6	Default	Cooperative	Low noise sensitivity (6 pp)
Gemini 2.5 Flash	Default	Aggressive (up to 77% under biased)	Highest aggression
Gemini 3.1 Pro	Self-Refine	Cooperative (ICD 0.925)	Highest ICD in dataset
GPT-5.4 Mini	Self-Refine	70% Cooperative	Strong cooperative bias

The study concludes that provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.

Implications for Enterprise AI Agent Systems

For technology leaders evaluating multi-agent AI deployments, these findings underscore that model choice can significantly affect system-level cooperation, with implications for tasks requiring coordinated action, such as automated negotiation, supply chain scheduling, or collaborative problem-solving. The persistence of cooperative bias in newer models suggests inherent behavioral tendencies that may reduce the need for explicit coordination mechanisms, but the substantial cross-provider variation warns against assuming uniform behavior across vendors. Noise sensitivity remains unresolved, highlighting the need for robust prompting strategies in real-world environments where imperfect information is common.

Sources:

Study Finds Persistent Cooperative Bias in Next-Gen LLM Agents but Significant Provider Divergence

Research Background and Methodology

Key Findings

Comparative Results Summary

Implications for Enterprise AI Agent Systems

Recommended Stories

Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

Edit Knowledge, Not Just Facts via Multi-Step Reasoning over Background Stories

UniT Framework Enables Multimodal Chain-of-Thought Test-Time Scaling for AI Reasoning

RoTRAG Framework Boosts Harm Detection Accuracy by 40% Using Retrieval-Augmented Generation