Large language model (LLM) agents are increasingly being deployed to take autonomous actions—screening job applicants, recommending credit limits, and triaging patients. Yet standard fairness benchmarks for LLMs only grade the models' answers to static questions, not the actions they take when given agency. This gap leaves enterprises exposed to real-world discrimination risks that answer-based tests cannot detect.
To close that gap, a group of researchers—Morla, Triveni, Bellibaltu, Rohith Reddy, Manpreet, Kapoor, and Manmeet Singh—has released AgentFairBench, a cheap, reproducible, multi-domain benchmark that measures demographic disparity in the actions of LLM agents. The paper is available on arXiv.
How AgentFairBench Works
AgentFairBench is grounded in a companion framework called the Bias Conduction Framework (BCF). It spans three regulator-anchored domains: hiring, lending, and medical triage. The benchmark uses synthetic, demographic-neutral profiles evaluated in counterfactual matched sets that vary only a name-coded race × gender signal—a methodology in the tradition of the classic Bertrand-Mullainathan correspondence studies.
Agents are tested under four levels of increasing agency:
- Direct (single prompt)
- Chain-of-thought (step-by-step reasoning)
- Multi-agent deliberation (multiple LLMs discuss the decision)
- Tool-augmented (agent can call external APIs or databases)
The harness, written in NumPy only, computes several metrics: counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity. It also provides bootstrap confidence intervals, paired statistical tests, and false-discovery-rate control. The entire test costs single-digit dollars per model, according to the paper.
Key Findings from the Pilot Study
The researchers conducted a pilot study involving 864 decisions plus a test-retest replication. A critical methodological finding emerged: comparing a six-group score spread against a two-run noise difference can overstate disparity by approximately 2.4× solely due to statistic arity.
When a proper arity-matched noise floor and an omnibus group test were used, the model Claude Haiku 4.5 showed no demographic effect above sampling noise—0 of 120 pairwise contrasts and 0 of 9 omnibus contrasts survived correction. A planted-bias test confirmed that the instrument can detect disparity when it is present, validating the benchmark's sensitivity.
Implications for Enterprise AI Adoption
For CTOs and technology leaders deploying LLM agents in high-stakes domains, AgentFairBench offers a sound, adoption-ready instrument for auditing fairness of agent actions. The authors have released the code, data, and harness under open licenses, with an anonymized review artifact available. A live leaderboard with a held-out private split and a contamination canary also allows external models to be submitted for testing.
| Metric | Value |
|---|---|
| Pilot decisions | 864 + replication |
| Overstatement risk (naive comparison) | ~2.4× |
| Claude Haiku 4.5 pairwise disparities (corrected) | 0 / 120 |
| Claude Haiku 4.5 omnibus disparities (corrected) | 0 / 9 |
| Cost per model | Single-digit USD |
The benchmark's focus on actions rather than answers aligns with the real risk for enterprises: an LLM that answers a fairness question correctly might still act unfairly when granted autonomy. AgentFairBench provides a rigorous, low-cost way to detect that gap before deployment.