AgentFairBench: Do LLM Agents Discriminate When They Act?

Manmeet Singh Kapoor; Manpreet Singh; Rohith Reddy Bellibaltu; Triveni Morla

arxiv: 2606.16723 · v1 · pith:AMOU5DLGnew · submitted 2026-06-15 · 💻 cs.AI

AgentFairBench: Do LLM Agents Discriminate When They Act?

Triveni Morla , Rohith Reddy Bellibaltu , Manpreet Singh , Manmeet Singh Kapoor This is my paper

Pith reviewed 2026-06-27 03:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsfairness benchmarkdemographic disparitycounterfactual testinghiring lending triageaction disparitystatistical controls

0 comments

The pith

LLM agents can be audited for demographic disparity in actions using counterfactual name-coded profiles across hiring, lending and triage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentFairBench to measure whether LLM agents produce different actions based on demographic signals when performing real tasks. Synthetic profiles that differ only in race and gender signals from names are run through four agent scaffolds in three regulator-anchored domains. Metrics including counterfactual flip rate and mean absolute score difference are computed with bootstrap intervals, paired tests, and false-discovery-rate control. The pilot of 864 decisions shows that after matching the noise floor to the statistic arity, Claude Haiku 4.5 yields no surviving contrasts, while a planted-bias check confirms the method detects disparity when present.

Core claim

AgentFairBench evaluates LLM agents on matched synthetic profiles that vary solely by name-coded race-by-gender signals in hiring, lending, and medical triage. Under direct, chain-of-thought, multi-agent, and tool-augmented scaffolds, it computes action disparity metrics with an arity-matched null and FDR control. The 864-decision pilot finds zero of 120 pairwise and zero of nine omnibus contrasts significant for Claude Haiku 4.5, with the instrument validated by successful detection of planted bias.

What carries the argument

Counterfactual matched sets of demographic-neutral synthetic profiles that isolate a name-coded race x gender signal, evaluated under four agent scaffolds and scored by a NumPy harness for flip rate, MASD, and action-rate disparity with bootstrap CIs and FDR correction.

If this is right

Fairness evaluation of LLM agents shifts from grading text outputs to auditing concrete actions in regulator-anchored domains.
The arity-matched noise floor prevents overstatement of disparity by a factor of roughly 2.4 through statistic arity alone.
External models can be submitted to a live leaderboard with a held-out private split and contamination canary for standardized testing.
The open NumPy harness and artifacts make single-digit-dollar reproducibility feasible for any new agent scaffold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same counterfactual design could be applied to test other protected attributes or intersectional signals once the name-coded method is validated.
If later models show detectable bias under the same instrument, the framework supplies a direct route to compare debiasing techniques in agent rather than base-model settings.
The planted-bias validation step offers a template for confirming sensitivity in any new fairness benchmark before null results are interpreted.

Load-bearing premise

The synthetic profiles carry no unintended correlations beyond the name-coded demographic signal and the three chosen domains plus four scaffolds are representative enough to speak about agent discrimination in general.

What would settle it

Re-running the full benchmark on Claude Haiku 4.5 with a doubled sample size and finding at least one pairwise or omnibus contrast that survives FDR correction would falsify the reported absence of demographic effects.

Figures

Figures reproduced from arXiv: 2606.16723 by Manmeet Singh Kapoor, Manpreet Singh, Rohith Reddy Bellibaltu, Triveni Morla.

**Figure 2.** Figure 2: FIGURE 2 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentFairBench supplies a ready-to-use benchmark for agent actions with an arity-matched null and planted-bias check, but the no-disparity pilot result for Claude Haiku rests on the unverified neutrality of its synthetic profiles.

read the letter

The main takeaway is a new benchmark that moves fairness testing from grading LLM answers to measuring actual decisions in hiring, lending, and triage. It runs name-based counterfactuals across race and gender, four agent scaffolds, and reports flip rates plus score differences with bootstrap intervals, paired tests, and FDR control. The pilot on 864 decisions finds no effect above noise for Claude Haiku after corrections, and a planted-bias test shows the instrument can detect disparity when it is present.

What the work does cleanly is release the NumPy harness, data, and leaderboard under open licenses, plus flag the arity artifact that inflates six-group spreads by roughly 2.4 times. That statistical lesson and the sensitivity check are concrete and reusable.

The soft spots are the usual ones for synthetic audit studies. Everything turns on the profiles being neutral except for the name signal; any hidden correlations in the generated text would undermine the null result. Three regulator domains and four scaffolds are a reasonable start but do not automatically generalize to other agent deployments. The abstract leaves profile construction and exact test details thin, though the internal logic described in the stress-test note holds together.

This is for teams that need an off-the-shelf instrument to audit deployed agents rather than for theorists. Readers who care about practical fairness tooling or the statistical correction will get value from the artifacts. It deserves peer review because it ships a validated, low-cost benchmark and addresses a clear measurement gap without overclaiming.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces AgentFairBench, a reproducible multi-domain benchmark for measuring demographic disparity in the actions (rather than text outputs) of LLM agents. Grounded in the Bias Conduction Framework, it evaluates synthetic, name-coded counterfactual profiles across hiring, lending, and medical triage under four agent scaffolds of increasing agency. A pilot of 864 decisions plus replication, using bootstrap CIs, paired tests, FDR control, and an arity-matched null, reports that Claude Haiku 4.5 shows no detectable demographic effect (0/120 pairwise and 0/9 omnibus contrasts survive correction); a planted-bias test confirms instrument sensitivity. Code, data, and a leaderboard with held-out split are released.

Significance. If the central empirical result holds, the work supplies a cheap, adoption-ready instrument for action-level fairness evaluation of LLM agents, together with an open harness and the arity-matched null methodology that prevents overstatement of disparity through statistic arity. The planted-bias validation directly addresses sensitivity, and the release of artifacts enables external scaling and verification.

minor comments (3)

[Abstract] Abstract: the model identifier 'claude haiku 4 5' should be written consistently (e.g., 'Claude Haiku 4.5' or the precise API name) to avoid ambiguity.
[Methods] The manuscript states that profile construction follows the Bertrand-Mullainathan tradition and that profiles are 'demographic-neutral except for the name-coded signal'; §3 or the methods appendix should supply the exact template text, name lists, and any auxiliary variables used so that the neutrality assumption can be inspected directly.
[Statistical analysis] The abstract mentions 'exact statistical procedures' and FDR control; the main text or supplementary material should list the precise test statistics, the omnibus test employed, and the FDR procedure (e.g., Benjamini-Hochberg) with the number of tests corrected.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of AgentFairBench, its grounding in the Bias Conduction Framework, the pilot results on Claude Haiku 4.5, and the recommendation for minor revision. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark

full rationale

The paper is an empirical benchmark study that constructs synthetic counterfactual profiles varying only a name-coded signal, applies explicitly defined metrics (counterfactual flip rate, MASD, action-rate disparity), and validates via planted-bias tests plus FDR-controlled statistical procedures. The Bias Conduction Framework is restated within the paper itself rather than imported as a load-bearing premise. No equations, predictions, or uniqueness claims reduce by construction to fitted inputs or self-citations; the planted-bias test and arity-matched null are independent checks on instrument sensitivity. All load-bearing elements are externally falsifiable through the released harness and data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be extracted. The central claim rests on unstated assumptions about profile neutrality and domain representativeness.

pith-pipeline@v0.9.1-grok · 5879 in / 1104 out tokens · 33364 ms · 2026-06-27T03:50:01.872934+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 7 canonical work pages · 1 internal anchor

[1]

American Economic Review , author =

M. Bertrand and S. Mullainathan, ‘‘Are emily and greg more employable than lakisha and jamal? a field experiment on labor market discrimination,’’ American Economic Review, 94(4), 991-1013, 2004. [Online]. Available: https://www.aeaweb.org/articles?id=10.1257/0002828042002561

work page doi:10.1257/0002828042002561 2004
[2]

Parrish, A

A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P . M. Htut, and S. R. Bowman, ‘‘Bbq: A hand-built bias benchmark for question answering,’’ inFindings of the Association for Computational Linguistics: ACL 2022, 2022, arXiv:2110.08193. [Online]. Available: https://arxiv.org/abs/2110.08193

Pith/arXiv arXiv 2022
[3]

Nadeem, A

M. Nadeem, A. Bethke, and S. Reddy, ‘‘Stereoset: Measuring stereotypical bias in pretrained language models,’’ inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP 2021), 2021, arXiv:2004.09456. [Online]. Available: https://arxiv.org/abs/2004.09456

arXiv 2021
[4]

Nangia, C

N. Nangia, C. V ania, R. Bhalerao, and S. R. Bowman, ‘‘Crows-pairs: A challenge dataset for measuring social biases in masked language models,’’ inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), 2020, arXiv:2010.00133. [Online]. Available: https://arxiv.org/abs/2010.00133

arXiv 2020
[5]

X. Liu, H. Y u, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Y ang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, ‘‘Agentbench: Evaluating llms as agents,’’ inICLR 2024 (International Conference on Learning Representations), 2024, arXiv:2308.03688. [Online]. Available: htt...

Pith/arXiv arXiv 2024
[6]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, ‘‘Webarena: A realistic web environment for building autonomous agents,’’ inICLR 2024 (International Conference on Learning Representations), 2024, arXiv:2307.13854. [Online]. Available: https://arxiv.org/abs/2307.13854

Pith/arXiv arXiv 2024
[7]

C. E. Jimenez, J. Y ang, A. Wettig, S. Y ao, K. Pei, O. Press, and K. Narasimhan, ‘‘Swe-bench: Can language models resolve real- world github issues?’’ inICLR 2024 (International Conference on Learning Representations), 2024, arXiv:2310.06770. [Online]. Available: https://arxiv.org/abs/2310.06770 14 VOLUME 14, 2026 Morlaet al.: AgentFairBench: Do LLM Agen...

Pith/arXiv arXiv 2024
[8]

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, ‘‘Mind2web: Towards a generalist agent for the web,’’ inNeurIPS 2023 Datasets and Benchmarks Track (Spotlight), 2023, arXiv:2306.06070. [Online]. Available: https://arxiv.org/abs/2306.06070

Pith/arXiv arXiv 2023
[9]

U.S. Equal Employment Opportunity Commission (EEOC), ‘‘Assessing adverse impact in software, algorithms, and artificial intelligence used in employment selection procedures under title vii of the civil rights act of 1964,’’ U.S. Equal Employment Opportunity Commission, Technical Assistance Document (issued May 18, 2023), AI and Algorithmic Fairness Initia...

1964
[10]

[Online]

New Y ork City Council; NYC Department of Consumer and Worker Protection (DCWP), ‘‘Local law 144 of 2021: Automated employment decision tools (aedt) - final rule (dcwp), subchapter 25 of title 6 of the rules of the city of new york,’’ New Y ork City Administrative Code 20-870 et seq.; effective Jan 1, 2023, enforcement began July 5, 2023, 2023, nYC Local ...

2021
[11]

Consumer Financial Protection Bureau (CFPB), ‘‘Equal credit opportunity act (regulation b), 12 cfr part 1002,’’ 15 U.S.C

U.S. Consumer Financial Protection Bureau (CFPB), ‘‘Equal credit opportunity act (regulation b), 12 cfr part 1002,’’ 15 U.S.C. 1691 et seq. (ECOA, enacted 1974); implemented by Regulation B, 12 CFR Part 1002 (CFPB), 1974, 12 CFR Part 1002. [Online]. Available: https://www.consumerfinance.gov/rules-policy/regulations/1002/

1974
[12]

S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach, ‘‘Language (technology) is power: A critical survey of ‘‘bias’’ in nlp,’’ inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 5454–5476, 2020, arXiv:2005.14050. [Online]. Available: https://aclanthology.org/2020.acl-main.485/

arXiv 2020
[13]

I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Y u, R. Zhang, and N. K. Ahmed, ‘‘Bias and fairness in large language models: A survey,’’Computational Linguistics, 50(3), 1097–1179, 2024, arXiv:2309.00770. [Online]. Available: https: //aclanthology.org/2024.cl-3.8/

arXiv 2024
[14]

Tamkin, A

A. Tamkin, A. Askell, L. Lovitt, E. Durmus, N. Joseph, S. Kravec, K. Nguyen, J. Kaplan, and D. Ganguli, ‘‘Evaluating and mitigating discrimination in language model decisions,’’arXiv preprint, 2023, arXiv:2312.03689; dataset: Anthropic/discrim-eval. [Online]. Available: https://arxiv.org/abs/2312.03689

arXiv 2023
[15]

J. An, D. Huang, C. Lin, and M. Tai, ‘‘Measuring gender and racial biases in large language models: Intersectional evidence from automated resume evaluation,’’PNAS Nexus, 2025. [Online]. Available: https://doi.org/10.1093/pnasnexus/pgaf089

work page doi:10.1093/pnasnexus/pgaf089 2025
[16]

A. Haim, A. Salinas, and J. Nyarko, ‘‘What’s in a name? auditing large language models for race and gender bias,’’arXiv preprint, 2024, arXiv:2402.14875. [Online]. Available: https://arxiv.org/abs/2402.14875

arXiv 2024
[17]

Jiang, K

Y . Jiang, K. C. Black, G. Geng, D. Park, J. Zou, A. Y . Ng, and J. H. Chen, ‘‘Medagentbench: A realistic virtual ehr environment to benchmark medical llm agents,’’arXiv preprint (also NEJM AI), 2025, arXiv:2501.14654. [Online]. Available: https://arxiv.org/abs/2501.14654

arXiv 2025
[18]

Y . Xiao, J. Huang, R. He, J. Xiao, M. R. Mousavi, Y . Liu, K. Li, Z. Chen, and J. M. Zhang, ‘‘Fairmedqa: Benchmarking bias in large language models for medical question answering,’’arXiv preprint, 2025, arXiv:2505.19562. [Online]. Available: https://arxiv.org/abs/2505.19562

arXiv 2025
[19]

Adappanavar, K

S. Adappanavar, K. Shailya, G. S. Krishnan, S. Natarajan, and B. Ravindran, ‘‘mfarm: Towards multi-faceted fairness assessment based on harms in clinical decision support,’’arXiv preprint, 2025, arXiv:2509.02007. [Online]. Available: https://arxiv.org/abs/2509.02007

arXiv 2025
[20]

R. J. Y oung and A. M. Matthews, ‘‘Equitriage: A fairness audit of gender bias in llm-based emergency department triage,’’arXiv preprint, 2026, arXiv:2605.03998. [Online]. Available: https://arxiv.org/abs/2605.03998

Pith/arXiv arXiv 2026
[21]

Mayilvaghanan, S

K. Mayilvaghanan, S. Gupta, and A. Kumar, ‘‘Counterfactual fairness evaluation of llm-based contact center agent quality assurance system,’’ arXiv preprint, 2026, arXiv:2602.14970. [Online]. Available: https: //arxiv.org/abs/2602.14970

arXiv 2026
[22]

Dwork, M

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, ‘‘Fairness through awareness,’’ inProceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS ’12), 214-226, 2012. [Online]. Available: https://dl.acm.org/doi/10.1145/2090236.2090255

work page doi:10.1145/2090236.2090255 2012
[23]

Hardt, E

M. Hardt, E. Price, and N. Srebro, ‘‘Equality of opportunity in supervised learning,’’ inAdvances in Neural Information Processing Systems 29 (NIPS 2016), 3315-3323, 2016, arXiv:1610.02413. [Online]. Available: https://arxiv.org/abs/1610.02413

Pith/arXiv arXiv 2016
[24]

M. J. Kusner, J. R. Loftus, C. Russell, and R. Silva, ‘‘Counterfactual fairness,’’ inAdvances in Neural Information Processing Systems 30 (NIPS 2017), 4066-4076, 2017, arXiv:1703.06856. [Online]. Available: https://arxiv.org/abs/1703.06856

Pith/arXiv arXiv 2017
[25]

Kleinberg, S

J. Kleinberg, S. Mullainathan, and M. Raghavan, ‘‘Inherent trade-offs in the fair determination of risk scores,’’ inProceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), 2017, arXiv:1609.05807. [Online]. Available: https://arxiv.org/abs/1609. 05807

Pith/arXiv arXiv 2017
[26]

Chouldechova, ‘‘Fair prediction with disparate impact: A study of bias in recidivism prediction instruments,’’Big Data, 5(2), 153–163, 2017, arXiv:1703.00056

A. Chouldechova, ‘‘Fair prediction with disparate impact: A study of bias in recidivism prediction instruments,’’Big Data, 5(2), 153–163, 2017, arXiv:1703.00056. [Online]. Available: https://arxiv.org/abs/1703.00056

Pith/arXiv arXiv 2017
[27]

[Online]

European Parliament and Council of the European Union, ‘‘Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act),’’ Official Journal of the European Union, L series, published 12 July 2024, 2024, regulation (EU) 2024/1689. [Online]. Availa...

2024
[28]

Department of Commerce (published January 26, 2023), 2023, nIST AI 100-1

National Institute of Standards and Technology (NIST), ‘‘Artificial intelligence risk management framework (ai rmf 1.0),’’ NIST, U.S. Department of Commerce (published January 26, 2023), 2023, nIST AI 100-1. [Online]. Available: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI. 100-1.pdf

2023
[29]

Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

M. De-Arteaga, A. Romanov, H. Wallach, J. Chayes, C. Borgs, A. Chouldechova, S. Geyik, K. Kenthapadi, and A. T. Kalai, ‘‘Bias in bios: A case study of semantic representation bias in a high-stakes setting,’’ inProceedings of the Conference on Fairness, Accountability, and Transparency (F AT* 2019), 120–128, 2019, arXiv:1901.09451. [Online]. Available: htt...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3287560.3287572 2019
[30]

Buolamwini and T

J. Buolamwini and T. Gebru, ‘‘Gender shades: Intersectional accuracy disparities in commercial gender classification,’’ inProceedings of the 1st Conference on Fairness, Accountability and Transparency (F AT* 2018), PMLR 81:77–91, 2018. [Online]. Available: https://proceedings.mlr.press/ v81/buolamwini18a.html

2018
[31]

Efron, ‘‘Better bootstrap confidence intervals,’’Journal of the American Statistical Association, 82(397), 171–185, 1987

B. Efron, ‘‘Better bootstrap confidence intervals,’’Journal of the American Statistical Association, 82(397), 171–185, 1987. [Online]. Available: https://www.tandfonline.com/doi/abs/10.1080/01621459.1987.10478410

work page doi:10.1080/01621459.1987.10478410 1987
[32]

T. J. DiCiccio and B. Efron, ‘‘Bootstrap confidence inter- vals,’’Statistical Science, 11(3), 189–228, 1996. [Online]. Avail- able: https://projecteuclid.org/journals/statistical-science/volume-11/ issue-3/Bootstrap-confidence-intervals/10.1214/ss/1032280214.full

work page doi:10.1214/ss/1032280214.full 1996
[33]

Benjamini and Y

Y . Benjamini and Y . Hochberg, ‘‘Controlling the false discovery rate: A practical and powerful approach to multiple testing,’’Journal of the Royal Statistical Society: Series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995

1995
[34]

Y anget al., ‘‘Compared to what? baselines and metrics for counterfactual prompting,’’arXiv preprint, 2026, arXiv:2605.01048

Z. Y anget al., ‘‘Compared to what? baselines and metrics for counterfactual prompting,’’arXiv preprint, 2026, arXiv:2605.01048. [Online]. Available: https://arxiv.org/abs/2605.01048

Pith/arXiv arXiv 2026
[35]

S. M. Gaddis, ‘‘How black are lakisha and jamal? racial perceptions from names used in correspondence audit studies,’’Sociological Science, vol. 4, pp. 469–489, 2017, dOI: 10.15195/v4.a21. [Online]. Available: https://doi.org/10.15195/v4.a21 VOLUME 14, 2026 15 Morlaet al.: AgentFairBench: Do LLM Agents Discriminate When They Act? TRIVENI MORLAis a Data An...

work page doi:10.15195/v4.a21 2017

[1] [1]

American Economic Review , author =

M. Bertrand and S. Mullainathan, ‘‘Are emily and greg more employable than lakisha and jamal? a field experiment on labor market discrimination,’’ American Economic Review, 94(4), 991-1013, 2004. [Online]. Available: https://www.aeaweb.org/articles?id=10.1257/0002828042002561

work page doi:10.1257/0002828042002561 2004

[2] [2]

Parrish, A

A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P . M. Htut, and S. R. Bowman, ‘‘Bbq: A hand-built bias benchmark for question answering,’’ inFindings of the Association for Computational Linguistics: ACL 2022, 2022, arXiv:2110.08193. [Online]. Available: https://arxiv.org/abs/2110.08193

Pith/arXiv arXiv 2022

[3] [3]

Nadeem, A

M. Nadeem, A. Bethke, and S. Reddy, ‘‘Stereoset: Measuring stereotypical bias in pretrained language models,’’ inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP 2021), 2021, arXiv:2004.09456. [Online]. Available: https://arxiv.org/abs/2004.09456

arXiv 2021

[4] [4]

Nangia, C

N. Nangia, C. V ania, R. Bhalerao, and S. R. Bowman, ‘‘Crows-pairs: A challenge dataset for measuring social biases in masked language models,’’ inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), 2020, arXiv:2010.00133. [Online]. Available: https://arxiv.org/abs/2010.00133

arXiv 2020

[5] [5]

X. Liu, H. Y u, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Y ang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, ‘‘Agentbench: Evaluating llms as agents,’’ inICLR 2024 (International Conference on Learning Representations), 2024, arXiv:2308.03688. [Online]. Available: htt...

Pith/arXiv arXiv 2024

[6] [6]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, ‘‘Webarena: A realistic web environment for building autonomous agents,’’ inICLR 2024 (International Conference on Learning Representations), 2024, arXiv:2307.13854. [Online]. Available: https://arxiv.org/abs/2307.13854

Pith/arXiv arXiv 2024

[7] [7]

C. E. Jimenez, J. Y ang, A. Wettig, S. Y ao, K. Pei, O. Press, and K. Narasimhan, ‘‘Swe-bench: Can language models resolve real- world github issues?’’ inICLR 2024 (International Conference on Learning Representations), 2024, arXiv:2310.06770. [Online]. Available: https://arxiv.org/abs/2310.06770 14 VOLUME 14, 2026 Morlaet al.: AgentFairBench: Do LLM Agen...

Pith/arXiv arXiv 2024

[8] [8]

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, ‘‘Mind2web: Towards a generalist agent for the web,’’ inNeurIPS 2023 Datasets and Benchmarks Track (Spotlight), 2023, arXiv:2306.06070. [Online]. Available: https://arxiv.org/abs/2306.06070

Pith/arXiv arXiv 2023

[9] [9]

U.S. Equal Employment Opportunity Commission (EEOC), ‘‘Assessing adverse impact in software, algorithms, and artificial intelligence used in employment selection procedures under title vii of the civil rights act of 1964,’’ U.S. Equal Employment Opportunity Commission, Technical Assistance Document (issued May 18, 2023), AI and Algorithmic Fairness Initia...

1964

[10] [10]

[Online]

New Y ork City Council; NYC Department of Consumer and Worker Protection (DCWP), ‘‘Local law 144 of 2021: Automated employment decision tools (aedt) - final rule (dcwp), subchapter 25 of title 6 of the rules of the city of new york,’’ New Y ork City Administrative Code 20-870 et seq.; effective Jan 1, 2023, enforcement began July 5, 2023, 2023, nYC Local ...

2021

[11] [11]

Consumer Financial Protection Bureau (CFPB), ‘‘Equal credit opportunity act (regulation b), 12 cfr part 1002,’’ 15 U.S.C

U.S. Consumer Financial Protection Bureau (CFPB), ‘‘Equal credit opportunity act (regulation b), 12 cfr part 1002,’’ 15 U.S.C. 1691 et seq. (ECOA, enacted 1974); implemented by Regulation B, 12 CFR Part 1002 (CFPB), 1974, 12 CFR Part 1002. [Online]. Available: https://www.consumerfinance.gov/rules-policy/regulations/1002/

1974

[12] [12]

S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach, ‘‘Language (technology) is power: A critical survey of ‘‘bias’’ in nlp,’’ inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 5454–5476, 2020, arXiv:2005.14050. [Online]. Available: https://aclanthology.org/2020.acl-main.485/

arXiv 2020

[13] [13]

I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Y u, R. Zhang, and N. K. Ahmed, ‘‘Bias and fairness in large language models: A survey,’’Computational Linguistics, 50(3), 1097–1179, 2024, arXiv:2309.00770. [Online]. Available: https: //aclanthology.org/2024.cl-3.8/

arXiv 2024

[14] [14]

Tamkin, A

A. Tamkin, A. Askell, L. Lovitt, E. Durmus, N. Joseph, S. Kravec, K. Nguyen, J. Kaplan, and D. Ganguli, ‘‘Evaluating and mitigating discrimination in language model decisions,’’arXiv preprint, 2023, arXiv:2312.03689; dataset: Anthropic/discrim-eval. [Online]. Available: https://arxiv.org/abs/2312.03689

arXiv 2023

[15] [15]

J. An, D. Huang, C. Lin, and M. Tai, ‘‘Measuring gender and racial biases in large language models: Intersectional evidence from automated resume evaluation,’’PNAS Nexus, 2025. [Online]. Available: https://doi.org/10.1093/pnasnexus/pgaf089

work page doi:10.1093/pnasnexus/pgaf089 2025

[16] [16]

A. Haim, A. Salinas, and J. Nyarko, ‘‘What’s in a name? auditing large language models for race and gender bias,’’arXiv preprint, 2024, arXiv:2402.14875. [Online]. Available: https://arxiv.org/abs/2402.14875

arXiv 2024

[17] [17]

Jiang, K

Y . Jiang, K. C. Black, G. Geng, D. Park, J. Zou, A. Y . Ng, and J. H. Chen, ‘‘Medagentbench: A realistic virtual ehr environment to benchmark medical llm agents,’’arXiv preprint (also NEJM AI), 2025, arXiv:2501.14654. [Online]. Available: https://arxiv.org/abs/2501.14654

arXiv 2025

[18] [18]

Y . Xiao, J. Huang, R. He, J. Xiao, M. R. Mousavi, Y . Liu, K. Li, Z. Chen, and J. M. Zhang, ‘‘Fairmedqa: Benchmarking bias in large language models for medical question answering,’’arXiv preprint, 2025, arXiv:2505.19562. [Online]. Available: https://arxiv.org/abs/2505.19562

arXiv 2025

[19] [19]

Adappanavar, K

S. Adappanavar, K. Shailya, G. S. Krishnan, S. Natarajan, and B. Ravindran, ‘‘mfarm: Towards multi-faceted fairness assessment based on harms in clinical decision support,’’arXiv preprint, 2025, arXiv:2509.02007. [Online]. Available: https://arxiv.org/abs/2509.02007

arXiv 2025

[20] [20]

R. J. Y oung and A. M. Matthews, ‘‘Equitriage: A fairness audit of gender bias in llm-based emergency department triage,’’arXiv preprint, 2026, arXiv:2605.03998. [Online]. Available: https://arxiv.org/abs/2605.03998

Pith/arXiv arXiv 2026

[21] [21]

Mayilvaghanan, S

K. Mayilvaghanan, S. Gupta, and A. Kumar, ‘‘Counterfactual fairness evaluation of llm-based contact center agent quality assurance system,’’ arXiv preprint, 2026, arXiv:2602.14970. [Online]. Available: https: //arxiv.org/abs/2602.14970

arXiv 2026

[22] [22]

Dwork, M

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, ‘‘Fairness through awareness,’’ inProceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS ’12), 214-226, 2012. [Online]. Available: https://dl.acm.org/doi/10.1145/2090236.2090255

work page doi:10.1145/2090236.2090255 2012

[23] [23]

Hardt, E

M. Hardt, E. Price, and N. Srebro, ‘‘Equality of opportunity in supervised learning,’’ inAdvances in Neural Information Processing Systems 29 (NIPS 2016), 3315-3323, 2016, arXiv:1610.02413. [Online]. Available: https://arxiv.org/abs/1610.02413

Pith/arXiv arXiv 2016

[24] [24]

M. J. Kusner, J. R. Loftus, C. Russell, and R. Silva, ‘‘Counterfactual fairness,’’ inAdvances in Neural Information Processing Systems 30 (NIPS 2017), 4066-4076, 2017, arXiv:1703.06856. [Online]. Available: https://arxiv.org/abs/1703.06856

Pith/arXiv arXiv 2017

[25] [25]

Kleinberg, S

J. Kleinberg, S. Mullainathan, and M. Raghavan, ‘‘Inherent trade-offs in the fair determination of risk scores,’’ inProceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), 2017, arXiv:1609.05807. [Online]. Available: https://arxiv.org/abs/1609. 05807

Pith/arXiv arXiv 2017

[26] [26]

Chouldechova, ‘‘Fair prediction with disparate impact: A study of bias in recidivism prediction instruments,’’Big Data, 5(2), 153–163, 2017, arXiv:1703.00056

A. Chouldechova, ‘‘Fair prediction with disparate impact: A study of bias in recidivism prediction instruments,’’Big Data, 5(2), 153–163, 2017, arXiv:1703.00056. [Online]. Available: https://arxiv.org/abs/1703.00056

Pith/arXiv arXiv 2017

[27] [27]

[Online]

European Parliament and Council of the European Union, ‘‘Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act),’’ Official Journal of the European Union, L series, published 12 July 2024, 2024, regulation (EU) 2024/1689. [Online]. Availa...

2024

[28] [28]

Department of Commerce (published January 26, 2023), 2023, nIST AI 100-1

National Institute of Standards and Technology (NIST), ‘‘Artificial intelligence risk management framework (ai rmf 1.0),’’ NIST, U.S. Department of Commerce (published January 26, 2023), 2023, nIST AI 100-1. [Online]. Available: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI. 100-1.pdf

2023

[29] [29]

Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

M. De-Arteaga, A. Romanov, H. Wallach, J. Chayes, C. Borgs, A. Chouldechova, S. Geyik, K. Kenthapadi, and A. T. Kalai, ‘‘Bias in bios: A case study of semantic representation bias in a high-stakes setting,’’ inProceedings of the Conference on Fairness, Accountability, and Transparency (F AT* 2019), 120–128, 2019, arXiv:1901.09451. [Online]. Available: htt...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3287560.3287572 2019

[30] [30]

Buolamwini and T

J. Buolamwini and T. Gebru, ‘‘Gender shades: Intersectional accuracy disparities in commercial gender classification,’’ inProceedings of the 1st Conference on Fairness, Accountability and Transparency (F AT* 2018), PMLR 81:77–91, 2018. [Online]. Available: https://proceedings.mlr.press/ v81/buolamwini18a.html

2018

[31] [31]

Efron, ‘‘Better bootstrap confidence intervals,’’Journal of the American Statistical Association, 82(397), 171–185, 1987

B. Efron, ‘‘Better bootstrap confidence intervals,’’Journal of the American Statistical Association, 82(397), 171–185, 1987. [Online]. Available: https://www.tandfonline.com/doi/abs/10.1080/01621459.1987.10478410

work page doi:10.1080/01621459.1987.10478410 1987

[32] [32]

T. J. DiCiccio and B. Efron, ‘‘Bootstrap confidence inter- vals,’’Statistical Science, 11(3), 189–228, 1996. [Online]. Avail- able: https://projecteuclid.org/journals/statistical-science/volume-11/ issue-3/Bootstrap-confidence-intervals/10.1214/ss/1032280214.full

work page doi:10.1214/ss/1032280214.full 1996

[33] [33]

Benjamini and Y

Y . Benjamini and Y . Hochberg, ‘‘Controlling the false discovery rate: A practical and powerful approach to multiple testing,’’Journal of the Royal Statistical Society: Series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995

1995

[34] [34]

Y anget al., ‘‘Compared to what? baselines and metrics for counterfactual prompting,’’arXiv preprint, 2026, arXiv:2605.01048

Z. Y anget al., ‘‘Compared to what? baselines and metrics for counterfactual prompting,’’arXiv preprint, 2026, arXiv:2605.01048. [Online]. Available: https://arxiv.org/abs/2605.01048

Pith/arXiv arXiv 2026

[35] [35]

S. M. Gaddis, ‘‘How black are lakisha and jamal? racial perceptions from names used in correspondence audit studies,’’Sociological Science, vol. 4, pp. 469–489, 2017, dOI: 10.15195/v4.a21. [Online]. Available: https://doi.org/10.15195/v4.a21 VOLUME 14, 2026 15 Morlaet al.: AgentFairBench: Do LLM Agents Discriminate When They Act? TRIVENI MORLAis a Data An...

work page doi:10.15195/v4.a21 2017