pith:I4FI2JPH
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
Standardized-test scores for LLM fairness are dominated by prompt wording choices unrelated to fairness itself.
arxiv:2605.12530 v1 · 2026-04-21 · cs.CL · cs.AI · cs.CY
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{I4FI2JPHCXWERRKM2WWGSQLV3F}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings.
That conversational behavior observed in the multi-agent MAC-Fairness setup is a valid, generalizable proxy for real-world fairness that is not itself distorted by the artificial dialogue structure or agent identities chosen.
Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific fairness behaviors across millions of dialogues.
References
Receipt and verification
| First computed | 2026-05-18T03:10:02.691430Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
470a8d25e715ec48c54cd5ac694175d9798a17ba7c66e061b9efebaa4e298406
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/I4FI2JPHCXWERRKM2WWGSQLV3F \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 470a8d25e715ec48c54cd5ac694175d9798a17ba7c66e061b9efebaa4e298406
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "7e006b07dc5ff1fda2bd23737db799bee303d478e7a420d9183a0b1e4c463eb9",
"cross_cats_sorted": [
"cs.AI",
"cs.CY"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2026-04-21T18:38:50Z",
"title_canon_sha256": "ee053f754da8a53b2955526e7832ae7e9ce032b9073784015f36aecd87e4a021"
},
"schema_version": "1.0",
"source": {
"id": "2605.12530",
"kind": "arxiv",
"version": 1
}
}