pith. sign in
Pith Number

pith:I4FI2JPH

pith:2026:I4FI2JPHCXWERRKM2WWGSQLV3F
not attested not anchored not stored refs resolved

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

Brando Miranda, Deonna Owens, Sang T. Truong, Sanmi Koyejo, Shreyas Sharma, Yibo Jacky Zhang, Zeyu Tang

Standardized-test scores for LLM fairness are dominated by prompt wording choices unrelated to fairness itself.

arxiv:2605.12530 v1 · 2026-04-21 · cs.CL · cs.AI · cs.CY

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{I4FI2JPHCXWERRKM2WWGSQLV3F}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings.

C2weakest assumption

That conversational behavior observed in the multi-agent MAC-Fairness setup is a valid, generalizable proxy for real-world fairness that is not itself distorted by the artificial dialogue structure or agent identities chosen.

C3one line summary

Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific fairness behaviors across millions of dialogues.

References

15 extracted · 15 resolved · 9 Pith anchors

[1] Phi-4 Technical Report · arXiv:2412.08905
[2] Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs · arXiv:2503.01743
[3] Orpp: Self-optimizing role-playing prompts to enhance language model capabilities 2025
[4] The Llama 3 Herd of Models · arXiv:2407.21783
[5] Llm generated persona is a promise with a catch 2020 · arXiv:2601.08584
Receipt and verification
First computed 2026-05-18T03:10:02.691430Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

470a8d25e715ec48c54cd5ac694175d9798a17ba7c66e061b9efebaa4e298406

Aliases

arxiv: 2605.12530 · arxiv_version: 2605.12530v1 · doi: 10.48550/arxiv.2605.12530 · pith_short_12: I4FI2JPHCXWE · pith_short_16: I4FI2JPHCXWERRKM · pith_short_8: I4FI2JPH
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/I4FI2JPHCXWERRKM2WWGSQLV3F \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 470a8d25e715ec48c54cd5ac694175d9798a17ba7c66e061b9efebaa4e298406
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "7e006b07dc5ff1fda2bd23737db799bee303d478e7a420d9183a0b1e4c463eb9",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CY"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-04-21T18:38:50Z",
    "title_canon_sha256": "ee053f754da8a53b2955526e7832ae7e9ce032b9073784015f36aecd87e4a021"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12530",
    "kind": "arxiv",
    "version": 1
  }
}