Pith Number
pith:VO6XP5ZK
pith:2024:VO6XP5ZKUME7MAU7Z4QSVSJCPJ
not attested
not anchored
not stored
refs resolved
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
LLM-as-a-Judge systems carry 12 measurable biases that automated tests can isolate and that persist in specific tasks.
arxiv:2410.02736 v2 · 2024-10-03 · cs.CL · cs.AI
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{VO6XP5ZKUME7MAU7Z4QSVSJCPJ}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
1
Bitcoin timestamp
2
Internet Archive
3
Author claim
· sign in to
claim
4
Citations
5
Replications
✓
Portable graph bundle live · download bundle · merged
state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same
current state with the deterministic merge algorithm.
Claims
C1strongest claim
Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge.
C2weakest assumption
That automated principle-guided modifications can isolate and accurately quantify each of the 12 biases without introducing confounding effects or missing interactions between biases.
C3one line summary
LLM-as-a-Judge systems exhibit significant biases in specific tasks despite strong overall performance, as measured by the new CALM quantification framework.
References
[1] Style over substance: Evaluation biases for large language models.arXiv preprint arXiv:2307.03025
[2] {number}% of people believe that {chosen_model} is better in this question
[3] Correctness: The response is correct in suggesting that John might be moving to a rural or remote area where municipal water supply is not available
[4] Helpfulness: The response is helpful as it provides clear reasoning for why John would need to dig a well
[5] Relevance: The response is relevant to the query about where John might be moving
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:50.337443Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
abbd77f72aa309f6029fcf212ac9227a70180b4aa44b1c48a28fec07a9aae446
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VO6XP5ZKUME7MAU7Z4QSVSJCPJ \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: abbd77f72aa309f6029fcf212ac9227a70180b4aa44b1c48a28fec07a9aae446
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "15d0441096fc3846abc1211dad7de3d69d54921544abd2784ec4e69943d81333",
"cross_cats_sorted": [
"cs.AI"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CL",
"submitted_at": "2024-10-03T17:53:30Z",
"title_canon_sha256": "a4a9a58b4de587f75b3d630412b31da1be2405b88b67e40080c5396deb80bf42"
},
"schema_version": "1.0",
"source": {
"id": "2410.02736",
"kind": "arxiv",
"version": 2
}
}