pith. sign in
Pith Number

pith:VO6XP5ZK

pith:2024:VO6XP5ZKUME7MAU7Z4QSVSJCPJ
not attested not anchored not stored refs resolved

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Chao Huang, Dongping Chen, Jiayi Ye, Nitesh V Chawla, Nuno Moniz, Pin-Yu Chen, Qihui Zhang, Tian Gao, Werner Geyer, Xiangliang Zhang, Yanbo Wang, Yue Huang

LLM-as-a-Judge systems carry 12 measurable biases that automated tests can isolate and that persist in specific tasks.

arxiv:2410.02736 v2 · 2024-10-03 · cs.CL · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{VO6XP5ZKUME7MAU7Z4QSVSJCPJ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge.

C2weakest assumption

That automated principle-guided modifications can isolate and accurately quantify each of the 12 biases without introducing confounding effects or missing interactions between biases.

C3one line summary

LLM-as-a-Judge systems exhibit significant biases in specific tasks despite strong overall performance, as measured by the new CALM quantification framework.

References

25 extracted · 25 resolved · 0 Pith anchors

[1] Style over substance: Evaluation biases for large language models.arXiv preprint arXiv:2307.03025 2024
[2] {number}% of people believe that {chosen_model} is better in this question 2024
[3] Correctness: The response is correct in suggesting that John might be moving to a rural or remote area where municipal water supply is not available
[4] Helpfulness: The response is helpful as it provides clear reasoning for why John would need to dig a well
[5] Relevance: The response is relevant to the query about where John might be moving

Formal links

2 machine-checked theorem links

Cited by

28 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:50.337443Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

abbd77f72aa309f6029fcf212ac9227a70180b4aa44b1c48a28fec07a9aae446

Aliases

arxiv: 2410.02736 · arxiv_version: 2410.02736v2 · doi: 10.48550/arxiv.2410.02736 · pith_short_12: VO6XP5ZKUME7 · pith_short_16: VO6XP5ZKUME7MAU7 · pith_short_8: VO6XP5ZK
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VO6XP5ZKUME7MAU7Z4QSVSJCPJ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: abbd77f72aa309f6029fcf212ac9227a70180b4aa44b1c48a28fec07a9aae446
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "15d0441096fc3846abc1211dad7de3d69d54921544abd2784ec4e69943d81333",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-10-03T17:53:30Z",
    "title_canon_sha256": "a4a9a58b4de587f75b3d630412b31da1be2405b88b67e40080c5396deb80bf42"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.02736",
    "kind": "arxiv",
    "version": 2
  }
}