pith:TXA7XASW
Large Language Models are not Fair Evaluators
Large language models used as evaluators favor responses according to their order in the prompt.
arxiv:2305.17926 v2 · 2023-05-29 · cs.CL · cs.AI · cs.IR
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{TXA7XASWTYJWVSMDZLVQCVWHTE}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator.
That human annotations collected on the Vicuna benchmark questions constitute a stable and unbiased ground truth against which LLM judgments can be calibrated.
LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:14.153571Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
9dc1fb82569e136ac983caeb0156c79920a0c1cf5b2d10e6e2cfada985e5d478
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9dc1fb82569e136ac983caeb0156c79920a0c1cf5b2d10e6e2cfada985e5d478
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "e107e13651404e94eae5d12c7e083470d09dac7978c012369e39c8927b4ec367",
"cross_cats_sorted": [
"cs.AI",
"cs.IR"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CL",
"submitted_at": "2023-05-29T07:41:03Z",
"title_canon_sha256": "0ba7c25aed0362032899ff9fac27d26763553d15813c42c934f6e07274c9398c"
},
"schema_version": "1.0",
"source": {
"id": "2305.17926",
"kind": "arxiv",
"version": 2
}
}