pith. sign in
Pith Number

pith:TXA7XASW

pith:2023:TXA7XASWTYJWVSMDZLVQCVWHTE
not attested not anchored not stored refs resolved

Large Language Models are not Fair Evaluators

Binghuai Lin, Dawei Zhu, Lei Li, Liang Chen, Peiyi Wang, Qi Liu, Tianyu Liu, Yunbo Cao, Zefan Cai, Zhifang Sui

Large language models used as evaluators favor responses according to their order in the prompt.

arxiv:2305.17926 v2 · 2023-05-29 · cs.CL · cs.AI · cs.IR

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{TXA7XASWTYJWVSMDZLVQCVWHTE}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator.

C2weakest assumption

That human annotations collected on the Vicuna benchmark questions constitute a stable and unbiased ground truth against which LLM judgments can be calibrated.

C3one line summary

LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.

References

72 extracted · 72 resolved · 10 Pith anchors

[1] Belinkov, Y.; Poliak, A.; Shieber, S.; Van Durme, B.; and Rush, A. 2019. Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. In Proceedings of the 57th Annual Mee 2019
[3] Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert - Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ram 2020
[5] Cai, Z.; Tu, L.; and Gimpel, K. 2017. Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task. In Proceedings of the 55th Annual Meeting of the Association for Computational L 2017
[6] PaLM: Scaling Language Modeling with Pathways 2022 · arXiv:2204.02311
[9] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model 2023 · arXiv:2304.15010

Formal links

2 machine-checked theorem links

Cited by

25 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:14.153571Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

9dc1fb82569e136ac983caeb0156c79920a0c1cf5b2d10e6e2cfada985e5d478

Aliases

arxiv: 2305.17926 · arxiv_version: 2305.17926v2 · doi: 10.48550/arxiv.2305.17926 · pith_short_12: TXA7XASWTYJW · pith_short_16: TXA7XASWTYJWVSMD · pith_short_8: TXA7XASW
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9dc1fb82569e136ac983caeb0156c79920a0c1cf5b2d10e6e2cfada985e5d478
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "e107e13651404e94eae5d12c7e083470d09dac7978c012369e39c8927b4ec367",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.IR"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-05-29T07:41:03Z",
    "title_canon_sha256": "0ba7c25aed0362032899ff9fac27d26763553d15813c42c934f6e07274c9398c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2305.17926",
    "kind": "arxiv",
    "version": 2
  }
}