Pith Number

pith:TXA7XASW

pith:2023:TXA7XASWTYJWVSMDZLVQCVWHTE

not attested not anchored not stored refs resolved

Large Language Models are not Fair Evaluators

Binghuai Lin, Dawei Zhu, Lei Li, Liang Chen, Peiyi Wang, Qi Liu, Tianyu Liu, Yunbo Cao, Zefan Cai, Zhifang Sui

Large language models used as evaluators favor responses according to their order in the prompt.

arxiv:2305.17926 v2 · 2023-05-29 · cs.CL · cs.AI · cs.IR

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{TXA7XASWTYJWVSMDZLVQCVWHTE}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator.

C2weakest assumption

That human annotations collected on the Vicuna benchmark questions constitute a stable and unbiased ground truth against which LLM judgments can be calibrated.

C3one line summary

LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.

References

72 extracted · 72 resolved · 10 Pith anchors

[1] Belinkov, Y.; Poliak, A.; Shieber, S.; Van Durme, B.; and Rush, A. 2019. Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. In Proceedings of the 57th Annual Mee 2019

[3] Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert - Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ram 2020

[5] Cai, Z.; Tu, L.; and Gimpel, K. 2017. Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task. In Proceedings of the 55th Annual Meeting of the Association for Computational L 2017

[6] PaLM: Scaling Language Modeling with Pathways 2022 · arXiv:2204.02311

[9] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model 2023 · arXiv:2304.15010

Formal links

2 machine-checked theorem links

Cited by

25 papers in Pith

Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks

Agentic AI Translate: An Agentic Translator Prototype for Translation as Communication Design

S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization

Receipt and verification

First computed	2026-05-17T23:38:14.153571Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

9dc1fb82569e136ac983caeb0156c79920a0c1cf5b2d10e6e2cfada985e5d478

Aliases

arxiv: 2305.17926 · arxiv_version: 2305.17926v2 · doi: 10.48550/arxiv.2305.17926 · pith_short_12: TXA7XASWTYJW · pith_short_16: TXA7XASWTYJWVSMD · pith_short_8: TXA7XASW

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9dc1fb82569e136ac983caeb0156c79920a0c1cf5b2d10e6e2cfada985e5d478

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "e107e13651404e94eae5d12c7e083470d09dac7978c012369e39c8927b4ec367",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.IR"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-05-29T07:41:03Z",
    "title_canon_sha256": "0ba7c25aed0362032899ff9fac27d26763553d15813c42c934f6e07274c9398c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2305.17926",
    "kind": "arxiv",
    "version": 2
  }
}