{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:TXA7XASWTYJWVSMDZLVQCVWHTE","short_pith_number":"pith:TXA7XASW","schema_version":"1.0","canonical_sha256":"9dc1fb82569e136ac983caeb0156c79920a0c1cf5b2d10e6e2cfada985e5d478","source":{"kind":"arxiv","id":"2305.17926","version":2},"attestation_state":"computed","paper":{"title":"Large Language Models are not Fair Evaluators","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Large language models used as evaluators favor responses according to their order in the prompt.","cross_cats":["cs.AI","cs.IR"],"primary_cat":"cs.CL","authors_text":"Binghuai Lin, Dawei Zhu, Lei Li, Liang Chen, Peiyi Wang, Qi Liu, Tianyu Liu, Yunbo Cao, Zefan Cai, Zhifang Sui","submitted_at":"2023-05-29T07:41:03Z","abstract_excerpt":"In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propos"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2305.17926","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2023-05-29T07:41:03Z","cross_cats_sorted":["cs.AI","cs.IR"],"title_canon_sha256":"0ba7c25aed0362032899ff9fac27d26763553d15813c42c934f6e07274c9398c","abstract_canon_sha256":"e107e13651404e94eae5d12c7e083470d09dac7978c012369e39c8927b4ec367"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.154368Z","signature_b64":"n4KSdimmxw//JnOxg98iSbywg41rExhD0WYoSMXtPNp3chReulGb+qHSEwXhjpumte+2WILBb51kQWAChrPEAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"9dc1fb82569e136ac983caeb0156c79920a0c1cf5b2d10e6e2cfada985e5d478","last_reissued_at":"2026-05-17T23:38:14.153571Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.153571Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Large Language Models are not Fair Evaluators","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Large language models used as evaluators favor responses according to their order in the prompt.","cross_cats":["cs.AI","cs.IR"],"primary_cat":"cs.CL","authors_text":"Binghuai Lin, Dawei Zhu, Lei Li, Liang Chen, Peiyi Wang, Qi Liu, Tianyu Liu, Yunbo Cao, Zefan Cai, Zhifang Sui","submitted_at":"2023-05-29T07:41:03Z","abstract_excerpt":"In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propos"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That human annotations collected on the Vicuna benchmark questions constitute a stable and unbiased ground truth against which LLM judgments can be calibrated.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Large language models used as evaluators favor responses according to their order in the prompt.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e8fe65c4f96d4c05160237347496aeaa0d28cd7376949bd6ca18235d3e7000e4"},"source":{"id":"2305.17926","kind":"arxiv","version":2},"verdict":{"id":"e5367dc8-1fb8-402d-a38f-16ce3a1be36a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T12:05:24.426089Z","strongest_claim":"the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator.","one_line_summary":"LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That human annotations collected on the Vicuna benchmark questions constitute a stable and unbiased ground truth against which LLM judgments can be calibrated.","pith_extraction_headline":"Large language models used as evaluators favor responses according to their order in the prompt."},"references":{"count":72,"sample":[{"doi":"","year":2019,"title":"Belinkov, Y.; Poliak, A.; Shieber, S.; Van Durme, B.; and Rush, A. 2019. Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. In Proceedings of the 57th Annual Mee","work_id":"099acdd3-ec45-46fa-8a38-bc935d292cc0","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert - Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ram","work_id":"7482ffb4-572e-4267-8c72-e5a5ff5eb542","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Cai, Z.; Tu, L.; and Gimpel, K. 2017. Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task. In Proceedings of the 55th Annual Meeting of the Association for Computational L","work_id":"ea44d084-0b7c-41a6-9fc3-9c354af34f7a","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","ref_index":6,"cited_arxiv_id":"2204.02311","is_internal_anchor":true},{"doi":"","year":2023,"title":"LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model","work_id":"0fe2cfd8-d442-4ceb-b1a9-a465704f39b2","ref_index":9,"cited_arxiv_id":"2304.15010","is_internal_anchor":true}],"resolved_work":72,"snapshot_sha256":"40b18502bd3aa18145e3d4ed24d5660671bf6c58b05ec29b63d4e06b22e95d5a","internal_anchors":10},"formal_canon":{"evidence_count":2,"snapshot_sha256":"310b8c173c245eed116c79825b9fc1cc714eca83ec05998962494fd6024e87be"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2305.17926","created_at":"2026-05-17T23:38:14.153684+00:00"},{"alias_kind":"arxiv_version","alias_value":"2305.17926v2","created_at":"2026-05-17T23:38:14.153684+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2305.17926","created_at":"2026-05-17T23:38:14.153684+00:00"},{"alias_kind":"pith_short_12","alias_value":"TXA7XASWTYJW","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"TXA7XASWTYJWVSMD","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"TXA7XASW","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":25,"internal_anchor_count":25,"sample":[{"citing_arxiv_id":"2605.20312","citing_title":"Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17041","citing_title":"Agentic AI Translate: An Agentic Translator Prototype for Translation as Communication Design","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17076","citing_title":"S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17554","citing_title":"Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2509.23542","citing_title":"On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2401.05561","citing_title":"TrustLLM: Trustworthiness in Large Language Models","ref_index":194,"is_internal_anchor":true},{"citing_arxiv_id":"2409.12917","citing_title":"Training Language Models to Self-Correct via Reinforcement Learning","ref_index":121,"is_internal_anchor":true},{"citing_arxiv_id":"2512.23213","citing_title":"Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2311.16867","citing_title":"The Falcon Series of Open Language Models","ref_index":218,"is_internal_anchor":true},{"citing_arxiv_id":"2309.00267","citing_title":"RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback","ref_index":114,"is_internal_anchor":true},{"citing_arxiv_id":"2312.08935","citing_title":"Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations","ref_index":85,"is_internal_anchor":true},{"citing_arxiv_id":"2305.19118","citing_title":"Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate","ref_index":74,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04539","citing_title":"RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2504.07615","citing_title":"VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04539","citing_title":"RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2412.05579","citing_title":"LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods","ref_index":239,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23178","citing_title":"Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04539","citing_title":"RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01311","citing_title":"The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01123","citing_title":"PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20273","citing_title":"ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04066","citing_title":"Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04065","citing_title":"Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2306.05685","citing_title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02765","citing_title":"U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning","ref_index":117,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE","json":"https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE.json","graph_json":"https://pith.science/api/pith-number/TXA7XASWTYJWVSMDZLVQCVWHTE/graph.json","events_json":"https://pith.science/api/pith-number/TXA7XASWTYJWVSMDZLVQCVWHTE/events.json","paper":"https://pith.science/paper/TXA7XASW"},"agent_actions":{"view_html":"https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE","download_json":"https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE.json","view_paper":"https://pith.science/paper/TXA7XASW","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2305.17926&json=true","fetch_graph":"https://pith.science/api/pith-number/TXA7XASWTYJWVSMDZLVQCVWHTE/graph.json","fetch_events":"https://pith.science/api/pith-number/TXA7XASWTYJWVSMDZLVQCVWHTE/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE/action/timestamp_anchor","attest_storage":"https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE/action/storage_attestation","attest_author":"https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE/action/author_attestation","sign_citation":"https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE/action/citation_signature","submit_replication":"https://pith.science/pith/TXA7XASWTYJWVSMDZLVQCVWHTE/action/replication_record"}},"created_at":"2026-05-17T23:38:14.153684+00:00","updated_at":"2026-05-17T23:38:14.153684+00:00"}