arxiv: 2305.17926 · v2 · submitted 2023-05-29 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 2 theorem links

· Lean Theorem

Large Language Models are not Fair Evaluators

Peiyi Wang , Lei Li , Liang Chen , Zefan Cai , Dawei Zhu , Binghuai Lin , Yunbo Cao , Qi Liu

show 2 more authors

Tianyu Liu Zhifang Sui

Authors on Pith no claims yet

Pith reviewed 2026-05-17 12:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords position biasLLM evaluatorfair evaluationcalibrationVicuna benchmarkhuman alignmentGPT-4response ordering

0 comments

The pith

Large language models used as evaluators favor responses according to their order in the prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that LLMs such as GPT-4 exhibit a strong position bias when asked to score or rank responses from other models. Simply reversing the order in which two candidate answers appear can flip which one receives the higher rating, allowing researchers to engineer the apparent superiority of one model over another. On the Vicuna benchmark the authors show that presenting answers in a particular sequence lets Vicuna-13B defeat ChatGPT on 66 of 80 queries when ChatGPT itself serves as judge. They introduce three calibration techniques—generating multiple pieces of evidence before scoring, averaging across many orderings, and routing difficult cases to humans via an entropy measure—to reduce the bias and produce judgments that better match new human annotations of the same data.

Core claim

We uncover a systematic bias in the evaluation paradigm of adopting large language models (LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propose a calibration framework with three simple yet effective strategies.

What carries the argument

Position bias in LLM-as-judge prompts, mitigated by a calibration framework that combines multiple evidence generation, aggregation over varied response orders, and entropy-based selection for human review.

If this is right

Generating multiple evaluation evidences before assigning a final rating increases consistency across repeated judgments.
Averaging scores obtained from many different response orderings reduces the influence of any single presentation sequence.
Entropy calculated over position-diverse judgments identifies examples where LLM and human decisions are likely to diverge, allowing targeted human intervention.
The resulting calibrated scores show higher agreement with human win/tie/lose labels than uncalibrated LLM outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation pipelines that rely on a single fixed ordering of candidates should be revised to include explicit randomization or multi-order averaging by default.
The same ordering sensitivity may appear in other LLM judgment tasks such as pairwise preference collection or automatic grading of open-ended answers.
Public release of the human annotations creates a reusable test set for measuring progress on bias-resistant evaluators.

Load-bearing premise

Human win/tie/lose annotations on the Vicuna benchmark questions supply a stable reference standard for measuring and correcting bias in LLM judgments.

What would settle it

Re-evaluating the same 80 queries after applying balanced position calibration across all presentation orders and checking whether the number of cases in which Vicuna-13B beats ChatGPT drops substantially below 66.

read the original abstract

In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propose a calibration framework with three simple yet effective strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple evaluation evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score; 3) Human-in-the-Loop Calibration, which introduces a balanced position diversity entropy to measure the difficulty of each example and seeks human assistance when needed. We also manually annotate the "win/tie/lose" outcomes of responses from ChatGPT and Vicuna-13B in the Vicuna Benchmark's question prompt, and extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. We release our code and human annotation at \url{https://github.com/i-Eval/FairEval} to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM judges flip rankings with response order, and the three calibrations reduce that bias while aligning better with the collected human labels.

read the letter

The main takeaway is that swapping response order in an LLM prompt can flip which model wins the comparison in a large share of cases, and the authors supply three practical calibration steps that cut the bias and move results closer to human judgments. They quantify this on the Vicuna benchmark: with ChatGPT judging, Vicuna-13B beats it on 66 of 80 queries under order manipulation. The three strategies are Multiple Evidence Calibration (ask for reasons before scores), Balanced Position Calibration (average across orders), and Human-in-the-Loop Calibration (use entropy to route hard cases to people). Experiments show the fixes improve agreement with the human win/tie/lose labels the authors collected on those same queries, and they release both code and the annotations. That combination of concrete measurement and usable recipes is the useful part. The experiments are straightforward and the central claim about order sensitivity is backed by the numbers they report. The human labels serve as the external check, which avoids circularity. One soft spot is the annotation process itself. The paper does not report inter-annotator agreement or spell out the exact instructions, number of annotators, or how disagreements were resolved. If those labels carry their own inconsistencies, the reported gains in alignment are harder to weigh. The work is also narrow in scope, focused on one benchmark and a couple of models, so broader tests would strengthen it. This paper is for anyone running LLM-based evaluations for model comparisons or benchmarks. Readers who need reliable automated scoring will pick up the methods directly. It deserves a serious referee because the bias is demonstrated with clear evidence and the fixes are simple enough to adopt. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs used as evaluators (e.g., GPT-4) exhibit systematic position bias: simply swapping the order of two candidate responses in the prompt can reverse win rates, allowing e.g. Vicuna-13B to beat ChatGPT on 66 of 80 Vicuna-benchmark queries. The authors propose three calibration strategies—Multiple Evidence Calibration, Balanced Position Calibration, and Human-in-the-Loop Calibration using position-diversity entropy—and validate them by manually collecting win/tie/lose labels on the same 80 queries, reporting improved agreement with these human labels after calibration.

Significance. If the order-bias effect and the effectiveness of the three calibrations hold, the result is significant for the growing literature that uses LLM-as-judge protocols; the public release of code and the human annotations is a clear strength that supports reproducibility and follow-up work.

major comments (2)

[Human annotation protocol (described in the experiments section)] The central validation rests on the manually collected human win/tie/lose labels, yet the manuscript supplies no inter-annotator agreement statistic (Fleiss’ or Cohen’s kappa), no exact annotation instructions, no count of annotators per item, and no disagreement-resolution procedure. Because the claim that the calibrations “result in closer alignment with human judgments” is only as strong as the stability of those labels, this omission is load-bearing for the empirical conclusion.
[§3 and the main results table] The reported 66/80 reversal under order swap is presented as evidence of a general, easily exploitable bias, but the paper does not test whether the magnitude of the effect is stable across different evaluator models, prompt templates, or response-length distributions; without such checks the claim that “the quality ranking … can be easily hacked” risks over-generalization from a single pair and single evaluator.

minor comments (2)

[Abstract] The abstract states that the three strategies are “simple yet effective” but does not preview the quantitative improvement (e.g., change in agreement rate or win-rate delta) that would let readers immediately gauge the practical gain.
[§3.3] Notation for the position-diversity entropy in the Human-in-the-Loop strategy should be defined explicitly with a formula rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects for improving the rigor and clarity of our work. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Human annotation protocol (described in the experiments section)] The central validation rests on the manually collected human win/tie/lose labels, yet the manuscript supplies no inter-annotator agreement statistic (Fleiss’ or Cohen’s kappa), no exact annotation instructions, no count of annotators per item, and no disagreement-resolution procedure. Because the claim that the calibrations “result in closer alignment with human judgments” is only as strong as the stability of those labels, this omission is load-bearing for the empirical conclusion.

Authors: We agree that the human annotation details are essential for assessing label reliability. The annotations were performed by three independent annotators per query using written instructions focused on helpfulness, accuracy, and relevance to determine win/tie/lose. Disagreements were resolved via majority vote. We will add a new subsection detailing the full protocol, including the exact instructions (in an appendix), annotator count, and inter-annotator agreement (Fleiss’ kappa). This will be incorporated in the revised manuscript to strengthen the empirical claims. revision: yes
Referee: [§3 and the main results table] The reported 66/80 reversal under order swap is presented as evidence of a general, easily exploitable bias, but the paper does not test whether the magnitude of the effect is stable across different evaluator models, prompt templates, or response-length distributions; without such checks the claim that “the quality ranking … can be easily hacked” risks over-generalization from a single pair and single evaluator.

Authors: We acknowledge the value of broader testing for generalizability. Our core demonstration uses a standard setup with GPT-4 on the Vicuna benchmark to illustrate the systematic position bias and its exploitability. The calibration strategies are model-agnostic by design. In revision we will add a limitations discussion noting that exact magnitudes may vary with other evaluators or templates, and include a small-scale check on one additional model where feasible. The central finding of bias in this widely adopted evaluation paradigm remains valid and actionable. revision: partial

Circularity Check

0 steps flagged

No significant circularity; validation uses independent human annotations

full rationale

The paper's core claims rest on empirical demonstrations of position bias in LLM evaluators and the effectiveness of three calibration strategies, all validated by direct comparison to separately collected human win/tie/lose labels on the Vicuna benchmark. These human annotations constitute an external benchmark rather than a quantity derived from the LLM outputs or the proposed calibrations themselves. No equations, fitted parameters, or self-citations are shown to reduce the reported improvements to the paper's own inputs by construction. Minor self-citation of prior LLM evaluation work may exist but is not load-bearing for the central empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical and relies on standard assumptions about human judgment reliability rather than new mathematical axioms or fitted parameters.

axioms (1)

domain assumption Human win/tie/lose annotations collected on the Vicuna benchmark questions form a reliable reference standard for response quality.
Used to measure how well the calibrated LLM judgments align with human preferences.

pith-pipeline@v0.9.0 · 5582 in / 1242 out tokens · 45796 ms · 2026-05-17T12:05:24.426089+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context
IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We also manually annotate the “win/tie/lose” outcomes of responses from ChatGPT and Vicuna-13B

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
cs.LG 2026-05 unverdicted novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 6.0

RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 6.0

RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
cs.AI 2026-05 unverdicted novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
cs.AI 2026-04 unverdicted novelty 6.0

ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Training Language Models to Self-Correct via Reinforcement Learning
cs.LG 2024-09 unverdicted novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
cs.CL 2023-09 conditional novelty 6.0

RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
cs.CL 2023-06 accept novelty 6.0

GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 5.0

Hybrid-DPO combining NLI and verifier scores delivers up to 6x NLI improvement over SFT baselines across multiple LLMs and domains while preserving answer coverage and inference speed.
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
cs.AI 2026-05 unverdicted novelty 5.0

U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
cs.AI 2026-04 unverdicted novelty 5.0

Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
cs.CL 2025-12 unverdicted novelty 5.0

LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
cs.CL 2023-05 conditional novelty 5.0

Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 17 Pith papers · 10 internal anchors

[1]

Belinkov, Y.; Poliak, A.; Shieber, S.; Van Durme, B.; and Rush, A. 2019. Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)

work page 2019
[3]

Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert - Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford...

work page 2020
[5]

Cai, Z.; Tu, L.; and Gimpel, K. 2017. Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL)

work page 2017
[6]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, C.; Gehrmann, S.; Schuh, P.; Shi, K.; Tsvyashchenko, S.; Maynez, J.; Rao, A.; Barnes, P.; Tay, Y.; Shazeer, N. M.; Prabhakaran, V.; Reif, E.; Du, N.; Hutchinson, B. C.; Pope, R.; Bradbury, J.; Austin, J.; Isard, M.; Gur-Ari, G.; Yin, P.; Duke, T.; ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; Zhou, A.; Zhang, W.; Lu, P.; He, C.; Yue, X.; Li, H.; and Qiao, Y. J. 2023. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. ArXiv, abs/2304.15010

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Goyal, Y.; Khot, T.; Summers - Stay, D.; Batra, D.; and Parikh, D. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2017
[11]

Gururangan, S.; Swayamdipta, S.; Levy, O.; Schwartz, R.; Bowman, S.; and Smith, N. A. 2018. Annotation Artifacts in Natural Language Inference Data. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics

work page 2018
[12]

He, T.; Zhang, J.; Wang, T.; Kumar, S.; Cho, K.; Glass, J.; and Tsvetkov, Y. 2023. On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 12067--12097. Toronto, Canada: Association for Computational Linguistics

work page 2023
[13]

Levy, O.; Remus, S.; Biemann, C.; and Dagan, I. 2015. Do Supervised Distributional Methods Really Learn Lexical Inference Relations? In Proceedings of the 2015 Conference of the North A merican Chapter of the Association for Computational Linguistics

work page 2015
[15]

Lin, C.-Y. 2004. ROUGE : A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74--81. Barcelona, Spain: Association for Computational Linguistics

work page 2004
[16]

Liu, T.; Xin, Z.; Chang, B.; and Sui, Z. 2020 a . H ypo NLI : Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association. ISBN 979-10-95546-34-4

work page 2020
[17]

Liu, T.; Xin, Z.; Ding, X.; Chang, B.; and Sui, Z. 2020 b . An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference. In Proceedings of the 24th Conference on Computational Natural Language Learning. Online: Association for Computational Linguistics

work page 2020
[19]

McCoy, T.; Pavlick, E.; and Linzen, T. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)

work page 2019
[20]

McHugh, M. L. 2012. Interrater reliability: the kappa statistic. Biochemia medica, 22(3): 276--282

work page 2012
[21]

Min, S.; Wallace, E.; Singh, S.; Gardner, M.; Hajishirzi, H.; and Zettlemoyer, L. 2019. Compositional Questions Do Not Necessitate Multi-hop Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)

work page 2019
[22]

OpenAI. 2022. Introducing ChatGPT

work page 2022
[23]

OpenAI. 2023. GPT-4 Technical Report. CoRR, abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. B leu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311--318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics

work page 2002
[25]

Peng, B.; Li, C.; He, P.; Galley, M.; and Gao, J. 2023. Instruction Tuning with GPT-4. ArXiv, abs/2304.03277

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Schwartz, R.; Sap, M.; Konstas, I.; Zilles, L.; Choi, Y.; and Smith, N. A. 2017. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. In Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL )

work page 2017
[29]

D.; Yang, Y.; and Gan, C

Sun, Z.; Shen, Y.; Zhou, Q.; Zhang, H.; Chen, Z.; Cox, D. D.; Yang, Y.; and Gan, C. 2023. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. ArXiv, abs/2305.03047

work page arXiv 2023
[30]

Turpin, M.; Michael, J.; Perez, E.; and Bowman, S. R. 2023. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. CoRR, abs/2305.04388

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Wang, P.; Song, Y.; Liu, T.; Lin, B.; Cao, Y.; Li, S.; and Sui, Z. 2022. Learning Robust Representations for Continual Relation Extraction via Adversarial Class Augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 6264--6278. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics

work page 2022
[32]

Wang, P.; Xun, R.; Liu, T.; Dai, D.; Chang, B.; and Sui, Z. 2021. Behind the Scenes: An Exploration of Trigger Biases Problem in Few-Shot Event Classification. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 1969--1978

work page 2021
[35]

Xia, H.; Wang, P.; Liu, T.; Lin, B.; Cao, Y.; and Sui, Z. 2023. Enhancing Continual Relation Extraction via Classifier Decomposition. In Findings of the Association for Computational Linguistics: ACL 2023, 10053--10062. Toronto, Canada: Association for Computational Linguistics

work page 2023
[36]

Xu, C.; Guo, D.; Duan, N.; and McAuley, J. 2023. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data. ArXiv, abs/2304.01196

work page arXiv 2023
[37]

Yuan, W.; Neubig, G.; and Liu, P. 2021. BARTScore: Evaluating Generated Text as Text Generation. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 27263--27277

work page 2021
[38]

Q.; and Artzi, Y

Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2020. BERTScore: Evaluating Text Generation with BERT . In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net

work page 2020
[40]

Zhou, C.; Liu, P.; Xu, P.; Iyer, S.; Sun, J.; Mao, Y.; Ma, X.; Efrat, A.; Yu, P.; Yu, L.; Zhang, S.; Ghosh, G.; Lewis, M.; Zettlemoyer, L.; and Levy, O. 2023. LIMA: Less Is More for Alignment

work page 2023
[41]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. arXiv preprint arXiv:2306.05685 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

ArXiv , year=

Instruction Tuning with GPT-4 , author=. ArXiv , year=

work page
[43]

ArXiv , year=

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. ArXiv , year=

work page
[44]

ArXiv , year=

Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data , author=. ArXiv , year=

work page
[45]

LIMA: Less Is More for Alignment , author=

work page
[46]

ArXiv , year=

PaLM: Scaling Language Modeling with Pathways , author=. ArXiv , year=

work page
[47]

ArXiv , year=

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , author=. ArXiv , year=

work page
[48]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =

work page 2020
[49]

B leu: a Method for Automatic Evaluation of Machine Translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[50]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

work page 2004
[51]

Weinberger and Yoav Artzi , title =

Tianyi Zhang and Varsha Kishore and Felix Wu and Kilian Q. Weinberger and Yoav Artzi , title =. 8th International Conference on Learning Representations,. 2020 , url =

work page 2020
[52]

BARTScore: Evaluating Generated Text as Text Generation , booktitle =

Weizhe Yuan and Graham Neubig and Pengfei Liu , editor =. BARTScore: Evaluating Generated Text as Text Generation , booktitle =. 2021 , url =

work page 2021
[53]

GPT-4 Technical Report

OpenAI , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2303.08774 , eprinttype =. 2303.08774 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[54]

2022 , url =

OpenAI , title =. 2022 , url =

work page 2022
[55]

A Survey on In-context Learning

A Survey for In-context Learning , author=. arXiv preprint arXiv:2301.00234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin and Julian Michael and Ethan Perez and Samuel R. Bowman , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2305.04388 , eprinttype =. 2305.04388 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.04388 2023
[57]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=

work page
[58]

arXiv preprint arXiv:2304.00612 , year=

Eight things to know about large language models , author=. arXiv preprint arXiv:2304.00612 , year=

work page arXiv
[59]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

arXiv preprint arXiv:2306.05087 , year=

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization , author=. arXiv preprint arXiv:2306.05087 , year=

work page arXiv
[61]

GitHub repository , howpublished =

Dai Juntao and Ji Jiaming and Pan Xuehai and SunRuiyang , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[62]

Biochemia medica , volume=

Interrater reliability: the kappa statistic , author=. Biochemia medica , volume=. 2012 , publisher=

work page 2012
[63]

arXiv preprint arXiv:2306.07932 , year=

Human-in-the-Loop through Chain-of-Thought , author=. arXiv preprint arXiv:2306.07932 , year=

work page arXiv
[64]

Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=

Behind the Scenes: An Exploration of Trigger Biases Problem in Few-Shot Event Classification , author=. Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=

work page
[65]

Annotation Artifacts in Natural Language Inference Data

Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel and Smith, Noah A. Annotation Artifacts in Natural Language Inference Data. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics. 2018

work page 2018
[66]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

McCoy, Tom and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019

work page 2019
[67]

Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Belinkov, Yonatan and Poliak, Adam and Shieber, Stuart and Van Durme, Benjamin and Rush, Alexander. Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019

work page 2019
[68]

H ypo NLI : Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference

Liu, Tianyu and Xin, Zheng and Chang, Baobao and Sui, Zhifang. H ypo NLI : Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference. Proceedings of the 12th Language Resources and Evaluation Conference. 2020

work page 2020
[69]

An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference

Liu, Tianyu and Xin, Zheng and Ding, Xiaoan and Chang, Baobao and Sui, Zhifang. An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference. Proceedings of the 24th Conference on Computational Natural Language Learning. 2020

work page 2020
[70]

Compositional Questions Do Not Necessitate Multi-hop Reasoning

Min, Sewon and Wallace, Eric and Singh, Sameer and Gardner, Matt and Hajishirzi, Hannaneh and Zettlemoyer, Luke. Compositional Questions Do Not Necessitate Multi-hop Reasoning. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019

work page 2019
[71]

Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task

Cai, Zheng and Tu, Lifu and Gimpel, Kevin. Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). 2017

work page 2017
[72]

The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task

Schwartz, Roy and Sap, Maarten and Konstas, Ioannis and Zilles, Leila and Choi, Yejin and Smith, Noah A. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL ). 2017

work page 2017
[73]

Do Supervised Distributional Methods Really Learn Lexical Inference Relations?

Levy, Omer and Remus, Steffen and Biemann, Chris and Dagan, Ido. Do Supervised Distributional Methods Really Learn Lexical Inference Relations?. Proceedings of the 2015 Conference of the North A merican Chapter of the Association for Computational Linguistics. 2015

work page 2015
[74]

Making the

Yash Goyal and Tejas Khot and Douglas Summers. Making the. 2017

work page 2017
[75]

arXiv preprint arXiv:2303.13809 , year=

Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt , author=. arXiv preprint arXiv:2303.13809 , year=

work page arXiv
[76]

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

He, Tianxing and Zhang, Jingyu and Wang, Tianle and Kumar, Sachin and Cho, Kyunghyun and Glass, James and Tsvetkov, Yulia. On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.674

work page doi:10.18653/v1/2023.acl-long.674 2023
[77]

Restgpt: Connecting large language models with real-world restful apis

RestGPT: Connecting Large Language Models with Real-World Applications via RESTful APIs , author=. arXiv preprint arXiv:2306.06624 , year=

work page arXiv
[78]

arXiv preprint arXiv:2305.14387 , year=

Alpacafarm: A simulation framework for methods that learn from human feedback , author=. arXiv preprint arXiv:2305.14387 , year=

work page arXiv
[79]

arXiv preprint arXiv:2306.04751 , year=

How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources , author=. arXiv preprint arXiv:2306.04751 , year=

work page arXiv
[80]

Learning Robust Representations for Continual Relation Extraction via Adversarial Class Augmentation

Wang, Peiyi and Song, Yifan and Liu, Tianyu and Lin, Binghuai and Cao, Yunbo and Li, Sujian and Sui, Zhifang. Learning Robust Representations for Continual Relation Extraction via Adversarial Class Augmentation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.420

work page doi:10.18653/v1/2022.emnlp-main.420 2022
[81]

arXiv preprint arXiv:2305.07289 , year=

RepCL: Exploring Effective Representation for Continual Text Classification , author=. arXiv preprint arXiv:2305.07289 , year=

work page arXiv
[82]

Enhancing Continual Relation Extraction via Classifier Decomposition

Xia, Heming and Wang, Peiyi and Liu, Tianyu and Lin, Binghuai and Cao, Yunbo and Sui, Zhifang. Enhancing Continual Relation Extraction via Classifier Decomposition. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.638

work page doi:10.18653/v1/2023.findings-acl.638 2023
[83]

arXiv preprint arXiv:2306.04387 , year=

M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning , author=. arXiv preprint arXiv:2306.04387 , year=

work page arXiv