pith. machine review for the scientific record. sign in

arxiv: 2305.17926 · v2 · submitted 2023-05-29 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 2 theorem links

· Lean Theorem

Large Language Models are not Fair Evaluators

Authors on Pith no claims yet

Pith reviewed 2026-05-17 12:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords position biasLLM evaluatorfair evaluationcalibrationVicuna benchmarkhuman alignmentGPT-4response ordering
0
0 comments X

The pith

Large language models used as evaluators favor responses according to their order in the prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that LLMs such as GPT-4 exhibit a strong position bias when asked to score or rank responses from other models. Simply reversing the order in which two candidate answers appear can flip which one receives the higher rating, allowing researchers to engineer the apparent superiority of one model over another. On the Vicuna benchmark the authors show that presenting answers in a particular sequence lets Vicuna-13B defeat ChatGPT on 66 of 80 queries when ChatGPT itself serves as judge. They introduce three calibration techniques—generating multiple pieces of evidence before scoring, averaging across many orderings, and routing difficult cases to humans via an entropy measure—to reduce the bias and produce judgments that better match new human annotations of the same data.

Core claim

We uncover a systematic bias in the evaluation paradigm of adopting large language models (LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propose a calibration framework with three simple yet effective strategies.

What carries the argument

Position bias in LLM-as-judge prompts, mitigated by a calibration framework that combines multiple evidence generation, aggregation over varied response orders, and entropy-based selection for human review.

If this is right

  • Generating multiple evaluation evidences before assigning a final rating increases consistency across repeated judgments.
  • Averaging scores obtained from many different response orderings reduces the influence of any single presentation sequence.
  • Entropy calculated over position-diverse judgments identifies examples where LLM and human decisions are likely to diverge, allowing targeted human intervention.
  • The resulting calibrated scores show higher agreement with human win/tie/lose labels than uncalibrated LLM outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation pipelines that rely on a single fixed ordering of candidates should be revised to include explicit randomization or multi-order averaging by default.
  • The same ordering sensitivity may appear in other LLM judgment tasks such as pairwise preference collection or automatic grading of open-ended answers.
  • Public release of the human annotations creates a reusable test set for measuring progress on bias-resistant evaluators.

Load-bearing premise

Human win/tie/lose annotations on the Vicuna benchmark questions supply a stable reference standard for measuring and correcting bias in LLM judgments.

What would settle it

Re-evaluating the same 80 queries after applying balanced position calibration across all presentation orders and checking whether the number of cases in which Vicuna-13B beats ChatGPT drops substantially below 66.

read the original abstract

In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propose a calibration framework with three simple yet effective strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple evaluation evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score; 3) Human-in-the-Loop Calibration, which introduces a balanced position diversity entropy to measure the difficulty of each example and seeks human assistance when needed. We also manually annotate the "win/tie/lose" outcomes of responses from ChatGPT and Vicuna-13B in the Vicuna Benchmark's question prompt, and extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. We release our code and human annotation at \url{https://github.com/i-Eval/FairEval} to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs used as evaluators (e.g., GPT-4) exhibit systematic position bias: simply swapping the order of two candidate responses in the prompt can reverse win rates, allowing e.g. Vicuna-13B to beat ChatGPT on 66 of 80 Vicuna-benchmark queries. The authors propose three calibration strategies—Multiple Evidence Calibration, Balanced Position Calibration, and Human-in-the-Loop Calibration using position-diversity entropy—and validate them by manually collecting win/tie/lose labels on the same 80 queries, reporting improved agreement with these human labels after calibration.

Significance. If the order-bias effect and the effectiveness of the three calibrations hold, the result is significant for the growing literature that uses LLM-as-judge protocols; the public release of code and the human annotations is a clear strength that supports reproducibility and follow-up work.

major comments (2)
  1. [Human annotation protocol (described in the experiments section)] The central validation rests on the manually collected human win/tie/lose labels, yet the manuscript supplies no inter-annotator agreement statistic (Fleiss’ or Cohen’s kappa), no exact annotation instructions, no count of annotators per item, and no disagreement-resolution procedure. Because the claim that the calibrations “result in closer alignment with human judgments” is only as strong as the stability of those labels, this omission is load-bearing for the empirical conclusion.
  2. [§3 and the main results table] The reported 66/80 reversal under order swap is presented as evidence of a general, easily exploitable bias, but the paper does not test whether the magnitude of the effect is stable across different evaluator models, prompt templates, or response-length distributions; without such checks the claim that “the quality ranking … can be easily hacked” risks over-generalization from a single pair and single evaluator.
minor comments (2)
  1. [Abstract] The abstract states that the three strategies are “simple yet effective” but does not preview the quantitative improvement (e.g., change in agreement rate or win-rate delta) that would let readers immediately gauge the practical gain.
  2. [§3.3] Notation for the position-diversity entropy in the Human-in-the-Loop strategy should be defined explicitly with a formula rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects for improving the rigor and clarity of our work. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Human annotation protocol (described in the experiments section)] The central validation rests on the manually collected human win/tie/lose labels, yet the manuscript supplies no inter-annotator agreement statistic (Fleiss’ or Cohen’s kappa), no exact annotation instructions, no count of annotators per item, and no disagreement-resolution procedure. Because the claim that the calibrations “result in closer alignment with human judgments” is only as strong as the stability of those labels, this omission is load-bearing for the empirical conclusion.

    Authors: We agree that the human annotation details are essential for assessing label reliability. The annotations were performed by three independent annotators per query using written instructions focused on helpfulness, accuracy, and relevance to determine win/tie/lose. Disagreements were resolved via majority vote. We will add a new subsection detailing the full protocol, including the exact instructions (in an appendix), annotator count, and inter-annotator agreement (Fleiss’ kappa). This will be incorporated in the revised manuscript to strengthen the empirical claims. revision: yes

  2. Referee: [§3 and the main results table] The reported 66/80 reversal under order swap is presented as evidence of a general, easily exploitable bias, but the paper does not test whether the magnitude of the effect is stable across different evaluator models, prompt templates, or response-length distributions; without such checks the claim that “the quality ranking … can be easily hacked” risks over-generalization from a single pair and single evaluator.

    Authors: We acknowledge the value of broader testing for generalizability. Our core demonstration uses a standard setup with GPT-4 on the Vicuna benchmark to illustrate the systematic position bias and its exploitability. The calibration strategies are model-agnostic by design. In revision we will add a limitations discussion noting that exact magnitudes may vary with other evaluators or templates, and include a small-scale check on one additional model where feasible. The central finding of bias in this widely adopted evaluation paradigm remains valid and actionable. revision: partial

Circularity Check

0 steps flagged

No significant circularity; validation uses independent human annotations

full rationale

The paper's core claims rest on empirical demonstrations of position bias in LLM evaluators and the effectiveness of three calibration strategies, all validated by direct comparison to separately collected human win/tie/lose labels on the Vicuna benchmark. These human annotations constitute an external benchmark rather than a quantity derived from the LLM outputs or the proposed calibrations themselves. No equations, fitted parameters, or self-citations are shown to reduce the reported improvements to the paper's own inputs by construction. Minor self-citation of prior LLM evaluation work may exist but is not load-bearing for the central empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical and relies on standard assumptions about human judgment reliability rather than new mathematical axioms or fitted parameters.

axioms (1)
  • domain assumption Human win/tie/lose annotations collected on the Vicuna benchmark questions form a reliable reference standard for response quality.
    Used to measure how well the calibrated LLM judgments align with human preferences.

pith-pipeline@v0.9.0 · 5582 in / 1242 out tokens · 45796 ms · 2026-05-17T12:05:24.426089+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

    cs.LG 2026-05 unverdicted novelty 7.0

    An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.

  2. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.

  3. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.

  4. PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.

  5. ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

    cs.AI 2026-04 unverdicted novelty 6.0

    ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...

  6. VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    cs.CV 2025-04 unverdicted novelty 6.0

    VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

  7. Training Language Models to Self-Correct via Reinforcement Learning

    cs.LG 2024-09 unverdicted novelty 6.0

    SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

  8. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    cs.AI 2023-12 conditional novelty 6.0

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  9. The Falcon Series of Open Language Models

    cs.CL 2023-11 conditional novelty 6.0

    Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

  10. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

    cs.CL 2023-09 conditional novelty 6.0

    RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.

  11. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    cs.CL 2023-06 accept novelty 6.0

    GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.

  12. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 5.0

    Hybrid-DPO combining NLI and verifier scores delivers up to 6x NLI improvement over SFT baselines across multiple LLMs and domains while preserving answer coverage and inference speed.

  13. U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning

    cs.AI 2026-05 unverdicted novelty 5.0

    U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.

  14. Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

    cs.AI 2026-04 unverdicted novelty 5.0

    Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.

  15. Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

  16. Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

  17. Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

    cs.CL 2025-12 unverdicted novelty 5.0

    LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.

  18. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    cs.CL 2023-05 conditional novelty 5.0

    Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.

  19. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 17 Pith papers · 10 internal anchors

  1. [1]

    Belinkov, Y.; Poliak, A.; Shieber, S.; Van Durme, B.; and Rush, A. 2019. Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)

  2. [3]

    Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert - Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford...

  3. [5]

    Cai, Z.; Tu, L.; and Gimpel, K. 2017. Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL)

  4. [6]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, C.; Gehrmann, S.; Schuh, P.; Shi, K.; Tsvyashchenko, S.; Maynez, J.; Rao, A.; Barnes, P.; Tay, Y.; Shazeer, N. M.; Prabhakaran, V.; Reif, E.; Du, N.; Hutchinson, B. C.; Pope, R.; Bradbury, J.; Austin, J.; Isard, M.; Gur-Ari, G.; Yin, P.; Duke, T.; ...

  5. [9]

    Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; Zhou, A.; Zhang, W.; Lu, P.; He, C.; Yue, X.; Li, H.; and Qiao, Y. J. 2023. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. ArXiv, abs/2304.15010

  6. [10]

    Goyal, Y.; Khot, T.; Summers - Stay, D.; Batra, D.; and Parikh, D. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  7. [11]

    Gururangan, S.; Swayamdipta, S.; Levy, O.; Schwartz, R.; Bowman, S.; and Smith, N. A. 2018. Annotation Artifacts in Natural Language Inference Data. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics

  8. [12]

    He, T.; Zhang, J.; Wang, T.; Kumar, S.; Cho, K.; Glass, J.; and Tsvetkov, Y. 2023. On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 12067--12097. Toronto, Canada: Association for Computational Linguistics

  9. [13]

    Levy, O.; Remus, S.; Biemann, C.; and Dagan, I. 2015. Do Supervised Distributional Methods Really Learn Lexical Inference Relations? In Proceedings of the 2015 Conference of the North A merican Chapter of the Association for Computational Linguistics

  10. [15]

    Lin, C.-Y. 2004. ROUGE : A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74--81. Barcelona, Spain: Association for Computational Linguistics

  11. [16]

    Liu, T.; Xin, Z.; Chang, B.; and Sui, Z. 2020 a . H ypo NLI : Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association. ISBN 979-10-95546-34-4

  12. [17]

    Liu, T.; Xin, Z.; Ding, X.; Chang, B.; and Sui, Z. 2020 b . An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference. In Proceedings of the 24th Conference on Computational Natural Language Learning. Online: Association for Computational Linguistics

  13. [19]

    McCoy, T.; Pavlick, E.; and Linzen, T. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)

  14. [20]

    McHugh, M. L. 2012. Interrater reliability: the kappa statistic. Biochemia medica, 22(3): 276--282

  15. [21]

    Min, S.; Wallace, E.; Singh, S.; Gardner, M.; Hajishirzi, H.; and Zettlemoyer, L. 2019. Compositional Questions Do Not Necessitate Multi-hop Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)

  16. [22]

    OpenAI. 2022. Introducing ChatGPT

  17. [23]

    OpenAI. 2023. GPT-4 Technical Report. CoRR, abs/2303.08774

  18. [24]

    Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. B leu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311--318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics

  19. [25]

    Peng, B.; Li, C.; He, P.; Galley, M.; and Gao, J. 2023. Instruction Tuning with GPT-4. ArXiv, abs/2304.03277

  20. [26]

    Schwartz, R.; Sap, M.; Konstas, I.; Zilles, L.; Choi, Y.; and Smith, N. A. 2017. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. In Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL )

  21. [29]

    D.; Yang, Y.; and Gan, C

    Sun, Z.; Shen, Y.; Zhou, Q.; Zhang, H.; Chen, Z.; Cox, D. D.; Yang, Y.; and Gan, C. 2023. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. ArXiv, abs/2305.03047

  22. [30]

    Turpin, M.; Michael, J.; Perez, E.; and Bowman, S. R. 2023. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. CoRR, abs/2305.04388

  23. [31]

    Wang, P.; Song, Y.; Liu, T.; Lin, B.; Cao, Y.; Li, S.; and Sui, Z. 2022. Learning Robust Representations for Continual Relation Extraction via Adversarial Class Augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 6264--6278. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics

  24. [32]

    Wang, P.; Xun, R.; Liu, T.; Dai, D.; Chang, B.; and Sui, Z. 2021. Behind the Scenes: An Exploration of Trigger Biases Problem in Few-Shot Event Classification. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 1969--1978

  25. [35]

    Xia, H.; Wang, P.; Liu, T.; Lin, B.; Cao, Y.; and Sui, Z. 2023. Enhancing Continual Relation Extraction via Classifier Decomposition. In Findings of the Association for Computational Linguistics: ACL 2023, 10053--10062. Toronto, Canada: Association for Computational Linguistics

  26. [36]

    Xu, C.; Guo, D.; Duan, N.; and McAuley, J. 2023. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data. ArXiv, abs/2304.01196

  27. [37]

    Yuan, W.; Neubig, G.; and Liu, P. 2021. BARTScore: Evaluating Generated Text as Text Generation. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 27263--27277

  28. [38]

    Q.; and Artzi, Y

    Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2020. BERTScore: Evaluating Text Generation with BERT . In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net

  29. [40]

    Zhou, C.; Liu, P.; Xu, P.; Iyer, S.; Sun, J.; Mao, Y.; Ma, X.; Efrat, A.; Yu, P.; Yu, L.; Zhang, S.; Ghosh, G.; Lewis, M.; Zettlemoyer, L.; and Levy, O. 2023. LIMA: Less Is More for Alignment

  30. [41]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. arXiv preprint arXiv:2306.05685 , year=

  31. [42]

    ArXiv , year=

    Instruction Tuning with GPT-4 , author=. ArXiv , year=

  32. [43]

    ArXiv , year=

    Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. ArXiv , year=

  33. [44]

    ArXiv , year=

    Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data , author=. ArXiv , year=

  34. [45]

    LIMA: Less Is More for Alignment , author=

  35. [46]

    ArXiv , year=

    PaLM: Scaling Language Modeling with Pathways , author=. ArXiv , year=

  36. [47]

    ArXiv , year=

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , author=. ArXiv , year=

  37. [48]

    Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =

  38. [49]

    B leu: a Method for Automatic Evaluation of Machine Translation

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

  39. [50]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  40. [51]

    Weinberger and Yoav Artzi , title =

    Tianyi Zhang and Varsha Kishore and Felix Wu and Kilian Q. Weinberger and Yoav Artzi , title =. 8th International Conference on Learning Representations,. 2020 , url =

  41. [52]

    BARTScore: Evaluating Generated Text as Text Generation , booktitle =

    Weizhe Yuan and Graham Neubig and Pengfei Liu , editor =. BARTScore: Evaluating Generated Text as Text Generation , booktitle =. 2021 , url =

  42. [53]

    GPT-4 Technical Report

    OpenAI , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2303.08774 , eprinttype =. 2303.08774 , timestamp =

  43. [54]

    2022 , url =

    OpenAI , title =. 2022 , url =

  44. [55]

    A Survey on In-context Learning

    A Survey for In-context Learning , author=. arXiv preprint arXiv:2301.00234 , year=

  45. [56]

    Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

    Miles Turpin and Julian Michael and Ethan Perez and Samuel R. Bowman , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2305.04388 , eprinttype =. 2305.04388 , timestamp =

  46. [57]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=

  47. [58]

    arXiv preprint arXiv:2304.00612 , year=

    Eight things to know about large language models , author=. arXiv preprint arXiv:2304.00612 , year=

  48. [59]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  49. [60]

    arXiv preprint arXiv:2306.05087 , year=

    PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization , author=. arXiv preprint arXiv:2306.05087 , year=

  50. [61]

    GitHub repository , howpublished =

    Dai Juntao and Ji Jiaming and Pan Xuehai and SunRuiyang , title =. GitHub repository , howpublished =. 2023 , publisher =

  51. [62]

    Biochemia medica , volume=

    Interrater reliability: the kappa statistic , author=. Biochemia medica , volume=. 2012 , publisher=

  52. [63]

    arXiv preprint arXiv:2306.07932 , year=

    Human-in-the-Loop through Chain-of-Thought , author=. arXiv preprint arXiv:2306.07932 , year=

  53. [64]

    Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=

    Behind the Scenes: An Exploration of Trigger Biases Problem in Few-Shot Event Classification , author=. Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=

  54. [65]

    Annotation Artifacts in Natural Language Inference Data

    Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel and Smith, Noah A. Annotation Artifacts in Natural Language Inference Data. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics. 2018

  55. [66]

    Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

    McCoy, Tom and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019

  56. [67]

    Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

    Belinkov, Yonatan and Poliak, Adam and Shieber, Stuart and Van Durme, Benjamin and Rush, Alexander. Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019

  57. [68]

    H ypo NLI : Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference

    Liu, Tianyu and Xin, Zheng and Chang, Baobao and Sui, Zhifang. H ypo NLI : Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference. Proceedings of the 12th Language Resources and Evaluation Conference. 2020

  58. [69]

    An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference

    Liu, Tianyu and Xin, Zheng and Ding, Xiaoan and Chang, Baobao and Sui, Zhifang. An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference. Proceedings of the 24th Conference on Computational Natural Language Learning. 2020

  59. [70]

    Compositional Questions Do Not Necessitate Multi-hop Reasoning

    Min, Sewon and Wallace, Eric and Singh, Sameer and Gardner, Matt and Hajishirzi, Hannaneh and Zettlemoyer, Luke. Compositional Questions Do Not Necessitate Multi-hop Reasoning. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019

  60. [71]

    Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task

    Cai, Zheng and Tu, Lifu and Gimpel, Kevin. Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). 2017

  61. [72]

    The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task

    Schwartz, Roy and Sap, Maarten and Konstas, Ioannis and Zilles, Leila and Choi, Yejin and Smith, Noah A. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL ). 2017

  62. [73]

    Do Supervised Distributional Methods Really Learn Lexical Inference Relations?

    Levy, Omer and Remus, Steffen and Biemann, Chris and Dagan, Ido. Do Supervised Distributional Methods Really Learn Lexical Inference Relations?. Proceedings of the 2015 Conference of the North A merican Chapter of the Association for Computational Linguistics. 2015

  63. [74]

    Making the

    Yash Goyal and Tejas Khot and Douglas Summers. Making the. 2017

  64. [75]

    arXiv preprint arXiv:2303.13809 , year=

    Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt , author=. arXiv preprint arXiv:2303.13809 , year=

  65. [76]

    On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

    He, Tianxing and Zhang, Jingyu and Wang, Tianle and Kumar, Sachin and Cho, Kyunghyun and Glass, James and Tsvetkov, Yulia. On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.674

  66. [77]

    Restgpt: Connecting large language models with real-world restful apis

    RestGPT: Connecting Large Language Models with Real-World Applications via RESTful APIs , author=. arXiv preprint arXiv:2306.06624 , year=

  67. [78]

    arXiv preprint arXiv:2305.14387 , year=

    Alpacafarm: A simulation framework for methods that learn from human feedback , author=. arXiv preprint arXiv:2305.14387 , year=

  68. [79]

    arXiv preprint arXiv:2306.04751 , year=

    How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources , author=. arXiv preprint arXiv:2306.04751 , year=

  69. [80]

    Learning Robust Representations for Continual Relation Extraction via Adversarial Class Augmentation

    Wang, Peiyi and Song, Yifan and Liu, Tianyu and Lin, Binghuai and Cao, Yunbo and Li, Sujian and Sui, Zhifang. Learning Robust Representations for Continual Relation Extraction via Adversarial Class Augmentation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.420

  70. [81]

    arXiv preprint arXiv:2305.07289 , year=

    RepCL: Exploring Effective Representation for Continual Text Classification , author=. arXiv preprint arXiv:2305.07289 , year=

  71. [82]

    Enhancing Continual Relation Extraction via Classifier Decomposition

    Xia, Heming and Wang, Peiyi and Liu, Tianyu and Lin, Binghuai and Cao, Yunbo and Sui, Zhifang. Enhancing Continual Relation Extraction via Classifier Decomposition. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.638

  72. [83]

    arXiv preprint arXiv:2306.04387 , year=

    M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning , author=. arXiv preprint arXiv:2306.04387 , year=