Recognition: 2 theorem links
· Lean TheoremLarge Language Models are not Fair Evaluators
Pith reviewed 2026-05-17 12:05 UTC · model grok-4.3
The pith
Large language models used as evaluators favor responses according to their order in the prompt.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We uncover a systematic bias in the evaluation paradigm of adopting large language models (LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propose a calibration framework with three simple yet effective strategies.
What carries the argument
Position bias in LLM-as-judge prompts, mitigated by a calibration framework that combines multiple evidence generation, aggregation over varied response orders, and entropy-based selection for human review.
If this is right
- Generating multiple evaluation evidences before assigning a final rating increases consistency across repeated judgments.
- Averaging scores obtained from many different response orderings reduces the influence of any single presentation sequence.
- Entropy calculated over position-diverse judgments identifies examples where LLM and human decisions are likely to diverge, allowing targeted human intervention.
- The resulting calibrated scores show higher agreement with human win/tie/lose labels than uncalibrated LLM outputs.
Where Pith is reading between the lines
- Evaluation pipelines that rely on a single fixed ordering of candidates should be revised to include explicit randomization or multi-order averaging by default.
- The same ordering sensitivity may appear in other LLM judgment tasks such as pairwise preference collection or automatic grading of open-ended answers.
- Public release of the human annotations creates a reusable test set for measuring progress on bias-resistant evaluators.
Load-bearing premise
Human win/tie/lose annotations on the Vicuna benchmark questions supply a stable reference standard for measuring and correcting bias in LLM judgments.
What would settle it
Re-evaluating the same 80 queries after applying balanced position calibration across all presentation orders and checking whether the number of cases in which Vicuna-13B beats ChatGPT drops substantially below 66.
read the original abstract
In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propose a calibration framework with three simple yet effective strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple evaluation evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score; 3) Human-in-the-Loop Calibration, which introduces a balanced position diversity entropy to measure the difficulty of each example and seeks human assistance when needed. We also manually annotate the "win/tie/lose" outcomes of responses from ChatGPT and Vicuna-13B in the Vicuna Benchmark's question prompt, and extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. We release our code and human annotation at \url{https://github.com/i-Eval/FairEval} to facilitate future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs used as evaluators (e.g., GPT-4) exhibit systematic position bias: simply swapping the order of two candidate responses in the prompt can reverse win rates, allowing e.g. Vicuna-13B to beat ChatGPT on 66 of 80 Vicuna-benchmark queries. The authors propose three calibration strategies—Multiple Evidence Calibration, Balanced Position Calibration, and Human-in-the-Loop Calibration using position-diversity entropy—and validate them by manually collecting win/tie/lose labels on the same 80 queries, reporting improved agreement with these human labels after calibration.
Significance. If the order-bias effect and the effectiveness of the three calibrations hold, the result is significant for the growing literature that uses LLM-as-judge protocols; the public release of code and the human annotations is a clear strength that supports reproducibility and follow-up work.
major comments (2)
- [Human annotation protocol (described in the experiments section)] The central validation rests on the manually collected human win/tie/lose labels, yet the manuscript supplies no inter-annotator agreement statistic (Fleiss’ or Cohen’s kappa), no exact annotation instructions, no count of annotators per item, and no disagreement-resolution procedure. Because the claim that the calibrations “result in closer alignment with human judgments” is only as strong as the stability of those labels, this omission is load-bearing for the empirical conclusion.
- [§3 and the main results table] The reported 66/80 reversal under order swap is presented as evidence of a general, easily exploitable bias, but the paper does not test whether the magnitude of the effect is stable across different evaluator models, prompt templates, or response-length distributions; without such checks the claim that “the quality ranking … can be easily hacked” risks over-generalization from a single pair and single evaluator.
minor comments (2)
- [Abstract] The abstract states that the three strategies are “simple yet effective” but does not preview the quantitative improvement (e.g., change in agreement rate or win-rate delta) that would let readers immediately gauge the practical gain.
- [§3.3] Notation for the position-diversity entropy in the Human-in-the-Loop strategy should be defined explicitly with a formula rather than described only in prose.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects for improving the rigor and clarity of our work. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Human annotation protocol (described in the experiments section)] The central validation rests on the manually collected human win/tie/lose labels, yet the manuscript supplies no inter-annotator agreement statistic (Fleiss’ or Cohen’s kappa), no exact annotation instructions, no count of annotators per item, and no disagreement-resolution procedure. Because the claim that the calibrations “result in closer alignment with human judgments” is only as strong as the stability of those labels, this omission is load-bearing for the empirical conclusion.
Authors: We agree that the human annotation details are essential for assessing label reliability. The annotations were performed by three independent annotators per query using written instructions focused on helpfulness, accuracy, and relevance to determine win/tie/lose. Disagreements were resolved via majority vote. We will add a new subsection detailing the full protocol, including the exact instructions (in an appendix), annotator count, and inter-annotator agreement (Fleiss’ kappa). This will be incorporated in the revised manuscript to strengthen the empirical claims. revision: yes
-
Referee: [§3 and the main results table] The reported 66/80 reversal under order swap is presented as evidence of a general, easily exploitable bias, but the paper does not test whether the magnitude of the effect is stable across different evaluator models, prompt templates, or response-length distributions; without such checks the claim that “the quality ranking … can be easily hacked” risks over-generalization from a single pair and single evaluator.
Authors: We acknowledge the value of broader testing for generalizability. Our core demonstration uses a standard setup with GPT-4 on the Vicuna benchmark to illustrate the systematic position bias and its exploitability. The calibration strategies are model-agnostic by design. In revision we will add a limitations discussion noting that exact magnitudes may vary with other evaluators or templates, and include a small-scale check on one additional model where feasible. The central finding of bias in this widely adopted evaluation paradigm remains valid and actionable. revision: partial
Circularity Check
No significant circularity; validation uses independent human annotations
full rationale
The paper's core claims rest on empirical demonstrations of position bias in LLM evaluators and the effectiveness of three calibration strategies, all validated by direct comparison to separately collected human win/tie/lose labels on the Vicuna benchmark. These human annotations constitute an external benchmark rather than a quantity derived from the LLM outputs or the proposed calibrations themselves. No equations, fitted parameters, or self-citations are shown to reduce the reported improvements to the paper's own inputs by construction. Minor self-citation of prior LLM evaluation work may exist but is not load-bearing for the central empirical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human win/tie/lose annotations collected on the Vicuna benchmark questions form a reliable reference standard for response quality.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context
-
IndisputableMonolith.Foundation.LawOfExistencedefect_zero_iff_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We also manually annotate the “win/tie/lose” outcomes of responses from ChatGPT and Vicuna-13B
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
-
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
-
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.
-
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
-
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
-
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
Hybrid-DPO combining NLI and verifier scores delivers up to 6x NLI improvement over SFT baselines across multiple LLMs and domains while preserving answer coverage and inference speed.
-
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
-
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
-
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Reference graph
Works this paper leans on
-
[1]
Belinkov, Y.; Poliak, A.; Shieber, S.; Van Durme, B.; and Rush, A. 2019. Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)
work page 2019
-
[3]
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert - Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford...
work page 2020
-
[5]
Cai, Z.; Tu, L.; and Gimpel, K. 2017. Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL)
work page 2017
-
[6]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, C.; Gehrmann, S.; Schuh, P.; Shi, K.; Tsvyashchenko, S.; Maynez, J.; Rao, A.; Barnes, P.; Tay, Y.; Shazeer, N. M.; Prabhakaran, V.; Reif, E.; Du, N.; Hutchinson, B. C.; Pope, R.; Bradbury, J.; Austin, J.; Isard, M.; Gur-Ari, G.; Yin, P.; Duke, T.; ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; Zhou, A.; Zhang, W.; Lu, P.; He, C.; Yue, X.; Li, H.; and Qiao, Y. J. 2023. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. ArXiv, abs/2304.15010
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Goyal, Y.; Khot, T.; Summers - Stay, D.; Batra, D.; and Parikh, D. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2017
-
[11]
Gururangan, S.; Swayamdipta, S.; Levy, O.; Schwartz, R.; Bowman, S.; and Smith, N. A. 2018. Annotation Artifacts in Natural Language Inference Data. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics
work page 2018
-
[12]
He, T.; Zhang, J.; Wang, T.; Kumar, S.; Cho, K.; Glass, J.; and Tsvetkov, Y. 2023. On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 12067--12097. Toronto, Canada: Association for Computational Linguistics
work page 2023
-
[13]
Levy, O.; Remus, S.; Biemann, C.; and Dagan, I. 2015. Do Supervised Distributional Methods Really Learn Lexical Inference Relations? In Proceedings of the 2015 Conference of the North A merican Chapter of the Association for Computational Linguistics
work page 2015
-
[15]
Lin, C.-Y. 2004. ROUGE : A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74--81. Barcelona, Spain: Association for Computational Linguistics
work page 2004
-
[16]
Liu, T.; Xin, Z.; Chang, B.; and Sui, Z. 2020 a . H ypo NLI : Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association. ISBN 979-10-95546-34-4
work page 2020
-
[17]
Liu, T.; Xin, Z.; Ding, X.; Chang, B.; and Sui, Z. 2020 b . An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference. In Proceedings of the 24th Conference on Computational Natural Language Learning. Online: Association for Computational Linguistics
work page 2020
-
[19]
McCoy, T.; Pavlick, E.; and Linzen, T. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)
work page 2019
-
[20]
McHugh, M. L. 2012. Interrater reliability: the kappa statistic. Biochemia medica, 22(3): 276--282
work page 2012
-
[21]
Min, S.; Wallace, E.; Singh, S.; Gardner, M.; Hajishirzi, H.; and Zettlemoyer, L. 2019. Compositional Questions Do Not Necessitate Multi-hop Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)
work page 2019
-
[22]
OpenAI. 2022. Introducing ChatGPT
work page 2022
-
[23]
OpenAI. 2023. GPT-4 Technical Report. CoRR, abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. B leu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311--318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics
work page 2002
-
[25]
Peng, B.; Li, C.; He, P.; Galley, M.; and Gao, J. 2023. Instruction Tuning with GPT-4. ArXiv, abs/2304.03277
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Schwartz, R.; Sap, M.; Konstas, I.; Zilles, L.; Choi, Y.; and Smith, N. A. 2017. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. In Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL )
work page 2017
-
[29]
Sun, Z.; Shen, Y.; Zhou, Q.; Zhang, H.; Chen, Z.; Cox, D. D.; Yang, Y.; and Gan, C. 2023. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. ArXiv, abs/2305.03047
-
[30]
Turpin, M.; Michael, J.; Perez, E.; and Bowman, S. R. 2023. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. CoRR, abs/2305.04388
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Wang, P.; Song, Y.; Liu, T.; Lin, B.; Cao, Y.; Li, S.; and Sui, Z. 2022. Learning Robust Representations for Continual Relation Extraction via Adversarial Class Augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 6264--6278. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics
work page 2022
-
[32]
Wang, P.; Xun, R.; Liu, T.; Dai, D.; Chang, B.; and Sui, Z. 2021. Behind the Scenes: An Exploration of Trigger Biases Problem in Few-Shot Event Classification. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 1969--1978
work page 2021
-
[35]
Xia, H.; Wang, P.; Liu, T.; Lin, B.; Cao, Y.; and Sui, Z. 2023. Enhancing Continual Relation Extraction via Classifier Decomposition. In Findings of the Association for Computational Linguistics: ACL 2023, 10053--10062. Toronto, Canada: Association for Computational Linguistics
work page 2023
- [36]
-
[37]
Yuan, W.; Neubig, G.; and Liu, P. 2021. BARTScore: Evaluating Generated Text as Text Generation. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 27263--27277
work page 2021
-
[38]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2020. BERTScore: Evaluating Text Generation with BERT . In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net
work page 2020
-
[40]
Zhou, C.; Liu, P.; Xu, P.; Iyer, S.; Sun, J.; Mao, Y.; Ma, X.; Efrat, A.; Yu, P.; Yu, L.; Zhang, S.; Ghosh, G.; Lewis, M.; Zettlemoyer, L.; and Levy, O. 2023. LIMA: Less Is More for Alignment
work page 2023
-
[41]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. arXiv preprint arXiv:2306.05685 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [42]
-
[43]
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. ArXiv , year=
-
[44]
Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data , author=. ArXiv , year=
-
[45]
LIMA: Less Is More for Alignment , author=
- [46]
-
[47]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , author=. ArXiv , year=
-
[48]
Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =
work page 2020
-
[49]
B leu: a Method for Automatic Evaluation of Machine Translation
Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135
-
[50]
ROUGE : A Package for Automatic Evaluation of Summaries
Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004
work page 2004
-
[51]
Weinberger and Yoav Artzi , title =
Tianyi Zhang and Varsha Kishore and Felix Wu and Kilian Q. Weinberger and Yoav Artzi , title =. 8th International Conference on Learning Representations,. 2020 , url =
work page 2020
-
[52]
BARTScore: Evaluating Generated Text as Text Generation , booktitle =
Weizhe Yuan and Graham Neubig and Pengfei Liu , editor =. BARTScore: Evaluating Generated Text as Text Generation , booktitle =. 2021 , url =
work page 2021
-
[53]
OpenAI , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2303.08774 , eprinttype =. 2303.08774 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
- [54]
-
[55]
A Survey on In-context Learning
A Survey for In-context Learning , author=. arXiv preprint arXiv:2301.00234 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Miles Turpin and Julian Michael and Ethan Perez and Samuel R. Bowman , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2305.04388 , eprinttype =. 2305.04388 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.04388 2023
-
[57]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=
-
[58]
arXiv preprint arXiv:2304.00612 , year=
Eight things to know about large language models , author=. arXiv preprint arXiv:2304.00612 , year=
-
[59]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
arXiv preprint arXiv:2306.05087 , year=
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization , author=. arXiv preprint arXiv:2306.05087 , year=
-
[61]
GitHub repository , howpublished =
Dai Juntao and Ji Jiaming and Pan Xuehai and SunRuiyang , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[62]
Interrater reliability: the kappa statistic , author=. Biochemia medica , volume=. 2012 , publisher=
work page 2012
-
[63]
arXiv preprint arXiv:2306.07932 , year=
Human-in-the-Loop through Chain-of-Thought , author=. arXiv preprint arXiv:2306.07932 , year=
-
[64]
Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=
Behind the Scenes: An Exploration of Trigger Biases Problem in Few-Shot Event Classification , author=. Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pages=
-
[65]
Annotation Artifacts in Natural Language Inference Data
Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel and Smith, Noah A. Annotation Artifacts in Natural Language Inference Data. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics. 2018
work page 2018
-
[66]
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
McCoy, Tom and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019
work page 2019
-
[67]
Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference
Belinkov, Yonatan and Poliak, Adam and Shieber, Stuart and Van Durme, Benjamin and Rush, Alexander. Don ' t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019
work page 2019
-
[68]
H ypo NLI : Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference
Liu, Tianyu and Xin, Zheng and Chang, Baobao and Sui, Zhifang. H ypo NLI : Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference. Proceedings of the 12th Language Resources and Evaluation Conference. 2020
work page 2020
-
[69]
An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference
Liu, Tianyu and Xin, Zheng and Ding, Xiaoan and Chang, Baobao and Sui, Zhifang. An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference. Proceedings of the 24th Conference on Computational Natural Language Learning. 2020
work page 2020
-
[70]
Compositional Questions Do Not Necessitate Multi-hop Reasoning
Min, Sewon and Wallace, Eric and Singh, Sameer and Gardner, Matt and Hajishirzi, Hannaneh and Zettlemoyer, Luke. Compositional Questions Do Not Necessitate Multi-hop Reasoning. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019
work page 2019
-
[71]
Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task
Cai, Zheng and Tu, Lifu and Gimpel, Kevin. Pay Attention to the Ending:Strong Neural Baselines for the ROC Story Cloze Task. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). 2017
work page 2017
-
[72]
The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task
Schwartz, Roy and Sap, Maarten and Konstas, Ioannis and Zilles, Leila and Choi, Yejin and Smith, Noah A. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL ). 2017
work page 2017
-
[73]
Do Supervised Distributional Methods Really Learn Lexical Inference Relations?
Levy, Omer and Remus, Steffen and Biemann, Chris and Dagan, Ido. Do Supervised Distributional Methods Really Learn Lexical Inference Relations?. Proceedings of the 2015 Conference of the North A merican Chapter of the Association for Computational Linguistics. 2015
work page 2015
- [74]
-
[75]
arXiv preprint arXiv:2303.13809 , year=
Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt , author=. arXiv preprint arXiv:2303.13809 , year=
-
[76]
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
He, Tianxing and Zhang, Jingyu and Wang, Tianle and Kumar, Sachin and Cho, Kyunghyun and Glass, James and Tsvetkov, Yulia. On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.674
-
[77]
Restgpt: Connecting large language models with real-world restful apis
RestGPT: Connecting Large Language Models with Real-World Applications via RESTful APIs , author=. arXiv preprint arXiv:2306.06624 , year=
-
[78]
arXiv preprint arXiv:2305.14387 , year=
Alpacafarm: A simulation framework for methods that learn from human feedback , author=. arXiv preprint arXiv:2305.14387 , year=
-
[79]
arXiv preprint arXiv:2306.04751 , year=
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources , author=. arXiv preprint arXiv:2306.04751 , year=
-
[80]
Learning Robust Representations for Continual Relation Extraction via Adversarial Class Augmentation
Wang, Peiyi and Song, Yifan and Liu, Tianyu and Lin, Binghuai and Cao, Yunbo and Li, Sujian and Sui, Zhifang. Learning Robust Representations for Continual Relation Extraction via Adversarial Class Augmentation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.420
-
[81]
arXiv preprint arXiv:2305.07289 , year=
RepCL: Exploring Effective Representation for Continual Text Classification , author=. arXiv preprint arXiv:2305.07289 , year=
-
[82]
Enhancing Continual Relation Extraction via Classifier Decomposition
Xia, Heming and Wang, Peiyi and Liu, Tianyu and Lin, Binghuai and Cao, Yunbo and Sui, Zhifang. Enhancing Continual Relation Extraction via Classifier Decomposition. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.638
-
[83]
arXiv preprint arXiv:2306.04387 , year=
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning , author=. arXiv preprint arXiv:2306.04387 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.