Llm-evaluation tropes: Perspectives on the validity of llm-evaluations.arXiv preprint arXiv:2504.19076

All that’s ‘human’ is not gold: Evaluating human evaluation of generated text · 2011 · arXiv 2504.19076

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

cs.CL · 2026-04-01 · unverdicted · novelty 4.0

A scoping review organizes decades of NLP evaluation debates into a taxonomy of recurring concerns and trade-offs with a structured checklist for better evaluation design.

citing papers explorer

Showing 2 of 2 citing papers.

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 4
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing cs.CL · 2026-04-01 · unverdicted · none · ref 7
A scoping review organizes decades of NLP evaluation debates into a taxonomy of recurring concerns and trade-offs with a structured checklist for better evaluation design.

Llm-evaluation tropes: Perspectives on the validity of llm-evaluations.arXiv preprint arXiv:2504.19076

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer