Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Title resolution pending
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
LLMs show significant biases in conflict event classification, with open-weight models exhibiting false illegitimation and adapted models showing actor bias and lexical sensitivity, making them unsuitable for unsupervised deployment.
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
The paper proposes combining the Metadata Prior Dominance Score (MPDS) with an evidence-intervention statistic ΔEvi to audit whether weak-label benchmarks measure true evidence dependence rather than metadata shortcuts, illustrated with synthetic HotpotQA, SNLI, and FEVER examples.
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
Later LLM layers align better with human cognitive effort in syntactic ambiguity than early layers do, indicating dual processing modes and complementary benefits from multi-layer probability updates.
Social identity markers in medical questions degrade LLM accuracy and uncertainty calibration, producing a calibration crisis that is non-additive for intersectional cases.
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
LLM reaches >=0.95 accuracy on 60 number theory problems with optimal hints; LightGBM classifier empirically supports Dirichlet conductor conjecture via zero features at 93.9% test accuracy for small q.
A historical review of NLG evaluation practices from 1990 to 2026, noting the rise of experimental methods and predicting increased focus on impact, qualitative, and safety evaluation.
citing papers explorer
-
Measuring Massive Multitask Language Understanding
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
-
Are LLMs Ready for Conflict Monitoring? Empirical Evidence from West Africa
LLMs show significant biases in conflict event classification, with open-weight models exhibiting false illegitimation and adapted models showing actor bias and lexical sensitivity, making them unsuitable for unsupervised deployment.
-
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
-
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks
The paper proposes combining the Metadata Prior Dominance Score (MPDS) with an evidence-intervention statistic ΔEvi to audit whether weak-label benchmarks measure true evidence dependence rather than metadata shortcuts, illustrated with synthetic HotpotQA, SNLI, and FEVER examples.
-
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
-
Dual Alignment Between Language Model Layers and Human Sentence Processing
Later LLM layers align better with human cognitive effort in syntactic ambiguity than early layers do, indicating dual processing modes and complementary benefits from multi-layer probability updates.
-
Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA
Social identity markers in medical questions degrade LLM accuracy and uncertainty calibration, producing a calibration crisis that is non-additive for intersectional cases.
-
Rigorous Interpretation Is a Form of Evaluation
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
-
Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Ensemble Methods for Conjecture Verification
LLM reaches >=0.95 accuracy on 60 number theory problems with optimal hints; LightGBM classifier empirically supports Dirichlet conductor conjecture via zero features at 93.9% test accuracy for small q.
-
NLG Evaluation: Past, Present, Future
A historical review of NLG evaluation practices from 1990 to 2026, noting the rise of experimental methods and predicting increased focus on impact, qualitative, and safety evaluation.