Effort and ability appraisals match or beat confidence in predicting LLM failures, with effort giving less overoptimistic and more stable signals across model sizes and task types.
All Physics samples are chosen randomly (ensuring they are text-only) from the original HLE dataset, leading to 168 items
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
Effort and ability appraisals match or beat confidence in predicting LLM failures, with effort giving less overoptimistic and more stable signals across model sizes and task types.