Effort and ability appraisals match or beat confidence in predicting LLM failures, with effort giving less overoptimistic and more stable signals across model sizes and task types.
75 items are subsampled for each of the subtasks, leading to 150 total samples
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1roles
background 1polarities
background 1representative citing papers
citing papers explorer
-
Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
Effort and ability appraisals match or beat confidence in predicting LLM failures, with effort giving less overoptimistic and more stable signals across model sizes and task types.