Effort and ability appraisals match or beat confidence in predicting LLM failures, with effort giving less overoptimistic and more stable signals across model sizes and task types.
Evaluation uses the BLEU score, implemented using the sacrebleu package in Python
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
Effort and ability appraisals match or beat confidence in predicting LLM failures, with effort giving less overoptimistic and more stable signals across model sizes and task types.