Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
Pith reviewed 2026-05-11 02:13 UTC · model grok-4.3
The pith
Effort and ability appraisals from LLMs predict correctness more reliably than confidence across tasks and model sizes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Drawing on cognitive appraisal theory, the authors elicit six appraisal dimensions alongside confidence from 12 LLMs on 38 tasks. Competence-related dimensions, particularly effort and ability, consistently match or outperform confidence in predicting model failure. Effort produces less overoptimistic estimates that remain stable across model sizes, while affective dimensions add only marginal value. The most informative dimension varies with task type, with effort strongest on reasoning tasks and ability or confidence strongest on retrieval tasks.
What carries the argument
Multidimensional appraisal-based self-assessment, which elicits separate ratings for dimensions such as effort, ability, and others to decompose model self-evaluation beyond single confidence scores.
If this is right
- Effort ratings give more stable and less overoptimistic failure predictions than confidence across model scales.
- The most useful appraisal dimension changes with task demands, favoring effort on reasoning tasks and ability on retrieval tasks.
- Affective dimensions supply only weak additional signals for performance prediction.
- Multidimensional self-assessment offers a route to more reliable and safer LLM deployment in varied settings.
Where Pith is reading between the lines
- Prompting models to report effort could be integrated into existing verification pipelines to flag likely errors without extra computation.
- Task-type differences suggest that systems could dynamically select which appraisal dimension to query based on detected problem class.
- If effort estimates prove robust, they might reduce reliance on post-hoc calibration methods that require ground-truth labels.
Load-bearing premise
That the verbalized appraisal dimensions LLMs produce actually reflect meaningful internal states and carry new information about correctness beyond what confidence and task difficulty already provide.
What would settle it
An experiment in which effort ratings show no higher correlation with actual correctness than confidence ratings when tested on new models or tasks after controlling for difficulty.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been shown to be an inconsistent and overoptimistic predictor of model correctness. Drawing on cognitive appraisal theory, a framework from human psychology that decomposes self-evaluation into multiple components, we propose a multidimensional perspective on model self-assessment. We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks spanning eight domains. We find that competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence across most settings. Effort additionally yields less overoptimistic estimates that remain stable across model sizes. In contrast, affective dimensions provide marginally predictive signals. Furthermore, the most informative dimension varies systematically with task characteristics: effort is most predictive for reasoning-intensive tasks, while ability and confidence dominate on retrieval-oriented tasks. Broadly, our findings indicate that structured multidimensional self-assessment is a promising approach to improving the reliability and safety of language model deployment across diverse real-world settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs' self-assessments can be improved by eliciting six appraisal dimensions drawn from cognitive appraisal theory (in addition to standard confidence) and that competence-related dimensions, especially effort and ability, match or outperform confidence for predicting correctness. Effort is reported to produce less overoptimistic estimates that remain stable across model scales. The best-performing dimension varies by task type (effort for reasoning tasks, ability/confidence for retrieval tasks), while affective dimensions add only marginal value. These patterns are reported across 12 LLMs and 38 tasks in eight domains.
Significance. If the central empirical patterns hold after controls for incremental validity, the work would provide a concrete, psychologically grounded alternative to single-metric confidence prompting, with direct implications for safer LLM deployment in high-stakes settings. The evaluation scope (12 models, 38 tasks) is a clear strength and supports claims of generality. The paper also earns credit for grounding the proposal in an established psychological framework rather than ad-hoc prompting variants.
major comments (2)
- [Results] Results section: The claim that effort and ability 'consistently match or outperform confidence' and that effort is 'less overoptimistic' and 'stable across model sizes' is load-bearing for the central contribution, yet the manuscript reports no partial correlations, hierarchical regressions, or difficulty-matched ablations that isolate incremental predictive validity beyond raw confidence scores and task-inherent difficulty. Without these, apparent superiority could be an artifact of prompt phrasing or shared variance with perceived hardness.
- [Methods / Experimental Setup] Experimental setup and methods: The abstract and main text provide no exact prompting templates for the six appraisal dimensions, no description of statistical tests or multiple-comparison corrections, no data-exclusion criteria, and no baseline controls that hold task difficulty constant while varying model self-knowledge. These omissions prevent verification that the reported patterns reflect genuine multidimensional self-assessment rather than surface-level prompt effects.
minor comments (2)
- [Abstract] The abstract states '38 tasks spanning eight domains' without enumerating the domains or providing a task list or citation; adding this would improve reproducibility.
- [Figures / Tables] Figure captions and table legends should explicitly state whether error bars represent standard error, confidence intervals, or standard deviation, and whether significance markers are corrected for multiple comparisons.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and the recommendation for major revision. We address each major comment below, agreeing that additional analyses and methodological details will strengthen the manuscript. We will incorporate the suggested changes in the revised version.
read point-by-point responses
-
Referee: [Results] Results section: The claim that effort and ability 'consistently match or outperform confidence' and that effort is 'less overoptimistic' and 'stable across model sizes' is load-bearing for the central contribution, yet the manuscript reports no partial correlations, hierarchical regressions, or difficulty-matched ablations that isolate incremental predictive validity beyond raw confidence scores and task-inherent difficulty. Without these, apparent superiority could be an artifact of prompt phrasing or shared variance with perceived hardness.
Authors: We agree that incremental validity analyses would strengthen the central claims. The current manuscript relies on direct comparisons of predictive performance (e.g., accuracy in forecasting correctness) across dimensions. In revision, we will add partial correlations and hierarchical regressions that control for confidence scores and task difficulty (using baseline performance or perceived hardness as proxies). We will also include difficulty-matched ablations to rule out artifacts from prompt phrasing or shared variance. These additions will clarify the unique contributions of effort and ability. revision: yes
-
Referee: [Methods / Experimental Setup] Experimental setup and methods: The abstract and main text provide no exact prompting templates for the six appraisal dimensions, no description of statistical tests or multiple-comparison corrections, no data-exclusion criteria, and no baseline controls that hold task difficulty constant while varying model self-knowledge. These omissions prevent verification that the reported patterns reflect genuine multidimensional self-assessment rather than surface-level prompt effects.
Authors: We agree these details should be more accessible. Prompting templates for the six dimensions are in the appendix; we will move representative examples to the main text. We will add a methods subsection describing the statistical tests, multiple-comparison corrections, and data-exclusion criteria. For baseline controls holding task difficulty constant, we will add analyses using difficulty covariates or matched subsets where possible, and discuss any remaining limitations regarding model self-knowledge variation. revision: yes
Circularity Check
No significant circularity in empirical evaluation
full rationale
The paper is an empirical study that elicits verbalized self-assessment dimensions from LLMs and compares their ability to predict correctness against confidence across 12 models and 38 tasks. No mathematical derivations, equations, or parameter-fitting steps are described that would reduce any reported predictive performance to inputs by construction. The central findings rest on observed experimental outcomes rather than self-definitions, self-citations as load-bearing premises, or renamed known results. This is a self-contained empirical comparison with no circular reduction in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can generate verbalized responses to appraisal-based questions that reflect meaningful internal states or processing characteristics relevant to correctness.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://aclanthology.org/2025.findings-acl.1316/. Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, et al. Mmlu-prox: A multilingual benchmark for advanced large language model evaluation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.emnlp-main.443 2025
-
[2]
Code Line Description: 60 items, ground truth in the form of multiple choices (same options provided in the prompt too), evaluation using exact string match to provide accuracy
-
[3]
Auto Debugging: 35 items, ground truth in the form of open-text, evaluation using exact string match to provide accuracy
-
[4]
The evaluation produces results for both the average compile rate and the overall accuracy
Python Programming: 32 items (across very easy, easy, medium and hard difficulty levels), ground truth is not directly available, and evaluation is performed by execution of code produced by models, using the exact protocol from BIG-Bench. The evaluation produces results for both the average compile rate and the overall accuracy. 29 •Math:
-
[5]
Evaluation is using exact string match to provide accuracy
Mathematical Induction: 70 items, ground truth in the form of Yes/No multiple choice, where the same options are provided in the prompts too. Evaluation is using exact string match to provide accuracy
-
[6]
Evaluated using exact string match to provide accuracy
Checkmate-in-one: 150 items, ground truth available directly in the form of the final correct move, and no options are provided in the prompt. Evaluated using exact string match to provide accuracy
-
[7]
Evaluation again uses exact string match to provide accuracy
Evaluating Information Essentiality: 60 items, ground truth in the form of options, which are also provided in the prompt to the models. Evaluation again uses exact string match to provide accuracy
-
[8]
Evaluation uses exact string match, and provides accuracy score
Dynamic Counting: 150 items, ground truth in the form of options, that are also provided in the prompt to the models. Evaluation uses exact string match, and provides accuracy score. •Science:
-
[9]
75 items are subsampled for each of the subtasks, leading to 150 total samples
Periodic Elements: includes two subtasks (named 0 and 1 within BIG-Bench) one of which involves direct recall of periodic elements, and one involves minor manipulation beyond simple recall. 75 items are subsampled for each of the subtasks, leading to 150 total samples. Ground truth is directly available in open-text format as the correct answer (the name ...
-
[10]
Evaluation uses exact string match, leading to accuracy scores
Physics: 150 items, ground truth available in the form of multiple choices, provided also in the prompts to the models. Evaluation uses exact string match, leading to accuracy scores. •Multilingual Reasoning:
-
[11]
Indic Cause and Effect: 50 items each for Hindi, Bengali, and Malayalam, with 2 sub- tasks each, involving different formats of questions for cause and effect. Ground truth is present as an utterance within the original question, as one of the provided two utterances has to be chosen as the cause. Evaluation is through an exact string match and provides a...
-
[12]
Kanji ASCII: Includes 75 items, each, for a ’pronunciation’ and a ’meaning’ task. Ground truth for the pronunciation task is available in the form of a list of words, all of which are possible correct answers, and are hence not provided within the prompt to the model. Evaluation marks a model response as correct if it matches any one of the words within t...
-
[13]
Evaluation uses an exact string match and provides an accuracy score
Proverb Translation: 72 items, ground truth available in the form of options that are provided to the model as part of the prompt. Evaluation uses an exact string match and provides an accuracy score. •Understanding World: 30
-
[14]
Cause and Effect: 51 items, ground truth available in the form of options that are also provided in the prompt, evaluation uses exact string match to produce accuracy scores
-
[15]
Physical Intuition: 81 items, ground truth available in the form of options that are also provided in the prompt, evaluation uses exact string match to produce accuracy scores
-
[16]
Commonsense Reasoning: 150 items, True/False options provided in the prompt, evalu- ation uses exact string match to produce accuracy scores
-
[17]
Fable: 150 items, ground truth available in the form of options that are also provided in the prompt, evaluation uses exact string match to produce accuracy scores. •Known Failure Modes:
-
[18]
Evaluation uses exact string match, and produces accuracy scores
Known Unknowns: 46 items, ground truth available in the form options (including an "Unknown" option) that is provided to the models in the prompt. Evaluation uses exact string match, and produces accuracy scores
-
[19]
World Unscrambling: 150 items, ground truth available in the form of a list of options, all of which are correct, and are hence not provided within the prompts. Evaluation uses exact string match, and rewards a model response if it matches any one of the possible correct answers. Metric produced is accuracy score. •Traditional NLP Tasks:
-
[20]
Evaluation uses the BLEU score, implemented using the sacrebleu package in Python
Text Simplification: 50 items, ground truth available in the form of direct open text. Evaluation uses the BLEU score, implemented using the sacrebleu package in Python. As the individual per-item correctness scores for this task are non binary, we binarize these scores for the AUROC calculation using a threshold of 0.5
-
[21]
Evaluation is using direct string match and provides the accuracy score
Phrase Relatedness: 100 items, ground truth available in the form of multiple choice options, which are provided in the prompt too. Evaluation is using direct string match and provides the accuracy score
-
[22]
Evaluation is using direct string match and provides the accuracy score
GRE Reading Comprehension: 32 items, ground truth available in the form of multiple choice options, which are provided in the prompt too. Evaluation is using direct string match and provides the accuracy score
-
[23]
Evaluation is using direct string match and provides the accuracy score
Identifying Anachronisms: 150 items, ground truth available in the form of Yes/No options, which are provided in the prompt too. Evaluation is using direct string match and provides the accuracy score. D.2 Tasks within the Hard Subset Following are the details for the inclusion of tasks within the Hard subset, described per domain: •Coding:
-
[24]
LiveCodeBenchPro (LCB-Pro) [Zheng et al., 2025]: Includes 332 items from the biannual_2025_1_6 subset of LCB-Pro, including samples from easy, medium, and hard difficulty categories. Evaluation is using a custom implementation that executes all available test cases for a given problem. The metric computed is pass rate, which shows the fraction of the tota...
work page 2025
-
[25]
Humanity’s Last Exam (HLE) (Math Subset) [Phan et al., 2025]: Includes 500 randomly subsampled math questions, subject to the filter that they are text-only, and do not include reasoning over an image. Ground truth is available in the form of open-ended generations, and evaluation is done using exact string match. Note that this may be a stricter evaluati...
work page 2025
-
[26]
Humanity’s Last Exam (HLE) (Physics, Chemistry, and Biology subsets) [Phan et al., 2025]: Includes a total of 324 items, across Physics, Chemistry, and Biology. All Physics samples are chosen randomly (ensuring they are text-only) from the original HLE dataset, leading to 168 items. Biology (107 items) and Chemistry (49 items) are chosen from the HLE-Gold...
work page 2025
-
[27]
Ground truth is available in the form of short, open-ended text
MultiNRC [Fabbri et al., 2025]: 252 total items, subsampled randomly, uniformly from available languages (French, Chinese, Spanish), and categories of questions (linguistic, wordplay, cultural, math). Ground truth is available in the form of short, open-ended text. Evaluation thus uses GPT-OSS 120B as a judge, and produces binary correct/incorrect scores....
work page 2025
-
[28]
MMLU-Prox [Xuan et al., 2025]: 250 items, subsampled randomly and uniformly from the 5 hardest languages: Zulu, Yoruba, Swahili, Wolof, and Telugu. Ground truth is available in the form of options, and evaluation is using exact match, leading to binary correctness labels and accuracy scores. 10https://www.futurehouse.org/research-announcements/hle-exam 32...
work page 2025
-
[29]
Ground truth in the form of multiple choices, also provided within the prompts
CausalProbe [Chi et al., 2024]: 500 items, included from the hard subset only. Ground truth in the form of multiple choices, also provided within the prompts. Evaluation is using exact match, leading to accuracy scores
work page 2024
-
[30]
ETHICS [Hendrycks et al., 2021]: 210 items, sampled uniformly and randomly from the hardest subsets: deontology, virtue ethics, and justice. Ground truth is available in the form of Yes/No options, allowing direct string match evaluation to produce accuracy scores
work page 2021
-
[31]
Includes several aspects of morality, such as, harm, authority, purity, fairness, etc
MoralBench [Ji et al., 2025a]: 43 items, chosen from the comparison format available within this benchmark, which admits binary ground truth for easier evaluation, leading to accuracy scores. Includes several aspects of morality, such as, harm, authority, purity, fairness, etc. •Understanding Humans:
-
[32]
EmoBench [Sabour et al., 2024]: 100 items from the harder emotion action subset, cover- ing categories of personal and third-person specific questions. Ground truth is available in the form of options and allows calculation of binary correctness labels and accuracy scores
work page 2024
-
[33]
Ground truth is available in the form of options, allowing the calculation of accuracy scores
ToMBench [Chen et al., 2024]: 200 items sampled uniformly and randomly across available languages (Chinese and English) and categories of questions (non-literal com- munication, emotion, desire, belief, intention, and knowledge). Ground truth is available in the form of options, allowing the calculation of accuracy scores. •Known Failures:
work page 2024
-
[34]
Ground truth is available as free-form generations
Artificial Analysis Omniscience (hallucination) [Jackson et al., 2025]: 500 items, subsam- pled uniformly and randomly across different available subjects of questions. Ground truth is available as free-form generations. Evaluation uses GPT-OSS 120B, using the original evaluation prompt format from the benchmark. •Traditional NLP Tasks:
work page 2025
-
[35]
answer": [your response to the original task],
Artificial Analysis Long Context Reasoning [Team, 2025]: All 100 samples from the original task are chosen. Ground truth is available as free text and is evaluated using GPT-OSS 120B and the original evaluation prompt format from the benchmark. E Vanilla Experimental Setup Prompt Our main experiments involve asking models for ratings after solving the dom...
- [36]
-
[37]
Rxe6 fxe6 20. Qf3 Qe5 21. Rd1 Rf8 22. Kf1 Qh2 23. Qg3 Qxg3 24. fxg3 Ne4+ 25. Kg1 Nxg3 26. Re1 Rf6 27. Kh2 Nh5 28. Ba6 Rf2 29. Bd3 Nf4
-
[38]
No because I heard that it is really hard
Kg3 Nxd3 31. cxd3 Rf6 32. Kh4 h6 33. g4 Kh7 34. Kh5 g6+ 35. Kh4 Rf3 36. Rxe6 g5+ 37. Kh5 Code Line Description Task Description:This task asks models to give an English language description of Python code. The goal of this task is to test the ability of language models to understand computer programming. Specifically, our task measures whether the languag...
-
[39]
Cole is under 7 feet tall
-
[40]
Every player on Cole’s varsity basketball team is above 6’1". Options: Statement 1 alone is sufficient while statement 2 alone is insufficient Statement 2 alone is sufficient while statement 1 alone is insufficient Either statement 1 or statement 2 is sufficient Statement 1 and statement 2 taken together are sufficient Neither statement 1 nor statement 2 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.