Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

James Z. Wang; Leona Chen; Lucas Craig; Samarth Khanna; Sree Bhattacharyya; Tharun Dilliraj

arxiv: 2605.07806 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG

Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

Sree Bhattacharyya , Samarth Khanna , Leona Chen , Lucas Craig , Tharun Dilliraj , James Z. Wang This is my paper

Pith reviewed 2026-05-11 02:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM self-assessmentperformance predictionconfidence calibrationcognitive appraisalmodel reliabilityfailure detection

0 comments

The pith

Effort and ability appraisals from LLMs predict correctness more reliably than confidence across tasks and model sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether breaking self-assessment into multiple psychological dimensions improves predictions of when large language models will fail. Standard confidence estimates tend to be overoptimistic and unstable, but dimensions drawn from cognitive appraisal theory, especially perceived effort and ability, match or beat confidence on most of the 38 tasks tested. Effort stands out for giving less inflated forecasts that hold steady even as models grow larger. The best dimension shifts with the task: effort works best on reasoning problems while ability and confidence lead on retrieval ones. This structured approach points toward safer ways to decide when to trust model outputs in real applications.

Core claim

Drawing on cognitive appraisal theory, the authors elicit six appraisal dimensions alongside confidence from 12 LLMs on 38 tasks. Competence-related dimensions, particularly effort and ability, consistently match or outperform confidence in predicting model failure. Effort produces less overoptimistic estimates that remain stable across model sizes, while affective dimensions add only marginal value. The most informative dimension varies with task type, with effort strongest on reasoning tasks and ability or confidence strongest on retrieval tasks.

What carries the argument

Multidimensional appraisal-based self-assessment, which elicits separate ratings for dimensions such as effort, ability, and others to decompose model self-evaluation beyond single confidence scores.

If this is right

Effort ratings give more stable and less overoptimistic failure predictions than confidence across model scales.
The most useful appraisal dimension changes with task demands, favoring effort on reasoning tasks and ability on retrieval tasks.
Affective dimensions supply only weak additional signals for performance prediction.
Multidimensional self-assessment offers a route to more reliable and safer LLM deployment in varied settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompting models to report effort could be integrated into existing verification pipelines to flag likely errors without extra computation.
Task-type differences suggest that systems could dynamically select which appraisal dimension to query based on detected problem class.
If effort estimates prove robust, they might reduce reliance on post-hoc calibration methods that require ground-truth labels.

Load-bearing premise

That the verbalized appraisal dimensions LLMs produce actually reflect meaningful internal states and carry new information about correctness beyond what confidence and task difficulty already provide.

What would settle it

An experiment in which effort ratings show no higher correlation with actual correctness than confidence ratings when tested on new models or tasks after controlling for difficulty.

Figures

Figures reproduced from arXiv: 2605.07806 by James Z. Wang, Leona Chen, Lucas Craig, Samarth Khanna, Sree Bhattacharyya, Tharun Dilliraj.

**Figure 2.** Figure 2: McFaddens’s pseudo-R2 denoting the incremental variance added when adding a single dimension, and all dimensions, to a confidence-only baseline. The underlying model in both cases is Logistic Regression [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of prompting strategies for eliciting confidence with other dimensions. The chosen [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Calibration analysis for Standard and Hard subsets. (a) Brier score decomposition averaged across [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Frequency with which a given dimension emerges as the most predictive of failure, within a given [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Factor Analysis heatmaps for all benchmarks, across all models, including both standard and hard tasks. We study self-assessment ratings of LLMs under seven total dimensions, which are principally categorized as the affective and competence dimensions. We examine through the model self-reports, how they internally represent these dimensions, and whether they follow our prior categorization based on theory.… view at source ↗

**Figure 7.** Figure 7: Complete AUROC heatmap, when using a single dimension at a time as the threshold to predict [PITH_FULL_IMAGE:figures/full_fig_p039_7.png] view at source ↗

**Figure 8.** Figure 8: Complete heatmap using Spearman Rank Correlation, considering one dimension at a time. [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗

**Figure 9.** Figure 9: Gains with three ensembling methods: naive ensembling through a mean of all dimensions, [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗

**Figure 10.** Figure 10: Mean Feature Importance of each dimension in the ensemble methods, for (a) the Standard subset, [PITH_FULL_IMAGE:figures/full_fig_p041_10.png] view at source ↗

**Figure 11.** Figure 11: Frequency with which each dimension is found to be the best discriminator across the different [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗

**Figure 12.** Figure 12: Frequency with which each dimension is found to be the best discriminator across the different [PITH_FULL_IMAGE:figures/full_fig_p042_12.png] view at source ↗

**Figure 13.** Figure 13: ECE scores with 95% confidence intervals for Standard tasks. [PITH_FULL_IMAGE:figures/full_fig_p043_13.png] view at source ↗

**Figure 14.** Figure 14: ECE scores with 95% confidence intervals for Hard tasks. [PITH_FULL_IMAGE:figures/full_fig_p044_14.png] view at source ↗

**Figure 15.** Figure 15: Reliability curves shown only with effort, confidence, and ability for Standard tasks. [PITH_FULL_IMAGE:figures/full_fig_p044_15.png] view at source ↗

**Figure 16.** Figure 16: Reliability curves shown only with effort, confidence, and ability for Hard tasks. [PITH_FULL_IMAGE:figures/full_fig_p045_16.png] view at source ↗

**Figure 17.** Figure 17: Effect Sizes denoting the magnitude of change in metacognitive ratings between each sub-category [PITH_FULL_IMAGE:figures/full_fig_p077_17.png] view at source ↗

**Figure 18.** Figure 18: Improvement in AUROC scores, over using only confidence, when using task-adaptive best [PITH_FULL_IMAGE:figures/full_fig_p078_18.png] view at source ↗

**Figure 19.** Figure 19: Improvement in ECE scores, over using only confidence, when using task-adaptive best dimension. [PITH_FULL_IMAGE:figures/full_fig_p079_19.png] view at source ↗

**Figure 20.** Figure 20: Summary of ceiling compression results. N Properties of Dimensional Ratings In this section, we consider some external properties of the dimensional ratings to inform potential future explorations of their mechanistic structure. N.1 Spread of Ratings First, we quantify ceiling compression: the percentage of scores at the maximum value of the rating scale (10/10 for all dimensions except effort, where the … view at source ↗

**Figure 21.** Figure 21: Visualization of average ratings and task difficulty, for each dimension, averaged across models. [PITH_FULL_IMAGE:figures/full_fig_p081_21.png] view at source ↗

**Figure 22.** Figure 22: Correlations between standard deviation of accuracy within each domain, and the AUROC [PITH_FULL_IMAGE:figures/full_fig_p082_22.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been shown to be an inconsistent and overoptimistic predictor of model correctness. Drawing on cognitive appraisal theory, a framework from human psychology that decomposes self-evaluation into multiple components, we propose a multidimensional perspective on model self-assessment. We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks spanning eight domains. We find that competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence across most settings. Effort additionally yields less overoptimistic estimates that remain stable across model sizes. In contrast, affective dimensions provide marginally predictive signals. Furthermore, the most informative dimension varies systematically with task characteristics: effort is most predictive for reasoning-intensive tasks, while ability and confidence dominate on retrieval-oriented tasks. Broadly, our findings indicate that structured multidimensional self-assessment is a promising approach to improving the reliability and safety of language model deployment across diverse real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Effort and ability ratings match or beat confidence for failure prediction in this setup, but the paper leaves open whether they add anything beyond task difficulty.

read the letter

The main point is that prompting LLMs for effort and ability appraisals gives failure predictions that hold up at least as well as confidence across 12 models and 38 tasks, with effort looking more stable and less overoptimistic. The task-specific pattern—effort stronger on reasoning tasks, ability and confidence on retrieval—also comes through clearly in the results they report. That framing from cognitive appraisal theory is new in the calibration literature and the scale of the comparison is decent, so the empirical patterns are worth seeing even if they are not revolutionary. The experiments cover enough ground to make the consistency claim believable on its face. The soft spot is the missing controls. The abstract and results do not show partial correlations or regressions that isolate the appraisal dimensions from raw confidence scores or from simple task difficulty measures. If effort ratings largely track how hard the model perceives the question to be, then the reported edge could be an artifact of prompt wording rather than genuine extra signal. Without those ablations or difficulty-matched baselines, it is hard to judge how much the multidimensional approach actually moves the needle. This paper is aimed at people working on LLM reliability and self-assessment for deployment. A reader already following calibration work will find the head-to-head numbers useful and the task breakdowns informative. It is solid enough on the empirical side to deserve a serious referee, though the review will likely focus on whether the incremental validity holds up once difficulty and confidence are properly controlled for. I would send it out for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs' self-assessments can be improved by eliciting six appraisal dimensions drawn from cognitive appraisal theory (in addition to standard confidence) and that competence-related dimensions, especially effort and ability, match or outperform confidence for predicting correctness. Effort is reported to produce less overoptimistic estimates that remain stable across model scales. The best-performing dimension varies by task type (effort for reasoning tasks, ability/confidence for retrieval tasks), while affective dimensions add only marginal value. These patterns are reported across 12 LLMs and 38 tasks in eight domains.

Significance. If the central empirical patterns hold after controls for incremental validity, the work would provide a concrete, psychologically grounded alternative to single-metric confidence prompting, with direct implications for safer LLM deployment in high-stakes settings. The evaluation scope (12 models, 38 tasks) is a clear strength and supports claims of generality. The paper also earns credit for grounding the proposal in an established psychological framework rather than ad-hoc prompting variants.

major comments (2)

[Results] Results section: The claim that effort and ability 'consistently match or outperform confidence' and that effort is 'less overoptimistic' and 'stable across model sizes' is load-bearing for the central contribution, yet the manuscript reports no partial correlations, hierarchical regressions, or difficulty-matched ablations that isolate incremental predictive validity beyond raw confidence scores and task-inherent difficulty. Without these, apparent superiority could be an artifact of prompt phrasing or shared variance with perceived hardness.
[Methods / Experimental Setup] Experimental setup and methods: The abstract and main text provide no exact prompting templates for the six appraisal dimensions, no description of statistical tests or multiple-comparison corrections, no data-exclusion criteria, and no baseline controls that hold task difficulty constant while varying model self-knowledge. These omissions prevent verification that the reported patterns reflect genuine multidimensional self-assessment rather than surface-level prompt effects.

minor comments (2)

[Abstract] The abstract states '38 tasks spanning eight domains' without enumerating the domains or providing a task list or citation; adding this would improve reproducibility.
[Figures / Tables] Figure captions and table legends should explicitly state whether error bars represent standard error, confidence intervals, or standard deviation, and whether significance markers are corrected for multiple comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and the recommendation for major revision. We address each major comment below, agreeing that additional analyses and methodological details will strengthen the manuscript. We will incorporate the suggested changes in the revised version.

read point-by-point responses

Referee: [Results] Results section: The claim that effort and ability 'consistently match or outperform confidence' and that effort is 'less overoptimistic' and 'stable across model sizes' is load-bearing for the central contribution, yet the manuscript reports no partial correlations, hierarchical regressions, or difficulty-matched ablations that isolate incremental predictive validity beyond raw confidence scores and task-inherent difficulty. Without these, apparent superiority could be an artifact of prompt phrasing or shared variance with perceived hardness.

Authors: We agree that incremental validity analyses would strengthen the central claims. The current manuscript relies on direct comparisons of predictive performance (e.g., accuracy in forecasting correctness) across dimensions. In revision, we will add partial correlations and hierarchical regressions that control for confidence scores and task difficulty (using baseline performance or perceived hardness as proxies). We will also include difficulty-matched ablations to rule out artifacts from prompt phrasing or shared variance. These additions will clarify the unique contributions of effort and ability. revision: yes
Referee: [Methods / Experimental Setup] Experimental setup and methods: The abstract and main text provide no exact prompting templates for the six appraisal dimensions, no description of statistical tests or multiple-comparison corrections, no data-exclusion criteria, and no baseline controls that hold task difficulty constant while varying model self-knowledge. These omissions prevent verification that the reported patterns reflect genuine multidimensional self-assessment rather than surface-level prompt effects.

Authors: We agree these details should be more accessible. Prompting templates for the six dimensions are in the appendix; we will move representative examples to the main text. We will add a methods subsection describing the statistical tests, multiple-comparison corrections, and data-exclusion criteria. For baseline controls holding task difficulty constant, we will add analyses using difficulty covariates or matched subsets where possible, and discuss any remaining limitations regarding model self-knowledge variation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper is an empirical study that elicits verbalized self-assessment dimensions from LLMs and compares their ability to predict correctness against confidence across 12 models and 38 tasks. No mathematical derivations, equations, or parameter-fitting steps are described that would reduce any reported predictive performance to inputs by construction. The central findings rest on observed experimental outcomes rather than self-definitions, self-citations as load-bearing premises, or renamed known results. This is a self-contained empirical comparison with no circular reduction in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can produce interpretable verbal responses to appraisal questions that carry genuine predictive signal about correctness, plus the representativeness of the chosen 38 tasks and 12 models.

axioms (1)

domain assumption LLMs can generate verbalized responses to appraisal-based questions that reflect meaningful internal states or processing characteristics relevant to correctness.
The entire evaluation presupposes that the six elicited dimensions are not merely linguistic artifacts but carry usable information about model reliability.

pith-pipeline@v0.9.0 · 5526 in / 1211 out tokens · 59851 ms · 2026-05-11T02:13:44.986044+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

[1]

Qwen3 Technical Report

URLhttps://aclanthology.org/2025.findings-acl.1316/. Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, et al. Mmlu-prox: A multilingual benchmark for advanced large language model evaluation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.emnlp-main.443 2025
[2]

Code Line Description: 60 items, ground truth in the form of multiple choices (same options provided in the prompt too), evaluation using exact string match to provide accuracy

work page
[3]

Auto Debugging: 35 items, ground truth in the form of open-text, evaluation using exact string match to provide accuracy

work page
[4]

The evaluation produces results for both the average compile rate and the overall accuracy

Python Programming: 32 items (across very easy, easy, medium and hard difficulty levels), ground truth is not directly available, and evaluation is performed by execution of code produced by models, using the exact protocol from BIG-Bench. The evaluation produces results for both the average compile rate and the overall accuracy. 29 •Math:

work page
[5]

Evaluation is using exact string match to provide accuracy

Mathematical Induction: 70 items, ground truth in the form of Yes/No multiple choice, where the same options are provided in the prompts too. Evaluation is using exact string match to provide accuracy

work page
[6]

Evaluated using exact string match to provide accuracy

Checkmate-in-one: 150 items, ground truth available directly in the form of the final correct move, and no options are provided in the prompt. Evaluated using exact string match to provide accuracy

work page
[7]

Evaluation again uses exact string match to provide accuracy

Evaluating Information Essentiality: 60 items, ground truth in the form of options, which are also provided in the prompt to the models. Evaluation again uses exact string match to provide accuracy

work page
[8]

Evaluation uses exact string match, and provides accuracy score

Dynamic Counting: 150 items, ground truth in the form of options, that are also provided in the prompt to the models. Evaluation uses exact string match, and provides accuracy score. •Science:

work page
[9]

75 items are subsampled for each of the subtasks, leading to 150 total samples

Periodic Elements: includes two subtasks (named 0 and 1 within BIG-Bench) one of which involves direct recall of periodic elements, and one involves minor manipulation beyond simple recall. 75 items are subsampled for each of the subtasks, leading to 150 total samples. Ground truth is directly available in open-text format as the correct answer (the name ...

work page
[10]

Evaluation uses exact string match, leading to accuracy scores

Physics: 150 items, ground truth available in the form of multiple choices, provided also in the prompts to the models. Evaluation uses exact string match, leading to accuracy scores. •Multilingual Reasoning:

work page
[11]

Ground truth is present as an utterance within the original question, as one of the provided two utterances has to be chosen as the cause

Indic Cause and Effect: 50 items each for Hindi, Bengali, and Malayalam, with 2 sub- tasks each, involving different formats of questions for cause and effect. Ground truth is present as an utterance within the original question, as one of the provided two utterances has to be chosen as the cause. Evaluation is through an exact string match and provides a...

work page
[12]

Kanji ASCII: Includes 75 items, each, for a ’pronunciation’ and a ’meaning’ task. Ground truth for the pronunciation task is available in the form of a list of words, all of which are possible correct answers, and are hence not provided within the prompt to the model. Evaluation marks a model response as correct if it matches any one of the words within t...

work page
[13]

Evaluation uses an exact string match and provides an accuracy score

Proverb Translation: 72 items, ground truth available in the form of options that are provided to the model as part of the prompt. Evaluation uses an exact string match and provides an accuracy score. •Understanding World: 30

work page
[14]

Cause and Effect: 51 items, ground truth available in the form of options that are also provided in the prompt, evaluation uses exact string match to produce accuracy scores

work page
[15]

Physical Intuition: 81 items, ground truth available in the form of options that are also provided in the prompt, evaluation uses exact string match to produce accuracy scores

work page
[16]

Commonsense Reasoning: 150 items, True/False options provided in the prompt, evalu- ation uses exact string match to produce accuracy scores

work page
[17]

•Known Failure Modes:

Fable: 150 items, ground truth available in the form of options that are also provided in the prompt, evaluation uses exact string match to produce accuracy scores. •Known Failure Modes:

work page
[18]

Evaluation uses exact string match, and produces accuracy scores

Known Unknowns: 46 items, ground truth available in the form options (including an "Unknown" option) that is provided to the models in the prompt. Evaluation uses exact string match, and produces accuracy scores

work page
[19]

Evaluation uses exact string match, and rewards a model response if it matches any one of the possible correct answers

World Unscrambling: 150 items, ground truth available in the form of a list of options, all of which are correct, and are hence not provided within the prompts. Evaluation uses exact string match, and rewards a model response if it matches any one of the possible correct answers. Metric produced is accuracy score. •Traditional NLP Tasks:

work page
[20]

Evaluation uses the BLEU score, implemented using the sacrebleu package in Python

Text Simplification: 50 items, ground truth available in the form of direct open text. Evaluation uses the BLEU score, implemented using the sacrebleu package in Python. As the individual per-item correctness scores for this task are non binary, we binarize these scores for the AUROC calculation using a threshold of 0.5

work page
[21]

Evaluation is using direct string match and provides the accuracy score

Phrase Relatedness: 100 items, ground truth available in the form of multiple choice options, which are provided in the prompt too. Evaluation is using direct string match and provides the accuracy score

work page
[22]

Evaluation is using direct string match and provides the accuracy score

GRE Reading Comprehension: 32 items, ground truth available in the form of multiple choice options, which are provided in the prompt too. Evaluation is using direct string match and provides the accuracy score

work page
[23]

Evaluation is using direct string match and provides the accuracy score

Identifying Anachronisms: 150 items, ground truth available in the form of Yes/No options, which are provided in the prompt too. Evaluation is using direct string match and provides the accuracy score. D.2 Tasks within the Hard Subset Following are the details for the inclusion of tasks within the Hard subset, described per domain: •Coding:

work page
[24]

Evaluation is using a custom implementation that executes all available test cases for a given problem

LiveCodeBenchPro (LCB-Pro) [Zheng et al., 2025]: Includes 332 items from the biannual_2025_1_6 subset of LCB-Pro, including samples from easy, medium, and hard difficulty categories. Evaluation is using a custom implementation that executes all available test cases for a given problem. The metric computed is pass rate, which shows the fraction of the tota...

work page 2025
[25]

Ground truth is available in the form of open-ended generations, and evaluation is done using exact string match

Humanity’s Last Exam (HLE) (Math Subset) [Phan et al., 2025]: Includes 500 randomly subsampled math questions, subject to the filter that they are text-only, and do not include reasoning over an image. Ground truth is available in the form of open-ended generations, and evaluation is done using exact string match. Note that this may be a stricter evaluati...

work page 2025
[26]

All Physics samples are chosen randomly (ensuring they are text-only) from the original HLE dataset, leading to 168 items

Humanity’s Last Exam (HLE) (Physics, Chemistry, and Biology subsets) [Phan et al., 2025]: Includes a total of 324 items, across Physics, Chemistry, and Biology. All Physics samples are chosen randomly (ensuring they are text-only) from the original HLE dataset, leading to 168 items. Biology (107 items) and Chemistry (49 items) are chosen from the HLE-Gold...

work page 2025
[27]

Ground truth is available in the form of short, open-ended text

MultiNRC [Fabbri et al., 2025]: 252 total items, subsampled randomly, uniformly from available languages (French, Chinese, Spanish), and categories of questions (linguistic, wordplay, cultural, math). Ground truth is available in the form of short, open-ended text. Evaluation thus uses GPT-OSS 120B as a judge, and produces binary correct/incorrect scores....

work page 2025
[28]

Ground truth is available in the form of options, and evaluation is using exact match, leading to binary correctness labels and accuracy scores

MMLU-Prox [Xuan et al., 2025]: 250 items, subsampled randomly and uniformly from the 5 hardest languages: Zulu, Yoruba, Swahili, Wolof, and Telugu. Ground truth is available in the form of options, and evaluation is using exact match, leading to binary correctness labels and accuracy scores. 10https://www.futurehouse.org/research-announcements/hle-exam 32...

work page 2025
[29]

Ground truth in the form of multiple choices, also provided within the prompts

CausalProbe [Chi et al., 2024]: 500 items, included from the hard subset only. Ground truth in the form of multiple choices, also provided within the prompts. Evaluation is using exact match, leading to accuracy scores

work page 2024
[30]

Ground truth is available in the form of Yes/No options, allowing direct string match evaluation to produce accuracy scores

ETHICS [Hendrycks et al., 2021]: 210 items, sampled uniformly and randomly from the hardest subsets: deontology, virtue ethics, and justice. Ground truth is available in the form of Yes/No options, allowing direct string match evaluation to produce accuracy scores

work page 2021
[31]

Includes several aspects of morality, such as, harm, authority, purity, fairness, etc

MoralBench [Ji et al., 2025a]: 43 items, chosen from the comparison format available within this benchmark, which admits binary ground truth for easier evaluation, leading to accuracy scores. Includes several aspects of morality, such as, harm, authority, purity, fairness, etc. •Understanding Humans:

work page
[32]

Ground truth is available in the form of options and allows calculation of binary correctness labels and accuracy scores

EmoBench [Sabour et al., 2024]: 100 items from the harder emotion action subset, cover- ing categories of personal and third-person specific questions. Ground truth is available in the form of options and allows calculation of binary correctness labels and accuracy scores

work page 2024
[33]

Ground truth is available in the form of options, allowing the calculation of accuracy scores

ToMBench [Chen et al., 2024]: 200 items sampled uniformly and randomly across available languages (Chinese and English) and categories of questions (non-literal com- munication, emotion, desire, belief, intention, and knowledge). Ground truth is available in the form of options, allowing the calculation of accuracy scores. •Known Failures:

work page 2024
[34]

Ground truth is available as free-form generations

Artificial Analysis Omniscience (hallucination) [Jackson et al., 2025]: 500 items, subsam- pled uniformly and randomly across different available subjects of questions. Ground truth is available as free-form generations. Evaluation uses GPT-OSS 120B, using the original evaluation prompt format from the benchmark. •Traditional NLP Tasks:

work page 2025
[35]

answer": [your response to the original task],

Artificial Analysis Long Context Reasoning [Team, 2025]: All 100 samples from the original task are chosen. Ground truth is available as free text and is evaluated using GPT-OSS 120B and the original evaluation prompt format from the benchmark. E Vanilla Experimental Setup Prompt Our main experiments involve asking models for ratings after solving the dom...

work page arXiv 2025
[36]

Nf3 Nc6 3

e4 e5 2. Nf3 Nc6 3. d4 exd4 4. Nxd4 Bc5 5. Nxc6 bxc6 6. Nc3 Nf6 7. h3 d6 8. Bd3 Qe7 9. O-O d5 10. exd5 cxd5 11. Re1 Be6 12. Bf4 c6 13. a3 O-O 14. b4 Bb6 15. Na4 Bc7 16. Bxc7 Qxc7 17. Nc5 Rfe8 18. Nxe6 Rxe6

work page
[37]

Qf3 Qe5 21

Rxe6 fxe6 20. Qf3 Qe5 21. Rd1 Rf8 22. Kf1 Qh2 23. Qg3 Qxg3 24. fxg3 Ne4+ 25. Kg1 Nxg3 26. Re1 Rf6 27. Kh2 Nh5 28. Ba6 Rf2 29. Bd3 Nf4

work page
[38]

No because I heard that it is really hard

Kg3 Nxd3 31. cxd3 Rf6 32. Kh4 h6 33. g4 Kh7 34. Kh5 g6+ 35. Kh4 Rf3 36. Rxe6 g5+ 37. Kh5 Code Line Description Task Description:This task asks models to give an English language description of Python code. The goal of this task is to test the ability of language models to understand computer programming. Specifically, our task measures whether the languag...

work page
[39]

Cole is under 7 feet tall

work page
[40]

Every player on Cole’s varsity basketball team is above 6’1". Options: Statement 1 alone is sufficient while statement 2 alone is insufficient Statement 2 alone is sufficient while statement 1 alone is insufficient Either statement 1 or statement 2 is sufficient Statement 1 and statement 2 taken together are sufficient Neither statement 1 nor statement 2 ...

work page arXiv 2026

[1] [1]

Qwen3 Technical Report

URLhttps://aclanthology.org/2025.findings-acl.1316/. Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, et al. Mmlu-prox: A multilingual benchmark for advanced large language model evaluation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.emnlp-main.443 2025

[2] [2]

Code Line Description: 60 items, ground truth in the form of multiple choices (same options provided in the prompt too), evaluation using exact string match to provide accuracy

work page

[3] [3]

Auto Debugging: 35 items, ground truth in the form of open-text, evaluation using exact string match to provide accuracy

work page

[4] [4]

The evaluation produces results for both the average compile rate and the overall accuracy

Python Programming: 32 items (across very easy, easy, medium and hard difficulty levels), ground truth is not directly available, and evaluation is performed by execution of code produced by models, using the exact protocol from BIG-Bench. The evaluation produces results for both the average compile rate and the overall accuracy. 29 •Math:

work page

[5] [5]

Evaluation is using exact string match to provide accuracy

Mathematical Induction: 70 items, ground truth in the form of Yes/No multiple choice, where the same options are provided in the prompts too. Evaluation is using exact string match to provide accuracy

work page

[6] [6]

Evaluated using exact string match to provide accuracy

Checkmate-in-one: 150 items, ground truth available directly in the form of the final correct move, and no options are provided in the prompt. Evaluated using exact string match to provide accuracy

work page

[7] [7]

Evaluation again uses exact string match to provide accuracy

Evaluating Information Essentiality: 60 items, ground truth in the form of options, which are also provided in the prompt to the models. Evaluation again uses exact string match to provide accuracy

work page

[8] [8]

Evaluation uses exact string match, and provides accuracy score

Dynamic Counting: 150 items, ground truth in the form of options, that are also provided in the prompt to the models. Evaluation uses exact string match, and provides accuracy score. •Science:

work page

[9] [9]

75 items are subsampled for each of the subtasks, leading to 150 total samples

Periodic Elements: includes two subtasks (named 0 and 1 within BIG-Bench) one of which involves direct recall of periodic elements, and one involves minor manipulation beyond simple recall. 75 items are subsampled for each of the subtasks, leading to 150 total samples. Ground truth is directly available in open-text format as the correct answer (the name ...

work page

[10] [10]

Evaluation uses exact string match, leading to accuracy scores

Physics: 150 items, ground truth available in the form of multiple choices, provided also in the prompts to the models. Evaluation uses exact string match, leading to accuracy scores. •Multilingual Reasoning:

work page

[11] [11]

Ground truth is present as an utterance within the original question, as one of the provided two utterances has to be chosen as the cause

Indic Cause and Effect: 50 items each for Hindi, Bengali, and Malayalam, with 2 sub- tasks each, involving different formats of questions for cause and effect. Ground truth is present as an utterance within the original question, as one of the provided two utterances has to be chosen as the cause. Evaluation is through an exact string match and provides a...

work page

[12] [12]

Kanji ASCII: Includes 75 items, each, for a ’pronunciation’ and a ’meaning’ task. Ground truth for the pronunciation task is available in the form of a list of words, all of which are possible correct answers, and are hence not provided within the prompt to the model. Evaluation marks a model response as correct if it matches any one of the words within t...

work page

[13] [13]

Evaluation uses an exact string match and provides an accuracy score

Proverb Translation: 72 items, ground truth available in the form of options that are provided to the model as part of the prompt. Evaluation uses an exact string match and provides an accuracy score. •Understanding World: 30

work page

[14] [14]

Cause and Effect: 51 items, ground truth available in the form of options that are also provided in the prompt, evaluation uses exact string match to produce accuracy scores

work page

[15] [15]

Physical Intuition: 81 items, ground truth available in the form of options that are also provided in the prompt, evaluation uses exact string match to produce accuracy scores

work page

[16] [16]

Commonsense Reasoning: 150 items, True/False options provided in the prompt, evalu- ation uses exact string match to produce accuracy scores

work page

[17] [17]

•Known Failure Modes:

Fable: 150 items, ground truth available in the form of options that are also provided in the prompt, evaluation uses exact string match to produce accuracy scores. •Known Failure Modes:

work page

[18] [18]

Evaluation uses exact string match, and produces accuracy scores

Known Unknowns: 46 items, ground truth available in the form options (including an "Unknown" option) that is provided to the models in the prompt. Evaluation uses exact string match, and produces accuracy scores

work page

[19] [19]

Evaluation uses exact string match, and rewards a model response if it matches any one of the possible correct answers

World Unscrambling: 150 items, ground truth available in the form of a list of options, all of which are correct, and are hence not provided within the prompts. Evaluation uses exact string match, and rewards a model response if it matches any one of the possible correct answers. Metric produced is accuracy score. •Traditional NLP Tasks:

work page

[20] [20]

Evaluation uses the BLEU score, implemented using the sacrebleu package in Python

Text Simplification: 50 items, ground truth available in the form of direct open text. Evaluation uses the BLEU score, implemented using the sacrebleu package in Python. As the individual per-item correctness scores for this task are non binary, we binarize these scores for the AUROC calculation using a threshold of 0.5

work page

[21] [21]

Evaluation is using direct string match and provides the accuracy score

Phrase Relatedness: 100 items, ground truth available in the form of multiple choice options, which are provided in the prompt too. Evaluation is using direct string match and provides the accuracy score

work page

[22] [22]

Evaluation is using direct string match and provides the accuracy score

GRE Reading Comprehension: 32 items, ground truth available in the form of multiple choice options, which are provided in the prompt too. Evaluation is using direct string match and provides the accuracy score

work page

[23] [23]

Evaluation is using direct string match and provides the accuracy score

Identifying Anachronisms: 150 items, ground truth available in the form of Yes/No options, which are provided in the prompt too. Evaluation is using direct string match and provides the accuracy score. D.2 Tasks within the Hard Subset Following are the details for the inclusion of tasks within the Hard subset, described per domain: •Coding:

work page

[24] [24]

Evaluation is using a custom implementation that executes all available test cases for a given problem

LiveCodeBenchPro (LCB-Pro) [Zheng et al., 2025]: Includes 332 items from the biannual_2025_1_6 subset of LCB-Pro, including samples from easy, medium, and hard difficulty categories. Evaluation is using a custom implementation that executes all available test cases for a given problem. The metric computed is pass rate, which shows the fraction of the tota...

work page 2025

[25] [25]

Ground truth is available in the form of open-ended generations, and evaluation is done using exact string match

Humanity’s Last Exam (HLE) (Math Subset) [Phan et al., 2025]: Includes 500 randomly subsampled math questions, subject to the filter that they are text-only, and do not include reasoning over an image. Ground truth is available in the form of open-ended generations, and evaluation is done using exact string match. Note that this may be a stricter evaluati...

work page 2025

[26] [26]

All Physics samples are chosen randomly (ensuring they are text-only) from the original HLE dataset, leading to 168 items

Humanity’s Last Exam (HLE) (Physics, Chemistry, and Biology subsets) [Phan et al., 2025]: Includes a total of 324 items, across Physics, Chemistry, and Biology. All Physics samples are chosen randomly (ensuring they are text-only) from the original HLE dataset, leading to 168 items. Biology (107 items) and Chemistry (49 items) are chosen from the HLE-Gold...

work page 2025

[27] [27]

Ground truth is available in the form of short, open-ended text

MultiNRC [Fabbri et al., 2025]: 252 total items, subsampled randomly, uniformly from available languages (French, Chinese, Spanish), and categories of questions (linguistic, wordplay, cultural, math). Ground truth is available in the form of short, open-ended text. Evaluation thus uses GPT-OSS 120B as a judge, and produces binary correct/incorrect scores....

work page 2025

[28] [28]

Ground truth is available in the form of options, and evaluation is using exact match, leading to binary correctness labels and accuracy scores

MMLU-Prox [Xuan et al., 2025]: 250 items, subsampled randomly and uniformly from the 5 hardest languages: Zulu, Yoruba, Swahili, Wolof, and Telugu. Ground truth is available in the form of options, and evaluation is using exact match, leading to binary correctness labels and accuracy scores. 10https://www.futurehouse.org/research-announcements/hle-exam 32...

work page 2025

[29] [29]

Ground truth in the form of multiple choices, also provided within the prompts

CausalProbe [Chi et al., 2024]: 500 items, included from the hard subset only. Ground truth in the form of multiple choices, also provided within the prompts. Evaluation is using exact match, leading to accuracy scores

work page 2024

[30] [30]

Ground truth is available in the form of Yes/No options, allowing direct string match evaluation to produce accuracy scores

ETHICS [Hendrycks et al., 2021]: 210 items, sampled uniformly and randomly from the hardest subsets: deontology, virtue ethics, and justice. Ground truth is available in the form of Yes/No options, allowing direct string match evaluation to produce accuracy scores

work page 2021

[31] [31]

Includes several aspects of morality, such as, harm, authority, purity, fairness, etc

MoralBench [Ji et al., 2025a]: 43 items, chosen from the comparison format available within this benchmark, which admits binary ground truth for easier evaluation, leading to accuracy scores. Includes several aspects of morality, such as, harm, authority, purity, fairness, etc. •Understanding Humans:

work page

[32] [32]

Ground truth is available in the form of options and allows calculation of binary correctness labels and accuracy scores

EmoBench [Sabour et al., 2024]: 100 items from the harder emotion action subset, cover- ing categories of personal and third-person specific questions. Ground truth is available in the form of options and allows calculation of binary correctness labels and accuracy scores

work page 2024

[33] [33]

Ground truth is available in the form of options, allowing the calculation of accuracy scores

ToMBench [Chen et al., 2024]: 200 items sampled uniformly and randomly across available languages (Chinese and English) and categories of questions (non-literal com- munication, emotion, desire, belief, intention, and knowledge). Ground truth is available in the form of options, allowing the calculation of accuracy scores. •Known Failures:

work page 2024

[34] [34]

Ground truth is available as free-form generations

Artificial Analysis Omniscience (hallucination) [Jackson et al., 2025]: 500 items, subsam- pled uniformly and randomly across different available subjects of questions. Ground truth is available as free-form generations. Evaluation uses GPT-OSS 120B, using the original evaluation prompt format from the benchmark. •Traditional NLP Tasks:

work page 2025

[35] [35]

answer": [your response to the original task],

Artificial Analysis Long Context Reasoning [Team, 2025]: All 100 samples from the original task are chosen. Ground truth is available as free text and is evaluated using GPT-OSS 120B and the original evaluation prompt format from the benchmark. E Vanilla Experimental Setup Prompt Our main experiments involve asking models for ratings after solving the dom...

work page arXiv 2025

[36] [36]

Nf3 Nc6 3

e4 e5 2. Nf3 Nc6 3. d4 exd4 4. Nxd4 Bc5 5. Nxc6 bxc6 6. Nc3 Nf6 7. h3 d6 8. Bd3 Qe7 9. O-O d5 10. exd5 cxd5 11. Re1 Be6 12. Bf4 c6 13. a3 O-O 14. b4 Bb6 15. Na4 Bc7 16. Bxc7 Qxc7 17. Nc5 Rfe8 18. Nxe6 Rxe6

work page

[37] [37]

Qf3 Qe5 21

Rxe6 fxe6 20. Qf3 Qe5 21. Rd1 Rf8 22. Kf1 Qh2 23. Qg3 Qxg3 24. fxg3 Ne4+ 25. Kg1 Nxg3 26. Re1 Rf6 27. Kh2 Nh5 28. Ba6 Rf2 29. Bd3 Nf4

work page

[38] [38]

No because I heard that it is really hard

Kg3 Nxd3 31. cxd3 Rf6 32. Kh4 h6 33. g4 Kh7 34. Kh5 g6+ 35. Kh4 Rf3 36. Rxe6 g5+ 37. Kh5 Code Line Description Task Description:This task asks models to give an English language description of Python code. The goal of this task is to test the ability of language models to understand computer programming. Specifically, our task measures whether the languag...

work page

[39] [39]

Cole is under 7 feet tall

work page

[40] [40]

Every player on Cole’s varsity basketball team is above 6’1". Options: Statement 1 alone is sufficient while statement 2 alone is insufficient Statement 2 alone is sufficient while statement 1 alone is insufficient Either statement 1 or statement 2 is sufficient Statement 1 and statement 2 taken together are sufficient Neither statement 1 nor statement 2 ...

work page arXiv 2026