LLMs Show No Signs Of Individuated Metacognition

Mark Whiting; M. Moran

arxiv: 2605.24299 · v1 · pith:CLAKD43Cnew · submitted 2026-05-22 · 💻 cs.LG

LLMs Show No Signs Of Individuated Metacognition

M. Moran , Mark Whiting This is my paper

Pith reviewed 2026-06-30 15:18 UTC · model grok-4.3

classification 💻 cs.LG

keywords large language modelsmetacognitionconfidence calibrationfactor analysisbenchmarksself-assessmentindividuated knowledgefrontier models

0 comments

The pith

LLMs show no individuated metacognition in confidence judgments across benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decomposes binary confidence judgments from twenty frontier LLMs on six benchmarks using tetrachoric factor analysis and pairwise calibration. On factual and retrieval tasks the judgments form an approximately rank-one structure where models share one item difficulty axis and vary mainly by decision threshold. Removing items that all models agree on eliminates any remaining link between stated confidence and actual performance. Mathematical reasoning appears different only because models solve the questions inside their chain of thought rather than reporting pre-existing self-knowledge. The authors conclude there is no evidence for significant verbalised individuated metacognition in any domain tested.

Core claim

The cross-model confidence matrix is approximately rank-one on factual recall and information retrieval benchmarks, with a single dominant factor capturing most of the latent variance. Models retrieving facts share an item-level difficulty axis and differ mainly in their decision thresholds along it. Across all benchmarks the relationship between confidence and performance collapses once items that all models agree on are removed. Inter-model pairwise calibration is small even where statistically significant, and what remains shrinks to nothing once base-rate differences along the shared factor are controlled for. Mathematical reasoning is the apparent exception, but this turns out to be a c

What carries the argument

Tetrachoric factor analysis on binary confidence judgments paired with pairwise calibration after removal of agreed items.

If this is right

Confidence-weighted routing in ensembles would not gain from using stated confidence as a signal of capability.
Selective abstention based on model confidence would not reliably improve task performance.
Ensemble weighting schemes that rely on confidence scores would see little benefit from that information.
Mathematical reasoning benchmarks do not measure metacognition when models are allowed to solve the problem in chain of thought.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current LLM confidence may largely reflect properties of the training distribution rather than any model-specific self-assessment.
Tests of metacognition that avoid verbal reports or chain-of-thought solving would be needed to look for the capacity in other ways.
High-stakes applications that assume models can accurately flag their own limits may need alternative safeguards if the finding holds.
The same analysis could be repeated on open-ended generation tasks to check whether the rank-one structure generalizes beyond multiple-choice formats.

Load-bearing premise

That tetrachoric factor analysis on binary confidence judgments combined with removal of agreed items isolates the presence or absence of individuated metacognition rather than base-rate differences or prompt effects.

What would settle it

A set of models in which confidence differences on items where they disagree still predict performance differences after the shared factor and base rates are controlled for.

Figures

Figures reproduced from arXiv: 2605.24299 by Mark Whiting, M. Moran.

**Figure 1.** Figure 1: Model-level aggregate confidence against accuracy for all six benchmarks. Each point is one model. In every panel the horizontal axis is the model’s mean confidence (fraction of items rated yes) and the vertical axis is its accuracy (fraction of items answered correctly); both run from 0 to 1 and tick labels are suppressed for legibility at this grid size. The left block (columns 1–2) shows retrieval and f… view at source ↗

**Figure 2.** Figure 2: Metacognitive 𝐹𝛽 score as a function of risk attitude 𝛽 (log scale). Low 𝛽 weights precision and penalises overconfidence; high 𝛽 weights recall and penalises underconfidence. The vertical reference at 𝛽 = 1 is the standard 𝐹1. On SQuAD no single model dominates and most scores fall in [0.5, 0.85] across 𝛽. On MathBench Gemini 3.1 Pro dominates at every 𝛽, a consequence of uniform overconfidence across the… view at source ↗

**Figure 3.** Figure 3: Per-model confidence base rate (mean yes-rate, recovered as Φ(threshold) from the tetrachoric factor analysis), z-scored within each benchmark and probe column. Five columns are shown (MathBench confidence-before, confidence-after, confidence-after-with-answer; SQuAD needs-context, contextnecessary). Each box is one model’s z-score distribution across columns; models are ordered left-to-right by mean z-s… view at source ↗

**Figure 4.** Figure 4: Eigenspectra of metacognitive tetrachoric correlation matrices (prospective conditions). Each panel shows the normalised per-rank eigenvalue (left axis, markers) with the per-rank 95th percentile of an empirical, base-rate-matched null overlaid as a red dashed curve, and the cumulative explained variance for the observed spectrum (green) and the null (pink) on the right. For each null sample, each model’s … view at source ↗

**Figure 5.** Figure 5: Pairwise Kendall’s 𝜏-b distributions across all model pairs, representative subset (full set in Figures A.10 and A.11). Left panels: orange is observed 𝜏-b. Blue is a base-rate-matched null with each of perf𝐴 , perf𝐵 , conf𝐴, conf𝐵 shuffled independently. Green is a calibration-preserving null with each model’s (perf, conf) pairs permuted together, holding each model’s calibration profile fixed but randomi… view at source ↗

**Figure 6.** Figure 6: Mean pairwise Kendall’s 𝜏 as a function of the maximum allowed performance base-rate difference 𝛿. (a–b) SQuAD and MMLU-Pro. Pink curve is mean 𝜏 on the left axis, grey shaded region is the number of admitted pairs on the right axis. Restricting to similar-capability pairs makes calibration vanish, consistent with performance base-rate differences projected onto the shared difficulty axis. (c) MathBench. 𝜏… view at source ↗

**Figure 7.** Figure 7: SQuAD performance and confidence are each approximately rank-one along distinct axes. (a, b) Performance and confidence tetrachoric eigenspectra. PC1 captures ∼68% of the performance variance and ∼55% of the confidence variance, with sharp drops to PC2 in both cases. Panel (b) reproduces a panel of [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Confidence-weighted routing, selective abstention, and ensemble weighting all assume that a model's stated confidence is informative about its capability on the question being asked. They presume functional metacognition, the capacity to assess one's own capabilities, without exercising them. Aggregate calibration is well studied, with mixed results, but the underlying structure of elicited confidence is less well understood. We decompose binary confidence judgements from 20 frontier Large Language Models (LLMs) across six benchmarks using tetrachoric factor analysis paired with pairwise calibration, asking whether two models that differ in confidence also differ in performance. On factual recall and information retrieval benchmarks the cross-model confidence matrix is approximately rank-one and a single dominant factor captures most of the latent variance. Models retrieving facts share an item-level difficulty axis and differ mainly in their decision thresholds along it. Across all benchmarks the relationship between confidence and performance collapses once items that all models agree on are removed. Inter-model pairwise calibration is small even where statistically significant, and what remains shrinks to nothing once base-rate differences along the shared factor are controlled for. Mathematical reasoning is the apparent exception, but this turns out to be a confound where reasoning models answer questions about their confidence by trying to solve them in their chain of thought, bypassing the sub-symbolic self-knowledge we seek to measure. We find no evidence for significant verbalised individuated metacognition in any tested domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM confidence looks mostly like shared item difficulty, with little left for model-specific metacognition once agreements are stripped out.

read the letter

The core observation is that confidence matrices across 20 frontier models are roughly rank-one on factual tasks, and the confidence-performance correlation vanishes after removing items where all models agree. This leaves inter-model calibration near zero once base-rate differences along the shared factor are controlled. The paper treats this as evidence against individuated metacognition.

What stands out is the clean application of tetrachoric factor analysis to binary judgments paired with the agreement-removal step. The mathematical-reasoning exception is handled by noting that those models appear to solve the question in the chain-of-thought rather than report a separate self-assessment. That is a useful clarification and keeps the claim from over-reaching on every domain.

The soft spot is that the abstract supplies almost no concrete information on the exact benchmarks, model list, exclusion rules, or how base-rate adjustment was performed. Without those details it is difficult to judge whether the collapse after agreement removal truly isolates absence of metacognition or simply removes the signal along with the shared difficulty axis. The stress-test concern lands: other sources such as prompt phrasing or model-specific guessing thresholds could produce the same pattern even if some individuated signal existed. The paper would be stronger with explicit checks that those alternatives are ruled out.

This is useful reading for anyone building routing, abstention, or ensemble systems that treat stated confidence as capability-aware. It is not yet tight enough on methods to stand as a definitive negative result, but the empirical pattern is worth checking. I would send it to review so the controls and data handling can be examined directly.

Referee Report

2 major / 1 minor

Summary. The paper analyzes binary confidence judgments from 20 frontier LLMs across six benchmarks using tetrachoric factor analysis paired with pairwise calibration. It reports that the cross-model confidence matrix is approximately rank-one on factual recall and information retrieval benchmarks (a single dominant factor captures most latent variance), that models differ mainly in decision thresholds along a shared item-difficulty axis, and that the confidence-performance relationship collapses once items on which all models agree are removed. Inter-model pairwise calibration is small and vanishes after controlling for base-rate differences along the shared factor. Mathematical reasoning is treated as a confound arising from chain-of-thought solving. The authors conclude there is no evidence for significant verbalised individuated metacognition in any tested domain.

Significance. If the result holds after full methodological disclosure and explicit controls for confounds, the finding would be significant for LLM evaluation and for downstream applications (confidence-weighted routing, selective abstention, ensemble weighting) that presuppose informative model self-assessment.

major comments (2)

[Abstract] Abstract: the statistical approach is described only at high level and supplies no details on the exact benchmarks, the list of 20 models, data exclusion criteria, or error controls. This prevents verification that the reported rank-one structure and post-removal collapse support the central claim.
[Abstract] Abstract: the interpretation that the observed rank-one structure plus vanishing pairwise calibration after agreed-item removal demonstrates absence of individuated metacognition assumes binary judgments primarily reflect either a common difficulty axis or model-specific self-assessment. It is not shown how the tetrachoric model and base-rate controls rule out alternative sources (prompt phrasing, response-format biases, or model-specific guessing thresholds uncorrelated with the shared factor).

minor comments (1)

The abstract would be clearer if it named the six benchmarks and briefly indicated how the mathematical-reasoning confound was diagnosed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below. Where the comments identify opportunities for greater clarity, we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the statistical approach is described only at high level and supplies no details on the exact benchmarks, the list of 20 models, data exclusion criteria, or error controls. This prevents verification that the reported rank-one structure and post-removal collapse support the central claim.

Authors: We agree that the abstract is high-level by design. The full manuscript (Section 3 and Appendix A) specifies the six benchmarks (MMLU, TriviaQA, Natural Questions, HotpotQA, GSM8K, MATH), the exact list of 20 models, the exclusion rule (items on which any model produced no valid binary confidence judgment), and the use of bootstrap resampling (1,000 iterations) for standard errors on factor loadings and tetrachoric correlations. We will expand the abstract to include a concise statement of the benchmarks, model count, and error-control method so that the rank-one claim and post-removal result can be evaluated from the abstract alone. revision: yes
Referee: [Abstract] Abstract: the interpretation that the observed rank-one structure plus vanishing pairwise calibration after agreed-item removal demonstrates absence of individuated metacognition assumes binary judgments primarily reflect either a common difficulty axis or model-specific self-assessment. It is not shown how the tetrachoric model and base-rate controls rule out alternative sources (prompt phrasing, response-format biases, or model-specific guessing thresholds uncorrelated with the shared factor).

Authors: The tetrachoric factor model decomposes the observed binary correlations into latent continuous variables; an approximately rank-one solution means that residual covariance after extracting the dominant factor is negligible. Any model-specific guessing threshold or response-format bias uncorrelated with the shared difficulty factor would therefore appear either as a second significant eigenvalue or as non-zero residual pairwise calibration after the base-rate (shared-factor) control is applied. Neither is observed. Because all models received identical prompts, prompt-phrasing effects are absorbed into the common factor rather than generating model-specific residuals. We will add a short paragraph in the Discussion section that makes this logic explicit and notes the assumption that prompt effects are not model-idiosyncratic. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical decomposition relies on standard factor analysis and external benchmarks

full rationale

The paper applies tetrachoric factor analysis to binary confidence judgments from 20 LLMs on six benchmarks, observes a rank-one structure in the cross-model matrix, and reports that confidence-performance correlations collapse after removing unanimous items and controlling for base rates. These steps use established statistical techniques on external data rather than self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The conclusion of absent individuated metacognition follows directly from the observed empirical patterns without reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5769 in / 1028 out tokens · 37290 ms · 2026-06-30T15:18:04.846307+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 19 canonical work pages · 5 internal anchors

[1]

Aggarwal, A

P. Aggarwal, A. Madaan, A. Anand, S. P. Potharaju, S. Mishra, P. Zhou, A. Gupta, D. Rajagopal, K. Kappaganthu, Y. Yang, S. Upadhyay, M. Faruqui, and Mausam. AutoMix : Automatically mixing language models. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024. arXiv:2310.12963

work page arXiv 2024
[2]

The Claude 3 model family: Opus , Sonnet , Haiku

Anthropic. The Claude 3 model family: Opus , Sonnet , Haiku . Anthropic Technical Report, 2024

2024
[3]

F. J. Binder, J. Chua, T. Korbak, H. Sleight, J. Hughes, R. Long, E. Perez, M. Turpin, and O. Evans. Looking inward: Language models can learn about themselves by introspection. In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

2025
[4]

Burnell, H

R. Burnell, H. Hao, A. R. A. Conway, and J. Hern \'a ndez-Orallo. Revealing the structure of language model capabilities. arXiv preprint arXiv:2306.10062, 2023

work page arXiv 2023
[5]

L. Chen, M. Zaharia, and J. Zou. F rugal GPT : How to use large language models while reducing cost and improving performance. Transactions on Machine Learning Research, 2024

2024
[6]

Damani, I

M. Damani, I. Puri, S. Slocum, I. Shenfeld, L. Choshen, Y. Kim, and J. Andreas. Beyond binary rewards: Training LMs to reason about their uncertainty. 2025

2025
[7]

Deepseek-r1

DeepSeek-AI . DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025. doi:10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[8]

S. M. Fleming and H. C. Lau. How to measure metacognition. Frontiers in Human Neuroscience, 8: 0 443, 2014. doi:10.3389/fnhum.2014.00443

work page doi:10.3389/fnhum.2014.00443 2014
[9]

B. Gao, F. Song, Z. Yang, Z. Cai, Y. Miao, Q. Dong, L. Li, C. Ma, L. Chen, R. Xu, Z. Tang, B. Wang, D. Zan, S. Quan, G. Zhang, L. Sha, Y. Zhang, X. Ren, T. Liu, and B. Chang. Omni-MATH : A universal olympiad level mathematic benchmark for large language models. In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

2025
[10]

Geifman and R

Y. Geifman and R. El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

2017
[11]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Google . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

N. Guha, J. Nyarko, D. E. Ho, C. R\' e , A. Chilton, A. Narayana, A. Chohlas-Wood, A. Peters, B. Waldon, D. N. Rockmore, D. Zambrano, D. Talisman, E. Hoque, F. Surani, F. Fagan, G. Sarfaty, G. M. Dickinson, H. Porat, J. Hegland, J. Wu, J. Nudell, J. Niklaus, J. Nay, J. H. Choi, K. Tobia, M. Hagan, M. Ma, M. Livermore, N. Rasumov-Rahe, N. Holzenberger, N. ...

2023
[13]

C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1321--1330, 2017

2017
[14]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems 34 (NeurIPS Datasets and Benchmarks Track), 2021

2021
[15]

Ili\' c and G

D. Ili\' c and G. E. Gignac. Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? Intelligence, 106: 0 101858, 2024. doi:10.1016/j.intell.2024.101858

work page doi:10.1016/j.intell.2024.101858 2024
[16]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7 B . arXiv preprint arXiv:2310.06825, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Jiang, X

D. Jiang, X. Ren, and B. Y. Lin. LLM -blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 14165--14178, 2023 b

2023
[18]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. Cl...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Kamath, R

A. Kamath, R. Jia, and P. Liang. Selective question answering under domain shift. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5684--5696, 2020. doi:10.18653/v1/2020.acl-main.503

work page doi:10.18653/v1/2020.acl-main.503 2020
[20]

M. G. Kendall. The treatment of ties in ranking problems. Biometrika, 33 0 (3): 0 239--251, 1945. doi:10.2307/2332303

work page doi:10.2307/2332303 1945
[21]

Kipnis, K

A. Kipnis, K. Voudouris, L. M. Schulze Buschoff, and E. Schulz. metabench: A sparse benchmark of reasoning and knowledge in large language models. In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

2025
[22]

A. Koriat. Monitoring one's own knowledge during study: A cue-utilization approach to judgments of learning. Journal of Experimental Psychology: General, 126 0 (4): 0 349--370, 1997. doi:10.1037/0096-3445.126.4.349

work page doi:10.1037/0096-3445.126.4.349 1997
[23]

Kruger and D

J. Kruger and D. Dunning. Unskilled and unaware of it: How difficulties in recognizing one's own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77 0 (6): 0 1121--1134, 1999. doi:10.1037/0022-3514.77.6.1121

work page doi:10.1037/0022-3514.77.6.1121 1999
[24]

L. Kuhn, Y. Gal, and S. Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In Proceedings of the 11th International Conference on Learning Representations (ICLR), 2023

2023
[25]

S. Lin, J. Hilton, and O. Evans. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research, 2022

2022
[26]

The Llama 3 herd of models, 2024

Llama Team, Meta AI . The Llama 3 herd of models, 2024. Lead author: Aaron Grattafiori

2024
[27]

Maniscalco and H

B. Maniscalco and H. Lau. A signal detection theoretic approach for estimating metacognitive sensitivity from confidence ratings. Consciousness and Cognition, 21 0 (1): 0 422--430, 2012. doi:10.1016/j.concog.2011.09.021

work page doi:10.1016/j.concog.2011.09.021 2012
[28]

T. O. Nelson and L. Narens. Metamemory: A theoretical framework and new findings. In Psychology of Learning and Motivation, volume 26, pages 125--173. Academic Press, 1990. doi:10.1016/S0079-7421(08)60053-5

work page doi:10.1016/s0079-7421(08)60053-5 1990
[29]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

K. Pearson. Mathematical contributions to the theory of evolution. VII . On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society of London. Series A, 195: 0 1--47, 1900. doi:10.1098/rsta.1900.0022

work page doi:10.1098/rsta.1900.0022 1900
[31]

Pedapati, A

T. Pedapati, A. Dhurandhar, S. Ghosh, S. Dan, and P. Sattigeri. Large language model confidence estimation via black-box access. Transactions on Machine Learning Research (TMLR), 2025

2025
[32]

Podolak and R

J. Podolak and R. Verma. Read your own mind: Reasoning helps surface self-confidence signals in LLMs . 2025

2025
[33]

Qwen2.5-Coder technical report, 2024

Qwen Team . Qwen2.5-Coder technical report, 2024

2024
[34]

Rajpurkar, J

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD : 100,000+ questions for machine comprehension of text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2383--2392, 2016. doi:10.18653/v1/D16-1264

work page doi:10.18653/v1/d16-1264 2016
[35]

Stengel-Eskin, P

E. Stengel-Eskin, P. Hase, and M. Bansal. LACIE : Listener-aware finetuning for confidence calibration in large language models. In Advances in Neural Information Processing Systems 38 (NeurIPS), 2024

2024
[36]

K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, and C. D. Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5433--5442, 2023

2023
[37]

M. Tian, L. Gao, S. D. Zhang, X. Chen, C. Fan, X. Guo, R. Haas, P. Ji, K. Krongchon, Y. Li, S. Liu, D. Luo, Y. Ma, H. Tong, K. Trinh, C. Tian, Z. Wang, B. Wu, Y. Xiong, S. Yin, M. Zhu, K. Lieret, Y. Lu, G. Liu, Y. Du, T. Tao, O. Press, J. Callan, E. Huerta, and H. Peng. SciCode : A research coding benchmark curated by scientists. In Advances in Neural Inf...

2024
[38]

E. Tulving. Memory and consciousness. Canadian Psychology / Psychologie canadienne, 26 0 (1): 0 1--12, 1985. doi:10.1037/h0080017

work page doi:10.1037/h0080017 1985
[39]

Ulmer, M

D. Ulmer, M. Gubri, H. Lee, S. Yun, and S. J. Oh. Calibrating large language models using their generations only. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 15440--15459, Bangkok, Thailand, 2024

2024
[40]

G. J. G. Upton. Fisher's exact test. Journal of the Royal Statistical Society. Series A (Statistics in Society), 155 0 (3): 0 395--402, 1992. ISSN 09641998, 1467985X

1992
[41]

X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations (ICLR), 2023

2023
[42]

Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. MMLU-Pro : A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems 37 (NeurIPS Datasets and Benchmarks Track), 2024

2024
[43]

Xiong, Z

M. Xiong, Z. Hu, X. Lu, Y. Li, J. Fu, J. He, and B. Hooi. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs . In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2024

2024
[44]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2.5 tec...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Yang, Y.-H

D. Yang, Y.-H. H. Tsai, and M. Yamada. On verbalized confidence scores for LLMs , 2024 b . Poster, ICLR 2025 Workshop QUESTION

2024
[46]

D. Yoon, S. Kim, S. Yang, S. Kim, S. Kim, Y. Kim, E. Choi, Y. Kim, and M. Seo. Reasoning models better express their confidence. In Advances in Neural Information Processing Systems 39 (NeurIPS), 2025

2025
[47]

X. Zhao, Z. Kang, A. Feng, S. Levine, and D. Song. Learning to reason without external rewards. In Proceedings of the 14th International Conference on Learning Representations (ICLR), 2026

2026

[1] [1]

Aggarwal, A

P. Aggarwal, A. Madaan, A. Anand, S. P. Potharaju, S. Mishra, P. Zhou, A. Gupta, D. Rajagopal, K. Kappaganthu, Y. Yang, S. Upadhyay, M. Faruqui, and Mausam. AutoMix : Automatically mixing language models. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024. arXiv:2310.12963

work page arXiv 2024

[2] [2]

The Claude 3 model family: Opus , Sonnet , Haiku

Anthropic. The Claude 3 model family: Opus , Sonnet , Haiku . Anthropic Technical Report, 2024

2024

[3] [3]

F. J. Binder, J. Chua, T. Korbak, H. Sleight, J. Hughes, R. Long, E. Perez, M. Turpin, and O. Evans. Looking inward: Language models can learn about themselves by introspection. In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

2025

[4] [4]

Burnell, H

R. Burnell, H. Hao, A. R. A. Conway, and J. Hern \'a ndez-Orallo. Revealing the structure of language model capabilities. arXiv preprint arXiv:2306.10062, 2023

work page arXiv 2023

[5] [5]

L. Chen, M. Zaharia, and J. Zou. F rugal GPT : How to use large language models while reducing cost and improving performance. Transactions on Machine Learning Research, 2024

2024

[6] [6]

Damani, I

M. Damani, I. Puri, S. Slocum, I. Shenfeld, L. Choshen, Y. Kim, and J. Andreas. Beyond binary rewards: Training LMs to reason about their uncertainty. 2025

2025

[7] [7]

Deepseek-r1

DeepSeek-AI . DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025. doi:10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025

[8] [8]

S. M. Fleming and H. C. Lau. How to measure metacognition. Frontiers in Human Neuroscience, 8: 0 443, 2014. doi:10.3389/fnhum.2014.00443

work page doi:10.3389/fnhum.2014.00443 2014

[9] [9]

B. Gao, F. Song, Z. Yang, Z. Cai, Y. Miao, Q. Dong, L. Li, C. Ma, L. Chen, R. Xu, Z. Tang, B. Wang, D. Zan, S. Quan, G. Zhang, L. Sha, Y. Zhang, X. Ren, T. Liu, and B. Chang. Omni-MATH : A universal olympiad level mathematic benchmark for large language models. In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

2025

[10] [10]

Geifman and R

Y. Geifman and R. El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

2017

[11] [11]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Google . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

N. Guha, J. Nyarko, D. E. Ho, C. R\' e , A. Chilton, A. Narayana, A. Chohlas-Wood, A. Peters, B. Waldon, D. N. Rockmore, D. Zambrano, D. Talisman, E. Hoque, F. Surani, F. Fagan, G. Sarfaty, G. M. Dickinson, H. Porat, J. Hegland, J. Wu, J. Nudell, J. Niklaus, J. Nay, J. H. Choi, K. Tobia, M. Hagan, M. Ma, M. Livermore, N. Rasumov-Rahe, N. Holzenberger, N. ...

2023

[13] [13]

C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1321--1330, 2017

2017

[14] [14]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems 34 (NeurIPS Datasets and Benchmarks Track), 2021

2021

[15] [15]

Ili\' c and G

D. Ili\' c and G. E. Gignac. Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? Intelligence, 106: 0 101858, 2024. doi:10.1016/j.intell.2024.101858

work page doi:10.1016/j.intell.2024.101858 2024

[16] [16]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7 B . arXiv preprint arXiv:2310.06825, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Jiang, X

D. Jiang, X. Ren, and B. Y. Lin. LLM -blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 14165--14178, 2023 b

2023

[18] [18]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. Cl...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Kamath, R

A. Kamath, R. Jia, and P. Liang. Selective question answering under domain shift. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5684--5696, 2020. doi:10.18653/v1/2020.acl-main.503

work page doi:10.18653/v1/2020.acl-main.503 2020

[20] [20]

M. G. Kendall. The treatment of ties in ranking problems. Biometrika, 33 0 (3): 0 239--251, 1945. doi:10.2307/2332303

work page doi:10.2307/2332303 1945

[21] [21]

Kipnis, K

A. Kipnis, K. Voudouris, L. M. Schulze Buschoff, and E. Schulz. metabench: A sparse benchmark of reasoning and knowledge in large language models. In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

2025

[22] [22]

A. Koriat. Monitoring one's own knowledge during study: A cue-utilization approach to judgments of learning. Journal of Experimental Psychology: General, 126 0 (4): 0 349--370, 1997. doi:10.1037/0096-3445.126.4.349

work page doi:10.1037/0096-3445.126.4.349 1997

[23] [23]

Kruger and D

J. Kruger and D. Dunning. Unskilled and unaware of it: How difficulties in recognizing one's own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77 0 (6): 0 1121--1134, 1999. doi:10.1037/0022-3514.77.6.1121

work page doi:10.1037/0022-3514.77.6.1121 1999

[24] [24]

L. Kuhn, Y. Gal, and S. Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In Proceedings of the 11th International Conference on Learning Representations (ICLR), 2023

2023

[25] [25]

S. Lin, J. Hilton, and O. Evans. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research, 2022

2022

[26] [26]

The Llama 3 herd of models, 2024

Llama Team, Meta AI . The Llama 3 herd of models, 2024. Lead author: Aaron Grattafiori

2024

[27] [27]

Maniscalco and H

B. Maniscalco and H. Lau. A signal detection theoretic approach for estimating metacognitive sensitivity from confidence ratings. Consciousness and Cognition, 21 0 (1): 0 422--430, 2012. doi:10.1016/j.concog.2011.09.021

work page doi:10.1016/j.concog.2011.09.021 2012

[28] [28]

T. O. Nelson and L. Narens. Metamemory: A theoretical framework and new findings. In Psychology of Learning and Motivation, volume 26, pages 125--173. Academic Press, 1990. doi:10.1016/S0079-7421(08)60053-5

work page doi:10.1016/s0079-7421(08)60053-5 1990

[29] [29]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

K. Pearson. Mathematical contributions to the theory of evolution. VII . On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society of London. Series A, 195: 0 1--47, 1900. doi:10.1098/rsta.1900.0022

work page doi:10.1098/rsta.1900.0022 1900

[31] [31]

Pedapati, A

T. Pedapati, A. Dhurandhar, S. Ghosh, S. Dan, and P. Sattigeri. Large language model confidence estimation via black-box access. Transactions on Machine Learning Research (TMLR), 2025

2025

[32] [32]

Podolak and R

J. Podolak and R. Verma. Read your own mind: Reasoning helps surface self-confidence signals in LLMs . 2025

2025

[33] [33]

Qwen2.5-Coder technical report, 2024

Qwen Team . Qwen2.5-Coder technical report, 2024

2024

[34] [34]

Rajpurkar, J

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD : 100,000+ questions for machine comprehension of text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2383--2392, 2016. doi:10.18653/v1/D16-1264

work page doi:10.18653/v1/d16-1264 2016

[35] [35]

Stengel-Eskin, P

E. Stengel-Eskin, P. Hase, and M. Bansal. LACIE : Listener-aware finetuning for confidence calibration in large language models. In Advances in Neural Information Processing Systems 38 (NeurIPS), 2024

2024

[36] [36]

K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, and C. D. Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5433--5442, 2023

2023

[37] [37]

M. Tian, L. Gao, S. D. Zhang, X. Chen, C. Fan, X. Guo, R. Haas, P. Ji, K. Krongchon, Y. Li, S. Liu, D. Luo, Y. Ma, H. Tong, K. Trinh, C. Tian, Z. Wang, B. Wu, Y. Xiong, S. Yin, M. Zhu, K. Lieret, Y. Lu, G. Liu, Y. Du, T. Tao, O. Press, J. Callan, E. Huerta, and H. Peng. SciCode : A research coding benchmark curated by scientists. In Advances in Neural Inf...

2024

[38] [38]

E. Tulving. Memory and consciousness. Canadian Psychology / Psychologie canadienne, 26 0 (1): 0 1--12, 1985. doi:10.1037/h0080017

work page doi:10.1037/h0080017 1985

[39] [39]

Ulmer, M

D. Ulmer, M. Gubri, H. Lee, S. Yun, and S. J. Oh. Calibrating large language models using their generations only. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 15440--15459, Bangkok, Thailand, 2024

2024

[40] [40]

G. J. G. Upton. Fisher's exact test. Journal of the Royal Statistical Society. Series A (Statistics in Society), 155 0 (3): 0 395--402, 1992. ISSN 09641998, 1467985X

1992

[41] [41]

X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations (ICLR), 2023

2023

[42] [42]

Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. MMLU-Pro : A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems 37 (NeurIPS Datasets and Benchmarks Track), 2024

2024

[43] [43]

Xiong, Z

M. Xiong, Z. Hu, X. Lu, Y. Li, J. Fu, J. He, and B. Hooi. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs . In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2024

2024

[44] [44]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2.5 tec...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Yang, Y.-H

D. Yang, Y.-H. H. Tsai, and M. Yamada. On verbalized confidence scores for LLMs , 2024 b . Poster, ICLR 2025 Workshop QUESTION

2024

[46] [46]

D. Yoon, S. Kim, S. Yang, S. Kim, S. Kim, Y. Kim, E. Choi, Y. Kim, and M. Seo. Reasoning models better express their confidence. In Advances in Neural Information Processing Systems 39 (NeurIPS), 2025

2025

[47] [47]

X. Zhao, Z. Kang, A. Feng, S. Levine, and D. Song. Learning to reason without external rewards. In Proceedings of the 14th International Conference on Learning Representations (ICLR), 2026

2026