Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

Jon-Paul Cacioli

arxiv: 2605.06673 · v1 · submitted 2026-04-21 · 💻 cs.CL · cs.AI· cs.LG

Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

Jon-Paul Cacioli This is my paper

Pith reviewed 2026-05-11 01:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords metacognitive monitoringlarge language modelsMMLUdomain variationverbalized confidenceType-2 AUROCmodel familiesbenchmark domains

0 comments

The pith

Frontier LLMs display large, consistent differences in monitoring their own accuracy across MMLU knowledge domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates metacognitive monitoring in 33 frontier large language models by having them answer 1,500 MMLU questions spread across six domains and rate their confidence on a 0-100 scale. It calculates Type-2 AUROC scores for each model in each domain to measure how well the confidence scores separate correct from incorrect answers. Even models that perform above chance overall show substantial variation between domains, with applied and professional knowledge proving easiest to monitor and formal reasoning plus natural science the hardest. This domain variation is stable enough to produce family-specific profile shapes in some model providers and is masked when only aggregate scores are reported. The findings indicate that deployment decisions for specific applications should account for these domain differences rather than relying on overall metrics.

Core claim

Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models).

What carries the argument

Type-2 AUROC per model-domain cell, computed from verbalized 0-100 confidence scores on 250 MMLU items per domain across six a priori domains.

If this is right

Applied/Professional knowledge domains yield reliably higher monitoring accuracy than others across nearly all models.
Formal Reasoning and Natural Science domains yield the lowest monitoring accuracy in most models.
Profile shapes of domain monitoring performance cluster significantly within certain model families such as Anthropic, Google-Gemini, and Qwen.
Newer model versions can show substantial monitoring improvements, as seen in Gemma 4 over Gemma 3.
Binary KEEP/WITHDRAW probes and verbalized confidence probes can produce discrepant results for the same model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Application-specific screening of models for metacognitive reliability could improve outcomes in domains like law or medicine where monitoring is stronger.
The domain differences may reflect uneven training data distributions across subjects rather than a general metacognitive capacity.
Future work could test whether these monitoring profiles predict performance on real-world tasks outside the MMLU benchmark.
Developers might benefit from training techniques that target weaker domains to flatten these monitoring profiles.

Load-bearing premise

That the verbalized 0-100 confidence scores genuinely measure metacognitive monitoring ability rather than just surface-level response patterns, and that the six MMLU domains capture distinct monitoring demands.

What would settle it

Finding uniform monitoring AUROC scores across all six domains when the same models are tested on a fresh set of items or when confidence is extracted from internal logits instead of verbalized responses.

Figures

Figures reproduced from arXiv: 2605.06673 by Jon-Paul Cacioli.

**Figure 2.** Figure 2: Mean Type-2 AUROC per domain across 33 models, with standard deviation bars. Ap [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Family-level mean aggregate AUROC (bars) with observed minimum–maximum range [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Ipsative domain profiles for all 33 models, colored by family. Thin lines show individual [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Aggregate Type-2 AUROC across generations for three families. Left: Anthropic (Opus [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Split-half stability of aggregate Type-2 AUROC across 33 models. Each point is one [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-benchmark consistency between Classical Minds battery AUROC and MMLU atlas [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall's W = .164). A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within-family profile-shape clustering is significant for Anthropic, Google-Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google-Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B. Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe-format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split-half aggregate stability r = .893; profile-level split-half is weaker (grand median r = .184). These results show stable benchmark-domain variation obscured by aggregate metrics, and support benchmark-stage domain screening as a step before deployment in specific application areas.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps domain variation in LLM metacognitive monitoring across 33 models but reports low split-half stability for those domain profiles.

read the letter

The central takeaway is that aggregate metacognition scores hide domain differences on MMLU, with Applied/Professional knowledge easiest to monitor and Formal Reasoning/Natural Science hardest, and every above-chance model showing some variation. The work is new in its scale: 33 models, eight families, 47k observations, and explicit checks on within-family clustering plus probe-format effects. It does a good job laying out the data transparently, including the bootstrap CIs and the split-half numbers that show aggregate AUROC is stable while domain profiles are not. That honesty about the numbers is useful. The main soft spot is exactly what the paper flags: median split-half correlation for domain profiles sits at .184, with median CI width .199 on the 198 cells. That level of within-model noise makes the specific rankings (top-2 in 21 models, bottom-2 in 27) look more like sampling variation from the 250-item cells than stable domain effects. The a priori six-domain split is treated as pragmatic rather than validated, and the reliance on verbalized 0-100 confidence as the metacognition signal is standard but still an assumption. Readers working on LLM evaluation or deployment in narrow domains will find the descriptive patterns worth looking at, even if they treat the exact domain ordering as provisional. The paper is coherent on its own terms and reports its limitations plainly, so it deserves a serious referee rather than a desk reject; the revisions would mainly be about tightening the language around how much the instability limits the claims.

Referee Report

2 major / 1 minor

Summary. The manuscript presents results from testing 33 frontier LLMs on 1,500 MMLU items (250 per domain under an a priori six-domain taxonomy), computing Type-2 AUROC for metacognitive monitoring from verbalized 0-100 confidence scores (47,151 total observations). The central claims are that every model with above-chance aggregate monitoring exhibits non-trivial domain-level variation, that Applied/Professional knowledge is reliably easiest to monitor (mean AUROC .742, top-2 rank in 21/33 models), and that Formal Reasoning and Natural Science are reliably hardest (bottom-2 in 27/33 models), with supporting analyses of within-family profile clustering, subject-level coherence (ratio 0.95), and stability metrics.

Significance. If the domain-specific patterns hold, the work would be significant as a large-scale empirical atlas demonstrating that aggregate metacognition metrics obscure domain differences with potential implications for LLM deployment. The scale (33 models across eight families), explicit reporting of split-half and bootstrap stability, and probe-format specificity findings are strengths that enable readers to evaluate the results directly.

major comments (2)

[Stability analysis] Stability analysis (as reported in the abstract and results): The grand median split-half correlation for domain profiles is only r=.184 (vs. r=.893 for aggregate scores), with a median bootstrap CI width of .199 across 198 cells. This low within-model consistency indicates that the specific AUROC patterns across domains may largely reflect sampling variability in the 250-item samples rather than stable, non-trivial metacognitive differences, directly undermining the claims that Applied/Professional knowledge is 'reliably' easiest and Formal Reasoning/Natural Science 'reliably' hardest.
[Methods] Methods section: The manuscript provides insufficient detail on the exact verbalized-confidence prompts, the a priori item selection process for the 250 items per domain, and the precise AUROC computation (including handling of invalid or tied responses). These omissions hinder verification of whether the Type-2 AUROC validly isolates metacognitive monitoring and whether the six-domain grouping accurately captures distinct monitoring abilities, which is load-bearing for interpreting the reported domain rankings.

minor comments (1)

[Abstract] The abstract states 47,151 observations but does not explicitly account for the difference from the expected 33 × 1,500 = 49,500; a brief statement of exclusion criteria would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below, providing clarifications and indicating revisions where the feedback identifies areas for improvement. We have revised the manuscript to incorporate additional methodological detail and to moderate language around domain patterns in light of the reported stability metrics.

read point-by-point responses

Referee: [Stability analysis] Stability analysis (as reported in the abstract and results): The grand median split-half correlation for domain profiles is only r=.184 (vs. r=.893 for aggregate scores), with a median bootstrap CI width of .199 across 198 cells. This low within-model consistency indicates that the specific AUROC patterns across domains may largely reflect sampling variability in the 250-item samples rather than stable, non-trivial metacognitive differences, directly undermining the claims that Applied/Professional knowledge is 'reliably' easiest and Formal Reasoning/Natural Science 'reliably' hardest.

Authors: We agree that the reported split-half correlation of r=.184 for domain profiles and median bootstrap CI width of .199 indicate substantial within-model sampling variability, which the manuscript already quantifies explicitly. This variability means that individual model-domain AUROCs should be interpreted with caution and that claims of stability at the single-model level are not supported. However, the cross-model consistency in domain rankings (Applied/Professional knowledge ranked top-2 in 21/33 models; Formal Reasoning and Natural Science bottom-2 in 27/33 models) provides convergent evidence across independent models that the observed ordering is unlikely to be pure noise. We have revised the abstract, results, and discussion to replace 'reliably' with more qualified language (e.g., 'consistently observed across models' or 'most frequently ranked') and to foreground the stability metrics as a limitation when interpreting within-model profiles. We have also added a dedicated subsection discussing the implications of the low profile-level stability for deployment decisions. revision: partial
Referee: [Methods] Methods section: The manuscript provides insufficient detail on the exact verbalized-confidence prompts, the a priori item selection process for the 250 items per domain, and the precise AUROC computation (including handling of invalid or tied responses). These omissions hinder verification of whether the Type-2 AUROC validly isolates metacognitive monitoring and whether the six-domain grouping accurately captures distinct monitoring abilities, which are load-bearing for interpreting the reported domain rankings.

Authors: We acknowledge that the original Methods section lacked sufficient granularity for full reproducibility. In the revised manuscript we have expanded this section to include: (1) the complete verbatim verbalized-confidence prompts administered to each model; (2) the explicit a priori criteria and sampling procedure used to select the 250 items per domain from the MMLU corpus, including how the six-domain taxonomy was applied; and (3) the exact AUROC implementation details, including exclusion rules for non-numeric or out-of-range confidence responses, tie-breaking conventions, and the software package/version used for computation. These additions directly address verification of the Type-2 AUROC as a metacognitive measure and the pragmatic nature of the domain grouping (already supported by the subject-level coherence ratio of 0.95 reported in the paper). revision: yes

Circularity Check

0 steps flagged

No significant circularity: direct empirical computation from benchmark data

full rationale

The paper's central claims rest on straightforward computation of Type-2 AUROC values from verbalized confidence scores (0-100) and ground-truth labels across 1,500 MMLU items administered to 33 models. No derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the reported chain. The six-domain grouping is introduced a priori and then checked post-hoc via a within-domain similarity ratio (0.95), while stability is quantified separately via split-half correlations and bootstrap CIs; these are independent empirical checks rather than reductions of the results to their own inputs. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis relies on established statistical methods and existing benchmark data without introducing new free parameters or postulated entities.

axioms (2)

domain assumption Verbalized confidence is a suitable proxy for metacognitive monitoring
Assumed in the use of Type-2 AUROC with 0-100 verbalized scores.
domain assumption MMLU items can be validly grouped into the six domains a priori
Paper uses this grouping and later validates it with coherence analysis.

pith-pipeline@v0.9.0 · 5625 in / 1580 out tokens · 56663 ms · 2026-05-11T01:15:29.933349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742...); Formal Reasoning and Natural Science were reliably the hardest
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Split-half aggregate stability r = .893; profile-level split-half is weaker (grand median r = .184)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

2026 , journal =

Cacioli, Jon-Paul , title =. 2026 , journal =

work page 2026
[2]

2026 , note =

Cacioli, Jon-Paul , title =. 2026 , note =

work page 2026
[3]

2021 , journal =

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , title =. 2021 , journal =

work page 2021
[4]

2022 , journal =

Kadavath, Saurav and Conerly, Tom and Askell, Amanda and Henighan, Tom and Drain, Dawn and Perez, Ethan and Schiefer, Nicholas and Hatfield-Dodds, Zac and DasSarma, Nova and Tran-Johnson, Eli and others , title =. 2022 , journal =

work page 2022
[5]

2026 , howpublished =

Kim, Jaehwan , title =. 2026 , howpublished =. doi:10.20944/preprints202604.0078.v2 , note =

work page doi:10.20944/preprints202604.0078.v2 2026
[6]

Kumaran, Dharshan and Conmy, Arthur and Barbero, Federico and Osindero, Simon and Patraucean, Viorica and Veli. How do. 2026 , journal =

work page 2026
[7]

2026 , journal =

Miao, Miranda Muqing and Ungar, Lyle , title =. 2026 , journal =

work page 2026
[8]

and Phillips, Edward and Gao, Boyan and Thakur, Anshul and Clifton, David A

Wu, Sean and Gustafsson, Fredrik K. and Phillips, Edward and Gao, Boyan and Thakur, Anshul and Clifton, David A. , title =. 2026 , journal =

work page 2026
[9]

Steyvers, Mark and Peters, Megan A. K. , title =. 2025 , journal =

work page 2025
[10]

and Lam, Monica S

Wen, Bingbing and Bansal, Hritik and Semnani, Sina J. and Lam, Monica S. , title =. 2025 , journal =

work page 2025
[11]

2023 , journal =

Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , title =. 2023 , journal =

work page 2023
[12]

2026 , journal =

Haznitrama, Faiz Ghifari and Ardi, Faeyza Rishad and Oh, Alice , title =. 2026 , journal =

work page 2026
[13]

, title =

Larrabee, Glenn J. , title =. 2012 , publisher =

work page 2012

[1] [1]

2026 , journal =

Cacioli, Jon-Paul , title =. 2026 , journal =

work page 2026

[2] [2]

2026 , note =

Cacioli, Jon-Paul , title =. 2026 , note =

work page 2026

[3] [3]

2021 , journal =

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , title =. 2021 , journal =

work page 2021

[4] [4]

2022 , journal =

Kadavath, Saurav and Conerly, Tom and Askell, Amanda and Henighan, Tom and Drain, Dawn and Perez, Ethan and Schiefer, Nicholas and Hatfield-Dodds, Zac and DasSarma, Nova and Tran-Johnson, Eli and others , title =. 2022 , journal =

work page 2022

[5] [5]

2026 , howpublished =

Kim, Jaehwan , title =. 2026 , howpublished =. doi:10.20944/preprints202604.0078.v2 , note =

work page doi:10.20944/preprints202604.0078.v2 2026

[6] [6]

Kumaran, Dharshan and Conmy, Arthur and Barbero, Federico and Osindero, Simon and Patraucean, Viorica and Veli. How do. 2026 , journal =

work page 2026

[7] [7]

2026 , journal =

Miao, Miranda Muqing and Ungar, Lyle , title =. 2026 , journal =

work page 2026

[8] [8]

and Phillips, Edward and Gao, Boyan and Thakur, Anshul and Clifton, David A

Wu, Sean and Gustafsson, Fredrik K. and Phillips, Edward and Gao, Boyan and Thakur, Anshul and Clifton, David A. , title =. 2026 , journal =

work page 2026

[9] [9]

Steyvers, Mark and Peters, Megan A. K. , title =. 2025 , journal =

work page 2025

[10] [10]

and Lam, Monica S

Wen, Bingbing and Bansal, Hritik and Semnani, Sina J. and Lam, Monica S. , title =. 2025 , journal =

work page 2025

[11] [11]

2023 , journal =

Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , title =. 2023 , journal =

work page 2023

[12] [12]

2026 , journal =

Haznitrama, Faiz Ghifari and Ardi, Faeyza Rishad and Oh, Alice , title =. 2026 , journal =

work page 2026

[13] [13]

, title =

Larrabee, Glenn J. , title =. 2012 , publisher =

work page 2012