pith. sign in

arxiv: 2605.06673 · v1 · submitted 2026-04-21 · 💻 cs.CL · cs.AI· cs.LG

Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

Pith reviewed 2026-05-11 01:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords metacognitive monitoringlarge language modelsMMLUdomain variationverbalized confidenceType-2 AUROCmodel familiesbenchmark domains
0
0 comments X

The pith

Frontier LLMs display large, consistent differences in monitoring their own accuracy across MMLU knowledge domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates metacognitive monitoring in 33 frontier large language models by having them answer 1,500 MMLU questions spread across six domains and rate their confidence on a 0-100 scale. It calculates Type-2 AUROC scores for each model in each domain to measure how well the confidence scores separate correct from incorrect answers. Even models that perform above chance overall show substantial variation between domains, with applied and professional knowledge proving easiest to monitor and formal reasoning plus natural science the hardest. This domain variation is stable enough to produce family-specific profile shapes in some model providers and is masked when only aggregate scores are reported. The findings indicate that deployment decisions for specific applications should account for these domain differences rather than relying on overall metrics.

Core claim

Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models).

What carries the argument

Type-2 AUROC per model-domain cell, computed from verbalized 0-100 confidence scores on 250 MMLU items per domain across six a priori domains.

If this is right

  • Applied/Professional knowledge domains yield reliably higher monitoring accuracy than others across nearly all models.
  • Formal Reasoning and Natural Science domains yield the lowest monitoring accuracy in most models.
  • Profile shapes of domain monitoring performance cluster significantly within certain model families such as Anthropic, Google-Gemini, and Qwen.
  • Newer model versions can show substantial monitoring improvements, as seen in Gemma 4 over Gemma 3.
  • Binary KEEP/WITHDRAW probes and verbalized confidence probes can produce discrepant results for the same model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Application-specific screening of models for metacognitive reliability could improve outcomes in domains like law or medicine where monitoring is stronger.
  • The domain differences may reflect uneven training data distributions across subjects rather than a general metacognitive capacity.
  • Future work could test whether these monitoring profiles predict performance on real-world tasks outside the MMLU benchmark.
  • Developers might benefit from training techniques that target weaker domains to flatten these monitoring profiles.

Load-bearing premise

That the verbalized 0-100 confidence scores genuinely measure metacognitive monitoring ability rather than just surface-level response patterns, and that the six MMLU domains capture distinct monitoring demands.

What would settle it

Finding uniform monitoring AUROC scores across all six domains when the same models are tested on a fresh set of items or when confidence is extracted from internal logits instead of verbalized responses.

Figures

Figures reproduced from arXiv: 2605.06673 by Jon-Paul Cacioli.

Figure 1
Figure 1. Figure 1: Type-2 AUROC by model (rows) and MMLU-domain bin (columns). Color scale is [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean Type-2 AUROC per domain across 33 models, with standard deviation bars. Ap [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Family-level mean aggregate AUROC (bars) with observed minimum–maximum range [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ipsative domain profiles for all 33 models, colored by family. Thin lines show individual [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Aggregate Type-2 AUROC across generations for three families. Left: Anthropic (Opus [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Split-half stability of aggregate Type-2 AUROC across 33 models. Each point is one [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-benchmark consistency between Classical Minds battery AUROC and MMLU atlas [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall's W = .164). A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within-family profile-shape clustering is significant for Anthropic, Google-Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google-Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B. Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe-format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split-half aggregate stability r = .893; profile-level split-half is weaker (grand median r = .184). These results show stable benchmark-domain variation obscured by aggregate metrics, and support benchmark-stage domain screening as a step before deployment in specific application areas.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents results from testing 33 frontier LLMs on 1,500 MMLU items (250 per domain under an a priori six-domain taxonomy), computing Type-2 AUROC for metacognitive monitoring from verbalized 0-100 confidence scores (47,151 total observations). The central claims are that every model with above-chance aggregate monitoring exhibits non-trivial domain-level variation, that Applied/Professional knowledge is reliably easiest to monitor (mean AUROC .742, top-2 rank in 21/33 models), and that Formal Reasoning and Natural Science are reliably hardest (bottom-2 in 27/33 models), with supporting analyses of within-family profile clustering, subject-level coherence (ratio 0.95), and stability metrics.

Significance. If the domain-specific patterns hold, the work would be significant as a large-scale empirical atlas demonstrating that aggregate metacognition metrics obscure domain differences with potential implications for LLM deployment. The scale (33 models across eight families), explicit reporting of split-half and bootstrap stability, and probe-format specificity findings are strengths that enable readers to evaluate the results directly.

major comments (2)
  1. [Stability analysis] Stability analysis (as reported in the abstract and results): The grand median split-half correlation for domain profiles is only r=.184 (vs. r=.893 for aggregate scores), with a median bootstrap CI width of .199 across 198 cells. This low within-model consistency indicates that the specific AUROC patterns across domains may largely reflect sampling variability in the 250-item samples rather than stable, non-trivial metacognitive differences, directly undermining the claims that Applied/Professional knowledge is 'reliably' easiest and Formal Reasoning/Natural Science 'reliably' hardest.
  2. [Methods] Methods section: The manuscript provides insufficient detail on the exact verbalized-confidence prompts, the a priori item selection process for the 250 items per domain, and the precise AUROC computation (including handling of invalid or tied responses). These omissions hinder verification of whether the Type-2 AUROC validly isolates metacognitive monitoring and whether the six-domain grouping accurately captures distinct monitoring abilities, which is load-bearing for interpreting the reported domain rankings.
minor comments (1)
  1. [Abstract] The abstract states 47,151 observations but does not explicitly account for the difference from the expected 33 × 1,500 = 49,500; a brief statement of exclusion criteria would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below, providing clarifications and indicating revisions where the feedback identifies areas for improvement. We have revised the manuscript to incorporate additional methodological detail and to moderate language around domain patterns in light of the reported stability metrics.

read point-by-point responses
  1. Referee: [Stability analysis] Stability analysis (as reported in the abstract and results): The grand median split-half correlation for domain profiles is only r=.184 (vs. r=.893 for aggregate scores), with a median bootstrap CI width of .199 across 198 cells. This low within-model consistency indicates that the specific AUROC patterns across domains may largely reflect sampling variability in the 250-item samples rather than stable, non-trivial metacognitive differences, directly undermining the claims that Applied/Professional knowledge is 'reliably' easiest and Formal Reasoning/Natural Science 'reliably' hardest.

    Authors: We agree that the reported split-half correlation of r=.184 for domain profiles and median bootstrap CI width of .199 indicate substantial within-model sampling variability, which the manuscript already quantifies explicitly. This variability means that individual model-domain AUROCs should be interpreted with caution and that claims of stability at the single-model level are not supported. However, the cross-model consistency in domain rankings (Applied/Professional knowledge ranked top-2 in 21/33 models; Formal Reasoning and Natural Science bottom-2 in 27/33 models) provides convergent evidence across independent models that the observed ordering is unlikely to be pure noise. We have revised the abstract, results, and discussion to replace 'reliably' with more qualified language (e.g., 'consistently observed across models' or 'most frequently ranked') and to foreground the stability metrics as a limitation when interpreting within-model profiles. We have also added a dedicated subsection discussing the implications of the low profile-level stability for deployment decisions. revision: partial

  2. Referee: [Methods] Methods section: The manuscript provides insufficient detail on the exact verbalized-confidence prompts, the a priori item selection process for the 250 items per domain, and the precise AUROC computation (including handling of invalid or tied responses). These omissions hinder verification of whether the Type-2 AUROC validly isolates metacognitive monitoring and whether the six-domain grouping accurately captures distinct monitoring abilities, which are load-bearing for interpreting the reported domain rankings.

    Authors: We acknowledge that the original Methods section lacked sufficient granularity for full reproducibility. In the revised manuscript we have expanded this section to include: (1) the complete verbatim verbalized-confidence prompts administered to each model; (2) the explicit a priori criteria and sampling procedure used to select the 250 items per domain from the MMLU corpus, including how the six-domain taxonomy was applied; and (3) the exact AUROC implementation details, including exclusion rules for non-numeric or out-of-range confidence responses, tie-breaking conventions, and the software package/version used for computation. These additions directly address verification of the Type-2 AUROC as a metacognitive measure and the pragmatic nature of the domain grouping (already supported by the subject-level coherence ratio of 0.95 reported in the paper). revision: yes

Circularity Check

0 steps flagged

No significant circularity: direct empirical computation from benchmark data

full rationale

The paper's central claims rest on straightforward computation of Type-2 AUROC values from verbalized confidence scores (0-100) and ground-truth labels across 1,500 MMLU items administered to 33 models. No derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the reported chain. The six-domain grouping is introduced a priori and then checked post-hoc via a within-domain similarity ratio (0.95), while stability is quantified separately via split-half correlations and bootstrap CIs; these are independent empirical checks rather than reductions of the results to their own inputs. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis relies on established statistical methods and existing benchmark data without introducing new free parameters or postulated entities.

axioms (2)
  • domain assumption Verbalized confidence is a suitable proxy for metacognitive monitoring
    Assumed in the use of Type-2 AUROC with 0-100 verbalized scores.
  • domain assumption MMLU items can be validly grouped into the six domains a priori
    Paper uses this grouping and later validates it with coherence analysis.

pith-pipeline@v0.9.0 · 5625 in / 1580 out tokens · 56663 ms · 2026-05-11T01:15:29.933349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    2026 , journal =

    Cacioli, Jon-Paul , title =. 2026 , journal =

  2. [2]

    2026 , note =

    Cacioli, Jon-Paul , title =. 2026 , note =

  3. [3]

    2021 , journal =

    Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , title =. 2021 , journal =

  4. [4]

    2022 , journal =

    Kadavath, Saurav and Conerly, Tom and Askell, Amanda and Henighan, Tom and Drain, Dawn and Perez, Ethan and Schiefer, Nicholas and Hatfield-Dodds, Zac and DasSarma, Nova and Tran-Johnson, Eli and others , title =. 2022 , journal =

  5. [5]

    2026 , howpublished =

    Kim, Jaehwan , title =. 2026 , howpublished =. doi:10.20944/preprints202604.0078.v2 , note =

  6. [6]

    Kumaran, Dharshan and Conmy, Arthur and Barbero, Federico and Osindero, Simon and Patraucean, Viorica and Veli. How do. 2026 , journal =

  7. [7]

    2026 , journal =

    Miao, Miranda Muqing and Ungar, Lyle , title =. 2026 , journal =

  8. [8]

    and Phillips, Edward and Gao, Boyan and Thakur, Anshul and Clifton, David A

    Wu, Sean and Gustafsson, Fredrik K. and Phillips, Edward and Gao, Boyan and Thakur, Anshul and Clifton, David A. , title =. 2026 , journal =

  9. [9]

    Steyvers, Mark and Peters, Megan A. K. , title =. 2025 , journal =

  10. [10]

    and Lam, Monica S

    Wen, Bingbing and Bansal, Hritik and Semnani, Sina J. and Lam, Monica S. , title =. 2025 , journal =

  11. [11]

    2023 , journal =

    Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , title =. 2023 , journal =

  12. [12]

    2026 , journal =

    Haznitrama, Faiz Ghifari and Ardi, Faeyza Rishad and Oh, Alice , title =. 2026 , journal =

  13. [13]

    , title =

    Larrabee, Glenn J. , title =. 2012 , publisher =