MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

Farhad Abtahi , Abdolamir Karbalaie , Eduardo Illueca-Fernandez , Fernando Seoane

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords modelsabilityevaluationmetacognitioncontrolmedley-benchunderbehavioural

0 comments

The pith

MEDLEY-BENCH reveals an evaluation/control dissociation in AI metacognition where scale improves reflective scoring but not proportional belief revision, with a consistent knowing/doing gap across 35 models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Metacognition means an AI can watch its own thinking and decide when to change its mind. The new benchmark creates 130 tricky questions across five areas where models must first answer alone, then see a different answer from another model, and decide whether to revise. It scores two things: how well the model reflects on its own output (evaluation) and how sensibly it actually updates its answer (control). Across 35 models the results show bigger models get better at the reflection part but the control part stays flat. Smaller models often performed as well or better than much larger ones. The benchmark also found that every model was relatively weak at turning knowledge into action, a gap the authors call knowing versus doing. Two patterns appeared in how models revised: some followed the quality of the argument, others just followed what most models said.

Core claim

Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. ... Smaller and cheaper models often matched or outperformed larger counterparts, suggesting that metacognitive competence is not simply a function of scale.

Load-bearing premise

That the constructed tasks and tier-based MMS/MAS scoring genuinely isolate metacognitive control and revision behavior rather than measuring surface-level response patterns or prompt sensitivity.

Figures

Figures reproduced from arXiv: 2604.16009 by Abdolamir Karbalaie, Eduardo Illueca-Fernandez, Farhad Abtahi, Fernando Seoane.

**Figure 2.** Figure 2: Two-dimensional cognitive map of 11 models under progressive adversarial conditions. X-axis: adver [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-validation of normal-mode judge dimensions as predictors of adversarial behaviour. Scatter plot [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Ipsative ability profiles for the top 20 models. Heatmap with rows representing models (ordered [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Metacognition, the ability to monitor and regulate one's own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two behavioural profiles, i.e., models that revise primarily in response to argument quality and models that track consensus statistics. Under within-model relative profiling (ipsative scoring), evaluation was the weakest relative ability in all 35 models, indicating a systematic knowing/doing gap. Smaller and cheaper models often matched or outperformed larger counterparts, suggesting that metacognitive competence is not simply a function of scale. These findings position MEDLEY-BENCH as a tool for measuring belief revision under social pressure and suggest that future training should reward calibrated, proportional updating rather than output quality alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MEDLEY-BENCH introduces a useful separation of private and social revision under disagreement, but the claimed evaluation/control dissociation rests on scoring that may track prompt compliance more than isolated metacognition.

read the letter

The main point is that this benchmark tries to measure whether models can revise their answers when faced with ambiguous cases and conflicting outputs from other models, and it reports that evaluation ability grows with scale while actual control over revisions does not. Smaller models sometimes match or beat larger ones on the control side, which undercuts simple scaling assumptions for metacognitive behavior.

Referee Report

2 major / 2 minor

Summary. The paper introduces MEDLEY-BENCH, a benchmark for behavioral metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. It evaluates 35 models from 12 families on 130 ambiguous instances across five domains, reporting the Medley Metacognition Score (MMS) as a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, plus the Medley Ability Score (MAS) derived from four metacognitive sub-abilities. Central results claim a robust evaluation/control dissociation: evaluation ability increases with model size within families while control does not; smaller/cheaper models often match or outperform larger ones; ipsative (within-model relative) profiling shows evaluation as the weakest ability across all 35 models, indicating a systematic knowing/doing gap; and a follow-up analysis of 11 models identifies two behavioral profiles (argument-quality-driven vs. consensus-tracking revision).

Significance. If the dissociation and ipsative gap hold after proper validation, the work would be significant for AI metacognition research by providing a tool to measure belief revision under social pressure and showing that scale improves monitoring more than regulation. The within-family comparisons and identification of distinct revision profiles offer concrete, falsifiable patterns that could guide training objectives focused on calibrated updating rather than output quality alone. The benchmark's emphasis on ambiguous instances and inter-model disagreement addresses a gap in current evaluations.

major comments (2)

[Benchmark construction and scoring (Methods/Results sections)] The dissociation claim (evaluation scales with size; control does not) is load-bearing but rests on MMS/MAS tier scoring whose rules for reflective updating, social robustness, and the four MAS sub-abilities are not specified with sufficient detail to confirm isolation from prompt compliance or general instruction-following. Without baseline compliance tasks, matched prompt-complexity controls, or inter-rater reliability statistics for the 130 instances, larger models' higher evaluation scores could arise from superior adherence to revision prompts rather than metacognitive control per se.
[Ipsative scoring and within-model profiling (Results section)] The ipsative profiling result—that evaluation is the weakest relative ability in all 35 models—is central to the knowing/doing gap claim, yet the paper provides no explicit description of how the four MAS sub-abilities are normalized or ranked within each model, nor any statistical test confirming the gap exceeds what would be expected from random variation in tier assignments.

minor comments (2)

[Abstract] The abstract states '130 ambiguous instances' and 'five domains' but does not list the domains or example instances; adding one or two concrete examples would improve clarity without lengthening the paper.
[Follow-up analysis] The progressive adversarial analysis on 11 models is mentioned but lacks a table or figure summarizing the two behavioral profiles (argument-quality vs. consensus-tracking); a small summary table would make the finding easier to evaluate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on MEDLEY-BENCH. The comments on scoring transparency and statistical rigor for the ipsative analysis are well-taken, and we will revise the manuscript to improve clarity while preserving the core claims.

read point-by-point responses

Referee: The dissociation claim (evaluation scales with size; control does not) is load-bearing but rests on MMS/MAS tier scoring whose rules for reflective updating, social robustness, and the four MAS sub-abilities are not specified with sufficient detail to confirm isolation from prompt compliance or general instruction-following. Without baseline compliance tasks, matched prompt-complexity controls, or inter-rater reliability statistics for the 130 instances, larger models' higher evaluation scores could arise from superior adherence to revision prompts rather than metacognitive control per se.

Authors: We agree that the Methods section would benefit from expanded detail on tier scoring. In revision we will add explicit criteria, decision rules, and examples for assigning tiers to reflective updating, social robustness, and epistemic articulation, plus the exact mapping to the four MAS sub-abilities. The benchmark deliberately uses genuine inter-model disagreement on pre-validated ambiguous items rather than direct instruction prompts; this design choice reduces (though does not eliminate) simple compliance confounds. We will add a limitations paragraph acknowledging the absence of separate baseline compliance controls and note that future extensions could include them. The 130 instances were selected via objective ambiguity heuristics across domains rather than multi-rater subjective scoring, so traditional inter-rater reliability does not directly apply; we will clarify the selection protocol and report any available validation checks. revision: partial
Referee: The ipsative profiling result—that evaluation is the weakest relative ability in all 35 models—is central to the knowing/doing gap claim, yet the paper provides no explicit description of how the four MAS sub-abilities are normalized or ranked within each model, nor any statistical test confirming the gap exceeds what would be expected from random variation in tier assignments.

Authors: We accept that the ipsative procedure requires fuller documentation. The revised manuscript will include a dedicated paragraph describing the within-model normalization (ranking the four sub-ability scores relative to each model's own MAS distribution) and will report a statistical test (e.g., a one-sample sign test or permutation test on the rank positions) demonstrating that evaluation's consistent lowest rank across all 35 models exceeds chance expectation under random tier assignment. These additions will be placed in the Results section alongside the existing profiling figure. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated or can be inferred beyond the general assumption that the benchmark tasks validly measure metacognition.

pith-pipeline@v0.9.0 · 5563 in / 1105 out tokens · 30173 ms · 2026-05-10T08:46:44.704115+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 16 canonical work pages · 6 internal anchors

[1]

DoLLMsknowwhattheyknow? measuringmetacognitiveefficiencywithsignaldetection theory.arXiv preprint arXiv:2603.25112, 2026

Jon-PaulCacioli. DoLLMsknowwhattheyknow? measuringmetacognitiveefficiencywithsignaldetection theory.arXiv preprint arXiv:2603.25112, 2026. URLhttps://arxiv.org/abs/2603.25112

work page arXiv 2026
[2]

Measuring the metacognition of AI

Richard Servajean and Philippe Servajean. Measuring the metacognition of AI.arXiv preprint arXiv:2603.29693, 2026. URLhttps://arxiv.org/abs/2603.29693

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

John H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry. American Psychologist, 34(10):906–911, 1979. doi: 10.1037/0003-066X.34.10.906

work page doi:10.1037/0003-066x.34.10.906 1979
[4]

Nelson and Louis Narens

Thomas O. Nelson and Louis Narens. Metamemory: A theoretical framework and new findings. In Psychology of Learning and Motivation, volume 26, pages 125–173. Academic Press, 1990. doi: 10.1016/ S0079-7421(08)60053-5

1990
[5]

Galatzer-Levy, Meredith Ringel Morris, Allan Dafoe, Alison M

Ryan Burnell, Yumeya Yamamori, Orhan Firat, Kate Olszewska, Steph Hughes-Fitt, Oran Kelly, Isaac R. Galatzer-Levy, Meredith Ringel Morris, Allan Dafoe, Alison M. Snyder, Noah D. Goodman, Matthew Botvinick, and Shane Legg. Measuring progress toward AGI: A cognitive framework. Technical report, Google DeepMind, 2026. URLhttps://storage.googleapis.com/deepmi...

2026
[6]

Language Models (Mostly) Know What They Know

Saurav Kadavath et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review arXiv
[7]

Language Models (Mostly) Know What They Know

doi: 10.48550/arXiv.2207.05221. URLhttps://arxiv.org/abs/2207.05221

work page internal anchor Pith review doi:10.48550/arxiv.2207.05221
[8]

doi: 10.18653/v1/2020.emnlp-main.466

Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. AmbigQA: Answering ambiguous open-domain questions. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.466. URLhttps://aclanthology.o...

work page doi:10.18653/v1/2020.emnlp-main.466 2020
[9]

ASQA: Factoid questions meet long- form answers

Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA: Factoid questions meet long- form answers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.566. URLhttps://aclantholo...

work page doi:10.18653/v1/2022.emnlp-main.566 2022
[10]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Constitutional AI: Harmlessness from AI Feedback

doi: 10.48550/arXiv.2212.08073. URLhttps://arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073
[12]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. URLhttps://openreview.net/forum?id= HPuSIXJaa9

2023
[13]

InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES)

Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, and Sanmi Koyejo. SycEval: Evaluating LLM sycophancy.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1):893–900, 2025. doi: 10.1609/aies.v8i1.36598

work page doi:10.1609/aies.v8i1.36598 2025
[14]

Bowman, Nandi Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Sam Ringer, Rose E

Mrinank Sharma, Katie Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Nandi Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Sam Ringer, Rose E. Yan, Ethan Zhang, Ethan Perez, and Nicholas Schiefer. Towards understanding sycophancy in language models. InThe Twelfth...

2024
[15]

Atkinson

Sohaib Imran, Ihor Kendiukhov, Matthew Broerman, Aditya Thomas, Riccardo Campanella, Rob Lamb, and Peter M. Atkinson. Are LLM belief updates consistent with bayes’ theorem?arXiv preprint arXiv:2507.17951, 2025

work page arXiv 2025
[16]

Morton Deutsch and Harold B. Gerard. A study of normative and informational social influences upon individual judgment.The Journal of Abnormal and Social Psychology, 51(3):629–636, 1955. doi: 10.1037/ h0046408

1955
[17]

Leveraging imperfection with MEDLEY: a multi- model approach harnessing bias in medical AI.Frontiers in Artificial Intelligence, 9:1701665, 2026

Farhad Abtahi, Mehdi Astaraki, and Fernando Seoane. Leveraging imperfection with MEDLEY: a multi- model approach harnessing bias in medical AI.Frontiers in Artificial Intelligence, 9:1701665, 2026. doi: 10.3389/frai.2026.1701665. URLhttps://www.frontiersin.org/journals/artificial-intelligence/ articles/10.3389/frai.2026.1701665/full. 12

work page doi:10.3389/frai.2026.1701665 2026
[18]

Fleming and Hakwan C

Stephen M. Fleming and Hakwan C. Lau. How to measure metacognition.Frontiers in Human Neuro- science, 8:443, 2014. doi: 10.3389/fnhum.2014.00443. URLhttps://www.frontiersin.org/journals/ human-neuroscience/articles/10.3389/fnhum.2014.00443/full

work page doi:10.3389/fnhum.2014.00443 2014
[19]

Farrar, Straus and Giroux, New York, 2011

Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, New York, 2011

2011
[20]

arXiv preprint arXiv:2603.15381 , year =

Emmanuel Dupoux, Yann LeCun, and Jitendra Malik. Why AI systems don’t learn and what to do about it: Lessons on autonomous learning from cognitive science.arXiv preprint arXiv:2603.15381, 2026. URL https://arxiv.org/abs/2603.15381

work page arXiv 2026
[21]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023. doi: 10.48550/arXiv.2307.13702. URLhttps://arxiv.org/abs/2307.13702

work page Pith review doi:10.48550/arxiv.2307.13702 2023
[22]

Language models don’t always say what they think: Unfaithful explana- tions in chain-of-thought prompting

Miles Turpin et al. Language models don’t always say what they think: Unfaithful explana- tions in chain-of-thought prompting. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. URLhttps://proceedings.neurips.cc/paper_files/paper/2023/hash/ f2543511e5f4d4764857f9ad833a977d-Abstract-Conference.html

2023
[23]

Regulation (EU) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act)

European Union. Regulation (EU) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act). Official Journal of the European Union, 2024. URLhttps://eur-lex.europa.eu/eli/reg/2024/1689/oj

2024
[24]

The Illusion of Insight in Reasoning Models

Liv G. d’Aliberti and Manoel Horta Ribeiro. The illusion of insight in reasoning models.arXiv preprint arXiv:2601.00514, 2026. doi: 10.48550/arXiv.2601.00514. URLhttps://arxiv.org/abs/2601.00514. 13 Appendix A Formal measure definitions LetC A j ,C P j , andC S j denote the model’s confidence on claimjat Steps A, B-Private, and B-Social, respectively. Let...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.00514 2026