pith. machine review for the scientific record. sign in

arxiv: 2605.14186 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords metacognitionLLM reasoningtest-time scalingself-monitoringfeeling of knowingjudgment of learningbenchmark improvement
0
0 comments X

The pith

Large language models can use their own pre- and post-solution self-assessments to control inference and raise accuracy on reasoning tasks without any training or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs generate useful signals about whether they know an answer before trying and whether their solution is correct after trying. These signals are turned into a control system called a metacognitive harness that decides whether to accept a solution, retry with feedback, or combine several attempts. When applied to a fixed Claude model on text, code, and multimodal benchmarks, this raises average accuracy from 48.3 percent to 56.9 percent and beats existing top entries on key leaderboards. A sympathetic reader would care because it suggests that current models already hold the knowledge needed to scale their own performance at test time, but lack an explicit way to act on that knowledge.

Core claim

Inspired by the Nelson-Narens theory of metacognition, the work demonstrates that LLMs possess latent metacognitive ability in the form of feeling-of-knowing signals before solving and judgment-of-learning signals after each attempt. By separating these monitoring signals from the reasoning process and using them to guide decisions on trust, retry with compact feedback, and aggregation, the metacognitive harness improves a fixed base model across diverse benchmarks without parameter updates.

What carries the argument

The metacognitive harness, which elicits pre-solve feeling-of-knowing (FOK) and post-solve judgment-of-learning (JOL) signals from the LLM and uses them as control inputs to decide trust, retry, or aggregate.

If this is right

  • Substantially improves accuracy on text, code, and multimodal reasoning benchmarks using a fixed model.
  • Raises pooled accuracy from 48.3 to 56.9 on public benchmark snapshots.
  • Exceeds strongest leaderboard entries on HLE-Verified, LiveCodeBench v6, and R-Bench-V.
  • Requires no parameter updates or benchmark-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar harnesses could be applied to other base models to test if metacognitive signals are a general property of strong LLMs.
  • The separation of monitoring from reasoning may enable more robust self-correction loops in future systems.
  • Explicit control interfaces might unlock further test-time scaling beyond current single-pass or simple sampling methods.

Load-bearing premise

The feeling-of-knowing and judgment-of-learning signals produced by the LLM are reliable, consistent, and free from systematic bias so that they can effectively direct the control decisions.

What would settle it

Observing that the harness produces lower accuracy than the base model on a held-out set of problems where the self-monitoring signals do not correlate with actual success would falsify the claim that these signals can be harnessed effectively.

Figures

Figures reproduced from arXiv: 2605.14186 by Peijia Qin, Pengtao Xie, Qi Cao, Shuhao Zhang, Yufan Wang.

Figure 1
Figure 1. Figure 1: LLMs exhibit metacognitive signals, but do not use them to control reasoning. (a) We directly prompt each LLM to report a scalar self-assessment in [0, 1] before answering, denoted as FOK (Feeling of Knowing), and after answering, denoted as JOL (Judgment of Learning). (b) These self-reported scores are meaningfully correlated with actual correctness: examples with higher FOK/JOL scores achieve higher accu… view at source ↗
Figure 2
Figure 2. Figure 2: Metacognitive harness. Inspired by the Nelson–Narens metacognition theory, we instanti￾ate metacognition as a two-level control loop for language model reasoning. The meta level monitors the model’s reasoning state through self-reported signals, including pre-solve feeling of knowing (FOK) and post-solve judgment of learning (JOL), while the object level performs test-time scaling actions such as solving, … view at source ↗
Figure 3
Figure 3. Figure 3: Three diagnosis cards illustrating the verdict rubric. Each card lists six graded rows ({FOK, JOL, Joint} × {AUROC, ECE}) and a final verdict. Sonnet-4.6 (left) is the only model in the panel that passes every row; Gemini3-Flash (middle) passes the discrimination rows but fails on calibration ECEs; Gemma 4 (right) fails discrimination on FOK and calibration on both raw ECEs. The remaining six models fall w… view at source ↗
Figure 4
Figure 4. Figure 4: SVM decision function. Kernel selection for the joint metacognition classifier on Sonnet￾4.6. Each panel shows the decision surface of an SVM trained with StandardScaler and isotonic calibration; titles report out-of-fold AUROC under RepeatedStratifiedKFold (5 splits × 3 repeats). Background color is the predicted P(correct), the dashed line marks the 0.5 boundary. The second type is parallel scaling, whic… view at source ↗
Figure 5
Figure 5. Figure 5: Discussions of confidence scores. (a) The harness spends more attempts on low-JOL1 problems, where it also yields larger gains. (b) After harnessing, effort decreases with confidence for both FOK and JOL. (c) JOL varies much more across problems than within the same problem, motivating our choice to use confidence for retry/stop decisions but not for aggregation. confidence scores are not directly actionab… view at source ↗
Figure 6
Figure 6. Figure 6: Per-model accuracy by confidence band. Each subplot shows one model’s accuracy on the Low (bottom 30%), Medium (middle 40%), and High (top 30%) bands for FOK (light blue, pre-solve confidence) and JOL (deep blue, post-solve confidence). Accuracy generally increases with confidence for most models and signals, though the trend is not strictly monotonic in every case. The main paper reports two cross-model a… view at source ↗
Figure 7
Figure 7. Figure 7: Per-model reasoning length by confidence band. Each subplot shows reasoning length on the Low / Medium / High bands for FOK (light orange) and JOL (deep orange), normalised by the model’s overall mean length (dotted line at 1.0). The direction varies across models — Anthropic models shorten on high-confidence items, several open-weight models lengthen — but in no case does low confidence consistently buy m… view at source ↗
Figure 8
Figure 8. Figure 8: Per-model metacognition diagnosis cards (extended). The six models not shown in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Retrieval-stage metamemory in the Nelson–Narens framework. A modern schematic of the retrieval-stage monitoring–control process described by Nelson and Narens [17]. A preliminary feeling-of-knowing judgment gates whether search should be initiated; candidate retrieval and confidence evaluation then determine whether to output an answer, continue searching, or terminate with no answer. We use this cognitive… view at source ↗
read the original abstract

Large language models (LLMs) often expose useful signals of self-monitoring: before solving a problem, they can estimate whether they are likely to succeed, and after solving it, they can judge whether their answer is likely to be correct. However, these signals are typically measured or elicited in isolation, rather than used to control inference. In this work, we ask whether LLMs possess latent metacognitive ability that can be turned into effective test-time control. Inspired by the Nelson--Narens theory from cognitive psychology, we propose a metacognitive harness that separates monitoring from reasoning. For each problem, the model first reports a pre-solve feeling-of-knowing (FOK) signal; after each solve attempt, it reports a post-solve judgment-of-learning (JOL) signal. Rather than treating these signals as passive confidence estimates, the harness turns them into an explicit control interface for reasoning: it decides when to trust the current solution, when to retry with compact metacognitive feedback, and when to pass multiple attempts to a final aggregator. Across text, code, and multimodal reasoning benchmarks, our harness substantially improves a fixed Claude Sonnet-4.6 base model without parameter updates or benchmark-specific fine-tuning. On the evaluated public benchmark snapshots, it raises pooled accuracy from 48.3 to 56.9 and exceeds the strongest listed leaderboard entries on the three primary evaluation settings: HLE-Verified, LiveCodeBench v6, and R-Bench-V. These results suggest that strong LLMs may already possess useful metacognitive ability, but require an explicit control harness to act on it during reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a metacognitive harness, inspired by Nelson-Narens theory, that elicits pre-solve feeling-of-knowing (FOK) and post-solve judgment-of-learning (JOL) signals from an LLM to control test-time decisions: whether to trust a solution, retry with metacognitive feedback, or aggregate multiple attempts. Applied without parameter updates or fine-tuning to a fixed Claude Sonnet-4.6 model, the harness is reported to raise pooled accuracy from 48.3% to 56.9% across text, code, and multimodal benchmarks and to exceed listed leaderboard entries on HLE-Verified, LiveCodeBench v6, and R-Bench-V.

Significance. If the FOK/JOL signals demonstrably supply control value beyond equivalent multi-attempt compute, the result would show that strong LLMs already possess latent, actionable metacognitive monitoring that can be turned into a general test-time scaling mechanism without training. The parameter-free, cross-domain nature of the harness is a strength, but the current evidence does not yet isolate the metacognitive component from generic ensembling.

major comments (2)
  1. [Abstract] Abstract: the reported lift from 48.3% to 56.9% pooled accuracy is presented as evidence that the metacognitive harness supplies unique control value, yet no ablation is described against a signal-agnostic baseline that performs the same average number of model calls and applies the identical final aggregator; without this comparison the attribution to FOK/JOL control rather than multi-attempt scaling remains unsupported.
  2. [Abstract] Abstract and results: decision thresholds, prompt templates for eliciting FOK and JOL, and statistical controls (e.g., variance across runs, significance tests) are not specified, leaving the link between the elicited signals and the observed accuracy gains only partially supported and difficult to reproduce.
minor comments (1)
  1. [Abstract] Abstract: the three primary evaluation settings are named but no per-benchmark breakdown or error analysis is provided to show where the harness helps or hurts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments. The distinction between metacognitive control and generic multi-attempt scaling is central to the paper's claim, and we appreciate the opportunity to strengthen the evidence for it. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported lift from 48.3% to 56.9% pooled accuracy is presented as evidence that the metacognitive harness supplies unique control value, yet no ablation is described against a signal-agnostic baseline that performs the same average number of model calls and applies the identical final aggregator; without this comparison the attribution to FOK/JOL control rather than multi-attempt scaling remains unsupported.

    Authors: We agree that the current manuscript does not contain a direct ablation against a compute-matched, signal-agnostic baseline, which leaves the unique contribution of the FOK/JOL-driven decisions incompletely isolated. In the revised manuscript we will add exactly this comparison: for each benchmark we will run the identical average number of model calls per problem, apply the same final aggregator, but replace the metacognitive decision logic (trust/retry/aggregate based on FOK and JOL) with either (a) random selection among attempts or (b) a fixed policy that always aggregates the maximum number of attempts. The resulting accuracy will be reported alongside the harness results. We expect the metacognitive policy to retain a measurable advantage; if it does not, we will revise the claims accordingly. revision: yes

  2. Referee: [Abstract] Abstract and results: decision thresholds, prompt templates for eliciting FOK and JOL, and statistical controls (e.g., variance across runs, significance tests) are not specified, leaving the link between the elicited signals and the observed accuracy gains only partially supported and difficult to reproduce.

    Authors: We acknowledge the omissions. In the revision we will add: (1) the complete prompt templates used to elicit FOK (pre-solve) and JOL (post-solve) in an appendix; (2) the exact numerical thresholds and decision rules that map the elicited signals to the actions trust/retry/aggregate, including how thresholds were selected; (3) standard deviations and full per-run accuracies across at least three independent executions for all reported figures; and (4) statistical significance tests (paired t-tests on accuracy and McNemar's test on per-problem correctness) comparing the harness to the base model and to the new signal-agnostic baseline. These additions will make the causal link between the metacognitive signals and the observed gains fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical procedure: elicit pre-solve FOK and post-solve JOL signals from a fixed LLM, then apply an external rule-based harness to decide trust/retry/aggregate. No equations, fitted parameters, or self-citations are load-bearing; the claimed accuracy lift (48.3 to 56.9) is presented as an observed outcome on public benchmarks rather than a quantity that reduces to the inputs by construction. The control logic is independent of the evaluation data and does not rename or smuggle in prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can produce usable metacognitive signals when prompted and that these signals can be turned into reliable control decisions without further training.

axioms (1)
  • domain assumption LLMs produce reliable feeling-of-knowing and judgment-of-learning signals when explicitly prompted
    This is invoked as the basis for the harness to function as an effective control interface.

pith-pipeline@v0.9.0 · 5613 in / 1342 out tokens · 40756 ms · 2026-05-15T04:46:21.059145+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 10 internal anchors

  1. [1]

    Harness engineering: Leveraging codex in an agent-first world

    Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world. https: //openai.com/index/harness-engineering/, February 2026. OpenAI Engineering Blog. Accessed: 2026-05-04

  2. [2]

    GPT-4 Technical Report

    Josh Achiam et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

  4. [4]

    Scaling llm test-time compute optimally can be more effective than scaling model parameters for reasoning

    Charlie Snell, Jaeho Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters for reasoning. InInternational Conference on Learning Representations, 2025

  5. [5]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  6. [6]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models.arXiv preprint arXiv:2206.14858, 2022

  7. [7]

    PAL: Program-aided Language Models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2023

  8. [8]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  9. [9]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

  10. [10]

    Effective harnesses for long-running agents

    Anthropic. Effective harnesses for long-running agents. https://www.anthropic.com/ engineering/effective-harnesses-for-long-running-agents , November 2025. Anthropic Engineering Blog

  11. [11]

    The anatomy of an agent harness

    Vivek Trivedy. The anatomy of an agent harness. https://www.langchain.com/blog/ the-anatomy-of-an-agent-harness, March 2026. LangChain Blog. 10

  12. [12]

    Harness capabilities

    LangChain. Harness capabilities. https://docs.langchain.com/oss/python/ deepagents/harness, 2026. LangChain Documentation

  13. [13]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Thomas Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  14. [14]

    Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

    Miao Xiong, Zhixiong Hu, Xuming Lu, Yufei Li, Jiaxin Fu, Jiazhen He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In International Conference on Learning Representations, 2024

  15. [15]

    Metacognitive capabilities of llms: An exploration in mathematical problem solving

    Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Rezende, Yoshua Bengio, Michael Mozer, and Sanjeev Arora. Metacognitive capabilities of llms: An exploration in mathematical problem solving. InAdvances in Neural Information Processing Systems, volume 37, 2024

  16. [16]

    Fact-level confidence calibration and self-correction.arXiv preprint arXiv:2411.13343, 2024

    Yifan Yuan, Bin Xu, Hao Tan, Fei Sun, Tong Xiao, Wen Li, Hua Shen, and Xueqi Cheng. Fact-level confidence calibration and self-correction.arXiv preprint arXiv:2411.13343, 2024

  17. [17]

    Nelson and Louis Narens

    Thomas O. Nelson and Louis Narens. Metamemory: A theoretical framework and new findings. In Gordon H. Bower, editor,Psychology of Learning and Motivation, volume 26, pages 125–173. Academic Press, 1990

  18. [18]

    John H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive- developmental inquiry.American Psychologist, 34(10):906–911, 1979

  19. [19]

    John T. Hart. Memory and the feeling-of-knowing experience.Journal of Educational Psychol- ogy, 56(4):208–216, 1965

  20. [20]

    Reder and Frank E

    Lynne M. Reder and Frank E. Ritter. What determines initial feeling of knowing? familiarity with question terms, not with the answer.Journal of Experimental Psychology: Learning, Memory, and Cognition, 18(3):435–451, 1992

  21. [21]

    Shimamura, editors.Metacognition: Knowing about Knowing

    Janet Metcalfe and Arthur P. Shimamura, editors.Metacognition: Knowing about Knowing. MIT Press, 1994

  22. [22]

    On Verbalized Confidence Scores for LLMs

    Dong Yang, Yu-Hsuan H. Tsai, and Makoto Yamada. On verbalized confidence scores for llms. arXiv preprint arXiv:2412.14737, 2024

  23. [23]

    Tuning-free account- able intervention for llm deployment: A metacognitive approach

    Zhen Tan, Jie Peng, Song Wang, Lijie Hu, Tianlong Chen, and Huan Liu. Tuning-free account- able intervention for llm deployment: A metacognitive approach. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25237–25245, 2025

  24. [24]

    Decoupling metacognition from cognition: A framework for quantifying metacognitive ability in llms

    Guoqing Wang, Wen Wu, Guangze Ye, Zhenxiao Cheng, Xi Chen, and Hong Zheng. Decoupling metacognition from cognition: A framework for quantifying metacognitive ability in llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25353–25361, 2025

  25. [25]

    Large language models have intrinsic meta-cognition, but need a good lens

    Ziyang Ma, Qingyue Yuan, Zhenglin Wang, and Deyu Zhou. Large language models have intrinsic meta-cognition, but need a good lens. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3460–3477, 2025

  26. [26]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Sean Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, volume 36, 2023

  27. [27]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

  28. [28]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 11

  29. [29]

    Solving math word problems with process- and outcome-based feedback

    Jonathan Uesato, Nate Kushman, Ramana Kumar, H. Francis Song, Noah Yamamoto Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

  30. [30]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024

  31. [31]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

  32. [32]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Fei-Fei Li, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

  33. [33]

    Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

    Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

  34. [34]

    Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

    Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, and Wenhai Wang. Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

  35. [35]

    Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014, 2025

    Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014, 2025

  36. [36]

    Claude sonnet 4.6

    Anthropic. Claude sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Ac- cessed 2026-05-03

  37. [37]

    Claude sonnet 4.6 system card

    Anthropic. Claude sonnet 4.6 system card. https://www.anthropic.com/ claude-sonnet-4-6-system-card, February 2026. Accessed 2026-05-03

  38. [38]

    All models

    OpenAI. All models. https://developers.openai.com/api/docs/models/all, 2026. Accessed 2026-05-03

  39. [39]

    Gpt-5.2 model

    OpenAI. Gpt-5.2 model. https://developers.openai.com/api/docs/models/gpt-5. 2, 2026. Accessed 2026-05-03

  40. [40]

    Gpt-5 mini model

    OpenAI. Gpt-5 mini model. https://developers.openai.com/api/docs/models/ gpt-5-mini, 2026. Accessed 2026-05-03

  41. [41]

    o3 model

    OpenAI. o3 model. https://developers.openai.com/api/docs/models/o3, 2026. Ac- cessed 2026-05-03

  42. [42]

    o4-mini model

    OpenAI. o4-mini model. https://developers.openai.com/api/docs/models/ o4-mini, 2026. Accessed 2026-05-03

  43. [43]

    Claude opus 4.6 system card

    Anthropic. Claude opus 4.6 system card. https://www.anthropic.com/ claude-opus-4-6-system-card, February 2026. Accessed 2026-05-03

  44. [44]

    Model system cards

    Anthropic. Model system cards. https://www.anthropic.com/system-cards, 2026. Accessed 2026-05-03

  45. [45]

    Gemini 3 developer guide

    Google AI for Developers. Gemini 3 developer guide. https://ai.google.dev/ gemini-api/docs/gemini-3, 2026. Accessed 2026-05-03

  46. [46]

    Gemini 2.5 pro

    Google AI for Developers. Gemini 2.5 pro. https://ai.google.dev/gemini-api/docs/ models/gemini-2.5-pro, 2026. Accessed 2026-05-03

  47. [47]

    Gemini api models

    Google AI for Developers. Gemini api models. https://ai.google.dev/gemini-api/ docs/models, 2026. Accessed 2026-05-03. 12

  48. [48]

    Qwen2.5-vl-72b-instruct

    Qwen Team. Qwen2.5-vl-72b-instruct. https://huggingface.co/Qwen/Qwen2. 5-VL-72B-Instruct, 2025. Accessed 2026-05-03

  49. [49]

    Qwen2.5-vl.https://qwenlm.github.io/blog/qwen2.5-vl/, January 2025

    Qwen Team. Qwen2.5-vl.https://qwenlm.github.io/blog/qwen2.5-vl/, January 2025. Accessed 2026-05-03

  50. [50]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  51. [51]

    FOK Metacognitive assessor; forbidden from solving {problem, image} {domain, FOK, FOK_reason}

  52. [52]

    Solve Same agent, now solving {problem, image, FOK, FOK_reason} {reasoning, answer}

  53. [53]

    JOL Same agent, post-hoc self- rating {problem, image, FOK, FOK_reason, own reasoning, own answer} (all in-call) {JOL_score, JOL_reason}

  54. [54]

    Retryk Same agent, instructed to try a different method {problem, image, FOK, FOK_reason, history of (an- swer, JOL, JOL_reason)}without previous reasoning chains new {reasoning, answer, JOL_score, JOL_reason}

  55. [55]

    try a DIFFERENT approach and address the concerns raised in previous JOL reasons

    Select Independent judge agent; forbidden from producing new answers {problem, image, shuffled list of (an- swer, reasoning) for all attempts} {selected_index, justifica- tion} A few design choices are worth highlighting: • No reasoning leakage from Stage 1 to Stage 2.The FOK_reason from Stage 1 is short and intuition-level by construction (the system pro...

  56. [56]

    1424–1425) and guesses the province where those campaigns were concentrated

    Attempt 1: Nghe An (incorrect).The solver assumes the defeat occurred during the late Lam Son uprising (c. 1424–1425) and guesses the province where those campaigns were concentrated

  57. [57]

    Attempt 2: Nam Dinh (correct).The solver re-anchors on theearlierphase of the occupation, identifies the Battle of Bo Co (1408) as Mu Sheng’s first major defeat, and correctly locates the engagement in the coastal Red-River-delta area corresponding to modern Nam Dinh

  58. [58]

    Attempt 3: Ninh Binh (incorrect).Now the solver fixes on the Battle of Bo Co but reasons geographicallyfrom the Day River estuary, which it places in Ninh Binh

  59. [59]

    do I know this problem

    Attempt 4: Thai Binh (incorrect).The solver explicitly notes that the previous three attempts disagreed and tries yet another distributary of the Red River delta. Aggregation and outcome.Because the four candidates are split four ways, string-consensus does not fire and the question is not a code task, so the hybrid aggregator falls through to the select-...

  60. [60]

    The post-solve JOL (0.72) isbelowthe high-confidence regime, signalling that the topology assumption is shaky

    Attempt 1: Req=4R (incorrect).The solver misreads the diagram as a 4×2 grid of diamond cells and applies a series-of-bridges decomposition. The post-solve JOL (0.72) isbelowthe high-confidence regime, signalling that the topology assumption is shaky

  61. [61]

    Post- solve JOL rises sharply, the retry signal drops belowτ, and the harness stops

    Attempt 2: Req=2R (correct).The solver re-counts and now reads a 3×2 diamond lattice, exploits top–bottom symmetry to merge equipotential nodes, and computes Req=2R. Post- solve JOL rises sharply, the retry signal drops belowτ, and the harness stops. Outcome.Final answer2R, correct, two attempts. Takeaway.This is the directed-retry regime: the controller ...