arxiv: 2605.14186 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

Qi Cao , Yufan Wang , Peijia Qin , Shuhao Zhang , Pengtao Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords metacognitionLLM reasoningtest-time scalingself-monitoringfeeling of knowingjudgment of learningbenchmark improvement

0 comments

The pith

Large language models can use their own pre- and post-solution self-assessments to control inference and raise accuracy on reasoning tasks without any training or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs generate useful signals about whether they know an answer before trying and whether their solution is correct after trying. These signals are turned into a control system called a metacognitive harness that decides whether to accept a solution, retry with feedback, or combine several attempts. When applied to a fixed Claude model on text, code, and multimodal benchmarks, this raises average accuracy from 48.3 percent to 56.9 percent and beats existing top entries on key leaderboards. A sympathetic reader would care because it suggests that current models already hold the knowledge needed to scale their own performance at test time, but lack an explicit way to act on that knowledge.

Core claim

Inspired by the Nelson-Narens theory of metacognition, the work demonstrates that LLMs possess latent metacognitive ability in the form of feeling-of-knowing signals before solving and judgment-of-learning signals after each attempt. By separating these monitoring signals from the reasoning process and using them to guide decisions on trust, retry with compact feedback, and aggregation, the metacognitive harness improves a fixed base model across diverse benchmarks without parameter updates.

What carries the argument

The metacognitive harness, which elicits pre-solve feeling-of-knowing (FOK) and post-solve judgment-of-learning (JOL) signals from the LLM and uses them as control inputs to decide trust, retry, or aggregate.

If this is right

Substantially improves accuracy on text, code, and multimodal reasoning benchmarks using a fixed model.
Raises pooled accuracy from 48.3 to 56.9 on public benchmark snapshots.
Exceeds strongest leaderboard entries on HLE-Verified, LiveCodeBench v6, and R-Bench-V.
Requires no parameter updates or benchmark-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar harnesses could be applied to other base models to test if metacognitive signals are a general property of strong LLMs.
The separation of monitoring from reasoning may enable more robust self-correction loops in future systems.
Explicit control interfaces might unlock further test-time scaling beyond current single-pass or simple sampling methods.

Load-bearing premise

The feeling-of-knowing and judgment-of-learning signals produced by the LLM are reliable, consistent, and free from systematic bias so that they can effectively direct the control decisions.

What would settle it

Observing that the harness produces lower accuracy than the base model on a held-out set of problems where the self-monitoring signals do not correlate with actual success would falsify the claim that these signals can be harnessed effectively.

Figures

Figures reproduced from arXiv: 2605.14186 by Peijia Qin, Pengtao Xie, Qi Cao, Shuhao Zhang, Yufan Wang.

**Figure 1.** Figure 1: LLMs exhibit metacognitive signals, but do not use them to control reasoning. (a) We directly prompt each LLM to report a scalar self-assessment in [0, 1] before answering, denoted as FOK (Feeling of Knowing), and after answering, denoted as JOL (Judgment of Learning). (b) These self-reported scores are meaningfully correlated with actual correctness: examples with higher FOK/JOL scores achieve higher accu… view at source ↗

**Figure 2.** Figure 2: Metacognitive harness. Inspired by the Nelson–Narens metacognition theory, we instantiate metacognition as a two-level control loop for language model reasoning. The meta level monitors the model’s reasoning state through self-reported signals, including pre-solve feeling of knowing (FOK) and post-solve judgment of learning (JOL), while the object level performs test-time scaling actions such as solving, … view at source ↗

**Figure 3.** Figure 3: Three diagnosis cards illustrating the verdict rubric. Each card lists six graded rows ({FOK, JOL, Joint} × {AUROC, ECE}) and a final verdict. Sonnet-4.6 (left) is the only model in the panel that passes every row; Gemini3-Flash (middle) passes the discrimination rows but fails on calibration ECEs; Gemma 4 (right) fails discrimination on FOK and calibration on both raw ECEs. The remaining six models fall w… view at source ↗

**Figure 4.** Figure 4: SVM decision function. Kernel selection for the joint metacognition classifier on Sonnet4.6. Each panel shows the decision surface of an SVM trained with StandardScaler and isotonic calibration; titles report out-of-fold AUROC under RepeatedStratifiedKFold (5 splits × 3 repeats). Background color is the predicted P(correct), the dashed line marks the 0.5 boundary. The second type is parallel scaling, whic… view at source ↗

**Figure 5.** Figure 5: Discussions of confidence scores. (a) The harness spends more attempts on low-JOL1 problems, where it also yields larger gains. (b) After harnessing, effort decreases with confidence for both FOK and JOL. (c) JOL varies much more across problems than within the same problem, motivating our choice to use confidence for retry/stop decisions but not for aggregation. confidence scores are not directly actionab… view at source ↗

**Figure 6.** Figure 6: Per-model accuracy by confidence band. Each subplot shows one model’s accuracy on the Low (bottom 30%), Medium (middle 40%), and High (top 30%) bands for FOK (light blue, pre-solve confidence) and JOL (deep blue, post-solve confidence). Accuracy generally increases with confidence for most models and signals, though the trend is not strictly monotonic in every case. The main paper reports two cross-model a… view at source ↗

**Figure 7.** Figure 7: Per-model reasoning length by confidence band. Each subplot shows reasoning length on the Low / Medium / High bands for FOK (light orange) and JOL (deep orange), normalised by the model’s overall mean length (dotted line at 1.0). The direction varies across models — Anthropic models shorten on high-confidence items, several open-weight models lengthen — but in no case does low confidence consistently buy m… view at source ↗

**Figure 8.** Figure 8: Per-model metacognition diagnosis cards (extended). The six models not shown in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Retrieval-stage metamemory in the Nelson–Narens framework. A modern schematic of the retrieval-stage monitoring–control process described by Nelson and Narens [17]. A preliminary feeling-of-knowing judgment gates whether search should be initiated; candidate retrieval and confidence evaluation then determine whether to output an answer, continue searching, or terminate with no answer. We use this cognitive… view at source ↗

read the original abstract

Large language models (LLMs) often expose useful signals of self-monitoring: before solving a problem, they can estimate whether they are likely to succeed, and after solving it, they can judge whether their answer is likely to be correct. However, these signals are typically measured or elicited in isolation, rather than used to control inference. In this work, we ask whether LLMs possess latent metacognitive ability that can be turned into effective test-time control. Inspired by the Nelson--Narens theory from cognitive psychology, we propose a metacognitive harness that separates monitoring from reasoning. For each problem, the model first reports a pre-solve feeling-of-knowing (FOK) signal; after each solve attempt, it reports a post-solve judgment-of-learning (JOL) signal. Rather than treating these signals as passive confidence estimates, the harness turns them into an explicit control interface for reasoning: it decides when to trust the current solution, when to retry with compact metacognitive feedback, and when to pass multiple attempts to a final aggregator. Across text, code, and multimodal reasoning benchmarks, our harness substantially improves a fixed Claude Sonnet-4.6 base model without parameter updates or benchmark-specific fine-tuning. On the evaluated public benchmark snapshots, it raises pooled accuracy from 48.3 to 56.9 and exceeds the strongest listed leaderboard entries on the three primary evaluation settings: HLE-Verified, LiveCodeBench v6, and R-Bench-V. These results suggest that strong LLMs may already possess useful metacognitive ability, but require an explicit control harness to act on it during reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a practical way to use an LLM's own self-monitoring signals for test-time control and reports clear gains on a fixed model, but the evidence does not yet separate the harness from plain multi-attempt scaling.

read the letter

The main thing here is that the authors take existing self-monitoring signals from LLMs, frame them as feeling-of-knowing and judgment-of-learning per Nelson-Narens, and build a simple external harness that decides when to trust an answer, when to retry with feedback, and when to aggregate. On Claude Sonnet-4.6 they move pooled accuracy from 48.3 to 56.9 across text, code, and multimodal benchmarks and beat some listed leaderboard numbers without any parameter changes. That is the concrete result worth noting. The separation of monitoring from reasoning and the explicit retry-plus-aggregation rules are the clearest addition over generic self-reflection prompts. The work is straightforward to implement and targets a real deployment pain point: getting more out of a frozen model at inference time. The soft spots are straightforward. The abstract gives no ablation that runs the same average number of calls without using the FOK or JOL signals, so it is still possible the lift comes from extra attempts plus aggregation rather than the metacognitive decisions. There are also no details on the exact prompts used to elicit the signals, the thresholds for trust/retry, or any statistical checks on signal reliability or bias. Without those, the causal claim that the harness turns latent metacognition into effective control remains only partially supported. The paper is aimed at people working on test-time scaling and inference-time methods. A reader who already runs multi-sample baselines will want to see the missing controls before adopting the harness. It is worth sending to peer review because the core idea is simple, the reported gains are large enough to matter, and the gaps are fixable with standard ablations rather than fundamental problems in the setup.

Referee Report

2 major / 1 minor

Summary. The paper proposes a metacognitive harness, inspired by Nelson-Narens theory, that elicits pre-solve feeling-of-knowing (FOK) and post-solve judgment-of-learning (JOL) signals from an LLM to control test-time decisions: whether to trust a solution, retry with metacognitive feedback, or aggregate multiple attempts. Applied without parameter updates or fine-tuning to a fixed Claude Sonnet-4.6 model, the harness is reported to raise pooled accuracy from 48.3% to 56.9% across text, code, and multimodal benchmarks and to exceed listed leaderboard entries on HLE-Verified, LiveCodeBench v6, and R-Bench-V.

Significance. If the FOK/JOL signals demonstrably supply control value beyond equivalent multi-attempt compute, the result would show that strong LLMs already possess latent, actionable metacognitive monitoring that can be turned into a general test-time scaling mechanism without training. The parameter-free, cross-domain nature of the harness is a strength, but the current evidence does not yet isolate the metacognitive component from generic ensembling.

major comments (2)

[Abstract] Abstract: the reported lift from 48.3% to 56.9% pooled accuracy is presented as evidence that the metacognitive harness supplies unique control value, yet no ablation is described against a signal-agnostic baseline that performs the same average number of model calls and applies the identical final aggregator; without this comparison the attribution to FOK/JOL control rather than multi-attempt scaling remains unsupported.
[Abstract] Abstract and results: decision thresholds, prompt templates for eliciting FOK and JOL, and statistical controls (e.g., variance across runs, significance tests) are not specified, leaving the link between the elicited signals and the observed accuracy gains only partially supported and difficult to reproduce.

minor comments (1)

[Abstract] Abstract: the three primary evaluation settings are named but no per-benchmark breakdown or error analysis is provided to show where the harness helps or hurts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments. The distinction between metacognitive control and generic multi-attempt scaling is central to the paper's claim, and we appreciate the opportunity to strengthen the evidence for it. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the reported lift from 48.3% to 56.9% pooled accuracy is presented as evidence that the metacognitive harness supplies unique control value, yet no ablation is described against a signal-agnostic baseline that performs the same average number of model calls and applies the identical final aggregator; without this comparison the attribution to FOK/JOL control rather than multi-attempt scaling remains unsupported.

Authors: We agree that the current manuscript does not contain a direct ablation against a compute-matched, signal-agnostic baseline, which leaves the unique contribution of the FOK/JOL-driven decisions incompletely isolated. In the revised manuscript we will add exactly this comparison: for each benchmark we will run the identical average number of model calls per problem, apply the same final aggregator, but replace the metacognitive decision logic (trust/retry/aggregate based on FOK and JOL) with either (a) random selection among attempts or (b) a fixed policy that always aggregates the maximum number of attempts. The resulting accuracy will be reported alongside the harness results. We expect the metacognitive policy to retain a measurable advantage; if it does not, we will revise the claims accordingly. revision: yes
Referee: [Abstract] Abstract and results: decision thresholds, prompt templates for eliciting FOK and JOL, and statistical controls (e.g., variance across runs, significance tests) are not specified, leaving the link between the elicited signals and the observed accuracy gains only partially supported and difficult to reproduce.

Authors: We acknowledge the omissions. In the revision we will add: (1) the complete prompt templates used to elicit FOK (pre-solve) and JOL (post-solve) in an appendix; (2) the exact numerical thresholds and decision rules that map the elicited signals to the actions trust/retry/aggregate, including how thresholds were selected; (3) standard deviations and full per-run accuracies across at least three independent executions for all reported figures; and (4) statistical significance tests (paired t-tests on accuracy and McNemar's test on per-problem correctness) comparing the harness to the base model and to the new signal-agnostic baseline. These additions will make the causal link between the metacognitive signals and the observed gains fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical procedure: elicit pre-solve FOK and post-solve JOL signals from a fixed LLM, then apply an external rule-based harness to decide trust/retry/aggregate. No equations, fitted parameters, or self-citations are load-bearing; the claimed accuracy lift (48.3 to 56.9) is presented as an observed outcome on public benchmarks rather than a quantity that reduces to the inputs by construction. The control logic is independent of the evaluation data and does not rename or smuggle in prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can produce usable metacognitive signals when prompted and that these signals can be turned into reliable control decisions without further training.

axioms (1)

domain assumption LLMs produce reliable feeling-of-knowing and judgment-of-learning signals when explicitly prompted
This is invoked as the basis for the harness to function as an effective control interface.

pith-pipeline@v0.9.0 · 5613 in / 1342 out tokens · 40756 ms · 2026-05-15T04:46:21.059145+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the harness turns them into an explicit control interface for reasoning: it decides when to trust the current solution, when to retry with compact metacognitive feedback, and when to pass multiple attempts to a final aggregator
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Inspired by the Nelson–Narens theory from cognitive psychology, we propose a metacognitive harness that separates monitoring from reasoning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 10 internal anchors

[1]

Harness engineering: Leveraging codex in an agent-first world

Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world. https: //openai.com/index/harness-engineering/, February 2026. OpenAI Engineering Blog. Accessed: 2026-05-04

work page 2026
[2]

GPT-4 Technical Report

Josh Achiam et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

work page 2022
[4]

Scaling llm test-time compute optimally can be more effective than scaling model parameters for reasoning

Charlie Snell, Jaeho Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters for reasoning. InInternational Conference on Learning Representations, 2025

work page 2025
[5]

Code Llama: Open Foundation Models for Code

Baptiste Rozière et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models.arXiv preprint arXiv:2206.14858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

PAL: Program-aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2023

work page Pith review arXiv 2023
[8]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[9]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Effective harnesses for long-running agents

Anthropic. Effective harnesses for long-running agents. https://www.anthropic.com/ engineering/effective-harnesses-for-long-running-agents , November 2025. Anthropic Engineering Blog

work page 2025
[11]

The anatomy of an agent harness

Vivek Trivedy. The anatomy of an agent harness. https://www.langchain.com/blog/ the-anatomy-of-an-agent-harness, March 2026. LangChain Blog. 10

work page 2026
[12]

Harness capabilities

LangChain. Harness capabilities. https://docs.langchain.com/oss/python/ deepagents/harness, 2026. LangChain Documentation

work page 2026
[13]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Thomas Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

Miao Xiong, Zhixiong Hu, Xuming Lu, Yufei Li, Jiaxin Fu, Jiazhen He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In International Conference on Learning Representations, 2024

work page 2024
[15]

Metacognitive capabilities of llms: An exploration in mathematical problem solving

Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Rezende, Yoshua Bengio, Michael Mozer, and Sanjeev Arora. Metacognitive capabilities of llms: An exploration in mathematical problem solving. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[16]

Fact-level confidence calibration and self-correction.arXiv preprint arXiv:2411.13343, 2024

Yifan Yuan, Bin Xu, Hao Tan, Fei Sun, Tong Xiao, Wen Li, Hua Shen, and Xueqi Cheng. Fact-level confidence calibration and self-correction.arXiv preprint arXiv:2411.13343, 2024

work page arXiv 2024
[17]

Nelson and Louis Narens

Thomas O. Nelson and Louis Narens. Metamemory: A theoretical framework and new findings. In Gordon H. Bower, editor,Psychology of Learning and Motivation, volume 26, pages 125–173. Academic Press, 1990

work page 1990
[18]

John H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive- developmental inquiry.American Psychologist, 34(10):906–911, 1979

work page 1979
[19]

John T. Hart. Memory and the feeling-of-knowing experience.Journal of Educational Psychol- ogy, 56(4):208–216, 1965

work page 1965
[20]

Reder and Frank E

Lynne M. Reder and Frank E. Ritter. What determines initial feeling of knowing? familiarity with question terms, not with the answer.Journal of Experimental Psychology: Learning, Memory, and Cognition, 18(3):435–451, 1992

work page 1992
[21]

Shimamura, editors.Metacognition: Knowing about Knowing

Janet Metcalfe and Arthur P. Shimamura, editors.Metacognition: Knowing about Knowing. MIT Press, 1994

work page 1994
[22]

On Verbalized Confidence Scores for LLMs

Dong Yang, Yu-Hsuan H. Tsai, and Makoto Yamada. On verbalized confidence scores for llms. arXiv preprint arXiv:2412.14737, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Tuning-free account- able intervention for llm deployment: A metacognitive approach

Zhen Tan, Jie Peng, Song Wang, Lijie Hu, Tianlong Chen, and Huan Liu. Tuning-free account- able intervention for llm deployment: A metacognitive approach. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25237–25245, 2025

work page 2025
[24]

Decoupling metacognition from cognition: A framework for quantifying metacognitive ability in llms

Guoqing Wang, Wen Wu, Guangze Ye, Zhenxiao Cheng, Xi Chen, and Hong Zheng. Decoupling metacognition from cognition: A framework for quantifying metacognitive ability in llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25353–25361, 2025

work page 2025
[25]

Large language models have intrinsic meta-cognition, but need a good lens

Ziyang Ma, Qingyue Yuan, Zhenglin Wang, and Deyu Zhou. Large language models have intrinsic meta-cognition, but need a good lens. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3460–3477, 2025

work page 2025
[26]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Sean Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[27]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[28]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, H. Francis Song, Noah Yamamoto Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024

work page 2024
[31]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[32]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Fei-Fei Li, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

work page arXiv 2025
[34]

Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, and Wenhai Wang. Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

work page arXiv 2025
[35]

Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014, 2025

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014, 2025

work page arXiv 2025
[36]

Claude sonnet 4.6

Anthropic. Claude sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Ac- cessed 2026-05-03

work page 2026
[37]

Claude sonnet 4.6 system card

Anthropic. Claude sonnet 4.6 system card. https://www.anthropic.com/ claude-sonnet-4-6-system-card, February 2026. Accessed 2026-05-03

work page 2026
[38]

All models

OpenAI. All models. https://developers.openai.com/api/docs/models/all, 2026. Accessed 2026-05-03

work page 2026
[39]

Gpt-5.2 model

OpenAI. Gpt-5.2 model. https://developers.openai.com/api/docs/models/gpt-5. 2, 2026. Accessed 2026-05-03

work page 2026
[40]

Gpt-5 mini model

OpenAI. Gpt-5 mini model. https://developers.openai.com/api/docs/models/ gpt-5-mini, 2026. Accessed 2026-05-03

work page 2026
[41]

o3 model

OpenAI. o3 model. https://developers.openai.com/api/docs/models/o3, 2026. Ac- cessed 2026-05-03

work page 2026
[42]

o4-mini model

OpenAI. o4-mini model. https://developers.openai.com/api/docs/models/ o4-mini, 2026. Accessed 2026-05-03

work page 2026
[43]

Claude opus 4.6 system card

Anthropic. Claude opus 4.6 system card. https://www.anthropic.com/ claude-opus-4-6-system-card, February 2026. Accessed 2026-05-03

work page 2026
[44]

Model system cards

Anthropic. Model system cards. https://www.anthropic.com/system-cards, 2026. Accessed 2026-05-03

work page 2026
[45]

Gemini 3 developer guide

Google AI for Developers. Gemini 3 developer guide. https://ai.google.dev/ gemini-api/docs/gemini-3, 2026. Accessed 2026-05-03

work page 2026
[46]

Gemini 2.5 pro

Google AI for Developers. Gemini 2.5 pro. https://ai.google.dev/gemini-api/docs/ models/gemini-2.5-pro, 2026. Accessed 2026-05-03

work page 2026
[47]

Gemini api models

Google AI for Developers. Gemini api models. https://ai.google.dev/gemini-api/ docs/models, 2026. Accessed 2026-05-03. 12

work page 2026
[48]

Qwen2.5-vl-72b-instruct

Qwen Team. Qwen2.5-vl-72b-instruct. https://huggingface.co/Qwen/Qwen2. 5-VL-72B-Instruct, 2025. Accessed 2026-05-03

work page 2025
[49]

Qwen2.5-vl.https://qwenlm.github.io/blog/qwen2.5-vl/, January 2025

Qwen Team. Qwen2.5-vl.https://qwenlm.github.io/blog/qwen2.5-vl/, January 2025. Accessed 2026-05-03

work page 2025
[50]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

FOK Metacognitive assessor; forbidden from solving {problem, image} {domain, FOK, FOK_reason}

work page
[52]

Solve Same agent, now solving {problem, image, FOK, FOK_reason} {reasoning, answer}

work page
[53]

JOL Same agent, post-hoc self- rating {problem, image, FOK, FOK_reason, own reasoning, own answer} (all in-call) {JOL_score, JOL_reason}

work page
[54]

Retryk Same agent, instructed to try a different method {problem, image, FOK, FOK_reason, history of (an- swer, JOL, JOL_reason)}without previous reasoning chains new {reasoning, answer, JOL_score, JOL_reason}

work page
[55]

try a DIFFERENT approach and address the concerns raised in previous JOL reasons

Select Independent judge agent; forbidden from producing new answers {problem, image, shuffled list of (an- swer, reasoning) for all attempts} {selected_index, justifica- tion} A few design choices are worth highlighting: • No reasoning leakage from Stage 1 to Stage 2.The FOK_reason from Stage 1 is short and intuition-level by construction (the system pro...

work page
[56]

1424–1425) and guesses the province where those campaigns were concentrated

Attempt 1: Nghe An (incorrect).The solver assumes the defeat occurred during the late Lam Son uprising (c. 1424–1425) and guesses the province where those campaigns were concentrated

work page
[57]

Attempt 2: Nam Dinh (correct).The solver re-anchors on theearlierphase of the occupation, identifies the Battle of Bo Co (1408) as Mu Sheng’s first major defeat, and correctly locates the engagement in the coastal Red-River-delta area corresponding to modern Nam Dinh

work page
[58]

Attempt 3: Ninh Binh (incorrect).Now the solver fixes on the Battle of Bo Co but reasons geographicallyfrom the Day River estuary, which it places in Ninh Binh

work page
[59]

do I know this problem

Attempt 4: Thai Binh (incorrect).The solver explicitly notes that the previous three attempts disagreed and tries yet another distributary of the Red River delta. Aggregation and outcome.Because the four candidates are split four ways, string-consensus does not fire and the question is not a code task, so the hybrid aggregator falls through to the select-...

work page
[60]

The post-solve JOL (0.72) isbelowthe high-confidence regime, signalling that the topology assumption is shaky

Attempt 1: Req=4R (incorrect).The solver misreads the diagram as a 4×2 grid of diamond cells and applies a series-of-bridges decomposition. The post-solve JOL (0.72) isbelowthe high-confidence regime, signalling that the topology assumption is shaky

work page
[61]

Post- solve JOL rises sharply, the retry signal drops belowτ, and the harness stops

Attempt 2: Req=2R (correct).The solver re-counts and now reads a 3×2 diamond lattice, exploits top–bottom symmetry to merge equipotential nodes, and computes Req=2R. Post- solve JOL rises sharply, the retry signal drops belowτ, and the harness stops. Outcome.Final answer2R, correct, two attempts. Takeaway.This is the directed-retry regime: the controller ...

work page