pith. sign in

arxiv: 2605.23384 · v1 · pith:XP2WRVK3new · submitted 2026-05-22 · 💻 cs.CL · cs.AI

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

Pith reviewed 2026-05-25 04:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords metacognitionreinforcement learningLLM reasoningreward designprocess supervisionknowledge coverageregulation fidelitytrajectory optimization
0
0 comments X

The pith

MaR improves LLM reasoning by rewarding explicit metacognitive knowledge and regulation signals in reasoning trajectories without instance-specific rubrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Metacognition-as-Reward (MaR) to supply process-level guidance during reinforcement learning for large language models. It extracts two general signals from model outputs: metacognitive knowledge that identifies task-relevant information and metacognitive regulation that plans and adjusts the reasoning steps. These signals combine with final-answer correctness into a single trajectory-level reward that optimizes entire rollouts. The method addresses the narrow scope of outcome-only rewards and the design burden of per-instance rubrics. A sympathetic reader would care because the framework aims to produce better intermediate reasoning behaviors at scale.

Core claim

MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. This extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions, producing up to 7.7 percent gains over base models and up to 11.0 percent gains over vanilla DAPO on 22 benchmarks.

What carries the argument

Scaffolding of model rollouts into metacognitive knowledge and regulation components followed by trajectory-level reward computation on coverage, fidelity, and correctness.

If this is right

  • MaR-trained models exhibit measurable improvements in reasoning process quality beyond final-answer accuracy.
  • The method generalizes to out-of-domain datasets where base models show smaller gains.
  • Smaller models augmented with MaR can surpass substantially larger frontier models on overall average scores.
  • Reward signals grounded in metacognition reduce dependence on hand-crafted instance rubrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the extracted signals remain stable across model scales, the same reward structure could transfer to agentic planning systems outside pure language models.
  • Process-level optimization of this form might reduce the sample complexity needed to reach high performance on multi-step reasoning tasks.
  • Future work could test whether the regulation component alone drives most of the observed gains or whether knowledge coverage is equally necessary.

Load-bearing premise

General metacognitive knowledge and regulation signals can be reliably extracted from model rollouts without instance-specific hand-crafted rubrics and that optimizing for them produces genuine improvements in reasoning quality rather than artifacts of the reward construction.

What would settle it

An ablation study in which removing the metacognitive reward terms causes the performance gains to disappear on the same 22 benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.23384 by Beier Zhu, Chaochao Lu, Hanwang Zhang, Lei Xu, Shengjie Zhao, Sirui Chen, Yutian Chen, Yu Wang, Yuying Zhao.

Figure 1
Figure 1. Figure 1: Overview of MaR. MaR follows a three-stage loop: the policy generates multiple structured rollouts consisting of metacognitive knowledge, metacognitive regulation, and the final answer; a grader scores each rollout along knowledge monitoring, regulation monitoring, and answer correct￾ness; and the resulting reward scores are used to optimize the policy. use multi-agent RL to separate high-level meta-thinki… view at source ↗
Figure 9
Figure 9. Figure 9: Figure 14 presents an example rollout [PITH_FULL_IMAGE:figures/full_fig_p004_9.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the evaluation datasets. Six in science, four in medical, three in long-context reasoning, seven in mathematical reasoning, and two in logical reasoning. These datasets con￾tain approximately 16K test exam￾ples in total, providing broad task cov￾erage and supporting reliable evalu￾ation. (1) Science datasets: GPQA￾Diamond (Rein et al., 2024), SuperG￾PQA (Du et al., 2025), BioProBench (Liu et al… view at source ↗
Figure 4
Figure 4. Figure 4: Metacognitive process scores on long-context reasoning tasks. An external grader scores the generated metacognitive process along three reward components: KMR, RMR, and CR. The green annotations indicate the absolute improvements (%) brought by MaR. and RMR by 10.7%, indicating that the model becomes better at identifying relevant knowledge and regulating its reasoning process. CR also improves by 9.8% on … view at source ↗
Figure 5
Figure 5. Figure 5: further analyzes the relationship among KMR, RMR, and CR by computing their pairwise Spearman correlations (Spearman, 1904). We find that: KMR RMR CR KMR RMR CR 1 0.321 0.624 0.321 1 0.541 0.624 0.541 1 DocQA KMR RMR CR 1 0.376 0.585 0.376 1 0.495 0.585 0.495 1 LongMIT KMR RMR CR 1 0.356 0.554 0.356 1 0.472 0.554 0.472 1 LongReward 0.4 0.6 0.8 1.0 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model generalizability across math and logical reasoning benchmarks. We take the Qwen3.5-4B/9B+MaR as the reference baseline in each panel. The green squares mark the Qwen3.5- 4B/9B+MaR’s absolute performance. The blue circles show the corresponding baseline model, po￾sitioned according to the performance difference. Negative radial values indicate that the baseline model performs worse than its MaR-traine… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of grader choice. We train the Qwen3.5-9B with MaR using varying sizes of graders. The figure reports average accuracy on science and medical benchmarks, with dashed lines indicating the Qwen3.5-9B baseline. 65.5 66.0 66.5 67.0 67.5 Average Acc [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for rollout generation. The prompt asks the policy πθ to generate the MK, MR, optional lookback and final answer. Grader Scoring Prompt Template You are a strict evaluator for scoring. Your task is to evaluate the actor's output and return exactly five values: - k: the number of gold knowledge units covered by the actor's metacognitive knowledge (MK) - r: the number of initially missing gol… view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for grader scoring. The prompt instructs the grader to assess whether each rollout from πθ covers the required metacognitive knowledge, follows its stated regulation plan, and produces a correct final answer. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt template for data generation. The prompt asks the model to generate both gold knowledge and possible meta regulation. The latter is an auxiliary field and is not used during training. B DATASET DETAILS We present the details of the training set in Figures 12 and 13. Training Data Example (Medical) { "sample_id": "sample_11180", "prompt": [ { "role": "user", "content": "A one-year-old child presents… view at source ↗
Figure 12
Figure 12. Figure 12: Training data example from the medical domain. The example illustrates the input prompt and the corresponding metacognitive supervision signals used for training. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Training data example from the science domain. The example illustrates the input prompt and the corresponding metacognitive supervision signals used for training. C DETAILED TRAINING SETTINGS For RL training, we set both the maximum prompt length and response length to 4096 tokens, with an overlong buffer of 256 tokens enabled, no KL reward/loss. The lower and upper clip ratios are to 0.2 and 0.28, respec… view at source ↗
Figure 14
Figure 14. Figure 14: Example generation from the policy πθ. During training, πθ can organize its response into the required metacognitive format. For the given question, it extracts relevant metacognitive knowledge, provides a detailed reasoning plan in metacognitive regulation, triggers LOOKBACK under uncertainty, and produces a faithful and correct final answer. E BROADER IMPACTS This work may have positive societal impacts… view at source ↗
read the original abstract

Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Metacognition-as-Reward (MaR), an RL framework that scaffolds LLM rollouts into explicit metacognitive knowledge (task-relevant information identification without instance-specific rubrics) and regulation (planning/adjustment) components. It optimizes a trajectory-level reward combining knowledge coverage, regulation fidelity, and final-answer correctness, claiming consistent gains on 22 benchmarks (up to 7.7% over base models, 11.0% over vanilla DAPO), OOD generalization, and improved reasoning process quality, with Qwen3.5-9B + MaR surpassing GPT-OSS-120B on average.

Significance. If the metacognitive signals prove reliably extractable and yield genuine reasoning gains independent of reward-construction artifacts, MaR would provide a scalable, general alternative to RLVR and RaR paradigms, reducing design effort for process-level feedback in LLM reasoning. The reported OOD generalization and process improvements would strengthen the case for metacognition-inspired rewards.

major comments (3)
  1. [Abstract / §3 (Methods)] The abstract (and by extension the methods) provides no concrete mechanism, algorithm, or pseudocode for how metacognitive knowledge coverage and regulation fidelity are identified and scored from raw rollouts without instance-specific rubrics. This extraction step is load-bearing for the central claim of generality and for ruling out reward artifacts; without it, the 7.7%/11.0% gains cannot be attributed to the proposed signals rather than final-answer correctness or implicit heuristics.
  2. [Experiments / Results] No ablations, statistical significance tests, or variance reporting are referenced for the individual reward components (knowledge coverage, regulation fidelity) versus the composite reward. This leaves open whether the reported improvements on 22 benchmarks are driven by the metacognitive terms or would arise from any additional process signal.
  3. [Process-level analysis] The process-level analysis is asserted to show 'substantial improvements in reasoning process quality,' but no specific metrics, inter-annotator agreement, or comparison to human judgments of reasoning steps are described. This is required to substantiate that gains reflect better intermediate reasoning rather than optimization artifacts.
minor comments (2)
  1. [§3.3] Clarify the exact weighting or normalization used when combining the three reward terms into the trajectory-level objective.
  2. [Table 1 / Results] Add explicit comparison tables showing per-benchmark breakdowns against all baselines, including confidence intervals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each of the major comments below, agreeing that additional details and analyses are needed to strengthen the presentation of our method and results. We will incorporate the suggested revisions in the updated version of the paper.

read point-by-point responses
  1. Referee: [Abstract / §3 (Methods)] The abstract (and by extension the methods) provides no concrete mechanism, algorithm, or pseudocode for how metacognitive knowledge coverage and regulation fidelity are identified and scored from raw rollouts without instance-specific rubrics. This extraction step is load-bearing for the central claim of generality and for ruling out reward artifacts; without it, the 7.7%/11.0% gains cannot be attributed to the proposed signals rather than final-answer correctness or implicit heuristics.

    Authors: We agree that the abstract is high-level and that the methods section would benefit from more explicit details on the extraction process. Although the manuscript describes scaffolding rollouts into metacognitive components, we will add a concrete algorithm description, pseudocode, and examples illustrating how knowledge coverage and regulation fidelity are identified and scored from raw rollouts in the revised §3. This will help demonstrate that the signals are general and not reliant on instance-specific rubrics or solely on final-answer correctness. revision: yes

  2. Referee: [Experiments / Results] No ablations, statistical significance tests, or variance reporting are referenced for the individual reward components (knowledge coverage, regulation fidelity) versus the composite reward. This leaves open whether the reported improvements on 22 benchmarks are driven by the metacognitive terms or would arise from any additional process signal.

    Authors: We acknowledge this limitation in the current experiments section. In the revision, we will include ablations that isolate the effects of each reward component, report statistical significance tests (e.g., paired t-tests), and provide variance across multiple random seeds to show that the improvements are attributable to the metacognitive signals rather than generic process rewards. revision: yes

  3. Referee: [Process-level analysis] The process-level analysis is asserted to show 'substantial improvements in reasoning process quality,' but no specific metrics, inter-annotator agreement, or comparison to human judgments of reasoning steps are described. This is required to substantiate that gains reflect better intermediate reasoning rather than optimization artifacts.

    Authors: We agree that the process-level analysis section requires more rigorous documentation. We will expand it to specify the metrics used for assessing reasoning process quality, include inter-annotator agreement scores if applicable, and add comparisons against human judgments of reasoning steps to better substantiate that the observed gains correspond to improved intermediate reasoning processes. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained

full rationale

The abstract and description present MaR as a new scaffolding and reward framework that extracts metacognitive knowledge/regulation signals from rollouts and combines them with final-answer correctness for trajectory-level optimization. No equations, fitting procedures, self-citations, or uniqueness theorems are referenced that would reduce any claimed prediction or result to the inputs by construction. The reported gains are framed as empirical outcomes on 22 benchmarks rather than derived quantities forced by the reward definition itself. This is the normal case of an independent proposal whose validity rests on external evaluation rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5877 in / 1099 out tokens · 21751 ms · 2026-05-25T04:42:37.435483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Each numbered item is exactly one gold knowledge unit

    Gold Knowledge ------------------------------------------------ These are the only references for evaluating k and r. Each numbered item is exactly one gold knowledge unit. ------------------------------------------------

  2. [2]

    - Do NOT count vague or partial matches

    k -- Covered Gold Knowledge ------------------------------------------------ A gold knowledge unit is covered if the actor's MK expresses the same semantic content, even if worded differently. - Do NOT count vague or partial matches. - Do NOT invent new gold knowledge units. - Coverage is binary: a unit is either covered or not. No partial credit. -------...

  3. [3]

    Step 2: Inspect the LOOKBACK section

    r -- Recovered Missing Gold Knowledge ------------------------------------------------ Step 1: Identify which gold knowledge units are absent from MK. Step 2: Inspect the LOOKBACK section. Count a unit as recovered ONLY if: - it was absent from MK, AND - the same semantic content appears in LOOKBACK. - Recovery is binary: a unit is either recovered or not...

  4. [4]

    Do NOT score whether the final answer is factually correct

    a -- Regulation-Answer Alignment ------------------------------------------------ Score ONLY the consistency between [Model Final Answer] and MR. Do NOT score whether the final answer is factually correct. Use the following as anchor points. Prefer anchor values when the case clearly fits one description. You may use intermediate values (e.g., 0.6, 0.8) o...

  5. [5]

    Do not set s = 1 merely because the response is brief or lacks one labeled section

    s -- Shortcut Flag ------------------------------------------------ Set s = 1 only if there is clear evidence that the actor bypasses its own visible metacognitive process and jumps directly to the final answer. Do not set s = 1 merely because the response is brief or lacks one labeled section. ------------------------------------------------

  6. [6]

    Set c = 1 if the actor's final answer is identical to or semantically equivalent to the Ground Truth

    c -- Final Answer Correctness ------------------------------------------------ Compare ONLY [Model Final Answer] with [Ground Truth]. Set c = 1 if the actor's final answer is identical to or semantically equivalent to the Ground Truth. Set c = 0 otherwise. ------------------------------------------------

  7. [7]

    k": <integer, 0 to {number_of_gold_knowledge}>,

    Additional rules ------------------------------------------------ - If there is no identifiable MK section, set k = 0. - If there is no valid LOOKBACK section, set r = 0. - If there is no identifiable MR or no identifiable final answer, score a conservatively. - Ground Truth must NOT be used to directly score k, r, a, or s. It may be used only for scoring...

  8. [8]

    gold_knowledge A set of atomic gold metacognitive knowledge (e.g., key facts / definitions / constraints / rules / procedures / exceptions) required for solving the task

  9. [9]

    possible_meta_regulation

    possible_meta_regulation One regulation description that captures a reasonable solving process. Important constraints: - Use the task as the primary source. - Use the reference answer only as auxiliary reference for relevance and necessity. - Do NOT simply paraphrase or decompose the reference answer into trivial answer-support bullets. - gold_knowledge u...

  10. [10]

    Analyze the functional groups present (ether vs alcohol)

  11. [11]

    Recall sodium's reactivity profile (reacts with protic compounds, not hydrocarbons/ethers)

  12. [12]

    Identify that diethyl ether has no acidic hydrogen for Na to deprotonate

  13. [13]

    Eliminate all options that require different reactants/mechanisms

  14. [14]

    Nothing happens

    Conclude "Nothing happens" is correct based on ether inertness * **If blocked:** Check if special conditions exist (answer confirms standard conditions apply, no special cases mentioned) * **Noticing:** The answer seems counterintuitive to students expecting a reaction - this is the teaching point of the question ### LOOKBACK: * **Seeking:** Is there any ...