pith. sign in

arxiv: 2603.16331 · v2 · pith:EBFZ7MYGnew · submitted 2026-03-17 · 💻 cs.LG

Decoding the Critique Mechanism in Large Reasoning Models

Pith reviewed 2026-05-25 06:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords large reasoning modelschain-of-thoughtself-correctioncritique abilitylatent steeringerror recoverytest-time scaling
0
0 comments X

The pith

Large reasoning models recover from errors injected into their chain-of-thought via an internal hidden critique ability represented by a steerable critique vector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large reasoning models can detect and correct mistakes they make during step-by-step reasoning. By deliberately inserting arithmetic errors into intermediate steps of the chain-of-thought, the authors find that models frequently arrive at the right final answer anyway, even though the error is never mentioned or fixed in the visible text. This pattern points to an internal detection process the authors call the hidden critique ability. They locate a corresponding direction in the model's latent space, the critique vector, that can be used to strengthen error detection and improve test-time scaling across model sizes without any retraining.

Core claim

Large reasoning models possess a hidden critique ability that detects errors in intermediate reasoning steps and still produces correct final answers even when the error propagates through the entire chain-of-thought without any verbalized correction. Feature-space analysis isolates a highly interpretable critique vector that encodes this behavior. Steering latent representations along this vector measurably improves error detection and raises test-time scaling performance on multiple model scales and families at zero additional training cost.

What carries the argument

The critique vector, a direction in the model's latent feature space that encodes the hidden critique ability and can be added to steer self-correction.

If this is right

  • Steering with the critique vector raises the model's ability to detect its own errors during reasoning.
  • The same steering improves performance of test-time scaling methods without extra training.
  • The effect holds across multiple model scales and families.
  • The vector supplies a concrete way to control and strengthen the self-verification process in large reasoning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the vector proves stable across domains, the same steering approach could be tested on non-arithmetic tasks such as code generation or logical deduction.
  • The separation between visible chain-of-thought and final-answer accuracy suggests final-answer generation may draw on internal states that are not fully reflected in the generated tokens.
  • Ablating or reversing the vector during inference could serve as a diagnostic to test whether other observed self-correction behaviors rely on the same latent direction.

Load-bearing premise

The observed recovery after error injection is produced by an internal critique mechanism rather than by statistical regularities in training data or other unmeasured factors.

What would settle it

Measure whether steering along the identified critique vector changes the rate at which models recover from injected arithmetic errors compared with an orthogonal direction or no steering.

Figures

Figures reproduced from arXiv: 2603.16331 by Heng Ji, Hoang Phan, Hung T. Q. Le, Khoa D. Doan, Quang H. Nguyen, Xiusi Chen.

Figure 1
Figure 1. Figure 1: Hidden self-correction despite erroneous reasoning in R1-32B. Left: original correct reasoning. Right: injected error (3 + 4 = 6) propagates to incorrect thinking conclusion ($20), yet the model recovers to the correct final answer ($18). We hypothe￾size this indicates implicit error detection beyond the observable CoT. Full generation details are in Appendix E.1. 2025a)—has substantially raised the perfor… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of reasoning-result alignment across R1 and Qwen3 model variants. The proportions represent four dis￾tinct outcomes based on the correctness of the internal “thinking” process versus the correctness of the final “answer” on the GSM8K￾Error and MATH500-Error benchmarks. In this paper, we analyze the self-correction mechanisms represented by the light blue seg￾ments: ‘× Think ✓ Answer‘. Surprisi… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of different steering coefficients α evaluated on BIG-Bench Mistake. Left to right: Error detection accuracy, correct solution accuracy, and F1 score. The results illustrate similar effects to [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of different steering coefficients α evaluated on ProcessBench. Left to right: Error detection accuracy, correct solution accuracy, and F1 score. Increasing α generally improves error detection and F1 scores while leading to a decline in correct solution accuracy. The dashed vertical line indicates the baseline performance at α = 0. 5.1. Experimental Details Datasets. We intervene on two widely-used… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative example of negative steering on BIG-Bench arithmetic task with R1-32B. The problem contains a sign error in the calculation of term B (0 − (0 − −7) = −7, not 7). While the baseline model correctly critiques the reasoning and identifies the error, negative steering (Coefficient -1.0) makes the model hallucinate that the incorrect calculation is valid. Critique vector controls the detection abili… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative example of positive steering on BIG-Bench arithmetic task with R1-32B. There is a subtle arithmetic error introduced in the prompt (7 + 2 + 20 = 30, whereas it should be 29). While the baseline model hallucinates that the math is correct during verification, positive steering enables the model to successfully catch the addition error. 5.2. Steering for Error Detection Setup. We conduct experime… view at source ↗
Figure 7
Figure 7. Figure 7: Impact of steering on model accuracy for test-time scal￾ing. Left to right: Accuracy across three benchmarks: GSM8K￾Error, MATH500-Error, and BIG-Bench Mistake. For increasing iterations of test-time scaling, positive steering consistently en￾hances the performance of all models. 5.3. Test-Time Scaling with Critique Steering Setup. We evaluate the controllability of error recovery performance on GSM8K-Erro… view at source ↗
Figure 8
Figure 8. Figure 8: Layer-wise steering results. The separation between the performance of positive and negative steering is minimal for early and late layers. For middle layers, this performance gap is the most noticeable [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Layer-wise linear probing results. Left to right: AUROC and ECE for GSM8K (in-distribution) and MATH500 (out-of￾distribution). For both settings, the linear prober can achieve near-perfect AUROC and low ECE at several layers. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Complete example shown in [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative example of positive steering on a GSM8K question from ProcessBench with R1-32B model. The proposed solution makes an unjustified assumption about the work schedule (5 days vs 7 days). The Baseline model accepts this assumption as “standard”, while the steered model critically analyzes the specific phrasing “entire week” and identifies the discrepancy. 20 [PITH_FULL_IMAGE:figures/full_fig_p020… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative example on GSM8K task with Qwen3-4B model. The baseline model hallucinates a logic error, incorrectly arguing that the total budget ($48) should be used for the counterfactual, ignoring that the CD was still purchased. The positive steering helps the model successfully verify the correct logic: money saved equals the cost of the unpurchased item ($44). Problem Statement (ProcessBench-GSM8K) Pr… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative example of negative steering on ProcessBench-GSM8K with R1-32B model. The prompt contains a logical error in the first step (calculating total food prepared). The baseline model successfully debugs the setup (54 hotdogs vs 36), whereas negative steering suppresses this critical check and accepts the erroneous premise. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative example of positive steering on GSM8K-Error with R1-32B model. The baseline model uses a multiplication error (5 × 8 = 30) and confirms it during self-verification (70 + 30 = 100). Positive steering initially makes the same error but successfully catches it during the “Wait” phase, correcting the allowance to 40 and the final answer to 60. Problem Statement (GSM8K-Error) Question: Juan and Car… view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative example of negative steering on ProcessBench-GSM8K with R1-32B model. The prompt context introduces an arithmetic error ($3+$2=$6 instead of $5). The baseline model initially adopts the error but triggers a self-correction after generating “Wait,” re-evaluating the costs. The negatively steered model accepts the erroneous premise and justifies the incorrect result. 22 [PITH_FULL_IMAGE:figures… view at source ↗
read the original abstract

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the \textit{hidden critique ability}. Building on feature space analysis, we identify a highly interpretable \textit{critique vector} representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that Large Reasoning Models exhibit a 'hidden critique ability' allowing recovery of correct final answers even when arithmetic errors propagate through the entire chain-of-thought without verbalized correction. It supports this via error-injection experiments across model scales and families, identifies an interpretable 'critique vector' via feature-space analysis, and demonstrates that steering latent representations with this vector improves error detection and test-time scaling performance at no training cost. Code is released.

Significance. If the recovery phenomenon is shown to arise from internal critique of the CoT rather than alternative mechanisms, and if the critique vector proves robust and general, the work would provide a concrete mechanistic account of self-correction in LRMs together with a practical, training-free intervention. The public code release is a clear strength supporting reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central inference that 'recovery implies the existence of an internal mechanism helping the model to detect errors' is load-bearing for the hidden-critique claim, yet the abstract (and, by the provided description, the experiments) contains no controls that would establish that the model actually conditions its final answer on the injected CoT steps rather than recomputing directly from the question. Without such controls (e.g., ablation of CoT visibility or verification that the injected error measurably shifts the internal answer computation), the observed recovery is compatible with the skeptic's alternative explanation.
  2. [Abstract] The feature-space analysis that isolates the 'critique vector' (mentioned in the abstract) is presented as highly interpretable, but the manuscript provides no quantitative test showing that steering along this vector specifically modulates error detection rather than a correlated but distinct latent direction; this weakens the claim that the vector directly represents the hypothesized hidden critique ability.
minor comments (1)
  1. [Abstract] The abstract refers to 'extensive experiments across multiple model scales and families' but does not specify the exact models, number of trials per condition, or statistical tests used to establish that recovery rates exceed chance; adding these details would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. The concerns about controls for the recovery phenomenon and specificity of the critique vector are well-taken and point to ways the evidence can be strengthened. We respond to each major comment below and will revise the manuscript to incorporate the suggested analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central inference that 'recovery implies the existence of an internal mechanism helping the model to detect errors' is load-bearing for the hidden-critique claim, yet the abstract (and, by the provided description, the experiments) contains no controls that would establish that the model actually conditions its final answer on the injected CoT steps rather than recomputing directly from the question. Without such controls (e.g., ablation of CoT visibility or verification that the injected error measurably shifts the internal answer computation), the observed recovery is compatible with the skeptic's alternative explanation.

    Authors: We agree that explicit controls are necessary to establish that recovery depends on conditioning over the erroneous CoT rather than direct recomputation from the question. Our current experiments document that injected arithmetic errors propagate through the full CoT without verbalized correction yet the final answer is still correct; however, we did not perform the ablations suggested (e.g., masking CoT visibility or measuring internal representation shifts attributable to the injected error). In the revision we will add (i) an ablation that removes or masks the CoT after error injection and (ii) a check that the injected error produces a measurable change in the model's internal answer computation before the final token. These additions will directly address the alternative explanation. revision: yes

  2. Referee: [Abstract] The feature-space analysis that isolates the 'critique vector' (mentioned in the abstract) is presented as highly interpretable, but the manuscript provides no quantitative test showing that steering along this vector specifically modulates error detection rather than a correlated but distinct latent direction; this weakens the claim that the vector directly represents the hypothesized hidden critique ability.

    Authors: The critique vector was extracted via feature-space analysis as the consistent direction separating error-injection from clean trajectories across scales and families; steering experiments then showed downstream gains in error detection and test-time scaling. We acknowledge, however, that these results do not yet include quantitative specificity tests (e.g., comparison against random directions or other control vectors) that would rule out a merely correlated latent direction. In the revised manuscript we will add such controls, reporting the differential effect of the critique vector versus matched random and task-relevant control directions on error-detection metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: claim rests on empirical recovery observation and feature analysis, not self-referential definitions or fitted inputs.

full rationale

The paper's derivation begins with an experimental intervention (inserting arithmetic errors into CoT) and observes recovery to the correct final answer without verbalized correction; this is interpreted as evidence for a hidden critique ability, which is then located via feature-space analysis as a critique vector. No equations, parameters, or predictions are shown to reduce by construction to quantities defined in terms of the target result itself. The abstract and described methodology contain no self-citation load-bearing steps, no fitted-input-called-prediction patterns, and no ansatz smuggled via prior work. The central claim therefore remains an independent empirical interpretation rather than a tautological renaming or self-definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on interpreting unverbalized recovery as evidence of an internal critique mechanism and on the extracted vector faithfully representing that mechanism; both rest on data-dependent analysis rather than external benchmarks.

free parameters (1)
  • critique vector direction
    The direction is identified from feature-space analysis comparing activations on error-injected versus normal trajectories, making it dependent on the specific experimental data.
axioms (1)
  • domain assumption Recovery after error injection without verbal correction is caused by an internal critique mechanism rather than other factors such as data regularities or chance.
    This interpretive step is required to label the observed behavior as hidden critique ability.
invented entities (1)
  • critique vector no independent evidence
    purpose: To represent and allow steering of the hidden critique ability
    The vector is extracted from internal activations but has no independent falsifiable handle outside the steering experiments described.

pith-pipeline@v0.9.0 · 5783 in / 1342 out tokens · 31207 ms · 2026-05-25T06:46:19.078301+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 6 internal anchors

  1. [1]

    Reasoning Models Don't Always Say What They Think

    URL https://openreview.net/forum? id=Wv9NMJoKww. Chen, R., Zhang, Z., Hong, J., Kundu, S., and Wang, Z. SEAL: Steerable reasoning calibration of large language models for free. InSecond Conference on Language Mod- eling, 2025c. URL https://openreview.net/ forum?id=klPszYDIRT. Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, ...

  2. [2]

    Measuring Mathematical Problem Solving With the MATH Dataset

    doi: 10.18653/v1/2025.acl-long.905. URL https: //aclanthology.org/2025.acl-long.905/. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/2103.03874. Kamoi, R., Zhang, Y ., Zhang, N., Han, J., and Zhang, R. When ...

  3. [3]

    Liu, R., Gao, J., Zhao, J., Zhang, K., Li, X., Qi, B., Ouyang, W., and Zhou, B

    URL https://openreview.net/forum? id=v8L0pN6EOi. Liu, R., Gao, J., Zhao, J., Zhang, K., Li, X., Qi, B., Ouyang, W., and Zhou, B. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025. URL https: //arxiv.org/abs/2502.06703. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand...

  4. [5]

    OpenAI o1 System Card

    URL https://aclanthology.org/2025. emnlp-main.1025/. nostalgebraist. Interpreting gpt: The logit lens. https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens , 2020. LessWrong. OpenAI, :, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beu- tel, A., Carney, A., Iftimie, A., Karpe...

  5. [6]

    Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

    URL https://openreview.net/forum? id=4FWAwZtd2n. Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B. Improving instruction-following in language models through activation steering, 2025. URL https: //arxiv.org/abs/2410.12877. Sun, C.-E., Yan, G., and Weng, T.-W. ThinkEdit: Interpretable weight editing to mitigate overly short thinking i...

  6. [7]

    Steering Language Models With Activation Engineering

    URL https://aclanthology.org/2025. emnlp-main.861/. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering, 2024. URL https: //arxiv.org/abs/2308.10248. Turpin, M., Michael, J., Perez, E., and Bowman, S. R. Lan- guage models don’t always say what they think: Un-...

  7. [8]

    Wu, T., Xiang, C., Wang, J

    URL https://openreview.net/forum? id=s4xIeYimGQ. Wu, T., Xiang, C., Wang, J. T., Suh, G. E., and Mittal, P. Effectively controlling reasoning models through think- ing intervention, 2025. URL https://arxiv.org/ abs/2503.24370. Wu, Z., Zeng, Q., Zhang, Z., Tan, Z., Shen, C., and Jiang, M. Large language models can self-correct with key condition verificati...

  8. [9]

    Qwen3 Technical Report

    URL https://aclanthology.org/2024. emnlp-main.714/. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., ...

  9. [10]

    findings-emnlp.370/

    URL https://aclanthology.org/2025. findings-emnlp.370/. Yang, S., Wu, J., Chen, X., Xiao, Y ., Yang, X., Wong, D. F., and Wang, D. Understanding aha moments: from exter- nal observations to internal mechanisms, 2025d. URL https://arxiv.org/abs/2504.02956. Yang, Z., Zhang, Y ., Wang, Y ., Xu, Z., Lin, J., and Sui, Z. Confidence v.s. critique: A decompositi...

  10. [11]

    URL https: //aclanthology.org/2025.acl-long.203/

    doi: 10.18653/v1/2025.acl-long.203. URL https: //aclanthology.org/2025.acl-long.203/. Zeng, Z., Cheng, Q., Yin, Z., Zhou, Y ., and Qiu, X. Re- visiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Pro- ceedings of the 63rd Annual Meeting of t...

  11. [12]

    Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

    doi: 10.18653/v1/2025.acl-long.232. URL https: //aclanthology.org/2025.acl-long.232/. Zhang, A., Chen, Y ., Pan, J., Zhao, C., Panda, A., Li, J., and He, H. Reasoning models know when they’re right: Probing hidden states for self-verification. InSecond Conference on Language Modeling, 2025. URL https: //openreview.net/forum?id=O6I0Av7683. Zhang, D., Zhoub...

  12. [13]

    First, I need to

    doi: 10.18653/v1/2025.acl-long.50. URL https: //aclanthology.org/2025.acl-long.50/. 13 Decoding the Critique Mechanism in Large Reasoning Models A. System Prompt A.1. GPT-5 Instruction Prompt The following system promptP err was used to generate incorrect reasoning steps: GPT-5 Instruction Prompt Given a question and its ground-truth answer, generate a na...

  13. [14]

    She eats three for breakfast every morning and bakes muffins for her friends every day with four

    Earnings from Selling Eggs: 9eggs×$2/egg= $18/day Final Answer:18 Intervened Response Prompt:Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the ...

  14. [15]

    • For Muffins: She uses 4 eggs daily

    Total Eggs Laid Per Day:Janet’s ducks lay a total of: 16eggs/day 2.Eggs Used by Janet: • For Breakfast: Janet eats 3 eggs every morning. • For Muffins: She uses 4 eggs daily. Total Eggs Used:3 + 4 = 7eggs/day 3.Eggs A vailable for Sale:16−7 = 9eggs/day

  15. [16]

    entire week

    Earnings from Selling Eggs: 9eggs×$2/egg= $18 Final Answer:18 Figure 10.Complete example shown in Figure 1 generated by DeepSeek-R1-Distill-Qwen-32B (R1-32B). As laid out in Section 3.1, we generate the arithmetic error with GPT-5 (highlighted by a box in the right column) and insert this error immediately after the ⟨think⟩ token, which causes the model t...