Decoding the Critique Mechanism in Large Reasoning Models
Pith reviewed 2026-05-25 06:46 UTC · model grok-4.3
The pith
Large reasoning models recover from errors injected into their chain-of-thought via an internal hidden critique ability represented by a steerable critique vector.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large reasoning models possess a hidden critique ability that detects errors in intermediate reasoning steps and still produces correct final answers even when the error propagates through the entire chain-of-thought without any verbalized correction. Feature-space analysis isolates a highly interpretable critique vector that encodes this behavior. Steering latent representations along this vector measurably improves error detection and raises test-time scaling performance on multiple model scales and families at zero additional training cost.
What carries the argument
The critique vector, a direction in the model's latent feature space that encodes the hidden critique ability and can be added to steer self-correction.
If this is right
- Steering with the critique vector raises the model's ability to detect its own errors during reasoning.
- The same steering improves performance of test-time scaling methods without extra training.
- The effect holds across multiple model scales and families.
- The vector supplies a concrete way to control and strengthen the self-verification process in large reasoning models.
Where Pith is reading between the lines
- If the vector proves stable across domains, the same steering approach could be tested on non-arithmetic tasks such as code generation or logical deduction.
- The separation between visible chain-of-thought and final-answer accuracy suggests final-answer generation may draw on internal states that are not fully reflected in the generated tokens.
- Ablating or reversing the vector during inference could serve as a diagnostic to test whether other observed self-correction behaviors rely on the same latent direction.
Load-bearing premise
The observed recovery after error injection is produced by an internal critique mechanism rather than by statistical regularities in training data or other unmeasured factors.
What would settle it
Measure whether steering along the identified critique vector changes the rate at which models recover from injected arithmetic errors compared with an orthogonal direction or no steering.
Figures
read the original abstract
Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the \textit{hidden critique ability}. Building on feature space analysis, we identify a highly interpretable \textit{critique vector} representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Large Reasoning Models exhibit a 'hidden critique ability' allowing recovery of correct final answers even when arithmetic errors propagate through the entire chain-of-thought without verbalized correction. It supports this via error-injection experiments across model scales and families, identifies an interpretable 'critique vector' via feature-space analysis, and demonstrates that steering latent representations with this vector improves error detection and test-time scaling performance at no training cost. Code is released.
Significance. If the recovery phenomenon is shown to arise from internal critique of the CoT rather than alternative mechanisms, and if the critique vector proves robust and general, the work would provide a concrete mechanistic account of self-correction in LRMs together with a practical, training-free intervention. The public code release is a clear strength supporting reproducibility.
major comments (2)
- [Abstract] Abstract: the central inference that 'recovery implies the existence of an internal mechanism helping the model to detect errors' is load-bearing for the hidden-critique claim, yet the abstract (and, by the provided description, the experiments) contains no controls that would establish that the model actually conditions its final answer on the injected CoT steps rather than recomputing directly from the question. Without such controls (e.g., ablation of CoT visibility or verification that the injected error measurably shifts the internal answer computation), the observed recovery is compatible with the skeptic's alternative explanation.
- [Abstract] The feature-space analysis that isolates the 'critique vector' (mentioned in the abstract) is presented as highly interpretable, but the manuscript provides no quantitative test showing that steering along this vector specifically modulates error detection rather than a correlated but distinct latent direction; this weakens the claim that the vector directly represents the hypothesized hidden critique ability.
minor comments (1)
- [Abstract] The abstract refers to 'extensive experiments across multiple model scales and families' but does not specify the exact models, number of trials per condition, or statistical tests used to establish that recovery rates exceed chance; adding these details would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. The concerns about controls for the recovery phenomenon and specificity of the critique vector are well-taken and point to ways the evidence can be strengthened. We respond to each major comment below and will revise the manuscript to incorporate the suggested analyses.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central inference that 'recovery implies the existence of an internal mechanism helping the model to detect errors' is load-bearing for the hidden-critique claim, yet the abstract (and, by the provided description, the experiments) contains no controls that would establish that the model actually conditions its final answer on the injected CoT steps rather than recomputing directly from the question. Without such controls (e.g., ablation of CoT visibility or verification that the injected error measurably shifts the internal answer computation), the observed recovery is compatible with the skeptic's alternative explanation.
Authors: We agree that explicit controls are necessary to establish that recovery depends on conditioning over the erroneous CoT rather than direct recomputation from the question. Our current experiments document that injected arithmetic errors propagate through the full CoT without verbalized correction yet the final answer is still correct; however, we did not perform the ablations suggested (e.g., masking CoT visibility or measuring internal representation shifts attributable to the injected error). In the revision we will add (i) an ablation that removes or masks the CoT after error injection and (ii) a check that the injected error produces a measurable change in the model's internal answer computation before the final token. These additions will directly address the alternative explanation. revision: yes
-
Referee: [Abstract] The feature-space analysis that isolates the 'critique vector' (mentioned in the abstract) is presented as highly interpretable, but the manuscript provides no quantitative test showing that steering along this vector specifically modulates error detection rather than a correlated but distinct latent direction; this weakens the claim that the vector directly represents the hypothesized hidden critique ability.
Authors: The critique vector was extracted via feature-space analysis as the consistent direction separating error-injection from clean trajectories across scales and families; steering experiments then showed downstream gains in error detection and test-time scaling. We acknowledge, however, that these results do not yet include quantitative specificity tests (e.g., comparison against random directions or other control vectors) that would rule out a merely correlated latent direction. In the revised manuscript we will add such controls, reporting the differential effect of the critique vector versus matched random and task-relevant control directions on error-detection metrics. revision: yes
Circularity Check
No circularity: claim rests on empirical recovery observation and feature analysis, not self-referential definitions or fitted inputs.
full rationale
The paper's derivation begins with an experimental intervention (inserting arithmetic errors into CoT) and observes recovery to the correct final answer without verbalized correction; this is interpreted as evidence for a hidden critique ability, which is then located via feature-space analysis as a critique vector. No equations, parameters, or predictions are shown to reduce by construction to quantities defined in terms of the target result itself. The abstract and described methodology contain no self-citation load-bearing steps, no fitted-input-called-prediction patterns, and no ansatz smuggled via prior work. The central claim therefore remains an independent empirical interpretation rather than a tautological renaming or self-definition.
Axiom & Free-Parameter Ledger
free parameters (1)
- critique vector direction
axioms (1)
- domain assumption Recovery after error injection without verbal correction is caused by an internal critique mechanism rather than other factors such as data regularities or chance.
invented entities (1)
-
critique vector
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we identify a highly interpretable critique vector representing this behavior... steering latent representations with this vector improves the model's error detection capability
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the hidden critique ability... linear separability in the latent space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Reasoning Models Don't Always Say What They Think
URL https://openreview.net/forum? id=Wv9NMJoKww. Chen, R., Zhang, Z., Hong, J., Kundu, S., and Wang, Z. SEAL: Steerable reasoning calibration of large language models for free. InSecond Conference on Language Mod- eling, 2025c. URL https://openreview.net/ forum?id=klPszYDIRT. Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2021
-
[2]
Measuring Mathematical Problem Solving With the MATH Dataset
doi: 10.18653/v1/2025.acl-long.905. URL https: //aclanthology.org/2025.acl-long.905/. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/2103.03874. Kamoi, R., Zhang, Y ., Zhang, N., Han, J., and Zhang, R. When ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.905 2025
-
[3]
Liu, R., Gao, J., Zhao, J., Zhang, K., Li, X., Qi, B., Ouyang, W., and Zhou, B
URL https://openreview.net/forum? id=v8L0pN6EOi. Liu, R., Gao, J., Zhao, J., Zhang, K., Li, X., Qi, B., Ouyang, W., and Zhou, B. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025. URL https: //arxiv.org/abs/2502.06703. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand...
-
[5]
URL https://aclanthology.org/2025. emnlp-main.1025/. nostalgebraist. Interpreting gpt: The logit lens. https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens , 2020. LessWrong. OpenAI, :, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beu- tel, A., Carney, A., Iftimie, A., Karpe...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.828 2025
-
[6]
Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B
URL https://openreview.net/forum? id=4FWAwZtd2n. Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B. Improving instruction-following in language models through activation steering, 2025. URL https: //arxiv.org/abs/2410.12877. Sun, C.-E., Yan, G., and Weng, T.-W. ThinkEdit: Interpretable weight editing to mitigate overly short thinking i...
-
[7]
Steering Language Models With Activation Engineering
URL https://aclanthology.org/2025. emnlp-main.861/. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering, 2024. URL https: //arxiv.org/abs/2308.10248. Turpin, M., Michael, J., Perez, E., and Bowman, S. R. Lan- guage models don’t always say what they think: Un-...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024 2025
-
[8]
URL https://openreview.net/forum? id=s4xIeYimGQ. Wu, T., Xiang, C., Wang, J. T., Suh, G. E., and Mittal, P. Effectively controlling reasoning models through think- ing intervention, 2025. URL https://arxiv.org/ abs/2503.24370. Wu, Z., Zeng, Q., Zhang, Z., Tan, Z., Shen, C., and Jiang, M. Large language models can self-correct with key condition verificati...
-
[9]
URL https://aclanthology.org/2024. emnlp-main.714/. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.findings-emnlp 2024
-
[10]
URL https://aclanthology.org/2025. findings-emnlp.370/. Yang, S., Wu, J., Chen, X., Xiao, Y ., Yang, X., Wong, D. F., and Wang, D. Understanding aha moments: from exter- nal observations to internal mechanisms, 2025d. URL https://arxiv.org/abs/2504.02956. Yang, Z., Zhang, Y ., Wang, Y ., Xu, Z., Lin, J., and Sui, Z. Confidence v.s. critique: A decompositi...
-
[11]
URL https: //aclanthology.org/2025.acl-long.203/
doi: 10.18653/v1/2025.acl-long.203. URL https: //aclanthology.org/2025.acl-long.203/. Zeng, Z., Cheng, Q., Yin, Z., Zhou, Y ., and Qiu, X. Re- visiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Pro- ceedings of the 63rd Annual Meeting of t...
-
[12]
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought
doi: 10.18653/v1/2025.acl-long.232. URL https: //aclanthology.org/2025.acl-long.232/. Zhang, A., Chen, Y ., Pan, J., Zhao, C., Panda, A., Li, J., and He, H. Reasoning models know when they’re right: Probing hidden states for self-verification. InSecond Conference on Language Modeling, 2025. URL https: //openreview.net/forum?id=O6I0Av7683. Zhang, D., Zhoub...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.232 2025
-
[13]
doi: 10.18653/v1/2025.acl-long.50. URL https: //aclanthology.org/2025.acl-long.50/. 13 Decoding the Critique Mechanism in Large Reasoning Models A. System Prompt A.1. GPT-5 Instruction Prompt The following system promptP err was used to generate incorrect reasoning steps: GPT-5 Instruction Prompt Given a question and its ground-truth answer, generate a na...
-
[14]
She eats three for breakfast every morning and bakes muffins for her friends every day with four
Earnings from Selling Eggs: 9eggs×$2/egg= $18/day Final Answer:18 Intervened Response Prompt:Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the ...
-
[15]
• For Muffins: She uses 4 eggs daily
Total Eggs Laid Per Day:Janet’s ducks lay a total of: 16eggs/day 2.Eggs Used by Janet: • For Breakfast: Janet eats 3 eggs every morning. • For Muffins: She uses 4 eggs daily. Total Eggs Used:3 + 4 = 7eggs/day 3.Eggs A vailable for Sale:16−7 = 9eggs/day
-
[16]
Earnings from Selling Eggs: 9eggs×$2/egg= $18 Final Answer:18 Figure 10.Complete example shown in Figure 1 generated by DeepSeek-R1-Distill-Qwen-32B (R1-32B). As laid out in Section 3.1, we generate the arithmetic error with GPT-5 (highlighted by a box in the right column) and insert this error immediately after the ⟨think⟩ token, which causes the model t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.