Decoding the Critique Mechanism in Large Reasoning Models

Heng Ji; Hoang Phan; Hung T. Q. Le; Khoa D. Doan; Quang H. Nguyen; Xiusi Chen

arxiv: 2603.16331 · v2 · pith:EBFZ7MYGnew · submitted 2026-03-17 · 💻 cs.LG

Decoding the Critique Mechanism in Large Reasoning Models

Hoang Phan , Quang H. Nguyen , Hung T. Q. Le , Xiusi Chen , Heng Ji , Khoa D. Doan This is my paper

Pith reviewed 2026-05-25 06:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords large reasoning modelschain-of-thoughtself-correctioncritique abilitylatent steeringerror recoverytest-time scaling

0 comments

The pith

Large reasoning models recover from errors injected into their chain-of-thought via an internal hidden critique ability represented by a steerable critique vector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large reasoning models can detect and correct mistakes they make during step-by-step reasoning. By deliberately inserting arithmetic errors into intermediate steps of the chain-of-thought, the authors find that models frequently arrive at the right final answer anyway, even though the error is never mentioned or fixed in the visible text. This pattern points to an internal detection process the authors call the hidden critique ability. They locate a corresponding direction in the model's latent space, the critique vector, that can be used to strengthen error detection and improve test-time scaling across model sizes without any retraining.

Core claim

Large reasoning models possess a hidden critique ability that detects errors in intermediate reasoning steps and still produces correct final answers even when the error propagates through the entire chain-of-thought without any verbalized correction. Feature-space analysis isolates a highly interpretable critique vector that encodes this behavior. Steering latent representations along this vector measurably improves error detection and raises test-time scaling performance on multiple model scales and families at zero additional training cost.

What carries the argument

The critique vector, a direction in the model's latent feature space that encodes the hidden critique ability and can be added to steer self-correction.

If this is right

Steering with the critique vector raises the model's ability to detect its own errors during reasoning.
The same steering improves performance of test-time scaling methods without extra training.
The effect holds across multiple model scales and families.
The vector supplies a concrete way to control and strengthen the self-verification process in large reasoning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the vector proves stable across domains, the same steering approach could be tested on non-arithmetic tasks such as code generation or logical deduction.
The separation between visible chain-of-thought and final-answer accuracy suggests final-answer generation may draw on internal states that are not fully reflected in the generated tokens.
Ablating or reversing the vector during inference could serve as a diagnostic to test whether other observed self-correction behaviors rely on the same latent direction.

Load-bearing premise

The observed recovery after error injection is produced by an internal critique mechanism rather than by statistical regularities in training data or other unmeasured factors.

What would settle it

Measure whether steering along the identified critique vector changes the rate at which models recover from injected arithmetic errors compared with an orthogonal direction or no steering.

Figures

Figures reproduced from arXiv: 2603.16331 by Heng Ji, Hoang Phan, Hung T. Q. Le, Khoa D. Doan, Quang H. Nguyen, Xiusi Chen.

**Figure 1.** Figure 1: Hidden self-correction despite erroneous reasoning in R1-32B. Left: original correct reasoning. Right: injected error (3 + 4 = 6) propagates to incorrect thinking conclusion ($20), yet the model recovers to the correct final answer ($18). We hypothesize this indicates implicit error detection beyond the observable CoT. Full generation details are in Appendix E.1. 2025a)—has substantially raised the perfor… view at source ↗

**Figure 2.** Figure 2: Distribution of reasoning-result alignment across R1 and Qwen3 model variants. The proportions represent four distinct outcomes based on the correctness of the internal “thinking” process versus the correctness of the final “answer” on the GSM8KError and MATH500-Error benchmarks. In this paper, we analyze the self-correction mechanisms represented by the light blue segments: ‘× Think ✓ Answer‘. Surprisi… view at source ↗

**Figure 4.** Figure 4: Effect of different steering coefficients α evaluated on BIG-Bench Mistake. Left to right: Error detection accuracy, correct solution accuracy, and F1 score. The results illustrate similar effects to [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 3.** Figure 3: Effect of different steering coefficients α evaluated on ProcessBench. Left to right: Error detection accuracy, correct solution accuracy, and F1 score. Increasing α generally improves error detection and F1 scores while leading to a decline in correct solution accuracy. The dashed vertical line indicates the baseline performance at α = 0. 5.1. Experimental Details Datasets. We intervene on two widely-used… view at source ↗

**Figure 6.** Figure 6: Qualitative example of negative steering on BIG-Bench arithmetic task with R1-32B. The problem contains a sign error in the calculation of term B (0 − (0 − −7) = −7, not 7). While the baseline model correctly critiques the reasoning and identifies the error, negative steering (Coefficient -1.0) makes the model hallucinate that the incorrect calculation is valid. Critique vector controls the detection abili… view at source ↗

**Figure 5.** Figure 5: Qualitative example of positive steering on BIG-Bench arithmetic task with R1-32B. There is a subtle arithmetic error introduced in the prompt (7 + 2 + 20 = 30, whereas it should be 29). While the baseline model hallucinates that the math is correct during verification, positive steering enables the model to successfully catch the addition error. 5.2. Steering for Error Detection Setup. We conduct experime… view at source ↗

**Figure 7.** Figure 7: Impact of steering on model accuracy for test-time scaling. Left to right: Accuracy across three benchmarks: GSM8KError, MATH500-Error, and BIG-Bench Mistake. For increasing iterations of test-time scaling, positive steering consistently enhances the performance of all models. 5.3. Test-Time Scaling with Critique Steering Setup. We evaluate the controllability of error recovery performance on GSM8K-Erro… view at source ↗

**Figure 8.** Figure 8: Layer-wise steering results. The separation between the performance of positive and negative steering is minimal for early and late layers. For middle layers, this performance gap is the most noticeable [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Layer-wise linear probing results. Left to right: AUROC and ECE for GSM8K (in-distribution) and MATH500 (out-ofdistribution). For both settings, the linear prober can achieve near-perfect AUROC and low ECE at several layers. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Complete example shown in [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative example of positive steering on a GSM8K question from ProcessBench with R1-32B model. The proposed solution makes an unjustified assumption about the work schedule (5 days vs 7 days). The Baseline model accepts this assumption as “standard”, while the steered model critically analyzes the specific phrasing “entire week” and identifies the discrepancy. 20 [PITH_FULL_IMAGE:figures/full_fig_p020… view at source ↗

**Figure 12.** Figure 12: Qualitative example on GSM8K task with Qwen3-4B model. The baseline model hallucinates a logic error, incorrectly arguing that the total budget ($48) should be used for the counterfactual, ignoring that the CD was still purchased. The positive steering helps the model successfully verify the correct logic: money saved equals the cost of the unpurchased item ($44). Problem Statement (ProcessBench-GSM8K) Pr… view at source ↗

**Figure 13.** Figure 13: Qualitative example of negative steering on ProcessBench-GSM8K with R1-32B model. The prompt contains a logical error in the first step (calculating total food prepared). The baseline model successfully debugs the setup (54 hotdogs vs 36), whereas negative steering suppresses this critical check and accepts the erroneous premise. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative example of positive steering on GSM8K-Error with R1-32B model. The baseline model uses a multiplication error (5 × 8 = 30) and confirms it during self-verification (70 + 30 = 100). Positive steering initially makes the same error but successfully catches it during the “Wait” phase, correcting the allowance to 40 and the final answer to 60. Problem Statement (GSM8K-Error) Question: Juan and Car… view at source ↗

**Figure 15.** Figure 15: Qualitative example of negative steering on ProcessBench-GSM8K with R1-32B model. The prompt context introduces an arithmetic error ($3+$2=$6 instead of $5). The baseline model initially adopts the error but triggers a self-correction after generating “Wait,” re-evaluating the costs. The negatively steered model accepts the erroneous premise and justifies the incorrect result. 22 [PITH_FULL_IMAGE:figures… view at source ↗

read the original abstract

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the \textit{hidden critique ability}. Building on feature space analysis, we identify a highly interpretable \textit{critique vector} representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The error-injection experiments and critique vector extraction are the real contribution here, but the causal story about hidden critique needs tighter controls to rule out independent answer computation.

read the letter

The paper's core observation is that large reasoning models can reach the right final answer after an arithmetic error is planted in the middle of their chain-of-thought, even when the mistake is never verbally fixed. From activations they pull a direction they call the critique vector and show that adding it improves error detection and test-time scaling on several model families. Code is released, which helps. The multi-scale experiments are straightforward and the vector-steering result is concrete enough to replicate or build on. That combination of error propagation without verbal correction plus a steerable latent direction is not something I have seen laid out this way in the LRM papers cited. The work is therefore new on the empirical side. The soft spot is exactly the one flagged in the stress-test note. Recovery after an injected error only implies an internal critique mechanism if the model is actually relying on the flawed intermediate steps to compute its answer. If instead the model largely ignores the provided CoT and solves the question directly, the correct final answer is unsurprising and does not require any hidden detection process. The abstract gives no description of controls that would force dependence on the injected chain or verify that the error actually altered the model's internal computation. If those checks are absent from the full paper, the interpretation stays under-supported. This is the kind of paper that belongs in a reading group focused on mechanistic interpretability of reasoning. It is worth a referee's time because the experimental setup is simple, the steering result is actionable, and the claim can be sharpened or refuted with existing methods. I would send it out rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper claims that Large Reasoning Models exhibit a 'hidden critique ability' allowing recovery of correct final answers even when arithmetic errors propagate through the entire chain-of-thought without verbalized correction. It supports this via error-injection experiments across model scales and families, identifies an interpretable 'critique vector' via feature-space analysis, and demonstrates that steering latent representations with this vector improves error detection and test-time scaling performance at no training cost. Code is released.

Significance. If the recovery phenomenon is shown to arise from internal critique of the CoT rather than alternative mechanisms, and if the critique vector proves robust and general, the work would provide a concrete mechanistic account of self-correction in LRMs together with a practical, training-free intervention. The public code release is a clear strength supporting reproducibility.

major comments (2)

[Abstract] Abstract: the central inference that 'recovery implies the existence of an internal mechanism helping the model to detect errors' is load-bearing for the hidden-critique claim, yet the abstract (and, by the provided description, the experiments) contains no controls that would establish that the model actually conditions its final answer on the injected CoT steps rather than recomputing directly from the question. Without such controls (e.g., ablation of CoT visibility or verification that the injected error measurably shifts the internal answer computation), the observed recovery is compatible with the skeptic's alternative explanation.
[Abstract] The feature-space analysis that isolates the 'critique vector' (mentioned in the abstract) is presented as highly interpretable, but the manuscript provides no quantitative test showing that steering along this vector specifically modulates error detection rather than a correlated but distinct latent direction; this weakens the claim that the vector directly represents the hypothesized hidden critique ability.

minor comments (1)

[Abstract] The abstract refers to 'extensive experiments across multiple model scales and families' but does not specify the exact models, number of trials per condition, or statistical tests used to establish that recovery rates exceed chance; adding these details would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. The concerns about controls for the recovery phenomenon and specificity of the critique vector are well-taken and point to ways the evidence can be strengthened. We respond to each major comment below and will revise the manuscript to incorporate the suggested analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the central inference that 'recovery implies the existence of an internal mechanism helping the model to detect errors' is load-bearing for the hidden-critique claim, yet the abstract (and, by the provided description, the experiments) contains no controls that would establish that the model actually conditions its final answer on the injected CoT steps rather than recomputing directly from the question. Without such controls (e.g., ablation of CoT visibility or verification that the injected error measurably shifts the internal answer computation), the observed recovery is compatible with the skeptic's alternative explanation.

Authors: We agree that explicit controls are necessary to establish that recovery depends on conditioning over the erroneous CoT rather than direct recomputation from the question. Our current experiments document that injected arithmetic errors propagate through the full CoT without verbalized correction yet the final answer is still correct; however, we did not perform the ablations suggested (e.g., masking CoT visibility or measuring internal representation shifts attributable to the injected error). In the revision we will add (i) an ablation that removes or masks the CoT after error injection and (ii) a check that the injected error produces a measurable change in the model's internal answer computation before the final token. These additions will directly address the alternative explanation. revision: yes
Referee: [Abstract] The feature-space analysis that isolates the 'critique vector' (mentioned in the abstract) is presented as highly interpretable, but the manuscript provides no quantitative test showing that steering along this vector specifically modulates error detection rather than a correlated but distinct latent direction; this weakens the claim that the vector directly represents the hypothesized hidden critique ability.

Authors: The critique vector was extracted via feature-space analysis as the consistent direction separating error-injection from clean trajectories across scales and families; steering experiments then showed downstream gains in error detection and test-time scaling. We acknowledge, however, that these results do not yet include quantitative specificity tests (e.g., comparison against random directions or other control vectors) that would rule out a merely correlated latent direction. In the revised manuscript we will add such controls, reporting the differential effect of the critique vector versus matched random and task-relevant control directions on error-detection metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: claim rests on empirical recovery observation and feature analysis, not self-referential definitions or fitted inputs.

full rationale

The paper's derivation begins with an experimental intervention (inserting arithmetic errors into CoT) and observes recovery to the correct final answer without verbalized correction; this is interpreted as evidence for a hidden critique ability, which is then located via feature-space analysis as a critique vector. No equations, parameters, or predictions are shown to reduce by construction to quantities defined in terms of the target result itself. The abstract and described methodology contain no self-citation load-bearing steps, no fitted-input-called-prediction patterns, and no ansatz smuggled via prior work. The central claim therefore remains an independent empirical interpretation rather than a tautological renaming or self-definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on interpreting unverbalized recovery as evidence of an internal critique mechanism and on the extracted vector faithfully representing that mechanism; both rest on data-dependent analysis rather than external benchmarks.

free parameters (1)

critique vector direction
The direction is identified from feature-space analysis comparing activations on error-injected versus normal trajectories, making it dependent on the specific experimental data.

axioms (1)

domain assumption Recovery after error injection without verbal correction is caused by an internal critique mechanism rather than other factors such as data regularities or chance.
This interpretive step is required to label the observed behavior as hidden critique ability.

invented entities (1)

critique vector no independent evidence
purpose: To represent and allow steering of the hidden critique ability
The vector is extracted from internal activations but has no independent falsifiable handle outside the steering experiments described.

pith-pipeline@v0.9.0 · 5783 in / 1342 out tokens · 31207 ms · 2026-05-25T06:46:19.078301+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we identify a highly interpretable critique vector representing this behavior... steering latent representations with this vector improves the model's error detection capability
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the hidden critique ability... linear separability in the latent space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 6 internal anchors

[1]

Reasoning Models Don't Always Say What They Think

URL https://openreview.net/forum? id=Wv9NMJoKww. Chen, R., Zhang, Z., Hong, J., Kundu, S., and Wang, Z. SEAL: Steerable reasoning calibration of large language models for free. InSecond Conference on Language Mod- eling, 2025c. URL https://openreview.net/ forum?id=klPszYDIRT. Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2021
[2]

Measuring Mathematical Problem Solving With the MATH Dataset

doi: 10.18653/v1/2025.acl-long.905. URL https: //aclanthology.org/2025.acl-long.905/. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/2103.03874. Kamoi, R., Zhang, Y ., Zhang, N., Han, J., and Zhang, R. When ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.905 2025
[3]

Liu, R., Gao, J., Zhao, J., Zhang, K., Li, X., Qi, B., Ouyang, W., and Zhou, B

URL https://openreview.net/forum? id=v8L0pN6EOi. Liu, R., Gao, J., Zhao, J., Zhang, K., Li, X., Qi, B., Ouyang, W., and Zhou, B. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025. URL https: //arxiv.org/abs/2502.06703. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand...

work page arXiv 2025
[5]

OpenAI o1 System Card

URL https://aclanthology.org/2025. emnlp-main.1025/. nostalgebraist. Interpreting gpt: The logit lens. https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens , 2020. LessWrong. OpenAI, :, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beu- tel, A., Carney, A., Iftimie, A., Karpe...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.828 2025
[6]

Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

URL https://openreview.net/forum? id=4FWAwZtd2n. Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B. Improving instruction-following in language models through activation steering, 2025. URL https: //arxiv.org/abs/2410.12877. Sun, C.-E., Yan, G., and Weng, T.-W. ThinkEdit: Interpretable weight editing to mitigate overly short thinking i...

work page doi:10.18653/v1/2025.emnlp-main 2025
[7]

Steering Language Models With Activation Engineering

URL https://aclanthology.org/2025. emnlp-main.861/. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering, 2024. URL https: //arxiv.org/abs/2308.10248. Turpin, M., Michael, J., Perez, E., and Bowman, S. R. Lan- guage models don’t always say what they think: Un-...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024 2025
[8]

Wu, T., Xiang, C., Wang, J

URL https://openreview.net/forum? id=s4xIeYimGQ. Wu, T., Xiang, C., Wang, J. T., Suh, G. E., and Mittal, P. Effectively controlling reasoning models through think- ing intervention, 2025. URL https://arxiv.org/ abs/2503.24370. Wu, Z., Zeng, Q., Zhang, Z., Tan, Z., Shen, C., and Jiang, M. Large language models can self-correct with key condition verificati...

work page doi:10.18653/v1/2024.emnlp-main 2025
[9]

Qwen3 Technical Report

URL https://aclanthology.org/2024. emnlp-main.714/. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.findings-emnlp 2024
[10]

findings-emnlp.370/

URL https://aclanthology.org/2025. findings-emnlp.370/. Yang, S., Wu, J., Chen, X., Xiao, Y ., Yang, X., Wong, D. F., and Wang, D. Understanding aha moments: from exter- nal observations to internal mechanisms, 2025d. URL https://arxiv.org/abs/2504.02956. Yang, Z., Zhang, Y ., Wang, Y ., Xu, Z., Lin, J., and Sui, Z. Confidence v.s. critique: A decompositi...

work page arXiv 2025
[11]

URL https: //aclanthology.org/2025.acl-long.203/

doi: 10.18653/v1/2025.acl-long.203. URL https: //aclanthology.org/2025.acl-long.203/. Zeng, Z., Cheng, Q., Yin, Z., Zhou, Y ., and Qiu, X. Re- visiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Pro- ceedings of the 63rd Annual Meeting of t...

work page doi:10.18653/v1/2025.acl-long.203 2025
[12]

Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

doi: 10.18653/v1/2025.acl-long.232. URL https: //aclanthology.org/2025.acl-long.232/. Zhang, A., Chen, Y ., Pan, J., Zhao, C., Panda, A., Li, J., and He, H. Reasoning models know when they’re right: Probing hidden states for self-verification. InSecond Conference on Language Modeling, 2025. URL https: //openreview.net/forum?id=O6I0Av7683. Zhang, D., Zhoub...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.232 2025
[13]

First, I need to

doi: 10.18653/v1/2025.acl-long.50. URL https: //aclanthology.org/2025.acl-long.50/. 13 Decoding the Critique Mechanism in Large Reasoning Models A. System Prompt A.1. GPT-5 Instruction Prompt The following system promptP err was used to generate incorrect reasoning steps: GPT-5 Instruction Prompt Given a question and its ground-truth answer, generate a na...

work page doi:10.18653/v1/2025.acl-long.50 2025
[14]

She eats three for breakfast every morning and bakes muffins for her friends every day with four

Earnings from Selling Eggs: 9eggs×$2/egg= $18/day Final Answer:18 Intervened Response Prompt:Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the ...

work page
[15]

• For Muffins: She uses 4 eggs daily

Total Eggs Laid Per Day:Janet’s ducks lay a total of: 16eggs/day 2.Eggs Used by Janet: • For Breakfast: Janet eats 3 eggs every morning. • For Muffins: She uses 4 eggs daily. Total Eggs Used:3 + 4 = 7eggs/day 3.Eggs A vailable for Sale:16−7 = 9eggs/day

work page
[16]

entire week

Earnings from Selling Eggs: 9eggs×$2/egg= $18 Final Answer:18 Figure 10.Complete example shown in Figure 1 generated by DeepSeek-R1-Distill-Qwen-32B (R1-32B). As laid out in Section 3.1, we generate the arithmetic error with GPT-5 (highlighted by a box in the right column) and insert this error immediately after the ⟨think⟩ token, which causes the model t...

work page

[1] [1]

Reasoning Models Don't Always Say What They Think

URL https://openreview.net/forum? id=Wv9NMJoKww. Chen, R., Zhang, Z., Hong, J., Kundu, S., and Wang, Z. SEAL: Steerable reasoning calibration of large language models for free. InSecond Conference on Language Mod- eling, 2025c. URL https://openreview.net/ forum?id=klPszYDIRT. Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2021

[2] [2]

Measuring Mathematical Problem Solving With the MATH Dataset

doi: 10.18653/v1/2025.acl-long.905. URL https: //aclanthology.org/2025.acl-long.905/. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/2103.03874. Kamoi, R., Zhang, Y ., Zhang, N., Han, J., and Zhang, R. When ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.905 2025

[3] [3]

Liu, R., Gao, J., Zhao, J., Zhang, K., Li, X., Qi, B., Ouyang, W., and Zhou, B

URL https://openreview.net/forum? id=v8L0pN6EOi. Liu, R., Gao, J., Zhao, J., Zhang, K., Li, X., Qi, B., Ouyang, W., and Zhou, B. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025. URL https: //arxiv.org/abs/2502.06703. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand...

work page arXiv 2025

[4] [5]

OpenAI o1 System Card

URL https://aclanthology.org/2025. emnlp-main.1025/. nostalgebraist. Interpreting gpt: The logit lens. https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens , 2020. LessWrong. OpenAI, :, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beu- tel, A., Carney, A., Iftimie, A., Karpe...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.828 2025

[5] [6]

Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

URL https://openreview.net/forum? id=4FWAwZtd2n. Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B. Improving instruction-following in language models through activation steering, 2025. URL https: //arxiv.org/abs/2410.12877. Sun, C.-E., Yan, G., and Weng, T.-W. ThinkEdit: Interpretable weight editing to mitigate overly short thinking i...

work page doi:10.18653/v1/2025.emnlp-main 2025

[6] [7]

Steering Language Models With Activation Engineering

URL https://aclanthology.org/2025. emnlp-main.861/. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering, 2024. URL https: //arxiv.org/abs/2308.10248. Turpin, M., Michael, J., Perez, E., and Bowman, S. R. Lan- guage models don’t always say what they think: Un-...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024 2025

[7] [8]

Wu, T., Xiang, C., Wang, J

URL https://openreview.net/forum? id=s4xIeYimGQ. Wu, T., Xiang, C., Wang, J. T., Suh, G. E., and Mittal, P. Effectively controlling reasoning models through think- ing intervention, 2025. URL https://arxiv.org/ abs/2503.24370. Wu, Z., Zeng, Q., Zhang, Z., Tan, Z., Shen, C., and Jiang, M. Large language models can self-correct with key condition verificati...

work page doi:10.18653/v1/2024.emnlp-main 2025

[8] [9]

Qwen3 Technical Report

URL https://aclanthology.org/2024. emnlp-main.714/. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.findings-emnlp 2024

[9] [10]

findings-emnlp.370/

URL https://aclanthology.org/2025. findings-emnlp.370/. Yang, S., Wu, J., Chen, X., Xiao, Y ., Yang, X., Wong, D. F., and Wang, D. Understanding aha moments: from exter- nal observations to internal mechanisms, 2025d. URL https://arxiv.org/abs/2504.02956. Yang, Z., Zhang, Y ., Wang, Y ., Xu, Z., Lin, J., and Sui, Z. Confidence v.s. critique: A decompositi...

work page arXiv 2025

[10] [11]

URL https: //aclanthology.org/2025.acl-long.203/

doi: 10.18653/v1/2025.acl-long.203. URL https: //aclanthology.org/2025.acl-long.203/. Zeng, Z., Cheng, Q., Yin, Z., Zhou, Y ., and Qiu, X. Re- visiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Pro- ceedings of the 63rd Annual Meeting of t...

work page doi:10.18653/v1/2025.acl-long.203 2025

[11] [12]

Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

doi: 10.18653/v1/2025.acl-long.232. URL https: //aclanthology.org/2025.acl-long.232/. Zhang, A., Chen, Y ., Pan, J., Zhao, C., Panda, A., Li, J., and He, H. Reasoning models know when they’re right: Probing hidden states for self-verification. InSecond Conference on Language Modeling, 2025. URL https: //openreview.net/forum?id=O6I0Av7683. Zhang, D., Zhoub...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.232 2025

[12] [13]

First, I need to

doi: 10.18653/v1/2025.acl-long.50. URL https: //aclanthology.org/2025.acl-long.50/. 13 Decoding the Critique Mechanism in Large Reasoning Models A. System Prompt A.1. GPT-5 Instruction Prompt The following system promptP err was used to generate incorrect reasoning steps: GPT-5 Instruction Prompt Given a question and its ground-truth answer, generate a na...

work page doi:10.18653/v1/2025.acl-long.50 2025

[13] [14]

She eats three for breakfast every morning and bakes muffins for her friends every day with four

Earnings from Selling Eggs: 9eggs×$2/egg= $18/day Final Answer:18 Intervened Response Prompt:Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the ...

work page

[14] [15]

• For Muffins: She uses 4 eggs daily

Total Eggs Laid Per Day:Janet’s ducks lay a total of: 16eggs/day 2.Eggs Used by Janet: • For Breakfast: Janet eats 3 eggs every morning. • For Muffins: She uses 4 eggs daily. Total Eggs Used:3 + 4 = 7eggs/day 3.Eggs A vailable for Sale:16−7 = 9eggs/day

work page

[15] [16]

entire week

Earnings from Selling Eggs: 9eggs×$2/egg= $18 Final Answer:18 Figure 10.Complete example shown in Figure 1 generated by DeepSeek-R1-Distill-Qwen-32B (R1-32B). As laid out in Section 3.1, we generate the arithmetic error with GPT-5 (highlighted by a box in the right column) and insert this error immediately after the ⟨think⟩ token, which causes the model t...

work page