Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Can Jin; Chenxi Huang; Dexu Yu; Dimitris N. Metaxas; Guanyu Zhu; Hongwu Peng; Jiakang Li; Ronghao Chen; Xuanqi Lan; Yang Zhou

arxiv: 2606.00726 · v1 · pith:TQLEP4SMnew · submitted 2026-05-30 · 💻 cs.AI

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Jiakang Li , Guanyu Zhu , Can Jin , Chenxi Huang , Dexu Yu , Ronghao Chen , Yang Zhou , Hongwu Peng

show 3 more authors

Xuanqi Lan Dimitris N. Metaxas Youhua Li

This is my paper

Pith reviewed 2026-06-28 18:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords Latent Reward SteeringSparse AutoencoderReasoning LLMsInference-time steeringCognitive behaviorsReward gradientsAdaptive correction

0 comments

The pith

A reward model trained only on final answer correctness can steer SAE latent states during inference to improve LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Latent Reward Steering as a way to make cognitive behavior control in reasoning LLMs more adaptive across different states and tasks. Instead of using fixed lists of behaviors, it trains a reward model on full traces labeled by whether the final answer is right, then uses that model's gradients to adjust sparse autoencoder latent states that carry the behaviors. At inference time the method applies corrections only to states the reward and confidence signals mark as fragile. Experiments across multiple model backbones and benchmarks report gains over baselines, and later checks suggest the corrections implicitly encourage useful reasoning patterns that repair earlier mistakes. A reader would care because the approach avoids the rigidity of hand-specified steering directions that often fail to match the exact failure at hand.

Core claim

Latent Reward Steering trains a latent reward model on reasoning traces using only final-answer correctness labels. During generation it computes gradients from this model to supply correction directions for fragile SAE latent states and applies the corrections only when both reward and confidence gates indicate intervention is needed. This process yields higher benchmark scores than baselines on several reasoning LLMs and post-hoc inspection shows the adjustments tend to restore cognitive behaviors that resolve the original errors.

What carries the argument

The latent reward model trained on final-answer correctness that supplies gradients for correcting intermediate SAE latent states, gated by reward and confidence thresholds.

If this is right

Performance rises consistently across multiple reasoning LLM backbones and standard benchmarks.
The corrections promote cognitive behaviors that repair the specific reasoning errors present in the unsteered run.
Intervention occurs only on states the reward signal and confidence threshold jointly flag as fragile.
Steering directions emerge from data rather than from any pre-chosen list of behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradient-steering idea could be tested on other internal representations besides sparse autoencoders.
If the reward model generalizes across domains, explicit behavior supervision might become less necessary for alignment.
Extending the approach to tasks where final-answer labels are noisier would test how far the core assumption holds.

Load-bearing premise

Signals derived from whether the final answer is correct can reliably indicate the quality of intermediate latent states inside the model.

What would settle it

Apply the method to a benchmark where the reward model has been shown to misjudge intermediate steps and measure whether accuracy still rises or falls compared with the unsteered baseline.

Figures

Figures reproduced from arXiv: 2606.00726 by Can Jin, Chenxi Huang, Dexu Yu, Dimitris N. Metaxas, Guanyu Zhu, Hongwu Peng, Jiakang Li, Ronghao Chen, Xuanqi Lan, Yang Zhou, Youhua Li.

**Figure 2.** Figure 2: Framework of LRS. It first constructs SAE latent traces from reasoning trajectories, then trains a latent reward model to estimate intermediate-state quality. During inference, a reward and confidence gate identifies fragile states, and reward-guided latent correction updates the hidden state before decoding continues. activation, enabling adaptive steering of fragile reasoning states without predefined b… view at source ↗

**Figure 3.** Figure 3: Latent reward scores distinguish successful and failed reasoning traces. We compare final-token reward [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: LRS repairs an incomplete enumeration by recovering the missed valid partitions. leading to an incorrect answer. In contrast, the LRS-steered trace receives early latent interventions when the reward model signal indicates that enumeration steps are fragile. After reward-guided optimization at the fragile enumeration stage, cognitive behavior relevant SAE latent dimensions associated with algebraic exec… view at source ↗

**Figure 5.** Figure 5: Post-hoc analysis shows that LRS-steered traces more often exhibit course correction, constraint grounding, and solution verification, with behavior labels used only for analysis. 4.5 Additional Analyses: Stability, Interpretability, and Efficiency We include 3 supporting analyses to examine the stability, interpretability, and practical cost of LRS. Selective intervention and steering strength [PITH_FUL… view at source ↗

**Figure 6.** Figure 6: Steering-strength sensitivity on AIME24. Accuracy peaks under moderate updates rather than increasing monotonically and the dashed line indicates standard decoding. Dim. Max Act. Interpreted Pattern z0 0.218 Geometric / structural modeling z1 0.287 Theorem invocation / logical branching z2 0.477 Algebraic flow / step execution z3 0.582 Symbolic math / formatting z4 0.552 Variable initialization / constan… view at source ↗

**Figure 7.** Figure 7: Zero-shot chain-of-thought prompt used as a [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Five-shot prompt template for math-style benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Five-shot prompt template for GPQA-Diamond. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Five-shot prompt template for IneqMath. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used for post-hoc GPT-4o behavior annotation. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative example where LRS steers an incorrect reasoning trace toward verification and a correct answer. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: AMC23 Q38: LRS corrects an overcount caused by including k in the selection pool. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: AIME25 Q2: LRS recovers the missed valid partitions and the correct remainder. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: AMC23 Q25: LRS fixes the imaginary-part constraint and obtains |z| 2 = 50. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: GPQA Diamond Q39: LRS reframes the clue as chromatographic separation. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: IneqMath Q84: LRS avoids a false simplification and applies AM-GM correctly. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

read the original abstract

Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference-time framework that promotes cognitive behaviors by optimizing the sparse-autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state-specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post-hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: https://github.com/jiakanglee/Latent-Reward-Steering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LRS steers SAE latents at inference using gradients from a reward model trained only on final-answer correctness, but the abstract gives almost no numbers and the core generalization step looks untested.

read the letter

The paper introduces Latent Reward Steering, which trains a reward model on complete reasoning traces labeled by final correctness, then uses its gradients to adjust SAE latents during generation when a gate flags the state as fragile. The claim is that this implicitly fixes cognitive errors without needing explicit behavior labels or retraining.

What is new is the combination of SAE latents as the steering target, a correctness-based reward, and the gated intervention that only acts on low-reward states. Releasing code is helpful. The post-hoc checks that the method increases behaviors like backtracking are a reasonable way to look for mechanism.

The main weakness is that the reward model sees only end-of-sequence labels yet is asked to score and correct partial latent states. Nothing in the abstract shows that the gradients actually point toward better intermediates rather than spurious directions that happen to match final-answer patterns. The reported gains are described as consistent but without baselines, effect sizes, or statistical tests, so it is impossible to tell how much the method moves the needle or whether the controls rule out confounds.

This is aimed at people working on inference-time control for reasoning models. A reader who already follows SAE work and reward-guided decoding might extract an idea or two, but the current write-up does not supply enough detail to evaluate the central claim.

I would send it to peer review so the experimental section can be checked properly, but the stress-test concern about intermediate generalization stands on the evidence given.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Latent Reward Steering (LRS), an adaptive inference-time framework for reasoning LLMs. It trains a reward model on complete reasoning traces labeled solely by final-answer correctness to score intermediate SAE latent states. During inference, gradients from this model steer fragile latent states, with reward and confidence gates limiting interventions. Experiments across multiple backbones and benchmarks reportedly show performance gains over baselines, with post-hoc analyses suggesting implicit promotion of beneficial cognitive behaviors that correct original errors. The code is made available.

Significance. Should the core mechanism prove reliable, LRS could advance inference-time control of reasoning processes by implicitly targeting cognitive behaviors through latent space interventions without requiring explicit behavior definitions. This is potentially significant for improving LLM reasoning adaptively. The provision of code is a positive for enabling verification and extension.

major comments (2)

Abstract: the claim that LRS 'consistently improves performance over various baselines' and that post-hoc analyses show promotion of good behaviors is load-bearing for the central empirical claim, yet no quantitative details (effect sizes, baseline values, statistical tests, or controls) are provided to support it.
Method (latent reward model and steering): the reward model is fitted only to final-answer correctness labels and then used to score and supply gradients for intermediate SAE latent states; this assumption that the sparse end-of-sequence signal generalizes to partial trajectories without introducing new errors is not directly validated or ablated, undermining the claim that steering fixes original reasoning errors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which help clarify the presentation of our empirical claims and the validation of our core methodological assumptions. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: Abstract: the claim that LRS 'consistently improves performance over various baselines' and that post-hoc analyses show promotion of good behaviors is load-bearing for the central empirical claim, yet no quantitative details (effect sizes, baseline values, statistical tests, or controls) are provided to support it.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support for the performance claims. While the full manuscript contains tables reporting accuracy improvements, baseline comparisons, and statistical details across backbones and benchmarks, the abstract itself does not summarize these numbers. We will revise the abstract to include key effect sizes, representative baseline values, and mention of controls. revision: yes
Referee: Method (latent reward model and steering): the reward model is fitted only to final-answer correctness labels and then used to score and supply gradients for intermediate SAE latent states; this assumption that the sparse end-of-sequence signal generalizes to partial trajectories without introducing new errors is not directly validated or ablated, undermining the claim that steering fixes original reasoning errors.

Authors: This is a fair point on the need for explicit validation of the generalization assumption. Our post-hoc analyses show that interventions correlate with error correction and improved cognitive behaviors, but we did not include a dedicated ablation isolating the effect of applying the end-of-sequence-trained reward model to intermediate states. We will add such an ablation study (e.g., comparing steering with vs. without intermediate-state scoring and measuring introduced errors) to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training and inference steps remain distinct

full rationale

The paper trains a latent reward model on complete reasoning traces labeled solely by final-answer correctness, then applies its gradients at inference time to steer SAE latent states. This separation between supervised training on end-of-sequence labels and inference-time correction does not reduce any claimed result to its own inputs by construction. No equations or self-citations are shown that make a prediction equivalent to a fitted parameter, nor is any uniqueness theorem or ansatz smuggled in. Post-hoc analyses are presented as downstream evidence rather than definitional. The method therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on a trained reward model whose parameters are fitted to final-answer data and on the domain assumption that SAE latents encode cognitive behaviors.

free parameters (2)

Latent reward model weights
Trained on reasoning traces using final answer correctness as the supervision signal.
Intervention threshold and gate parameters
Chosen to restrict steering to fragile states flagged by the reward signal.

axioms (1)

domain assumption SAE latent states implicitly encode cognitive behaviors relevant to reasoning quality
Invoked to justify steering latents as a proxy for promoting better behaviors.

pith-pipeline@v0.9.1-grok · 5764 in / 1148 out tokens · 25338 ms · 2026-06-28T18:47:12.471835+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 44 canonical work pages · 15 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Advances in Neural Information Processing Systems , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =
[9]

arXiv preprint arXiv:2601.20833 , year =

Base Models Know How to Reason, Thinking Models Learn When , author =. arXiv preprint arXiv:2601.20833 , year =

work page arXiv
[10]

Conference on Language Modeling (COLM) , year =

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs , author =. Conference on Language Modeling (COLM) , year =
[11]

Conference on Language Modeling (COLM) , year =

SEAL: Steerable Reasoning Calibration of Large Language Models for Free , author =. Conference on Language Modeling (COLM) , year =
[12]

ICLR 2025 Workshop on Reasoning and Planning for LLMs , year =

Understanding Reasoning in Thinking Language Models via Steering Vectors , author =. ICLR 2025 Workshop on Reasoning and Planning for LLMs , year =

2025
[13]

arXiv preprint arXiv:2601.09269 , year =

RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering , author =. arXiv preprint arXiv:2601.09269 , year =

work page arXiv
[14]

arXiv preprint arXiv:2501.????? , year =

MetaScale: Test-Time Scaling with Evolving Meta-Thoughts , author =. arXiv preprint arXiv:2501.????? , year =
[15]

arXiv preprint arXiv:2501.04682 , year =

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought , author =. arXiv preprint arXiv:2501.04682 , year =

work page arXiv
[16]

Steering Language Models With Activation Engineering

Steering Language Models with Activation Engineering , author =. arXiv preprint arXiv:2308.10248 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Base Models Know How to Reason, Thinking Models Learn When , author =
[19]

Large Language Models are Zero-Shot Reasoners

Large Language Models are Zero-Shot Reasoners , author =. arXiv preprint arXiv:2205.11916 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[20]

The Eleventh International Conference on Learning Representations , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. The Eleventh International Conference on Learning Representations , year =
[21]

Let's Verify Step by Step

Let's Verify Step by Step , author =. arXiv preprint arXiv:2305.20050 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[22]

STaR: Bootstrapping Reasoning With Reasoning

STaR: Bootstrapping Reasoning With Reasoning , author =. arXiv preprint arXiv:2203.14465 , year =

work page arXiv
[23]

Steering Llama 2 via Contrastive Activation Addition

Steering Llama 2 via Contrastive Activation Addition , author =. arXiv preprint arXiv:2312.06681 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Workshop on Reasoning and Planning for Large Language Models , year =

Understanding Reasoning in Thinking Language Models via Steering Vectors , author =. Workshop on Reasoning and Planning for Large Language Models , year =
[26]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author =. arXiv preprint arXiv:2309.08600 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[27]

2024 , note =

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author =. 2024 , note =

2024
[28]

arXiv preprint arXiv:2601.03595 , year =

Controllable LLM Reasoning via Sparse Autoencoder-Based Steering , author =. arXiv preprint arXiv:2601.03595 , year =

work page arXiv
[29]

arXiv preprint arXiv:2504.18587 , year =

Training Large Language Models to Reason via EM Policy Gradient , author =. arXiv preprint arXiv:2504.18587 , year =

work page arXiv
[30]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author =. arXiv preprint arXiv:2501.12948 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters , author =. arXiv preprint arXiv:2408.03314 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2512.24574 , year=

Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time , author=. arXiv preprint arXiv:2512.24574 , year=

work page arXiv
[33]

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin

Reasoning-finetuning repurposes latent representations in base models , author=. arXiv preprint arXiv:2507.12638 , year=

work page arXiv
[34]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

The lessons of developing process reward models in mathematical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[35]

Large Language Models Cannot Self-Correct Reasoning Yet

Large language models cannot self-correct reasoning yet , author=. arXiv preprint arXiv:2310.01798 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Dissecting Failure Dynamics in Large Language Model Reasoning

Dissecting Failure Dynamics in Large Language Model Reasoning , author=. arXiv preprint arXiv:2604.14528 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

arXiv preprint arXiv:2604.01639 , year=

Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations , author=. arXiv preprint arXiv:2604.01639 , year=

work page arXiv
[38]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought reasoning without prompting , author=. Advances in Neural Information Processing Systems , volume=
[39]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
[40]

arXiv preprint arXiv:2501.02497 , year=

A survey of test-time compute: From intuitive inference to deliberate reasoning , author=. arXiv preprint arXiv:2501.02497 , year=

work page arXiv
[41]

arXiv preprint arXiv:2503.00177 , year=

Steering large language model activations in sparse spaces , author=. arXiv preprint arXiv:2503.00177 , year=

work page arXiv
[42]

arXiv preprint arXiv:2501.09929 , year=

Interpretable steering of large language models with feature guided activation additions , author=. arXiv preprint arXiv:2501.09929 , year=

work page arXiv
[43]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Sae-ssv: Supervised steering in sparse representation spaces for reliable control of language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[44]

Steering knowledge selection behaviours in LLMs via sae-based representation engineering , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[45]

arXiv preprint arXiv:2602.10063 , year=

Chain of Mindset: Reasoning with Adaptive Cognitive Modes , author=. arXiv preprint arXiv:2602.10063 , year=

work page arXiv
[46]

arXiv preprint arXiv:2601.14290 , year=

Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models , author=. arXiv preprint arXiv:2601.14290 , year=

work page arXiv
[47]

arXiv preprint arXiv:2507.03704 , year=

Controlling thinking speed in reasoning models , author=. arXiv preprint arXiv:2507.03704 , year=

work page arXiv
[48]

Process reward models that think.arXiv preprint arXiv:2504.16828,

Process reward models that think , author=. arXiv preprint arXiv:2504.16828 , year=

work page arXiv
[49]

Autoregressive, Yet Revisable: In Decoding Revision for Secure Code Generation

Autoregressive, Yet Revisable: In Decoding Revision for Secure Code Generation , author=. arXiv preprint arXiv:2602.01187 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

arXiv preprint arXiv:2501.15602 , year=

Rethinking external slow-thinking: From snowball errors to probability of correct reasoning , author=. arXiv preprint arXiv:2501.15602 , year=

work page arXiv
[51]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

LLMs cannot find reasoning errors, but can correct them given the error location , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[52]

Steered LLM Activations are Non-Surjective

Steered LLM Activations are Non-Surjective , author=. arXiv preprint arXiv:2604.09839 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[54]

arXiv preprint arXiv:2410.12299 , year=

Semantics-adaptive activation intervention for llms via dynamic steering vectors , author=. arXiv preprint arXiv:2410.12299 , year=

work page arXiv
[55]

arXiv preprint arXiv:2510.13290 , year=

To steer or not to steer? mechanistic error reduction with abstention for language models , author=. arXiv preprint arXiv:2510.13290 , year=

work page arXiv
[56]

arXiv preprint arXiv:2602.04428 , year=

Fine-Grained Activation Steering: Steering Less, Achieving More , author=. arXiv preprint arXiv:2602.04428 , year=

work page arXiv
[57]

International Conference on Learning Representations , volume=

Let's verify step by step , author=. International Conference on Learning Representations , volume=
[58]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2507.12885 , year=

VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks , author=. arXiv preprint arXiv:2507.12885 , year=

work page arXiv
[60]

Advances in Neural Information Processing Systems , volume=

Solving inequality proofs with large language models , author=. Advances in Neural Information Processing Systems , volume=
[61]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

International Conference on Learning Representations , volume=

Take a step back: Evoking reasoning via abstraction in large language models , author=. International Conference on Learning Representations , volume=
[64]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Large language models are better reasoners with self-verification , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023
[65]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=
[66]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
[67]

arXiv preprint arXiv:2308.00436 , year=

Selfcheck: Using llms to zero-shot check their own step-by-step reasoning , author=. arXiv preprint arXiv:2308.00436 , year=

work page arXiv
[68]

Findings of the association for computational linguistics: ACL 2024 , pages=

Chain-of-verification reduces hallucination in large language models , author=. Findings of the association for computational linguistics: ACL 2024 , pages=

2024
[69]

arXiv preprint arXiv:2509.26314 , year=

Latent thinking optimization: Your latent reasoning language model secretly encodes reward signals in its latent thoughts , author=. arXiv preprint arXiv:2509.26314 , year=

work page arXiv
[70]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
[71]

arXiv preprint arXiv:2601.17897 , year=

UniCog: Uncovering Cognitive Abilities of LLMs through Latent Mind Space Analysis , author=. arXiv preprint arXiv:2601.17897 , year=

work page arXiv
[72]

arXiv preprint arXiv:2602.01695 , year=

Beyond Dense States: Elevating Sparse Transcoders to Active Operators for Latent Reasoning , author=. arXiv preprint arXiv:2602.01695 , year=

work page arXiv
[73]

Ren, Hangliang , booktitle =
[74]

Advances in Neural Information Processing Systems , volume=

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. Advances in Neural Information Processing Systems , volume=
[75]

MAA Invitational Competitions: American Invitational Mathematics Examination , author =. n.d. , howpublished =
[76]

2024 , howpublished =

2024 AIME I Problems and Solutions , author =. 2024 , howpublished =

2024
[77]

2024 , howpublished =

2024 AIME II Problems and Solutions , author =. 2024 , howpublished =

2024
[78]

2025 , howpublished =

2025 AIME I Problems and Solutions , author =. 2025 , howpublished =

2025
[79]

2025 , howpublished =

2025 AIME II Problems and Solutions , author =. 2025 , howpublished =

2025
[80]

2025 , publisher =

AMC 2023 Dataset , author =. 2025 , publisher =

2023
[81]

Companion Proceedings of the ACM on Web Conference 2025 , pages=

Apeer: Automatic prompt engineering enhances large language model reranking , author=. Companion Proceedings of the ACM on Web Conference 2025 , pages=

2025

Showing first 80 references.

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

Advances in Neural Information Processing Systems , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =

[9] [9]

arXiv preprint arXiv:2601.20833 , year =

Base Models Know How to Reason, Thinking Models Learn When , author =. arXiv preprint arXiv:2601.20833 , year =

work page arXiv

[10] [10]

Conference on Language Modeling (COLM) , year =

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs , author =. Conference on Language Modeling (COLM) , year =

[11] [11]

Conference on Language Modeling (COLM) , year =

SEAL: Steerable Reasoning Calibration of Large Language Models for Free , author =. Conference on Language Modeling (COLM) , year =

[12] [12]

ICLR 2025 Workshop on Reasoning and Planning for LLMs , year =

Understanding Reasoning in Thinking Language Models via Steering Vectors , author =. ICLR 2025 Workshop on Reasoning and Planning for LLMs , year =

2025

[13] [13]

arXiv preprint arXiv:2601.09269 , year =

RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering , author =. arXiv preprint arXiv:2601.09269 , year =

work page arXiv

[14] [14]

arXiv preprint arXiv:2501.????? , year =

MetaScale: Test-Time Scaling with Evolving Meta-Thoughts , author =. arXiv preprint arXiv:2501.????? , year =

[15] [15]

arXiv preprint arXiv:2501.04682 , year =

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought , author =. arXiv preprint arXiv:2501.04682 , year =

work page arXiv

[16] [16]

Steering Language Models With Activation Engineering

Steering Language Models with Activation Engineering , author =. arXiv preprint arXiv:2308.10248 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Base Models Know How to Reason, Thinking Models Learn When , author =

[19] [19]

Large Language Models are Zero-Shot Reasoners

Large Language Models are Zero-Shot Reasoners , author =. arXiv preprint arXiv:2205.11916 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

The Eleventh International Conference on Learning Representations , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. The Eleventh International Conference on Learning Representations , year =

[21] [21]

Let's Verify Step by Step

Let's Verify Step by Step , author =. arXiv preprint arXiv:2305.20050 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

STaR: Bootstrapping Reasoning With Reasoning

STaR: Bootstrapping Reasoning With Reasoning , author =. arXiv preprint arXiv:2203.14465 , year =

work page arXiv

[23] [23]

Steering Llama 2 via Contrastive Activation Addition

Steering Llama 2 via Contrastive Activation Addition , author =. arXiv preprint arXiv:2312.06681 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Workshop on Reasoning and Planning for Large Language Models , year =

Understanding Reasoning in Thinking Language Models via Steering Vectors , author =. Workshop on Reasoning and Planning for Large Language Models , year =

[25] [26]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author =. arXiv preprint arXiv:2309.08600 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[26] [27]

2024 , note =

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author =. 2024 , note =

2024

[27] [28]

arXiv preprint arXiv:2601.03595 , year =

Controllable LLM Reasoning via Sparse Autoencoder-Based Steering , author =. arXiv preprint arXiv:2601.03595 , year =

work page arXiv

[28] [29]

arXiv preprint arXiv:2504.18587 , year =

Training Large Language Models to Reason via EM Policy Gradient , author =. arXiv preprint arXiv:2504.18587 , year =

work page arXiv

[29] [30]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author =. arXiv preprint arXiv:2501.12948 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[30] [31]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters , author =. arXiv preprint arXiv:2408.03314 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[31] [32]

arXiv preprint arXiv:2512.24574 , year=

Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time , author=. arXiv preprint arXiv:2512.24574 , year=

work page arXiv

[32] [33]

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin

Reasoning-finetuning repurposes latent representations in base models , author=. arXiv preprint arXiv:2507.12638 , year=

work page arXiv

[33] [34]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

The lessons of developing process reward models in mathematical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025

[34] [35]

Large Language Models Cannot Self-Correct Reasoning Yet

Large language models cannot self-correct reasoning yet , author=. arXiv preprint arXiv:2310.01798 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [36]

Dissecting Failure Dynamics in Large Language Model Reasoning

Dissecting Failure Dynamics in Large Language Model Reasoning , author=. arXiv preprint arXiv:2604.14528 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [37]

arXiv preprint arXiv:2604.01639 , year=

Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations , author=. arXiv preprint arXiv:2604.01639 , year=

work page arXiv

[37] [38]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought reasoning without prompting , author=. Advances in Neural Information Processing Systems , volume=

[38] [39]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

[39] [40]

arXiv preprint arXiv:2501.02497 , year=

A survey of test-time compute: From intuitive inference to deliberate reasoning , author=. arXiv preprint arXiv:2501.02497 , year=

work page arXiv

[40] [41]

arXiv preprint arXiv:2503.00177 , year=

Steering large language model activations in sparse spaces , author=. arXiv preprint arXiv:2503.00177 , year=

work page arXiv

[41] [42]

arXiv preprint arXiv:2501.09929 , year=

Interpretable steering of large language models with feature guided activation additions , author=. arXiv preprint arXiv:2501.09929 , year=

work page arXiv

[42] [43]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Sae-ssv: Supervised steering in sparse representation spaces for reliable control of language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[43] [44]

Steering knowledge selection behaviours in LLMs via sae-based representation engineering , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025

[44] [45]

arXiv preprint arXiv:2602.10063 , year=

Chain of Mindset: Reasoning with Adaptive Cognitive Modes , author=. arXiv preprint arXiv:2602.10063 , year=

work page arXiv

[45] [46]

arXiv preprint arXiv:2601.14290 , year=

Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models , author=. arXiv preprint arXiv:2601.14290 , year=

work page arXiv

[46] [47]

arXiv preprint arXiv:2507.03704 , year=

Controlling thinking speed in reasoning models , author=. arXiv preprint arXiv:2507.03704 , year=

work page arXiv

[47] [48]

Process reward models that think.arXiv preprint arXiv:2504.16828,

Process reward models that think , author=. arXiv preprint arXiv:2504.16828 , year=

work page arXiv

[48] [49]

Autoregressive, Yet Revisable: In Decoding Revision for Secure Code Generation

Autoregressive, Yet Revisable: In Decoding Revision for Secure Code Generation , author=. arXiv preprint arXiv:2602.01187 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [50]

arXiv preprint arXiv:2501.15602 , year=

Rethinking external slow-thinking: From snowball errors to probability of correct reasoning , author=. arXiv preprint arXiv:2501.15602 , year=

work page arXiv

[50] [51]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

LLMs cannot find reasoning errors, but can correct them given the error location , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024

[51] [52]

Steered LLM Activations are Non-Surjective

Steered LLM Activations are Non-Surjective , author=. arXiv preprint arXiv:2604.09839 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [53]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[53] [54]

arXiv preprint arXiv:2410.12299 , year=

Semantics-adaptive activation intervention for llms via dynamic steering vectors , author=. arXiv preprint arXiv:2410.12299 , year=

work page arXiv

[54] [55]

arXiv preprint arXiv:2510.13290 , year=

To steer or not to steer? mechanistic error reduction with abstention for language models , author=. arXiv preprint arXiv:2510.13290 , year=

work page arXiv

[55] [56]

arXiv preprint arXiv:2602.04428 , year=

Fine-Grained Activation Steering: Steering Less, Achieving More , author=. arXiv preprint arXiv:2602.04428 , year=

work page arXiv

[56] [57]

International Conference on Learning Representations , volume=

Let's verify step by step , author=. International Conference on Learning Representations , volume=

[57] [58]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [59]

arXiv preprint arXiv:2507.12885 , year=

VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks , author=. arXiv preprint arXiv:2507.12885 , year=

work page arXiv

[59] [60]

Advances in Neural Information Processing Systems , volume=

Solving inequality proofs with large language models , author=. Advances in Neural Information Processing Systems , volume=

[60] [61]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [62]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[62] [63]

International Conference on Learning Representations , volume=

Take a step back: Evoking reasoning via abstraction in large language models , author=. International Conference on Learning Representations , volume=

[63] [64]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Large language models are better reasoners with self-verification , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023

[64] [65]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

[65] [66]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

[66] [67]

arXiv preprint arXiv:2308.00436 , year=

Selfcheck: Using llms to zero-shot check their own step-by-step reasoning , author=. arXiv preprint arXiv:2308.00436 , year=

work page arXiv

[67] [68]

Findings of the association for computational linguistics: ACL 2024 , pages=

Chain-of-verification reduces hallucination in large language models , author=. Findings of the association for computational linguistics: ACL 2024 , pages=

2024

[68] [69]

arXiv preprint arXiv:2509.26314 , year=

Latent thinking optimization: Your latent reasoning language model secretly encodes reward signals in its latent thoughts , author=. arXiv preprint arXiv:2509.26314 , year=

work page arXiv

[69] [70]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

[70] [71]

arXiv preprint arXiv:2601.17897 , year=

UniCog: Uncovering Cognitive Abilities of LLMs through Latent Mind Space Analysis , author=. arXiv preprint arXiv:2601.17897 , year=

work page arXiv

[71] [72]

arXiv preprint arXiv:2602.01695 , year=

Beyond Dense States: Elevating Sparse Transcoders to Active Operators for Latent Reasoning , author=. arXiv preprint arXiv:2602.01695 , year=

work page arXiv

[72] [73]

Ren, Hangliang , booktitle =

[73] [74]

Advances in Neural Information Processing Systems , volume=

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. Advances in Neural Information Processing Systems , volume=

[74] [75]

MAA Invitational Competitions: American Invitational Mathematics Examination , author =. n.d. , howpublished =

[75] [76]

2024 , howpublished =

2024 AIME I Problems and Solutions , author =. 2024 , howpublished =

2024

[76] [77]

2024 , howpublished =

2024 AIME II Problems and Solutions , author =. 2024 , howpublished =

2024

[77] [78]

2025 , howpublished =

2025 AIME I Problems and Solutions , author =. 2025 , howpublished =

2025

[78] [79]

2025 , howpublished =

2025 AIME II Problems and Solutions , author =. 2025 , howpublished =

2025

[79] [80]

2025 , publisher =

AMC 2023 Dataset , author =. 2025 , publisher =

2023

[80] [81]

Companion Proceedings of the ACM on Web Conference 2025 , pages=

Apeer: Automatic prompt engineering enhances large language model reranking , author=. Companion Proceedings of the ACM on Web Conference 2025 , pages=

2025