arxiv: 2605.05715 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.CL· cs.LG

Recognition: unknown

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

Ming Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:37 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords overthinkinglinear steeringresidual streammedical QALLM failuresdecodabilityabstentionrepresentational entanglement

0 comments

The pith

Overthinking failures in medical LLMs are linearly decodable from hidden states yet resist correction by any tested fixed residual-stream steering vector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a detectable signal of overthinking failures inside LLM activations can be used to fix those failures. Overthinking is defined as a stable regime where models answer correctly when resampling but produce wrong answers during extended chain-of-thought reasoning. Linear probes recover this regime at 71.6 percent balanced accuracy. Five families of fixed linear steering vectors, totaling 29 configurations, produce no improvement in accuracy on those instances. The same probe nevertheless allows selective abstention that outperforms standard uncertainty measures.

Core claim

Overthinking constitutes a stable behavioral regime in medical question answering where models succeed under resampling yet fail in extended chain-of-thought, with Jaccard similarity at least 0.81. This regime is linearly decodable from residual-stream activations at 71.6 percent balanced accuracy. Five families of fixed residual-stream linear steering across 29 configurations and 1,273 trials yield performance deltas near zero, with identical null results on a different model and on MMLU-STEM. The probe nevertheless supports post-generation abstention at held-out AUROC 0.610, exceeding five uncertainty baselines.

What carries the argument

The overthinking direction extracted by a linear probe on residual-stream activations, which overlaps 85-88 percent with task-critical directions and therefore resists correction by fixed steering vectors.

If this is right

The same linear probe enables selective abstention that beats all tested uncertainty baselines on held-out data.
Steering in the shared direction without targeting the failure regime reduces overall accuracy by 12.1 percentage points.
Erasing the overthinking direction via LEACE reduces accuracy by 3.6 percentage points, while random erasures produce no change.
The probe-steering correlation per instance is near zero, showing that decodability does not translate into steerability.
Results replicate across model architectures and across medical and STEM domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Failure modes that are linearly readable may still require non-linear or dynamic interventions for correction rather than fixed vectors.
Reliability estimation via abstention could be deployed immediately in medical QA while correction methods are developed.
High overlap between failure and task directions suggests that future steering should isolate instance-specific components instead of using global vectors.

Load-bearing premise

The five families of fixed linear steering vectors tested are broad enough that their failure to correct overthinking implies no linear residual-stream steering can succeed.

What would settle it

Discovery of one linear steering vector that, when added to the residual stream at inference time, raises accuracy on overthinking instances by at least 5 percentage points while leaving overall accuracy unchanged or improved.

Figures

Figures reproduced from arXiv: 2605.05715 by Ming Liu.

**Figure 1.** Figure 1: Layer-wise classification accuracy profiles. (a) Exploratory three-way classification ( view at source ↗

**Figure 2.** Figure 2: Pairwise cosine similarity between contrastive view at source ↗

**Figure 3.** Figure 3: Steering experiment results. (a) Accuracy delta vs baseline: uniform shared steering damages performance view at source ↗

**Figure 4.** Figure 4: Selective abstention using a binary correct view at source ↗

**Figure 5.** Figure 5: Probe vs contrastive vector comparison. (a) Pairwise cosine similarity: probe vectors are much less corre view at source ↗

read the original abstract

Can linearly decodable failure signals in LLM hidden states be leveraged to correct those failures? We investigate this classification-correction gap via Overthinking (OT)--a stable behavioral regime (Jaccard >= 0.81, 94% inter-annotator agreement) in medical QA where models answer correctly under resampling yet fail in extended chain-of-thought. OT is linearly decodable at 71.6% balanced accuracy (p < 10^{-16}). Yet five families of fixed linear steering (29 configurations, n=1,273) all yield Delta ~= 0, with identical null results cross-architecture (Qwen2.5-7B) and cross-domain (MMLU-STEM). Three convergent lines of evidence suggest representational entanglement: the OT direction has 85-88% overlap with task-critical computation (specificity ratio <= 0.152); non-targeted shared-direction steering damages accuracy (-12.1pp); and LEACE concept erasure damages accuracy (-3.6pp, p=0.01), while 10 random erasures produce Delta=+0.3pp. The per-instance probe-steering correlation is r=-0.002 (p=0.97). Positively, the same probe enables selective abstention (held-out AUROC=0.610, exceeding all five uncertainty baselines, p=0.009): decodable failure structure supports post-generation reliability estimation even when the fixed linear steering family cannot exploit it for correction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper investigates the classification-correction gap for failure modes in LLMs by studying Overthinking (OT), a stable regime in medical QA where models succeed under resampling but fail in extended CoT (Jaccard >=0.81, 94% IAA). OT is linearly decodable at 71.6% balanced accuracy (p<10^{-16}). Across five families of fixed residual-stream linear steering (29 configurations, n=1,273), all yield Delta~0 with null results replicated cross-architecture (Qwen2.5-7B) and cross-domain (MMLU-STEM). Convergent evidence for entanglement includes 85-88% overlap with task directions (specificity ratio <=0.152), non-targeted steering damage (-12.1pp), LEACE erasure harming accuracy (-3.6pp, p=0.01) vs. random erasures (+0.3pp), and near-zero probe-steering correlation (r=-0.002). The probe nonetheless supports selective abstention (held-out AUROC=0.610 > baselines, p=0.009).

Significance. If the results hold, the work demonstrates that linear decodability of a failure signal does not imply correctability via fixed residual-stream linear interventions, providing a concrete empirical example of representational entanglement in a high-stakes domain. The convergent diagnostics (overlap, specificity, damage, erasure, correlation) and cross-checks make the negative steering result internally consistent rather than an artifact of underpowered methods. The positive abstention result shows practical utility for reliability estimation even when correction fails. This contributes to mechanistic interpretability by delineating limits of simple linear interventions.

major comments (1)

[Steering experiments] Steering experiments section: The central negative claim (no correction by fixed linear steering) rests on null results across 29 configurations in five families. To make this load-bearing for the scoped conclusion, the manuscript should explicitly report the statistical power, exact p-values or confidence intervals for Delta~0 in each family, and whether steering vectors were derived from held-out data relative to the probe training set.

minor comments (2)

[Abstract] Abstract and methods: Define the specificity ratio (<=0.152) and the five steering families with a brief equation or pseudocode on first use; the current phrasing leaves the overlap metric ambiguous for readers unfamiliar with the LEACE baseline.
[Results] Results: The per-instance correlation r=-0.002 (p=0.97) is reported but the exact number of instances and any multiple-comparison correction across the 29 configs should be stated to allow direct assessment of the null.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and positive review, including the recommendation for minor revision. We address the major comment below and will incorporate the requested details into the revised manuscript.

read point-by-point responses

Referee: [Steering experiments] Steering experiments section: The central negative claim (no correction by fixed linear steering) rests on null results across 29 configurations in five families. To make this load-bearing for the scoped conclusion, the manuscript should explicitly report the statistical power, exact p-values or confidence intervals for Delta~0 in each family, and whether steering vectors were derived from held-out data relative to the probe training set.

Authors: We agree that explicit reporting of statistical power, exact p-values, and confidence intervals will strengthen the presentation of the null results and make the central claim more robust. In the revised manuscript we will add these quantities for each of the five steering families (and, where relevant, per configuration). Post-hoc power analysis (based on the observed near-zero effect sizes and our total n=1,273) will be included to quantify the ability to detect small corrections if they existed. Regarding data partitioning: the steering vectors were derived from a held-out subset that was disjoint from the probe training data; we will state this explicitly in the methods and results sections to eliminate any ambiguity. These additions do not change the reported findings or conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical study reporting experimental results on linear decodability of overthinking (OT) in medical QA, null effects from 29 fixed residual-stream steering configurations (n=1,273), entanglement diagnostics, and selective abstention performance. No derivation chain exists that reduces a central claim to its own fitted inputs, self-citations, or ansatzes by construction; all load-bearing elements (balanced accuracy, Delta values, AUROC, correlation coefficients) are directly measured from held-out data and cross-validated across architectures/domains without internal redefinition or statistical forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the operational definition of the OT regime.

pith-pipeline@v0.9.0 · 7979 in / 963 out tokens · 36367 ms · 2026-05-08T11:37:26.085793+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 70 canonical work pages · 31 internal anchors

[1]

Guillaume Alain and Yoshua Bengio. 2017. https://arxiv.org/abs/1610.01644 Understanding intermediate layers using linear classifier probes . arXiv preprint arXiv:1610.01644

work page internal anchor Pith review arXiv 2017
[2]

Anthropic . 2024. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf The Claude 3 model family: Opus , Sonnet , Haiku . Technical report, Anthropic. We use Claude Haiku 4.5 ( claude-haiku-4-5-20251001 )

2024
[3]

Andy Arditi, Oscar Obber, Ajeya Shlegeris, and Nicholas Schiefer. 2024. https://arxiv.org/abs/2406.11717 Refusal in language models is mediated by a single direction . arXiv preprint arXiv:2406.11717

work page internal anchor Pith review arXiv 2024
[4]

Amos Azaria and Tom Mitchell. 2023. https://arxiv.org/abs/2304.13734 The internal state of an LLM knows when it's lying . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967--976

work page arXiv 2023
[5]

Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations.arXiv preprint arXiv:2603.18353, 2026

Sanjay Basu, Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, and Rajaie Batniji. 2026. https://arxiv.org/abs/2603.18353 Interpretability without actionability: Mechanistic methods cannot correct language model errors despite near-perfect internal representations . arXiv preprint arXiv:2603.18353

work page arXiv 2026
[6]

Yonatan Belinkov. 2022. https://doi.org/10.1162/coli_a_00422 Probing classifiers: Promises, shortcomings, and advances . Computational Linguistics, 48(1):207--219

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[7]

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. 2023. https://arxiv.org/abs/2306.03819 LEACE : Perfect linear concept erasure in closed form . In Advances in Neural Information Processing Systems, volume 36

work page arXiv 2023
[8]

Jayadev Billa. 2026. https://arxiv.org/abs/2604.15557 Predicting where steering vectors succeed . arXiv preprint arXiv:2604.15557

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Joschka Braun, Dmitrii Krasheninnikov, Usman Anwar, Robert Kirk, Daniel Tan, and David Scott Krueger. 2025. https://arxiv.org/abs/2411.02827 A sober look at steering vectors for LLM s . arXiv preprint

work page arXiv 2025
[10]

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2023. https://arxiv.org/abs/2212.03827 Discovering latent knowledge in language models without supervision . In International Conference on Learning Representations

work page arXiv 2023
[11]

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. https://arxiv.org/abs/2412.21187 Do NOT think that much for 2+3=? on the overthinking of o1-like LLM s . arXiv preprint arXiv:2412.21187

work page internal anchor Pith review arXiv 2024
[12]

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. 2024. https://arxiv.org/abs/2309.03883 Do L a: Decoding by contrasting layers improves factuality in large language models . In Proceedings of the 12th International Conference on Learning Representations (ICLR)

work page arXiv 2024
[13]

Kyle Cox, Darius Kianersi, and Adria Garriga-Alonso. 2026. https://arxiv.org/abs/2603.01437 Decoding answers before chain-of-thought: Evidence from pre- CoT probes and activation steering . arXiv preprint arXiv:2603.01437

work page arXiv 2026
[14]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. https://arxiv.org/abs/2309.08600 Sparse autoencoders find highly interpretable features in language models . arXiv preprint arXiv:2309.08600

work page internal anchor Pith review arXiv 2023
[15]

Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, and Sachin Kumar. 2025. https://arxiv.org/abs/2504.04635 Steering off course: Reliability challenges in steering language models . arXiv preprint arXiv:2504.04635

work page arXiv 2025
[16]

Zhihong Deng, Jing Jiang, Guodong Long, and Chengqi Zhang. 2025. https://openreview.net/forum?id=sYJQEgkkaI Rethinking the reliability of representation engineering in large language models . arXiv preprint arXiv:2409.15726

work page arXiv 2025
[17]

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. https://doi.org/10.1162/tacl_a_00359 Amnesic probing: Behavioral explanation with amnesic counterfactuals . Transactions of the Association for Computational Linguistics, 9:160--175

work page doi:10.1162/tacl_a_00359 2021
[18]

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, and 1 others. 2022. https://transformer-circuits.pub/2022/toy_model/index.html Toy models of superposition . Transformer Circuits Thread

2022
[19]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, and 1 others. 2021. https://transformer-circuits.pub/2021/framework/index.html A mathematical framework for transformer circuits . Transformer Circuits Thread

2021
[20]

Lang Gao, Jinghui Zhang, Wei Liu, Fengxian Ji, Chenxi Wang, Zirui Song, Akash Ghosh, Youssef Mohamed, Preslav Nakov, and Xiuying Chen. 2026. https://arxiv.org/abs/2605.01844 The cylindrical representation hypothesis for language model steering . arXiv preprint arXiv:2605.01844

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Yonatan Geifman and Ran El-Yaniv. 2017. https://arxiv.org/abs/1705.08500 Selective classification for deep neural networks . In Advances in Neural Information Processing Systems, volume 30

work page arXiv 2017
[22]

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. 2024. https://arxiv.org/abs/2303.02536 Finding alignments between interpretable causal variables and distributed neural representations . arXiv preprint arXiv:2303.02536

work page arXiv 2024
[23]

Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. 2023. https://arxiv.org/abs/2304.05969 Localizing model behavior with path patching . arXiv preprint arXiv:2304.05969

work page arXiv 2023
[24]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The Llama 3 herd of models . arXiv preprint arXiv:2407.21783

work page internal anchor Pith review arXiv 2024
[25]

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. https://arxiv.org/abs/1706.04599 On calibration of modern neural networks . In International Conference on Machine Learning, pages 1321--1330

work page arXiv 2017
[26]

Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. 2023. https://arxiv.org/abs/2301.04213 Does localization inform editing? surprising differences in causality-based localization vs.\ knowledge editing in language models . In Advances in Neural Information Processing Systems, volume 36

work page arXiv 2023
[27]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://arxiv.org/abs/2009.03300 Measuring massive multitask language understanding . In International Conference on Learning Representations

work page internal anchor Pith review arXiv 2021
[28]

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. 2024. https://arxiv.org/abs/2308.09124 Linearity of relation decoding in transformer language models . In International Conference on Machine Learning

work page arXiv 2024
[29]

John Hewitt and Percy Liang. 2019. https://doi.org/10.18653/v1/D19-1275 Designing and interpreting probes with control tasks . In Proceedings of EMNLP, pages 2733--2743

work page doi:10.18653/v1/d19-1275 2019
[30]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://arxiv.org/abs/2106.09685 LoRA : Low-rank adaptation of large language models . In International Conference on Learning Representations

work page internal anchor Pith review arXiv 2022
[31]

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. https://arxiv.org/abs/2310.01798 Large language models cannot self-correct reasoning yet . In Proceedings of the 12th International Conference on Learning Representations (ICLR)

work page internal anchor Pith review arXiv 2024
[32]

Amir Jafari, Aryaman Gangopadhyay, Jeffrey Long, and Ekdeep Singh Lubana. 2026. https://arxiv.org/abs/2602.01716 Mechanistic indicators of steering effectiveness in large language models . arXiv preprint arXiv:2602.01716

work page arXiv 2026
[33]

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. https://doi.org/10.3390/app11146421 What disease does this patient have? a large-scale open domain question answering dataset from medical exams . Applied Sciences, 11(14):6421

work page doi:10.3390/app11146421 2021
[34]

Ole Jorgensen, Dylan Cope, Nandi Schoots, and Murray Shanahan. 2023. https://arxiv.org/abs/2312.03813 Improving activation steering in language models with mean-centring . arXiv preprint arXiv:2312.03813

work page arXiv 2023
[35]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022. https://arxiv.org/abs/2207.05221 Language models (mostly) know what they know . arXiv preprint arXiv:2207.05221

work page internal anchor Pith review arXiv 2022
[36]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. https://arxiv.org/abs/2205.11916 Large language models are zero-shot reasoners . In Advances in Neural Information Processing Systems, volume 35

work page internal anchor Pith review arXiv 2022
[37]

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. https://arxiv.org/abs/2302.09664 Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In International Conference on Learning Representations

work page internal anchor Pith review arXiv 2023
[38]

Daniel Lakens. 2017. https://doi.org/10.1177/1948550617697177 Equivalence tests: A practical primer for t tests, correlations, and meta-analyses . Social Psychological and Personality Science, 8(4):355--362

work page doi:10.1177/1948550617697177 2017
[39]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil\. Luki\' c , Karina Nguyen, Newton Schiefer, Catherine Olsson, Tom Henighan, Andy Jones, Karina Ndousse, Oliver Bloom, Nelson Elhage, and 6 others. 2023. https://arxiv.org/abs/2307.13702 Measuring fai...

work page Pith review arXiv 2023
[40]

Kenneth Li, Oam Patel, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. 2023. https://doi.org/10.48550/arXiv.2306.03341 Inference-time intervention: Eliciting truthful answers from a language model . Advances in Neural Information Processing Systems, 36

work page internal anchor Pith review doi:10.48550/arxiv.2306.03341 2023
[41]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. https://arxiv.org/abs/2305.20050 Let's verify step by step . arXiv preprint arXiv:2305.20050

work page internal anchor Pith review arXiv 2023
[42]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 Truthfulqa: Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252

work page doi:10.18653/v1/2022.acl-long.229 2022
[43]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shravya Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. https://arxiv.org/abs/2303.17651 Self-refine: Iterative refinement with self-feedback . In Adva...

work page internal anchor Pith review arXiv 2023
[44]

Samuel Marks and Max Tegmark. 2024. https://arxiv.org/abs/2310.06824 The geometry of truth: Emergent linear structure in large language model representations of true/false datasets . In Conference on Language Modeling

work page internal anchor Pith review arXiv 2024
[45]

Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Mike Vaiana, Diogo de Lucena, Judd Rosenblatt, and Michael S. A. Graziano. 2026. https://arxiv.org/abs/2602.06941 Endogenous resistance to activation steering in language models . arXiv preprint arXiv:2602.06941

work page arXiv 2026
[46]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems, volume 35, pages 17359--17372

2022
[47]

Aayush Mishra, Daniel Khashabi, and Anqi Liu. 2026. https://arxiv.org/abs/2604.09839 Steered LLM activations are non-surjective . arXiv preprint arXiv:2604.09839

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candela, and Dirk Groeneveld. 2025. https://arxiv.org/abs/2501.19393 s1: Simple test-time scaling . arXiv preprint arXiv:2501.19393

work page internal anchor Pith review arXiv 2025
[49]

Mohammed Suhail B Nadaf. 2026. https://arxiv.org/abs/2604.02608 Steerable but not decodable: Function vectors operate beyond the logit lens . arXiv preprint arXiv:2604.02608

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Neel Nanda, Andrew Lee, and Martin Wattenberg. 2023. https://arxiv.org/abs/2309.00941 Emergent linear representations in world models of self-supervised sequence models . In Proceedings of the Annual Meeting of the Association for Computational Linguistics: BlackboxNLP Workshop

work page arXiv 2023
[51]

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Morrow, Tristan Nguyen, Hoifung Poon, Qiufeng Wei, and 1 others. 2023. https://arxiv.org/abs/2311.16452 Can generalist foundation models outcompete special-purpose tun...

work page arXiv 2023
[52]

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. 2023. https://arxiv.org/abs/2312.06681 Steering Llama 2 via contrastive activation addition . arXiv preprint arXiv:2312.06681

work page internal anchor Pith review arXiv 2023
[53]

Kiho Park, Yo Joong Choe, and Victor Veitch. 2024. https://arxiv.org/abs/2311.03658 The linear representation hypothesis and the geometry of large language models . In International Conference on Machine Learning

work page internal anchor Pith review arXiv 2024
[54]

Fabian Pedregosa, Ga \"e l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and 1 others. 2011. Scikit-learn: Machine learning in Python . Journal of Machine Learning Research, 12:2825--2830

2011
[55]

Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. 2020. https://doi.org/10.18653/v1/2020.acl-main.647 Null it out: Guarding protected attributes by iterative nullspace projection . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7237--7256. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.647 2020
[56]

Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan Cotterell. 2022. https://proceedings.mlr.press/v162/ravfogel22a.html Linear adversarial concept erasure . In Proceedings of ICML, pages 18400--18421

2022
[57]

Debdeep Sanyal, Manya Pandey, Dhruv Kumar, Saurabh Deshpande, and Murari Mandal. 2025. https://arxiv.org/abs/2510.24772 Confidence is not competence: Probing vs steering LLM solvability beliefs . arXiv preprint arXiv:2510.24772

work page arXiv 2025
[58]

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, and 1 others. 2024. https://arxiv.org/abs/2310.13548 Towards understanding sycophancy in language models . arXiv preprint arXiv:2310.13548

work page internal anchor Pith review arXiv 2024
[59]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, and 1 others. 2023. https://doi.org/10.1038/s41586-023-06291-2 Large language models encode clinical knowledge . Nature, 620(7972):172--180

work page doi:10.1038/s41586-023-06291-2 2023
[60]

Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. 2024. https://arxiv.org/abs/2409.12183 To CoT or not to CoT ? chain-of-thought helps mainly on math and symbolic reasoning . arXiv preprint arXiv:2409.12183

work page arXiv 2024
[61]

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Amezcua, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C Daniel Freeman, Theodore R Sumers, Edward Rees, Joshua Batson, Adam Jermyn, and 3 others. 2024. https://transformer-circuits.pub/2024/sca...

2024
[62]

Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. 2024. https://arxiv.org/abs/2310.15213 Function vectors in large language models . In International Conference on Learning Representations

work page arXiv 2024
[63]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2024. https://arxiv.org/abs/2308.10248 Steering language models with activation engineering . arXiv preprint arXiv:2308.10248

work page internal anchor Pith review arXiv 2024
[64]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. https://arxiv.org/abs/2305.04388 Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting . In Advances in Neural Information Processing Systems, volume 36

work page arXiv 2023
[65]

Marco Valentino, Mokanarangan Thayaparan, Tom Sherborne, and Andr \'e Freitas. 2026. https://arxiv.org/abs/2505.12189 Mitigating content effects on reasoning in language models through fine-grained activation steering . In Proceedings of the AAAI Conference on Artificial Intelligence

work page arXiv 2026
[66]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html Investigating gender bias in language models using causal mediation analysis . In Advances in Neural Information Processing Systems, volume 33, page...

2020
[67]

Esteban Walker and Amy S Nowacki. 2011. Understanding equivalence and noninferiority testing. Journal of General Internal Medicine, 26(2):192--196

2011
[68]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://arxiv.org/abs/2203.11171 Self-consistency improves chain of thought reasoning in language models . In International Conference on Learning Representations

work page internal anchor Pith review arXiv 2023
[69]

Youjin Wang, Run Zhou, Rong Fu, Shuaishuai Cao, Hongwei Zeng, Jiaxuan Lu, Sicheng Fan, Jiaqiao Zhao, and Liangming Pan. 2026. https://arxiv.org/abs/2602.04935 ASA : Training-free representation engineering for tool-calling agents . arXiv preprint arXiv:2602.04935

work page arXiv 2026
[70]

Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2025. https://arxiv.org/abs/2501.18585 Thoughts are all over the place: On the underthinking of o1-like LLM s . arXiv preprint arXiv:2501.18585

work page arXiv 2025
[71]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. https://arxiv.org/abs/2201.11903 Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems, volume 35

work page internal anchor Pith review arXiv 2022
[72]

5 Before the Last Token Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U

Tom Wollschl \"a ger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G \"u nnemann, and Johannes Gasteiger. 2025. https://arxiv.org/abs/2502.17420 The geometry of refusal in large language models: Concept cones and representational independence . arXiv preprint arXiv:2502.17420

work page arXiv 2025
[73]

Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, and Liang He

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. 2025. https://arxiv.org/abs/2501.17148 AxBench : Steering LLM s? even simple baselines outperform sparse autoencoders . In International Conference on Machine Learning

work page arXiv 2025
[74]

and Potts, Christopher , booktitle=

Zhengxuan Wu, Aryaman Arora, Zhitao Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. 2024. https://arxiv.org/abs/2404.03592 ReFT : Representation finetuning for language models . In International Conference on Machine Learning

work page arXiv 2024
[75]

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2024. https://arxiv.org/abs/2306.13063 Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s . In International Conference on Learning Representations

work page internal anchor Pith review arXiv 2024
[76]

An Yang, Baosong Yang, Beichen Zhang, and 1 others. 2024. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . arXiv preprint arXiv:2412.15115

work page internal anchor Pith review arXiv 2024
[77]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, and 1 others. 2023. https://arxiv.org/abs/2306.05685 Judging LLM -as-a-judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems, volume 36

work page internal anchor Pith review arXiv 2023
[78]

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others. 2023. https://arxiv.org/abs/2310.01405 Representation engineering: A top-d...

work page internal anchor Pith review arXiv 2023
[79]

Amir Zur, Atticus Geiger, Ekdeep Singh Lubana, and Eric Bigelow. 2025. https://arxiv.org/abs/2511.04527 Are language models aware of the road not taken? token-level uncertainty and hidden state dynamics . arXiv preprint arXiv:2511.04527

work page arXiv 2025