Recognition: unknown
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Pith reviewed 2026-05-08 11:37 UTC · model grok-4.3
The pith
Overthinking failures in medical LLMs are linearly decodable from hidden states yet resist correction by any tested fixed residual-stream steering vector.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Overthinking constitutes a stable behavioral regime in medical question answering where models succeed under resampling yet fail in extended chain-of-thought, with Jaccard similarity at least 0.81. This regime is linearly decodable from residual-stream activations at 71.6 percent balanced accuracy. Five families of fixed residual-stream linear steering across 29 configurations and 1,273 trials yield performance deltas near zero, with identical null results on a different model and on MMLU-STEM. The probe nevertheless supports post-generation abstention at held-out AUROC 0.610, exceeding five uncertainty baselines.
What carries the argument
The overthinking direction extracted by a linear probe on residual-stream activations, which overlaps 85-88 percent with task-critical directions and therefore resists correction by fixed steering vectors.
If this is right
- The same linear probe enables selective abstention that beats all tested uncertainty baselines on held-out data.
- Steering in the shared direction without targeting the failure regime reduces overall accuracy by 12.1 percentage points.
- Erasing the overthinking direction via LEACE reduces accuracy by 3.6 percentage points, while random erasures produce no change.
- The probe-steering correlation per instance is near zero, showing that decodability does not translate into steerability.
- Results replicate across model architectures and across medical and STEM domains.
Where Pith is reading between the lines
- Failure modes that are linearly readable may still require non-linear or dynamic interventions for correction rather than fixed vectors.
- Reliability estimation via abstention could be deployed immediately in medical QA while correction methods are developed.
- High overlap between failure and task directions suggests that future steering should isolate instance-specific components instead of using global vectors.
Load-bearing premise
The five families of fixed linear steering vectors tested are broad enough that their failure to correct overthinking implies no linear residual-stream steering can succeed.
What would settle it
Discovery of one linear steering vector that, when added to the residual stream at inference time, raises accuracy on overthinking instances by at least 5 percentage points while leaving overall accuracy unchanged or improved.
Figures
read the original abstract
Can linearly decodable failure signals in LLM hidden states be leveraged to correct those failures? We investigate this classification-correction gap via Overthinking (OT)--a stable behavioral regime (Jaccard >= 0.81, 94% inter-annotator agreement) in medical QA where models answer correctly under resampling yet fail in extended chain-of-thought. OT is linearly decodable at 71.6% balanced accuracy (p < 10^{-16}). Yet five families of fixed linear steering (29 configurations, n=1,273) all yield Delta ~= 0, with identical null results cross-architecture (Qwen2.5-7B) and cross-domain (MMLU-STEM). Three convergent lines of evidence suggest representational entanglement: the OT direction has 85-88% overlap with task-critical computation (specificity ratio <= 0.152); non-targeted shared-direction steering damages accuracy (-12.1pp); and LEACE concept erasure damages accuracy (-3.6pp, p=0.01), while 10 random erasures produce Delta=+0.3pp. The per-instance probe-steering correlation is r=-0.002 (p=0.97). Positively, the same probe enables selective abstention (held-out AUROC=0.610, exceeding all five uncertainty baselines, p=0.009): decodable failure structure supports post-generation reliability estimation even when the fixed linear steering family cannot exploit it for correction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the classification-correction gap for failure modes in LLMs by studying Overthinking (OT), a stable regime in medical QA where models succeed under resampling but fail in extended CoT (Jaccard >=0.81, 94% IAA). OT is linearly decodable at 71.6% balanced accuracy (p<10^{-16}). Across five families of fixed residual-stream linear steering (29 configurations, n=1,273), all yield Delta~0 with null results replicated cross-architecture (Qwen2.5-7B) and cross-domain (MMLU-STEM). Convergent evidence for entanglement includes 85-88% overlap with task directions (specificity ratio <=0.152), non-targeted steering damage (-12.1pp), LEACE erasure harming accuracy (-3.6pp, p=0.01) vs. random erasures (+0.3pp), and near-zero probe-steering correlation (r=-0.002). The probe nonetheless supports selective abstention (held-out AUROC=0.610 > baselines, p=0.009).
Significance. If the results hold, the work demonstrates that linear decodability of a failure signal does not imply correctability via fixed residual-stream linear interventions, providing a concrete empirical example of representational entanglement in a high-stakes domain. The convergent diagnostics (overlap, specificity, damage, erasure, correlation) and cross-checks make the negative steering result internally consistent rather than an artifact of underpowered methods. The positive abstention result shows practical utility for reliability estimation even when correction fails. This contributes to mechanistic interpretability by delineating limits of simple linear interventions.
major comments (1)
- [Steering experiments] Steering experiments section: The central negative claim (no correction by fixed linear steering) rests on null results across 29 configurations in five families. To make this load-bearing for the scoped conclusion, the manuscript should explicitly report the statistical power, exact p-values or confidence intervals for Delta~0 in each family, and whether steering vectors were derived from held-out data relative to the probe training set.
minor comments (2)
- [Abstract] Abstract and methods: Define the specificity ratio (<=0.152) and the five steering families with a brief equation or pseudocode on first use; the current phrasing leaves the overlap metric ambiguous for readers unfamiliar with the LEACE baseline.
- [Results] Results: The per-instance correlation r=-0.002 (p=0.97) is reported but the exact number of instances and any multiple-comparison correction across the 29 configs should be stated to allow direct assessment of the null.
Simulated Author's Rebuttal
We thank the referee for their constructive and positive review, including the recommendation for minor revision. We address the major comment below and will incorporate the requested details into the revised manuscript.
read point-by-point responses
-
Referee: [Steering experiments] Steering experiments section: The central negative claim (no correction by fixed linear steering) rests on null results across 29 configurations in five families. To make this load-bearing for the scoped conclusion, the manuscript should explicitly report the statistical power, exact p-values or confidence intervals for Delta~0 in each family, and whether steering vectors were derived from held-out data relative to the probe training set.
Authors: We agree that explicit reporting of statistical power, exact p-values, and confidence intervals will strengthen the presentation of the null results and make the central claim more robust. In the revised manuscript we will add these quantities for each of the five steering families (and, where relevant, per configuration). Post-hoc power analysis (based on the observed near-zero effect sizes and our total n=1,273) will be included to quantify the ability to detect small corrections if they existed. Regarding data partitioning: the steering vectors were derived from a held-out subset that was disjoint from the probe training data; we will state this explicitly in the methods and results sections to eliminate any ambiguity. These additions do not change the reported findings or conclusions. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper is an empirical study reporting experimental results on linear decodability of overthinking (OT) in medical QA, null effects from 29 fixed residual-stream steering configurations (n=1,273), entanglement diagnostics, and selective abstention performance. No derivation chain exists that reduces a central claim to its own fitted inputs, self-citations, or ansatzes by construction; all load-bearing elements (balanced accuracy, Delta values, AUROC, correlation coefficients) are directly measured from held-out data and cross-validated across architectures/domains without internal redefinition or statistical forcing.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Guillaume Alain and Yoshua Bengio. 2017. https://arxiv.org/abs/1610.01644 Understanding intermediate layers using linear classifier probes . arXiv preprint arXiv:1610.01644
work page internal anchor Pith review arXiv 2017
-
[2]
Anthropic . 2024. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf The Claude 3 model family: Opus , Sonnet , Haiku . Technical report, Anthropic. We use Claude Haiku 4.5 ( claude-haiku-4-5-20251001 )
2024
-
[3]
Andy Arditi, Oscar Obber, Ajeya Shlegeris, and Nicholas Schiefer. 2024. https://arxiv.org/abs/2406.11717 Refusal in language models is mediated by a single direction . arXiv preprint arXiv:2406.11717
work page internal anchor Pith review arXiv 2024
- [4]
-
[5]
Sanjay Basu, Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, and Rajaie Batniji. 2026. https://arxiv.org/abs/2603.18353 Interpretability without actionability: Mechanistic methods cannot correct language model errors despite near-perfect internal representations . arXiv preprint arXiv:2603.18353
-
[6]
Yonatan Belinkov. 2022. https://doi.org/10.1162/coli_a_00422 Probing classifiers: Promises, shortcomings, and advances . Computational Linguistics, 48(1):207--219
work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
- [7]
-
[8]
Jayadev Billa. 2026. https://arxiv.org/abs/2604.15557 Predicting where steering vectors succeed . arXiv preprint arXiv:2604.15557
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [9]
- [10]
-
[11]
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. https://arxiv.org/abs/2412.21187 Do NOT think that much for 2+3=? on the overthinking of o1-like LLM s . arXiv preprint arXiv:2412.21187
work page internal anchor Pith review arXiv 2024
-
[12]
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. 2024. https://arxiv.org/abs/2309.03883 Do L a: Decoding by contrasting layers improves factuality in large language models . In Proceedings of the 12th International Conference on Learning Representations (ICLR)
- [13]
-
[14]
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. https://arxiv.org/abs/2309.08600 Sparse autoencoders find highly interpretable features in language models . arXiv preprint arXiv:2309.08600
work page internal anchor Pith review arXiv 2023
- [15]
- [16]
-
[17]
Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. https://doi.org/10.1162/tacl_a_00359 Amnesic probing: Behavioral explanation with amnesic counterfactuals . Transactions of the Association for Computational Linguistics, 9:160--175
-
[18]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, and 1 others. 2022. https://transformer-circuits.pub/2022/toy_model/index.html Toy models of superposition . Transformer Circuits Thread
2022
-
[19]
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, and 1 others. 2021. https://transformer-circuits.pub/2021/framework/index.html A mathematical framework for transformer circuits . Transformer Circuits Thread
2021
-
[20]
Lang Gao, Jinghui Zhang, Wei Liu, Fengxian Ji, Chenxi Wang, Zirui Song, Akash Ghosh, Youssef Mohamed, Preslav Nakov, and Xiuying Chen. 2026. https://arxiv.org/abs/2605.01844 The cylindrical representation hypothesis for language model steering . arXiv preprint arXiv:2605.01844
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [21]
- [22]
- [23]
-
[24]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The Llama 3 herd of models . arXiv preprint arXiv:2407.21783
work page internal anchor Pith review arXiv 2024
- [25]
-
[26]
Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. 2023. https://arxiv.org/abs/2301.04213 Does localization inform editing? surprising differences in causality-based localization vs.\ knowledge editing in language models . In Advances in Neural Information Processing Systems, volume 36
-
[27]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://arxiv.org/abs/2009.03300 Measuring massive multitask language understanding . In International Conference on Learning Representations
work page internal anchor Pith review arXiv 2021
- [28]
-
[29]
John Hewitt and Percy Liang. 2019. https://doi.org/10.18653/v1/D19-1275 Designing and interpreting probes with control tasks . In Proceedings of EMNLP, pages 2733--2743
-
[30]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://arxiv.org/abs/2106.09685 LoRA : Low-rank adaptation of large language models . In International Conference on Learning Representations
work page internal anchor Pith review arXiv 2022
-
[31]
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. https://arxiv.org/abs/2310.01798 Large language models cannot self-correct reasoning yet . In Proceedings of the 12th International Conference on Learning Representations (ICLR)
work page internal anchor Pith review arXiv 2024
- [32]
-
[33]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. https://doi.org/10.3390/app11146421 What disease does this patient have? a large-scale open domain question answering dataset from medical exams . Applied Sciences, 11(14):6421
- [34]
-
[35]
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022. https://arxiv.org/abs/2207.05221 Language models (mostly) know what they know . arXiv preprint arXiv:2207.05221
work page internal anchor Pith review arXiv 2022
-
[36]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. https://arxiv.org/abs/2205.11916 Large language models are zero-shot reasoners . In Advances in Neural Information Processing Systems, volume 35
work page internal anchor Pith review arXiv 2022
-
[37]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. https://arxiv.org/abs/2302.09664 Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In International Conference on Learning Representations
work page internal anchor Pith review arXiv 2023
-
[38]
Daniel Lakens. 2017. https://doi.org/10.1177/1948550617697177 Equivalence tests: A practical primer for t tests, correlations, and meta-analyses . Social Psychological and Personality Science, 8(4):355--362
-
[39]
Measuring Faithfulness in Chain-of-Thought Reasoning
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil\. Luki\' c , Karina Nguyen, Newton Schiefer, Catherine Olsson, Tom Henighan, Andy Jones, Karina Ndousse, Oliver Bloom, Nelson Elhage, and 6 others. 2023. https://arxiv.org/abs/2307.13702 Measuring fai...
work page Pith review arXiv 2023
-
[40]
Kenneth Li, Oam Patel, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. 2023. https://doi.org/10.48550/arXiv.2306.03341 Inference-time intervention: Eliciting truthful answers from a language model . Advances in Neural Information Processing Systems, 36
work page internal anchor Pith review doi:10.48550/arxiv.2306.03341 2023
-
[41]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. https://arxiv.org/abs/2305.20050 Let's verify step by step . arXiv preprint arXiv:2305.20050
work page internal anchor Pith review arXiv 2023
-
[42]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 Truthfulqa: Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252
-
[43]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shravya Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. https://arxiv.org/abs/2303.17651 Self-refine: Iterative refinement with self-feedback . In Adva...
work page internal anchor Pith review arXiv 2023
-
[44]
Samuel Marks and Max Tegmark. 2024. https://arxiv.org/abs/2310.06824 The geometry of truth: Emergent linear structure in large language model representations of true/false datasets . In Conference on Language Modeling
work page internal anchor Pith review arXiv 2024
-
[45]
Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Mike Vaiana, Diogo de Lucena, Judd Rosenblatt, and Michael S. A. Graziano. 2026. https://arxiv.org/abs/2602.06941 Endogenous resistance to activation steering in language models . arXiv preprint arXiv:2602.06941
-
[46]
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems, volume 35, pages 17359--17372
2022
-
[47]
Aayush Mishra, Daniel Khashabi, and Anqi Liu. 2026. https://arxiv.org/abs/2604.09839 Steered LLM activations are non-surjective . arXiv preprint arXiv:2604.09839
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candela, and Dirk Groeneveld. 2025. https://arxiv.org/abs/2501.19393 s1: Simple test-time scaling . arXiv preprint arXiv:2501.19393
work page internal anchor Pith review arXiv 2025
-
[49]
Mohammed Suhail B Nadaf. 2026. https://arxiv.org/abs/2604.02608 Steerable but not decodable: Function vectors operate beyond the logit lens . arXiv preprint arXiv:2604.02608
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [50]
-
[51]
Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Morrow, Tristan Nguyen, Hoifung Poon, Qiufeng Wei, and 1 others. 2023. https://arxiv.org/abs/2311.16452 Can generalist foundation models outcompete special-purpose tun...
-
[52]
Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. 2023. https://arxiv.org/abs/2312.06681 Steering Llama 2 via contrastive activation addition . arXiv preprint arXiv:2312.06681
work page internal anchor Pith review arXiv 2023
-
[53]
Kiho Park, Yo Joong Choe, and Victor Veitch. 2024. https://arxiv.org/abs/2311.03658 The linear representation hypothesis and the geometry of large language models . In International Conference on Machine Learning
work page internal anchor Pith review arXiv 2024
-
[54]
Fabian Pedregosa, Ga \"e l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and 1 others. 2011. Scikit-learn: Machine learning in Python . Journal of Machine Learning Research, 12:2825--2830
2011
-
[55]
Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. 2020. https://doi.org/10.18653/v1/2020.acl-main.647 Null it out: Guarding protected attributes by iterative nullspace projection . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7237--7256. Association for Computational Linguistics
-
[56]
Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan Cotterell. 2022. https://proceedings.mlr.press/v162/ravfogel22a.html Linear adversarial concept erasure . In Proceedings of ICML, pages 18400--18421
2022
- [57]
-
[58]
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, and 1 others. 2024. https://arxiv.org/abs/2310.13548 Towards understanding sycophancy in language models . arXiv preprint arXiv:2310.13548
work page internal anchor Pith review arXiv 2024
-
[59]
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, and 1 others. 2023. https://doi.org/10.1038/s41586-023-06291-2 Large language models encode clinical knowledge . Nature, 620(7972):172--180
-
[60]
Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. 2024. https://arxiv.org/abs/2409.12183 To CoT or not to CoT ? chain-of-thought helps mainly on math and symbolic reasoning . arXiv preprint arXiv:2409.12183
-
[61]
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Amezcua, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C Daniel Freeman, Theodore R Sumers, Edward Rees, Joshua Batson, Adam Jermyn, and 3 others. 2024. https://transformer-circuits.pub/2024/sca...
2024
- [62]
-
[63]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2024. https://arxiv.org/abs/2308.10248 Steering language models with activation engineering . arXiv preprint arXiv:2308.10248
work page internal anchor Pith review arXiv 2024
- [64]
-
[65]
Marco Valentino, Mokanarangan Thayaparan, Tom Sherborne, and Andr \'e Freitas. 2026. https://arxiv.org/abs/2505.12189 Mitigating content effects on reasoning in language models through fine-grained activation steering . In Proceedings of the AAAI Conference on Artificial Intelligence
-
[66]
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html Investigating gender bias in language models using causal mediation analysis . In Advances in Neural Information Processing Systems, volume 33, page...
2020
-
[67]
Esteban Walker and Amy S Nowacki. 2011. Understanding equivalence and noninferiority testing. Journal of General Internal Medicine, 26(2):192--196
2011
-
[68]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://arxiv.org/abs/2203.11171 Self-consistency improves chain of thought reasoning in language models . In International Conference on Learning Representations
work page internal anchor Pith review arXiv 2023
- [69]
-
[70]
Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2025. https://arxiv.org/abs/2501.18585 Thoughts are all over the place: On the underthinking of o1-like LLM s . arXiv preprint arXiv:2501.18585
-
[71]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. https://arxiv.org/abs/2201.11903 Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems, volume 35
work page internal anchor Pith review arXiv 2022
-
[72]
5 Before the Last Token Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U
Tom Wollschl \"a ger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G \"u nnemann, and Johannes Gasteiger. 2025. https://arxiv.org/abs/2502.17420 The geometry of refusal in large language models: Concept cones and representational independence . arXiv preprint arXiv:2502.17420
-
[73]
Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, and Liang He
Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. 2025. https://arxiv.org/abs/2501.17148 AxBench : Steering LLM s? even simple baselines outperform sparse autoencoders . In International Conference on Machine Learning
-
[74]
and Potts, Christopher , booktitle=
Zhengxuan Wu, Aryaman Arora, Zhitao Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. 2024. https://arxiv.org/abs/2404.03592 ReFT : Representation finetuning for language models . In International Conference on Machine Learning
-
[75]
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2024. https://arxiv.org/abs/2306.13063 Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s . In International Conference on Learning Representations
work page internal anchor Pith review arXiv 2024
-
[76]
An Yang, Baosong Yang, Beichen Zhang, and 1 others. 2024. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . arXiv preprint arXiv:2412.15115
work page internal anchor Pith review arXiv 2024
-
[77]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, and 1 others. 2023. https://arxiv.org/abs/2306.05685 Judging LLM -as-a-judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems, volume 36
work page internal anchor Pith review arXiv 2023
-
[78]
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others. 2023. https://arxiv.org/abs/2310.01405 Representation engineering: A top-d...
work page internal anchor Pith review arXiv 2023
- [79]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.