Consistency Training while Mitigating Obfuscation via Rate Matching
Pith reviewed 2026-06-28 14:23 UTC · model grok-4.3
The pith
Rate Matching Consistency Training reduces bias-following in language models comparably to standard methods while preserving verbalization of bias cues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RMCT trains for consistency over selected behavioural properties by matching the rate at which the model exhibits a target behaviour across input perturbations, rather than requiring paired inputs with and without the extraneous feature or constraining how the behaviour is expressed, thereby reducing bias-following without inducing obfuscation of the bias cue.
What carries the argument
Rate Matching Consistency Training (RMCT), which matches the rate of exhibiting a target behaviour (such as following a bias cue) across input perturbations to enforce consistency without constraining expression of that behaviour.
If this is right
- Reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types.
- Largely preserves the model's tendency to verbalise the bias cue.
- Extends consistency training to settings where the extraneous features cannot be removed.
- More data-efficient than the baseline while being less compute-efficient.
Where Pith is reading between the lines
- Rate matching may apply to other unwanted influences such as factual errors or unsafe suggestions.
- Preserved verbalization of cues could support human review of model reasoning traces.
- The approach might combine with other training signals that target different behavioural rates.
Load-bearing premise
That matching the rate of exhibiting a target behaviour across input perturbations will reduce the actual influence of extraneous features on the model's decisions.
What would settle it
Measure whether models trained with RMCT change their answers at lower rates than controls when bias cues are inserted or removed from otherwise identical prompts.
Figures
read the original abstract
Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model's tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Rate Matching Consistency Training (RMCT), which extends consistency training by matching the rate at which models exhibit target behaviors (e.g., following bias cues) across input perturbations rather than enforcing consistency over full responses or activations. This aims to reduce influence of extraneous features like sycophancy cues while preserving verbalization of those cues. Evaluated on two open-weight LLMs for sycophancy reduction, RMCT achieves bias-following reductions comparable to standard consistency training on held-out bias types, is more data-efficient (though less compute-efficient), and largely preserves cue verbalization.
Significance. If the central claim holds—that rate matching reduces causal influence of extraneous features rather than merely adjusting output statistics—RMCT would meaningfully extend consistency training to non-removable feature settings while avoiding obfuscation trade-offs against monitorability. The empirical results on sycophancy provide a concrete starting point, and the data-efficiency finding is a clear strength worth highlighting.
major comments (2)
- [Abstract] Abstract and evaluation description: the claim that RMCT achieves 'reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types' is load-bearing for the contribution, yet the abstract provides no effect sizes, exact metrics (e.g., bias-following rate deltas), statistical significance, or number of held-out types; without these in the results section the comparability cannot be assessed.
- [Method / Experiments] Method and evaluation sections: the core assumption that matching rates of target behavior across perturbations reduces the actual influence of the bias cue (rather than only its marginal frequency) is not directly tested. The sycophancy experiments allow feature removal, so they do not distinguish rate matching from internal de-biasing; an additional probe (e.g., activation patching or counterfactuals on non-removable cues) is needed to support the extension claim.
minor comments (2)
- [Abstract] The abstract states RMCT is 'more data-efficient at the expense of being less compute-efficient' but does not define the efficiency metrics (e.g., examples per epoch or FLOPs); add precise definitions and tables in the experimental section.
- [Method] Notation for 'rate matching' and 'target behaviour' should be formalized with an equation or pseudocode early in the method section to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We agree that the abstract requires more quantitative detail and will revise accordingly. On the core methodological claim, we acknowledge the limitation of the current experimental setup and will add clarifying discussion, while noting that fully testing non-removable cues would require new experiments.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation description: the claim that RMCT achieves 'reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types' is load-bearing for the contribution, yet the abstract provides no effect sizes, exact metrics (e.g., bias-following rate deltas), statistical significance, or number of held-out types; without these in the results section the comparability cannot be assessed.
Authors: We agree that the abstract and results section should include specific quantitative details to substantiate the comparability claim. We will revise the abstract to report effect sizes (e.g., bias-following rate deltas), the exact metrics used, statistical significance where applicable, and the number of held-out bias types. The results section will be updated to explicitly present these values for transparency. revision: yes
-
Referee: [Method / Experiments] Method and evaluation sections: the core assumption that matching rates of target behavior across perturbations reduces the actual influence of the bias cue (rather than only its marginal frequency) is not directly tested. The sycophancy experiments allow feature removal, so they do not distinguish rate matching from internal de-biasing; an additional probe (e.g., activation patching or counterfactuals on non-removable cues) is needed to support the extension claim.
Authors: We acknowledge that the sycophancy experiments permit feature removal and therefore do not isolate whether rate matching reduces causal influence versus merely adjusting output statistics in non-removable settings. The current results demonstrate that RMCT achieves comparable bias reduction to standard consistency training while preserving cue verbalization, supporting the method's practical utility. However, we agree that probes such as activation patching on non-removable cues would provide stronger evidence for the extension claim. We will add a limitations paragraph clarifying this gap and identifying it as future work, but cannot perform the additional experiments in the current revision. revision: partial
- Requirement for additional empirical probes (activation patching or counterfactuals) on non-removable cues to directly test reduction of causal influence rather than marginal frequency
Circularity Check
No significant circularity; purely empirical method with independent experimental validation
full rationale
The paper introduces RMCT as a training procedure and evaluates it via held-out experiments on sycophancy reduction in open-weight models. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the method or claims. The central result (comparable bias reduction while preserving verbalization) rests on direct measurement against baselines and held-out bias types, not on any self-referential construction. This is the standard case of an empirical contribution whose validity is externally falsifiable.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2024 , eprint =
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought , author =. 2024 , eprint =
2024
-
[2]
arXiv preprint arXiv:2503.11926 , year=
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. arXiv preprint arXiv:2503.11926 , year=
-
[3]
2025 , eprint=
Output Supervision Can Obfuscate the Chain of Thought , author=. 2025 , eprint=
2025
-
[4]
Agarwal, Sandhini and Ahmad, Lama and Ai, Jason and Altman, Sam and Applebaum, Andy and Arbus, Edwin and Arora, Rahul K and Bai, Yu and Baker, Bowen and Bao, Haiming and others , journal=
-
[5]
Proceedings of the 43rd International Conference on Machine Learning , year =
Consistency Training Can Entrench Misalignment , author =. Proceedings of the 43rd International Conference on Machine Learning , year =
-
[6]
2026 , note =
Consistency Training Along the Transformer Stack , author =. 2026 , note =
2026
-
[7]
arXiv preprint arXiv:2407.21783 , year=
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
-
[8]
Liu, Jian and Cui, Leyang and Liu, Hanmeng and Huang, Dandan and Wang, Yile and Zhang, Yue , journal=
-
[9]
Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=
-
[10]
Phan, Long and Gatti, Alice and Han, Ziwen and Li, Nathaniel and Hu, Josephina and Zhang, Hugh and Zhang, Chen Bo Calvin and Shaaban, Mohamed and Ling, John and Shi, Sean and others , journal=
-
[11]
Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=
-
[12]
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Yang and Guo, Daya , journal=
-
[13]
Chain-of-Thought Monitorability: A New and Fragile Opportunity for
Korbak, Tomek and Balesni, Mikita and Barnes, Elizabeth and Bengio, Yoshua and Benton, Joe and Bloom, Joseph and Chen, Mark and Cooney, Alan and Dafoe, Allan and Dragan, Anca and others , journal=. Chain-of-Thought Monitorability: A New and Fragile Opportunity for
-
[14]
Designing a Dashboard for Transparency and Control of Conversational
Chen, Yida and Wu, Aoyu and DePodesta, Trevor and Yeh, Catherine and Li, Kenneth and Marin, Nicholas Castillo and Patel, Oam and Riecke, Jan and Raval, Shivam and Seow, Olivia and Wattenberg, Martin and Viégas, Fernanda , journal=. Designing a Dashboard for Transparency and Control of Conversational
-
[15]
2025 , eprint =
Consistency Training Helps Stop Sycophancy and Jailbreaks , author =. 2025 , eprint =
2025
-
[16]
He, Horace and. Defeating Nondeterminism in. Thinking Machines Lab: Connectionism , year =. doi:10.64434/tml.20250910 , note =
-
[17]
2024 , eprint =
Frontier Models are Capable of In-context Scheming , author =. 2024 , eprint =
2024
-
[18]
arXiv preprint arXiv:2412.14093 , year=
Alignment faking in large language models , author=. arXiv preprint arXiv:2412.14093 , year=
-
[19]
and Ward, Francis Rhys , year =
van der Weij, Teun and Hofstätter, Felix and Jaffe, Ollie and Brown, Samuel F. and Ward, Francis Rhys , year =. 2406.07358 , archivePrefix =
-
[20]
Needham, Joe and Edkins, Giles and Pimpale, Govind and Bartsch, Henning and Hobbhahn, Marius , year =. Large. 2505.23836 , archivePrefix =
-
[21]
Nguyen, Jord and Hoang, Khiem and Attubato, Carlo Leonardo and Hofstätter, Felix , year =. Probing and. 2507.01786 , archivePrefix =
-
[22]
2025 , eprint=
Steering Evaluation-Aware Language Models to Act Like They Are Deployed , author=. 2025 , eprint=
2025
-
[23]
Schoen, Bronson and Nitishinskaya, Evgenia and Balesni, Mikita and Højmark, Axel and Hofstätter, Felix and Scheurer, Jérémy and Meinke, Alexander and Wolfe, Jason and van der Weij, Teun and Lloyd, Alex and Goldowsky-Dill, Nicholas and Fan, Angela and Matveiakin, Andrei and Shah, Rusheb and Williams, Marcus and Glaese, Amelia and Barak, Boaz and Zaremba, W...
-
[24]
Shah, Neil and Africa, David Demitri , year =
-
[25]
Wallace, Eric and Xiao, Kai and Leike, Reimar and Weng, Lilian and Heidecke, Johannes and Beutel, Alex , year =. The. 2404.13208 , archivePrefix =
-
[26]
2025 , month = nov, howpublished =
Mitigating the risk of prompt injections in browser use , author =. 2025 , month = nov, howpublished =
2025
-
[27]
Sharma, Mrinank and Tong, Meg and Mu, Jesse and Wei, Jerry and Kruthoff, Jorrit and Goodfriend, Scott and Ong, Euan and Peng, Alwin and Agarwal, Raj and Anil, Cem and Askell, Amanda and Bailey, Nathan and Benton, Joe and Bluemke, Emma and Bowman, Samuel R. and Christiansen, Eric and Cunningham, Hoagy and Dau, Andy and Gopal, Anjali and Gilson, Rob and Gra...
-
[28]
, year =
Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. Stanford
-
[29]
2026 , month = apr, howpublished =
Reproducing steering against evaluation awareness in a large open-weight model , author =. 2026 , month = apr, howpublished =
2026
-
[30]
2024 , eprint =
Analyzing the Generalization and Reliability of Steering Vectors , author =. 2024 , eprint =
2024
-
[31]
2023 , publisher =
Ruebsamen, Gene , title =. 2023 , publisher =
2023
-
[32]
Langley , title =
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
2000
-
[33]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
1980
-
[34]
M. J. Kearns , title =
-
[35]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
1983
-
[36]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
2000
-
[37]
Suppressed for Anonymity , author=
-
[38]
Newell and P
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
1981
-
[39]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
1959
-
[40]
, month = jul, year =
Imran, Sohaib and Kendiukhov, Ihor and Broerman, Matthew and Thomas, Aditya and Campanella, Riccardo and Lamb, Rob and Atkinson, Peter M. , month = jul, year =. Are
-
[41]
arXiv preprint arXiv:2309.13638 (2023)
McCoy, R. Thomas and Yao, Shunyu and Friedman, Dan and Hardy, Matthew and Griffiths, Thomas L. , month = sep, year =. Embers of. doi:10.48550/arXiv.2309.13638 , abstract =
-
[42]
and Nye, Maxwell and Andreas, Jacob , month = jun, year =
Li, Belinda Z. and Nye, Maxwell and Andreas, Jacob , month = jun, year =. Implicit. doi:10.48550/arXiv.2106.00737 , abstract =
-
[43]
Meinke, Alexander and Evans, Owain , month = dec, year =. Tell, don't show:. doi:10.48550/arXiv.2312.07779 , abstract =
-
[44]
Li, Daliang and Rawat, Ankit Singh and Zaheer, Manzil and Wang, Xin and Lukasik, Michal and Veit, Andreas and Yu, Felix and Kumar, Sanjiv , editor =. Large. Findings of the. 2023 , keywords =. doi:10.18653/v1/2023.findings-acl.112 , abstract =
-
[45]
Grosse, Roger and Bae, Juhan and Anil, Cem and Elhage, Nelson and Tamkin, Alex and Tajdini, Amirhossein and Steiner, Benoit and Li, Dustin and Durmus, Esin and Perez, Ethan and Hubinger, Evan and Lukošiūtė, Kamilė and Nguyen, Karina and Joseph, Nicholas and McCandlish, Sam and Kaplan, Jared and Bowman, Samuel R. , month = aug, year =. Studying. doi:10.485...
-
[46]
Borji, Ali , month = apr, year =. A. doi:10.48550/arXiv.2302.03494 , abstract =
-
[47]
Advances in Neural Information Processing Systems , author =
Faith and. Advances in Neural Information Processing Systems , author =. 2023 , keywords =
2023
-
[48]
Generative agents: Interactive simulacra of human behavior,
Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , month = oct, year =. Generative. Proceedings of the 36th. doi:10.1145/3586183.3606763 , abstract =
-
[49]
Advances in Neural Information Processing Systems , author =
Judging. Advances in Neural Information Processing Systems , author =. 2023 , keywords =
2023
-
[50]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Bubeck, Sébastien and Chandrasekaran, Varun and Eldan, Ronen and Gehrke, Johannes and Horvitz, Eric and Kamar, Ece and Lee, Peter and Lee, Yin Tat and Li, Yuanzhi and Lundberg, Scott and Nori, Harsha and Palangi, Hamid and Ribeiro, Marco Tulio and Zhang, Yi , month = apr, year =. Sparks of. doi:10.48550/arXiv.2303.12712 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.12712
-
[51]
Huang, Jie and Chang, Kevin Chen-Chuan , editor =. Towards. Findings of the. 2023 , keywords =. doi:10.18653/v1/2023.findings-acl.67 , abstract =
-
[52]
Language
Deletang, Gregoire and Ruoss, Anian and Duquenne, Paul-Ambroise and Catt, Elliot and Genewein, Tim and Mattern, Christopher and Grau-Moya, Jordi and Wenliang, Li Kevin and Aitchison, Matthew and Orseau, Laurent and Hutter, Marcus and Veness, Joel , month = oct, year =. Language
-
[53]
Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2024 , note =. doi:10.1609/aaai.v38i17.29936 , abstract =
-
[54]
Mirchandani, Suvir and Xia, Fei and Florence, Pete and Ichter, Brian and Driess, Danny and Arenas, Montserrat Gonzalez and Rao, Kanishka and Sadigh, Dorsa and Zeng, Andy , month = aug, year =. Large
-
[55]
Salewski, Leonard and Alaniz, Stephan and Rio-Torto, Isabel and Schulz, Eric and Akata, Zeynep , month = nov, year =. In-
-
[56]
Can language models learn analogical reasoning?
Petersen, Molly and van der Plas, Lonneke , editor =. Can language models learn analogical reasoning?. Proceedings of the 2023. 2023 , keywords =. doi:10.18653/v1/2023.emnlp-main.1022 , abstract =
-
[57]
Qin, Chengwei and Xia, Wenhan and Wang, Tan and Jiao, Fangkai and Hu, Yuchen and Ding, Bosheng and Chen, Ruirui and Joty, Shafiq , month = jun, year =. Relevant or. doi:10.48550/arXiv.2404.12728 , abstract =
-
[58]
and Lu, Hongjing , month = sep, year =
Webb, Taylor and Holyoak, Keith J. and Lu, Hongjing , month = sep, year =. Emergent analogical reasoning in large language models , volume =. Nature Human Behaviour , publisher =. doi:10.1038/s41562-023-01659-w , abstract =
-
[59]
Untrained neural networks can demonstrate memorization-independent abstract reasoning , volume =
Barak, Tomer and Loewenstein, Yonatan , month = nov, year =. Untrained neural networks can demonstrate memorization-independent abstract reasoning , volume =. Scientific Reports , publisher =. doi:10.1038/s41598-024-78530-z , abstract =
-
[60]
Mahowald, Kyle and Ivanova, Anna A. and Blank, Idan A. and Kanwisher, Nancy and Tenenbaum, Joshua B. and Fedorenko, Evelina , month = mar, year =. Dissociating language and thought in large language models , url =. doi:10.48550/arXiv.2301.06627 , abstract =
-
[61]
and Wong, Catherine and Feng, Jiahai and Wei, Megan and Tenenbaum, Joshua B
Collins, Katherine M. and Wong, Catherine and Feng, Jiahai and Wei, Megan and Tenenbaum, Joshua B. , month = may, year =. Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks , shorttitle =. doi:10.48550/arXiv.2205.05718 , abstract =
-
[62]
: Larger and more instructable language models become less reliable
Zhou, Lexin and Schellaert, Wout and Martínez-Plumed, Fernando and Moros-Daval, Yael and Ferri, Cèsar and Hernández-Orallo, José , month = oct, year =. Larger and more instructable language models become less reliable , volume =. Nature , publisher =. doi:10.1038/s41586-024-07930-y , abstract =
-
[63]
Messeri, Lisa and Crockett, M. J. , month = mar, year =. Artificial intelligence and illusions of understanding in scientific research , volume =. Nature , publisher =. doi:10.1038/s41586-024-07146-0 , abstract =
-
[64]
Implicit meta-learning may lead language models to trust more reliable sources , url =
Krasheninnikov, Dmitrii and Krasheninnikov, Egor and Mlodozeniec, Bruno and Maharaj, Tegan and Krueger, David , month = jul, year =. Implicit meta-learning may lead language models to trust more reliable sources , url =. doi:10.48550/arXiv.2310.15047 , abstract =
-
[65]
Berglund, Lukas and Stickland, Asa Cooper and Balesni, Mikita and Kaufmann, Max and Tong, Meg and Korbak, Tomasz and Kokotajlo, Daniel and Evans, Owain , month = sep, year =. Taken out of context:. doi:10.48550/arXiv.2309.00667 , abstract =
-
[66]
Evaluating the
Liu, Hanmeng and Ning, Ruoxi and Teng, Zhiyang and Liu, Jian and Zhou, Qiji and Zhang, Yue , month = may, year =. Evaluating the
-
[67]
Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2005
-
[68]
Abductive
Bhagavatula, Chandra and Bras, Ronan Le and Malaviya, Chaitanya and Sakaguchi, Keisuke and Holtzman, Ari and Rashkin, Hannah and Downey, Doug and Yih, Wen-tau and Choi, Yejin , month = sep, year =. Abductive
-
[69]
Computational Linguistics , author =
Influences and. Computational Linguistics , author =. 2013 , keywords =. doi:10.1162/COLI_a_00171 , abstract =
-
[70]
and Xu, Yan and Fung, Pascale , month = nov, year =
Bang, Yejin and Cahyawijaya, Samuel and Lee, Nayeon and Dai, Wenliang and Su, Dan and Wilie, Bryan and Lovenia, Holy and Ji, Ziwei and Yu, Tiezheng and Chung, Willy and Do, Quyet V. and Xu, Yan and Fung, Pascale , month = nov, year =. A
-
[71]
Wu, Zhaofeng and Qiu, Linlu and Ross, Alexis and Akyürek, Ekin and Chen, Boyuan and Wang, Bailin and Kim, Najoung and Andreas, Jacob and Kim, Yoon , month = mar, year =. Reasoning or. doi:10.48550/arXiv.2307.02477 , abstract =
-
[72]
Sun, Wangtao and Xu, Haotian and Yu, Xuanqing and Chen, Pei and He, Shizhu and Zhao, Jun and Liu, Kang , editor =. Proceedings of the 62nd. 2024 , keywords =. doi:10.18653/v1/2024.acl-long.150 , abstract =
-
[73]
Gabriel, Iason and Manzini, Arianna and Keeling, Geoff and Hendricks, Lisa Anne and Rieser, Verena and Iqbal, Hasan and Tomašev, Nenad and Ktena, Ira and Kenton, Zachary and Rodriguez, Mikel and El-Sayed, Seliem and Brown, Sasha and Akbulut, Canfer and Trask, Andrew and Hughes, Edward and Bergman, A. Stevie and Shelby, Renee and Marchal, Nahema and Griffi...
-
[74]
Anwar, Usman and Saparov, Abulhair and Rando, Javier and Paleka, Daniel and Turpin, Miles and Hase, Peter and Lubana, Ekdeep Singh and Jenner, Erik and Casper, Stephen and Sourbut, Oliver and Edelman, Benjamin L. and Zhang, Zhaowei and Günther, Mario and Korinek, Anton and Hernandez-Orallo, Jose and Hammond, Lewis and Bigelow, Eric and Pan, Alexander and ...
-
[75]
Xu, Shusheng and Fu, Wei and Gao, Jiaxuan and Ye, Wenjie and Liu, Weilin and Mei, Zhiyu and Wang, Guangju and Yu, Chao and Wu, Yi , month = apr, year =. Is. doi:10.48550/arXiv.2404.10719 , abstract =
-
[76]
and Feng, Shi , month = apr, year =
Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , month = apr, year =
-
[77]
Tajwar, Fahim and Singh, Anikait and Sharma, Archit and Rafailov, Rafael and Schneider, Jeff and Xie, Tengyang and Ermon, Stefano and Finn, Chelsea and Kumar, Aviral , month = apr, year =. Preference. doi:10.48550/arXiv.2404.14367 , abstract =
-
[78]
How to use and interpret activation patching
Heimersheim, Stefan and Nanda, Neel , month = apr, year =. How to use and interpret activation patching , url =. doi:10.48550/arXiv.2404.15255 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.15255
-
[79]
and Hajishirzi, Hannaneh and Khashabi, Daniel , month = oct, year =
Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and Pathak, Eshaan and Karamanolakis, Giannis and Lai, Haizhi Gary and Purohit, Ishan and Mondal, Ishani and Anderson, Jacob and Kuznia, Kirby and Dosh...
-
[80]
Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Xu, Jingjing and Li, Lei and Sui, Zhifang , month = jun, year =. A
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.