Recognition: unknown
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
Pith reviewed 2026-05-07 05:26 UTC · model grok-4.3
The pith
MASCing applies steering masks to expert routing gates in Mixture-of-Experts models to reconfigure specific behaviors at inference time without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MASCing trains an LSTM surrogate to capture cross-layer routing dependencies, optimizes a steering matrix that locates the expert circuits tied to a target behavior, and then applies the resulting masks to the routing gates during inference to enhance or suppress that behavior while leaving general capabilities intact.
What carries the argument
The steering matrix, optimized via the LSTM surrogate, that produces behavior-specific masks applied to the Mixture-of-Experts routing gates.
If this is right
- The same base model can be rapidly reconfigured for different safety goals without repeated full training runs.
- Average success rate for multi-turn jailbreak defense rises from 52.5 percent to 83.9 percent.
- Average success rate for adult-content generation rises from 52.6 percent to 82.0 percent.
- General language utility and inference cost remain essentially unchanged across the tested models.
Where Pith is reading between the lines
- The same steering approach could be used to boost or suppress non-safety behaviors such as domain-specific reasoning or stylistic preferences.
- Masks might be switched dynamically at runtime based on conversation context, allowing one model to serve multiple user profiles.
- The method could be combined with other inference-time interventions to achieve finer-grained control over model outputs.
Load-bearing premise
The LSTM surrogate sufficiently models the routing dependencies so that the derived masks change only the intended behavior and leave overall model performance and safety profile unchanged.
What would settle it
Applying the learned masks to a previously unseen Mixture-of-Experts model on a third behavior task and finding either no improvement in the target behavior or a clear drop in general accuracy or new refusals would falsify the claim of reliable, low-side-effect control.
Figures
read the original abstract
Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios. At the same time, adapting model behavior through full fine-tuning or retraining is costly, especially when developers need to rapidly configure the same model for different safety objectives. We present MASCing (MoE Activation Steering Configuration), the first framework that enables flexible reconfiguration of MoE behavior across diverse safety scenarios without retraining. MASCing uses an LSTM-based surrogate model to capture cross-layer routing dependencies and map routing logits to downstream behaviors. It then optimizes a steering matrix to identify behavior-relevant expert circuits and, at inference time, applies steering masks to the routing gates to override expert selection. This enables targeted enhancement or suppression of specific behaviors while preserving general language utility. To demonstrate its reconfigurability, we apply MASCing to two different safety-related objectives and observe consistent gains with negligible overhead across seven open-source MoE models. For multi-turn jailbreak defense, it improves the average defense success rate from 52.5% to 83.9%, with gains of up to 89.2%. For adult-content generation, MASCing enables models to comply with such requests that would otherwise be refused, increasing the average generation success rate from 52.6% to 82.0%, with gains of up to 93.0%. These results establish MASCing as a practical, lightweight, and flexible framework for scenario-specific safety reconfiguration in MoE models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MASCing, a framework for reconfiguring Mixture-of-Experts (MoE) LLM behavior across safety scenarios without retraining. It trains an LSTM surrogate to capture cross-layer routing dependencies, optimizes a steering matrix to identify behavior-relevant expert circuits, and applies steering masks to routing gates at inference time. Demonstrated on seven open-source MoE models, it reports average improvements in multi-turn jailbreak defense success from 52.5% to 83.9% (up to 89.2%) and in adult-content generation success from 52.6% to 82.0% (up to 93.0%), while claiming to preserve general language utility with negligible overhead.
Significance. If the surrogate accurately captures routing dynamics and the masks deliver targeted control without degrading utility, MASCing offers a practical, lightweight alternative to fine-tuning for scenario-specific safety reconfiguration in sparse MoE models. The multi-model empirical evaluation and dual-objective demonstration (defense and compliance) would make it a useful contribution to AI safety tooling for efficient, reconfigurable control.
major comments (3)
- [LSTM surrogate model and optimization procedure] The section describing the LSTM surrogate model and steering matrix optimization provides no quantitative validation of surrogate fidelity (e.g., held-out prediction accuracy on routing logits, correlation with actual downstream behaviors, or ablation replacing the surrogate with direct model queries). This is load-bearing for the central claim, as the reported gains (52.5%→83.9%, 52.6%→82.0%) depend on the surrogate correctly identifying real expert circuits rather than proxy artifacts.
- [Experimental results and evaluation] The experimental results section reports specific percentage improvements but omits key details required to assess robustness: number of trials or test instances per scenario, statistical significance tests, precise baseline definitions (what exactly yields the 52.5% and 52.6% figures), and controls for confounds such as prompt sensitivity or model-specific routing variance. These omissions prevent verification that the gains are attributable to MASCing.
- [Utility preservation and overhead analysis] The claim that steering masks preserve general language utility and introduce negligible overhead lacks supporting metrics or ablations (e.g., perplexity on held-out text, accuracy on standard benchmarks like MMLU before/after masking, or safety side-effect checks). This directly affects the practical significance, as the weakest assumption in the approach is that targeted routing overrides do not degrade or destabilize unrelated capabilities.
minor comments (2)
- [Abstract and introduction] The abstract and introduction could include a short table or sentence summarizing the seven models tested (sizes, architectures) to help readers gauge scope without searching the full text.
- [Methods] Notation for the steering matrix, routing logits, and mask application would benefit from an early dedicated equation or pseudocode block to improve readability for readers unfamiliar with MoE routing mechanics.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have reviewed each major comment carefully and provide point-by-point responses below. We agree that several clarifications and additional analyses will strengthen the manuscript and will incorporate them in the revised version.
read point-by-point responses
-
Referee: The section describing the LSTM surrogate model and steering matrix optimization provides no quantitative validation of surrogate fidelity (e.g., held-out prediction accuracy on routing logits, correlation with actual downstream behaviors, or ablation replacing the surrogate with direct model queries). This is load-bearing for the central claim, as the reported gains (52.5%→83.9%, 52.6%→82.0%) depend on the surrogate correctly identifying real expert circuits rather than proxy artifacts.
Authors: We agree that direct quantitative validation of the LSTM surrogate is important to support the central claims. The current manuscript relies on downstream performance as indirect evidence but does not report explicit fidelity metrics. In the revision we will add a new subsection with held-out prediction accuracy of the surrogate on routing logits, Pearson/Spearman correlations between surrogate predictions and observed behavior changes, and an ablation comparing surrogate-optimized masks to masks obtained via direct model queries (where feasible given compute constraints). These additions will address concerns about proxy artifacts. revision: yes
-
Referee: The experimental results section reports specific percentage improvements but omits key details required to assess robustness: number of trials or test instances per scenario, statistical significance tests, precise baseline definitions (what exactly yields the 52.5% and 52.6% figures), and controls for confounds such as prompt sensitivity or model-specific routing variance. These omissions prevent verification that the gains are attributable to MASCing.
Authors: We acknowledge the need for greater experimental transparency. The 52.5% and 52.6% figures represent the unmodified baseline models on the respective multi-turn jailbreak defense and adult-content generation tasks. In the revised manuscript we will report the exact number of test instances and trials per scenario, include statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals), and describe the prompt construction process together with controls for prompt sensitivity and model-specific routing variance. This will make the attribution of gains to MASCing verifiable. revision: yes
-
Referee: The claim that steering masks preserve general language utility and introduce negligible overhead lacks supporting metrics or ablations (e.g., perplexity on held-out text, accuracy on standard benchmarks like MMLU before/after masking, or safety side-effect checks). This directly affects the practical significance, as the weakest assumption in the approach is that targeted routing overrides do not degrade or destabilize unrelated capabilities.
Authors: We agree that explicit utility and overhead metrics are required. The manuscript currently states negligible overhead based on inference-time measurements but does not provide supporting ablations. In the revision we will add perplexity on held-out text, accuracy on a representative MMLU subset before and after masking, and evaluations on unrelated safety and capability benchmarks to check for side effects. These results will quantify that the steering masks preserve general language utility while achieving the targeted safety reconfigurations. revision: yes
Circularity Check
No circularity: empirical gains measured on real models after surrogate-guided optimization
full rationale
The paper describes an empirical pipeline: an LSTM surrogate is trained to approximate routing-to-behavior mappings, a steering matrix is optimized against that surrogate, and the resulting masks are applied directly to seven real MoE models whose defense and compliance rates are then measured. The reported deltas (52.5 % → 83.9 %, 52.6 % → 82.0 %) are therefore external performance numbers on the target models, not quantities that reduce by construction to the surrogate’s fitted parameters or to any self-citation. No equations, uniqueness theorems, or ansatzes are shown to be self-referential; the central claim rests on observable behavior change rather than on a closed derivation loop.
Axiom & Free-Parameter Ledger
free parameters (2)
- steering matrix
- LSTM model weights
axioms (1)
- domain assumption The routing logits in MoE layers can be mapped to downstream behaviors via a surrogate model
invented entities (1)
-
steering masks
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL] https://arxiv.org/abs/2404.14219
work page internal anchor Pith review arXiv 2024
-
[2]
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems37 (2024), 136037–136083
2024
-
[3]
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al
-
[4]
https://transformer- circuits.pub/2023/monosemantic-features/index.html
Towards Monosemanticity: Decomposing Language Models With Dic- tionary Learning.Transformer Circuits Thread(2023). https://transformer- circuits.pub/2023/monosemantic-features/index.html
2023
-
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, 12 MASCing: Configurable Mixture-of-Experts Behavior via Ac...
2020
-
[6]
Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, and Juanzi Li. 2026. Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=AAXMcAyNF6
2026
-
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374
work page internal anchor Pith review arXiv 2021
-
[8]
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509(2025)
work page internal anchor Pith review arXiv 2025
-
[9]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al . 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168
work page internal anchor Pith review arXiv 2021
-
[10]
Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, Alwin Peng, Jor- dan Abderrachid, Raj Agarwal, Bobby Chen, Andy Dau, Alek Dimitriev, Logan Howard, Yijin Hua, Rob Gilson, Mu Lin, Christopher Liu, Vladimir Mikulik, Rohit Mittapalli, Clare O’Hara, Jin Pan, Nikhil Saxena, Alex Silverstein, Yue Song, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez, a...
-
[11]
InThe Fourteenth International Conference on Learning Representations
Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=eNvsH5Ye2V
-
[12]
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, et al . 2024. DeepSeekMoE: To- wards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066 [cs.CL] https://arxiv.org/abs/2401.06066
work page internal anchor Pith review arXiv 2024
-
[13]
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das- Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah....
2021
-
[14]
Rossi, Trung Bui, Hinrich Schuetze, and Nanyun Peng
Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan A. Rossi, Trung Bui, Hinrich Schuetze, and Nanyun Peng. 2026. Steering MoE LLMs via Expert (De)Activation. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=v5Yl9V8rJs
2026
-
[15]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res.23, 1, Article 120 (Jan. 2022), 39 pages
2022
-
[16]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783
work page internal anchor Pith review arXiv 2024
-
[17]
Guardian. 2025. OpenAI will allow verified adults to use ChatGPT to generate erotic content. https://www.theguardian.com/technology/2025/oct/14/openai- chatgpt-adult-erotic-content
2025
-
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In2015 IEEE International Conference on Computer Vision (ICCV). 1026–1034. doi:10.1109/ICCV.2015.123
-
[19]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Lan- guage Understanding. InInternational Conference on Learning Representations. https://openreview.net/forum?id=d7KBjmI3GmQ
2021
-
[20]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3
2022
-
[21]
Tim Tian Hua, Andrew Qin, Samuel Marks, and Neel Nanda. 2026. Steering Evaluation-Aware Language Models To Act Like They Are Deployed. InThe Fourteenth International Conference on Learning Representations. https://openre view.net/forum?id=1TdRdf0fkw
2026
-
[22]
Hunyuan Team Tencent. 2025. Hunyuan-A13B Technical Report. https://github .com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/report/Hunyuan_A13B_T echnical_Report.pdf
2025
-
[23]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of Experts. arXiv:2401.04088 [cs.LG] https://arxiv.org/abs/2401.04088
work page internal anchor Pith review arXiv 2024
-
[24]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Opti- mization. arXiv:1412.6980 [cs.LG] https://arxiv.org/abs/1412.6980
work page internal anchor Pith review arXiv 2017
-
[25]
Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 22199– 22213. https://proceedings.neurips.cc/paper_files/p...
2022
- [26]
-
[27]
ZhengLin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, and Jianqiang Li. 2025. SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems. https: //openreview.net/forum?id=VwsXmcMyg5
2025
-
[28]
Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar
Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. 2025. Programming Refusal with Conditional Activation Steering. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=Oi47 wc10sm
2025
- [29]
-
[30]
Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li. 2025. Safety Layers in Aligned Large Language Models: The Key to LLM Security. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=kU H1yPMAn7
2025
- [31]
-
[32]
Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey
- [33]
- [34]
-
[35]
OpenAI. 2025. Introducing GPT-OSS. https://openai.com/index/introducing- gpt-oss/
2025
-
[36]
OpenErotica. 2024. erotica-analysis: A Dataset for Erotica Literature Analysis. https://huggingface.co/datasets/openerotica/erotica-analysis. Accessed: April 5, 2026
2024
-
[37]
Kiho Park, Yo Joong Choe, and Victor Veitch. 2024. The Linear Representation Hypothesis and the Geometry of Large Language Models. InForty-first Interna- tional Conference on Machine Learning. https://openreview.net/forum?id=UG pGkLzwpP
2024
-
[38]
So, Maud Texier, and Jeff Dean
David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. 2022. The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink. Computer55, 7 (2022), 18–28. doi:10.1109/MC.2022.3148714
-
[39]
Qwen Team. 2024. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters". https://qwenlm.github.io/blog/qwen-moe/
2024
-
[40]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538 [cs.LG] https: //arxiv.org/abs/1701.06538
work page internal anchor Pith review arXiv 2017
-
[41]
do anything now
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685
2024
- [42]
-
[43]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems 36 (2023), 80079–80110
2023
-
[44]
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent Abilities of Large Language Models.Transactions on Ma- chine Learning Research(2022). https://openreview.net/fo...
2022
-
[45]
Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Johannes Gasteiger. 2025. The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence. InForty- second International Conference on Machine Learning. https://openreview.net/f orum?id=80IwJqlXs8
2025
- [46]
-
[47]
Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami, Maximilian Thang, Stjepan Picek, and Ahmad-Reza Sadeghi. 2026. NeuroStrike: Neuron-Level Attacks on 13 te Lintelo et al. Aligned LLMs.Network and Distributed System Security (NDSS) Symposium (2026)
2026
-
[48]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388
work page internal anchor Pith review arXiv 2025
- [49]
-
[50]
Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, and Xian Li. 2025. NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions. arXiv:2502.13124 [cs.CL] https://arxiv.org/abs/2502.13124
- [51]
-
[52]
Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. 2026. LLMs Encode Harmfulness and Refusal Separately. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/f orum?id=zLkpt30ngy
2026
-
[53]
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, and Yongbin Li. 2025. On the Role of Attention Heads in Large Language Model Safety. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=h0Ak8A5yqw
2025
-
[54]
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al
-
[55]
Representation Engineering: A Top-Down Approach to AI Transparency
Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405(2023)
work page internal anchor Pith review arXiv 2023
-
[56]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Uni- versal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043 [cs.CL] A LSTM Training Results We report the validation accuracy of the trained LSTM models used to generate the main results in Table 7. The LSTMs were trained using the datasets created with the ...
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.