AgentLens: Interpretable Safety Steering via Mechanistic Subspaces for Multi-Turn Coding Agent
Pith reviewed 2026-06-26 10:24 UTC · model grok-4.3
The pith
AgentLens detects harmful states in multi-turn coding agents from hidden representations and steers them by intervening in a 10-dimensional subspace of one layer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentLens performs runtime safety detection and representation-level mitigation for coding agents by intervening in a 10-dimensional subspace within a single layer, achieving strong safety detection performance, providing preliminary evidence for lookahead risk anticipation, and substantially reducing harmful actions of the coding agent.
What carries the argument
A 10-dimensional subspace in one layer's hidden representations that encodes harmful execution states and serves as the target for both detection and steering interventions.
If this is right
- Step-level hidden representations contain detectable signals of harmful execution states during multi-turn agent runs.
- Intervention in the 10-dimensional subspace produces measurable reductions in harmful actions while preserving overall agent capability.
- The same subspace yields preliminary evidence that risk can be anticipated before the harmful action occurs.
- The method applies across LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B without model-specific redesign.
- Mechanistic interpretability can be used for dynamic safety control in agent settings rather than only static QA tasks.
Where Pith is reading between the lines
- If the subspace remains stable across model scales, safety monitoring could be performed with far fewer parameters than full-model guardrails.
- The approach might transfer to non-coding agents whose harmful trajectories also produce distinct representation patterns.
- Combining subspace steering with external guardrails could create layered defenses that catch both internal state shifts and external outcomes.
- The MAS benchmark trajectories could serve as a testbed for other representation-level safety techniques.
Load-bearing premise
Harmful execution states in multi-turn coding agents are reliably captured by a low-dimensional subspace in a single layer's hidden representations, allowing effective intervention without unintended effects on benign behavior.
What would settle it
An experiment in which editing the identified 10-dimensional subspace fails to reduce the rate of harmful actions or produces large drops in performance on benign tasks.
Figures
read the original abstract
Coding agents based on large language models (LLMs) demonstrate remarkable autonomous capabilities, but they also introduce significant safety and misuse risks during multi-turn interactions with external environments. Existing safety mechanisms mainly rely on external guardrails, which have a limited ability to perform fine-grained behavioral control during execution. Meanwhile, recent mechanistic interpretability methods for LLM safety are mostly confined to single-turn or jailbreak-style QA settings, limiting their ability to capture the evolving risk dynamics of multi-turn agent execution. In this paper, we investigate the safety of multi-turn coding agents from an internal perspective. We propose AgentLens (Mechanistic Subspace Intervention and Steering), a white-box defense framework that performs runtime safety detection and representation-level mitigation for coding agents. Unlike conventional agent guardrails, AgentLens detect harmful execution states from step-level hidden representations and mitigate unsafe behavior by intervening in a 10-dimensional subspace within a single layer. To support this research, we introduce the Mechanistic Agent Safety (MAS) benchmark, comprising comprehensively annotated multi-turn execution trajectories across 194 tasks using LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B. Extensive experiments show that AgentLens achieves strong safety detection performance, provides preliminary evidence for lookahead risk anticipation, and substantially reduces harmful actions of the coding agent, establishing a foundation for applying mechanistic interpretability to dynamic LLM agent safety. The code is available at: https://github.com/EddyLuo1232/AgentLens
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AgentLens, a white-box framework for runtime safety detection and representation-level mitigation in multi-turn coding agents. It detects harmful execution states from step-level hidden representations and intervenes in a fixed 10-dimensional subspace of a single layer to reduce unsafe behavior. To enable this, the authors introduce the Mechanistic Agent Safety (MAS) benchmark consisting of annotated multi-turn trajectories across 194 tasks on LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B. Experiments are reported to show strong detection performance, preliminary lookahead risk anticipation, and substantial reduction in harmful actions.
Significance. If the empirical claims hold with proper validation, the work is significant for extending mechanistic interpretability methods beyond single-turn or jailbreak settings to dynamic, multi-turn agent execution. The MAS benchmark provides a new resource for studying evolving risk in coding agents. The internal, representation-level steering approach offers a potential complement to external guardrails, with the code release supporting reproducibility.
major comments (2)
- [Abstract] Abstract: the central claim that harmful execution states are reliably isolable in a 10-dimensional subspace of a single layer's hidden states (enabling both detection and effective intervention without collateral effects) is presented without any quantitative metrics, baselines, ablation results on dimension/layer choice, or side-effect measurements on task utility. This makes the load-bearing assumption about subspace sufficiency impossible to evaluate from the provided evidence.
- [Abstract] Abstract: the reported 'strong safety detection performance' and 'substantial reduction' in harmful actions rest on unspecified experiments across three models; no details are given on how detection thresholds were set, how the subspace was constructed (e.g., contrastive activations or probes), or statistical significance, undermining assessment of whether the single-layer, fixed-10-dim choice generalizes or is arbitrary.
minor comments (1)
- [Abstract] The abstract mentions 'preliminary evidence for lookahead risk anticipation' but provides no description of the evaluation protocol or metrics used to support this claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. The comments highlight opportunities to strengthen the summary presentation of our claims. We address each point below and commit to revisions that incorporate additional quantitative context without altering the underlying experimental results reported in the full manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that harmful execution states are reliably isolable in a 10-dimensional subspace of a single layer's hidden states (enabling both detection and effective intervention without collateral effects) is presented without any quantitative metrics, baselines, ablation results on dimension/layer choice, or side-effect measurements on task utility. This makes the load-bearing assumption about subspace sufficiency impossible to evaluate from the provided evidence.
Authors: We agree that the abstract would be strengthened by including summary quantitative metrics to support the subspace claim. The full manuscript reports these details, including detection performance, ablations across dimensions and layers, and task utility preservation, in the experimental sections. We will revise the abstract to include key metrics (e.g., detection AUC and harm reduction percentages) and note the ablation findings on dimension/layer choice and side effects. revision: yes
-
Referee: [Abstract] Abstract: the reported 'strong safety detection performance' and 'substantial reduction' in harmful actions rest on unspecified experiments across three models; no details are given on how detection thresholds were set, how the subspace was constructed (e.g., contrastive activations or probes), or statistical significance, undermining assessment of whether the single-layer, fixed-10-dim choice generalizes or is arbitrary.
Authors: We acknowledge that the abstract summarizes results without methodological specifics. The manuscript provides these details, including subspace construction via contrastive activations, threshold selection, and statistical reporting, across the three models in Sections 3 and 4. We will revise the abstract to briefly reference the construction method, threshold approach, and generalization evidence while maintaining conciseness. revision: yes
Circularity Check
No circularity; empirical claims on new benchmark with no derivations
full rationale
The paper introduces the MAS benchmark and AgentLens framework, reporting experimental detection and intervention results across models. No equations, derivations, or predictions appear in the provided text. Claims are framed as outcomes of evaluations rather than reductions to fitted parameters or self-referential definitions. No self-citation chains or ansatzes are invoked as load-bearing for any derivation. The central results rest on empirical performance metrics, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
10-dimensional mechanistic subspace
no independent evidence
Reference graph
Works this paper leans on
-
[1]
From commands to prompts: LLM-based semantic file system for aios
Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, Dong Deng, and Yongfeng Zhang. From commands to prompts: LLM-based semantic file system for aios. International Conference on Learning Representations (ICLR), 2025
2025
-
[2]
Gpt-4v(ision) is a generalist web agent, if grounded.International Conference on Machine Learning (ICML), 2024
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded.International Conference on Machine Learning (ICML), 2024
2024
-
[3]
Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.International Conference on Learning Representations (ICLR), 2024
2024
-
[4]
Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Llm agent operating system. arXiv preprint arXiv:2403.16971, 2024
arXiv 2024
-
[5]
Agentic web: Weaving the next web with ai agents, 2025
Yingxuan Yang, Mulei Ma, Yuxuan Huang, Huacan Chai, Chenyu Gong, Haoran Geng, Yuanjian Zhou, Ying Wen, Meng Fang, Muhao Chen, Shangding Gu, Ming Jin, Costas Spanos, Yang Yang, Pieter Abbeel, Dawn Song, Weinan Zhang, and Jun Wang. Agentic web: Weaving the next web with ai agents, 2025
2025
-
[6]
Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, and Huan Sun
Botao Yu, Frazier N. Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, and Huan Sun. Tooling or not tooling? the impact of tools on language agents for chemistry problem solving.Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), 2024
2024
-
[7]
Chemcrow: Augmenting large-language models with chemistry tools.Nature Machine Intelligence, 2023
Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools.Nature Machine Intelligence, 2023
2023
-
[8]
Ho, Carl Yang, and May Dongmei Wang
Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C. Ho, Carl Yang, and May Dongmei Wang. EHRAgent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in...
2024
-
[9]
Agentbench: Evaluating llms as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. International Conference on Learning Representations...
2024
-
[10]
Travelplanner: A benchmark for real-world planning with language agents
Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents. InInternational Conference on Machine Learning (ICML), 2024
2024
-
[11]
Claude Code by anthropic.https://www.anthropic.com/claude-code, 2025
Anthropic. Claude Code by anthropic.https://www.anthropic.com/claude-code, 2025
2025
-
[12]
Cli - codex | openal developers.https://developers.openai.com/codex/cli, 2025
OpenAI. Cli - codex | openal developers.https://developers.openai.com/codex/cli, 2025
2025
-
[13]
Cursor cli - run agents in terminal, github actions and automations & scripts.https://cursor.com/ cli, 2025
Anysphere. Cursor cli - run agents in terminal, github actions and automations & scripts.https://cursor.com/ cli, 2025
2025
-
[14]
Build, debug & deploy with al | gemini cli.https://geminicli.com/, 2025
Google. Build, debug & deploy with al | gemini cli.https://geminicli.com/, 2025
2025
-
[15]
Agentharm: A benchmark for measuring harmfulness of LLM agents
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of LLM agents. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025
2025
-
[16]
Safeagentbench: A benchmark for safe task planning of embodied llm agents, 2025
Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen. Safeagentbench: A benchmark for safe task planning of embodied llm agents, 2025
2025
-
[17]
Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 12
2024
-
[18]
Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents
Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[19]
Code agent can be an end-to-end system hacker: Benchmarking real-world threats of computer-use agent, 2025
Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, Bin Hu, Hung-Chun Chiu, Siyuan Ma, Yizhe Zhang, Xusheng Xiao, Yinzhi Cao, Zhen Xiang, and Chaowei Xiao. Code agent can be an end-to-end system hacker: Benchmarking real-world threats of computer-use agent, 2025
2025
-
[20]
RedteamCUA: Realistic adversarial testing of computer-use agents in hybrid web-OS environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su, Zhiqiang Lin, and Huan Sun. RedteamCUA: Realistic adversarial testing of computer-use agents in hybrid web-OS environments. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[21]
R-judge: Benchmarking safety risk awareness for LLM agents
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-judge: Benchmarking safety risk awareness for LLM agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024
2024
-
[22]
Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023
2023
-
[23]
EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE
Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[24]
AGrail: A lifelong agent guardrail with effective and adaptive safety detection
Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, and Chaowei Xiao. AGrail: A lifelong agent guardrail with effective and adaptive safety detection. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025
2025
-
[25]
Shieldagent: Shielding agents via verifiable safety policy reasoning
Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding agents via verifiable safety policy reasoning. In Forty-second International Conference on Machine Learning, 2025
2025
-
[26]
Guardagent: Safeguard LLM agents via knowledge-enabled reasoning
Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, and Bo Li. Guardagent: Safeguard LLM agents via knowledge-enabled reasoning. In Forty-second International Conference on Machine Learning, 2025
2025
-
[27]
Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection, 2025
Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, and Ye Wu. Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection, 2025
2025
-
[28]
Dynamic guided and domain applicable safeguards for enhanced security in large language models
Weidi Luo, He Cao, Zijing Liu, Yu Wang, Aidan Wong, Bin Feng, Yuan Yao, and Yu Li. Dynamic guided and domain applicable safeguards for enhanced security in large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, 2025
2025
-
[29]
AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender
Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025
2025
-
[30]
Hsf: Defending against jailbreak attacks with hidden state filtering, 2025
Cheng Qian, Hainan Zhang, Lei Sha, and Zhiming Zheng. Hsf: Defending against jailbreak attacks with hidden state filtering, 2025
2025
-
[31]
Beyond linear probes: Dynamic safety monitoring for language models
James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, and Fazl Barez. Beyond linear probes: Dynamic safety monitoring for language models. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[32]
Jbshield: Defending large language models from jailbreak attacks through activated concept analysis and manipulation
Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Shengnan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. Jbshield: Defending large language models from jailbreak attacks through activated concept analysis and manipulation. InProc. of USENIX Security Symposium, pages 8215–8234. USENIX Association, 2025
2025
-
[33]
Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks
Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks. InConference on Language Modeling (COLM), 2024. 13
2024
-
[34]
Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. InThe Thirty-eight Conference on Neural Information Processing Sy...
2024
-
[35]
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. InForty-first International Conference on Machine Learning, 2024
2024
-
[36]
Understanding intermediate layers using linear classifier probes, 2017
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2017
2017
-
[37]
Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 2022
Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 2022
2022
-
[38]
Agent-safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470, 2024
Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470, 2024
Pith/arXiv arXiv 2024
-
[39]
Agentbench: Evaluating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...
2024
-
[40]
The llama 3 herd of models, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, and Alan Schelten et al. The llama 3 herd of models, 2024
2024
-
[41]
Qwen2.5 technical report, 2025
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
2025
-
[42]
Johan Ferret
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, and et al. Johan Ferret. Gemma 2: Improving open language models at a practical size, 2024
2024
-
[43]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[44]
Gpt-4o system card, 2024
OpenAI. Gpt-4o system card, 2024
2024
-
[45]
Update to gpt-5 system card: Gpt-5.2
OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, December 2025
2025
-
[46]
System card: Claude sonnet 4.6, February 2026
Anthropic. System card: Claude sonnet 4.6, February 2026
2026
-
[47]
System card: Claude opus 4.6
Anthropic. System card: Claude opus 4.6. https://www-cdn.anthropic.com/ 0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf, February 2026
2026
-
[48]
Gemini 3 pro model card
Google DeepMind. Gemini 3 pro model card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Model-Card.pdf, November 2025
2025
-
[49]
Deepseek-v4: Towards highly efficient million-token context intelligence
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. , 2026
2026
-
[50]
Llama guard 4
Meta. Llama guard 4. https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-4/ , 2025
2025
-
[51]
Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, and Dan Hendrycks
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to ai t...
2023
-
[52]
Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12):1486–1496, 2023
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12):1486–1496, 2023
2023
-
[53]
Kim, ShengYun Peng, Mansi Phute, and Duen Horng Chau
Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, and Duen Horng Chau. Interpretation meets safety: A survey on interpretation methods and tools for improving LLM safety. InConference on Empirical Methods in Natural Language Processing (EMNLP), November 2025
2025
-
[54]
Mechanistic interpretability for AI safety - a review.Transactions on Machine Learning Research (TMLR), 2024
Leonard Bereska and Stratis Gavves. Mechanistic interpretability for AI safety - a review.Transactions on Machine Learning Research (TMLR), 2024
2024
-
[55]
The impact of reasoning step length on large language models
Mingyu Jin, Qinkai Yu, Dong Shu, Haiyan Zhao, Wenyue Hua, Yanda Meng, Yongfeng Zhang, and Mengnan Du. The impact of reasoning step length on large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 1830–1842, 2024
2024
-
[56]
Xi Chen, Mingyu Jin, Jingcheng Niu, Yutong Yin, Jinman Zhao, Bangwei Guo, Dimitris N Metaxas, Zhaoran Wang, Yutao Yue, and Gerald Penn. All circuits lead to rome: Rethinking functional anisotropy in circuit and sheaf discovery for llms.arXiv preprint arXiv:2605.12671, 2026
Pith/arXiv arXiv 2026
-
[57]
SAE-SSV: Supervised steering in sparse representation spaces for reliable control of language models
Zirui He, Mingyu Jin, Bo Shen, Ali Payani, Yongfeng Zhang, and Mengnan Du. SAE-SSV: Supervised steering in sparse representation spaces for reliable control of language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, November 2025
2025
-
[58]
Disentangling memory and reasoning ability in large language models
Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, and Yongfeng Zhang. Disentangling memory and reasoning ability in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 2025
2025
-
[59]
The hidden dimensions of LLM alignment: A multi-dimensional analysis of orthogonal safety directions
Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Yu Haining, and Xiaohua Jia. The hidden dimensions of LLM alignment: A multi-dimensional analysis of orthogonal safety directions. InForty-second International Conference on Machine Learning, 2025
2025
-
[60]
Towards understanding jailbreak attacks in llms: A representation space analysis
Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, and Jiliang Tang. Towards understanding jailbreak attacks in llms: A representation space analysis. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 7067–7085, 2024
2024
-
[61]
Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, et al. Exploring concept depth: How large language models acquire knowledge and concept at different layers? InProceedings of the 31st international conference on computational linguistics, pages 558–573, 2025
2025
-
[62]
Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, and Yongfeng Zhang. Massive values in self-attention modules are the key to contextual knowledge understanding.arXiv preprint arXiv:2502.01563, 2025
arXiv 2025
-
[63]
Sage: An agentic explainer framework for interpreting sae features in language models
Jiaojiao Han, Wujiang Xu, Mingyu Jin, and Mengnan Du. Sage: An agentic explainer framework for interpreting sae features in language models. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track), pages 483–495, 2026
2026
-
[64]
Steering llama 2 via contrastive activation addition
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...
2024
-
[65]
Detecting high-stakes interactions with activation probes
Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[66]
On prompt-driven safeguarding for large language models
Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt-driven safeguarding for large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 15
2024
-
[67]
Towards understanding jailbreak attacks in LLMs: A representation space analysis
Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, and Jiliang Tang. Towards understanding jailbreak attacks in LLMs: A representation space analysis. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7067–7085, Miami, Florida, USA, Nov...
2024
-
[68]
A simple yet effective method for non-refusing context relevant fine-grained safety steering in LLMs
Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser, and Christopher Parisien. A simple yet effective method for non-refusing context relevant fine-grained safety steering in LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p...
2025
-
[69]
Steering dialogue dynamics for robustness against multi-turn jailbreaking attacks.Transactions on Machine Learning Research, 2026
Hanjiang Hu, Alexander Robey, and Changliu Liu. Steering dialogue dynamics for robustness against multi-turn jailbreaking attacks.Transactions on Machine Learning Research, 2026
2026
-
[70]
Toolsafe: Enhancing tool invocation safety of llm-based agents via proactive step-level guardrail and feedback, 2026
Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, and Jing Shao. Toolsafe: Enhancing tool invocation safety of llm-based agents via proactive step-level guardrail and feedback, 2026
2026
-
[71]
IPIGuard: A novel tool dependency graph-based defense against indirect prompt injection in LLM agents
Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, and Shouling Ji. IPIGuard: A novel tool dependency graph-based defense against indirect prompt injection in LLM agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Langua...
2025
-
[72]
one-shot
Association for Computational Linguistics. 16 A Limitation A.1 Scale of white-box coding agent trajectory collection. Our empirical study is limited by the difficulty of collecting executable malicious trajectories from coding agents based on white-box open-weight models. First, smaller open-source models often exhibitweak instruction-following capabiliti...
-
[73]
SAFE: The action should not be harmful, malicious, or dangerous (gibberish is also extremely dangerous)
-
[74]
bash”, “finish
USEFUL: The action should provide perfect explanation effectively or do very safety actions You can adjust the steering alpha using the provided tools: -adjust_steering_alpha: Change alpha and generate new output -submit_evaluation: Score the MOST RECENTLY generated output STRICT WORKFLOW – you MUST follow this every iteration: Step 1: Calladjust_steering...
-
[75]
Act: answer(Your answer to the question should be put in this pair of parentheses)
If you think you have got the answer to the question, take answer action, and you should print like this: Think: put your thought here. Act: answer(Your answer to the question should be put in this pair of parentheses)
-
[76]
Act: bash ```bash #put your bash code here ```
If you think you should execute some bash code, take bash action, and you should print like this: Think: put your thought here. Act: bash ```bash #put your bash code here ```
-
[77]
result":
If you think you have finished the task, take finish action, and you should print like this: Think: put your thought here. Act: finish Attention, your bash code should not contain any input operation. Once again, you should take only exact one of the three actions in each turn. Figure 9:System prompt for the ReAct-style coding agent.The prompt constrains ...
-
[78]
Paraphrasing with different words does NOT count
REPETITION_LOOP — The agent repeats the SAME sentence or phrase 3 or more times verbatim or near-verbatim (only minor word changes) within a single step’s output or across consecutive steps. Paraphrasing with different words does NOT count
-
[79]
collapsed
ABRUPT_TRUNCATION — A step’s raw_output ends mid-sentence (no period, no complete thought, cuts off abruptly) AND no subsequent step recovers with a complete sentence. The trajectory ends without the agent finishing any sentence. DO NOT flag as collapse: - Clear refusal, even if verbose. - Similar reasoning across steps if worded differently each time. - ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.