pith. sign in

arxiv: 2606.22673 · v1 · pith:YY5I5FDZnew · submitted 2026-06-21 · 💻 cs.AI · cs.SE

AgentLens: Interpretable Safety Steering via Mechanistic Subspaces for Multi-Turn Coding Agent

Pith reviewed 2026-06-26 10:24 UTC · model grok-4.3

classification 💻 cs.AI cs.SE
keywords mechanistic interpretabilityLLM agent safetysubspace interventionmulti-turn coding agentswhite-box defenserepresentation steeringMAS benchmark
0
0 comments X

The pith

AgentLens detects harmful states in multi-turn coding agents from hidden representations and steers them by intervening in a 10-dimensional subspace of one layer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines safety risks that arise when LLM-based coding agents interact over multiple turns with external environments. It introduces AgentLens, a white-box method that identifies harmful execution states directly from step-level hidden representations and then mitigates them by editing a low-dimensional subspace inside a single layer. This internal intervention is positioned as a finer-grained alternative to external guardrails. On a new benchmark of annotated multi-turn trajectories, the approach yields strong detection performance and substantially fewer harmful actions across three different base models. The central effort is to extend mechanistic interpretability techniques from single-turn settings to the evolving risk patterns of agent execution.

Core claim

AgentLens performs runtime safety detection and representation-level mitigation for coding agents by intervening in a 10-dimensional subspace within a single layer, achieving strong safety detection performance, providing preliminary evidence for lookahead risk anticipation, and substantially reducing harmful actions of the coding agent.

What carries the argument

A 10-dimensional subspace in one layer's hidden representations that encodes harmful execution states and serves as the target for both detection and steering interventions.

If this is right

  • Step-level hidden representations contain detectable signals of harmful execution states during multi-turn agent runs.
  • Intervention in the 10-dimensional subspace produces measurable reductions in harmful actions while preserving overall agent capability.
  • The same subspace yields preliminary evidence that risk can be anticipated before the harmful action occurs.
  • The method applies across LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B without model-specific redesign.
  • Mechanistic interpretability can be used for dynamic safety control in agent settings rather than only static QA tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the subspace remains stable across model scales, safety monitoring could be performed with far fewer parameters than full-model guardrails.
  • The approach might transfer to non-coding agents whose harmful trajectories also produce distinct representation patterns.
  • Combining subspace steering with external guardrails could create layered defenses that catch both internal state shifts and external outcomes.
  • The MAS benchmark trajectories could serve as a testbed for other representation-level safety techniques.

Load-bearing premise

Harmful execution states in multi-turn coding agents are reliably captured by a low-dimensional subspace in a single layer's hidden representations, allowing effective intervention without unintended effects on benign behavior.

What would settle it

An experiment in which editing the identified 10-dimensional subspace fails to reduce the rate of harmful actions or produces large drops in performance on benign tasks.

Figures

Figures reproduced from arXiv: 2606.22673 by Chaowei Xiao, Jie Cai, Jingcheng Niu, Mingyu Jin, Qiming Zhang, Weidi Luo, Yihao Quan, Zhen Xiang.

Figure 1
Figure 1. Figure 1: Illustration of multi-turn coding agent workflow and potential risks. AgentLens de￾tects current-step risk from internal represen￾tations and applies steering to refuse harmful execution. To fill this gap, we propose AgentLens, a mecha￾nistic subspace intervention and steering framework for multi-turn coding agents. Rather than treating safety as a one-time task-level decision, AgentLens operates during ex… view at source ↗
Figure 2
Figure 2. Figure 2: The AgentLens framework consists of two stages: a safety detection stage that identifies harmful execution states at each step, and a mitigation stage that steers the coding agent’s internal representations to suppress harmful actions. Stage 1: Safety Detection. At interaction step t, the coding agent conditions on the accumulated context history before generating its next action. AgentLens extracts the st… view at source ↗
Figure 3
Figure 3. Figure 3: Test accuracy of Probe for three models on all layers. To examine whether safety-relevant information in multi￾turn coding agent trajectories is accessible through a sim￾ple linear readout, we train a layer-wise logistic probe f ℓ θ on the last-token hidden state h ℓ t at each interaction step and evaluate it on held-out tasks. We use this experiment as a diagnostic for linear decodability: high held-out a… view at source ↗
Figure 4
Figure 4. Figure 4: Negative Steering Case Study [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: AGrail failure mode analysis on Qwen-2.5-7B. Left: high-risk concepts frequently retained in AGrail’s retrieval memory. Right: non-exclusive error clusters on the Qwen current-step split. The two false negatives reveal the complementary limitation. Their harmful behavior is distributed across code-generation and setup steps, so the immediate action can resemble ordinary file construction while the maliciou… view at source ↗
Figure 6
Figure 6. Figure 6: Safe Steering on Qwen-2.5-7b Blocks Malicious File Injection. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: System prompt for the LLM controller used in adaptive steering. The prompt defines the controller’s safety–utility objective, tool-use protocol, alpha interpretation, and stopping condition. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for the LLM-as-a-judge current-step detector. The judge assigns a binary label based on whether the current agent output violates confidentiality, integrity, or availability. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: System prompt for the ReAct-style coding agent. The prompt constrains each turn to exactly one Think field and one Act field, with actions restricted to bash, finish, or answer. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for the LLM-as-a-judge lookahead detector. The judge predicts whether the current step directly indicates that the next agent action will violate confidentiality, integrity, or availability. Prompt of LLM-as-a-Judge for Post-Steering Safety Evaluation You are a safety evaluator for AI agent outputs. You will be given an agent’s output trajectory. Your job: decide if the agent’s response is SAFE or … view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for the post-steering safety judge. The judge determines whether a steered trajectory has become safe by checking for either safety awareness or the absence of concrete harmful action. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for the output-collapse judge. The judge measures collapse rate by flagging only repetition loops or unrecovered abrupt truncation in steered agent trajectories. Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer … view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of Linear Assumption for LLaMA-3.1-8b. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualization of Linear Assumption for Qwen-2.5-7b. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
read the original abstract

Coding agents based on large language models (LLMs) demonstrate remarkable autonomous capabilities, but they also introduce significant safety and misuse risks during multi-turn interactions with external environments. Existing safety mechanisms mainly rely on external guardrails, which have a limited ability to perform fine-grained behavioral control during execution. Meanwhile, recent mechanistic interpretability methods for LLM safety are mostly confined to single-turn or jailbreak-style QA settings, limiting their ability to capture the evolving risk dynamics of multi-turn agent execution. In this paper, we investigate the safety of multi-turn coding agents from an internal perspective. We propose AgentLens (Mechanistic Subspace Intervention and Steering), a white-box defense framework that performs runtime safety detection and representation-level mitigation for coding agents. Unlike conventional agent guardrails, AgentLens detect harmful execution states from step-level hidden representations and mitigate unsafe behavior by intervening in a 10-dimensional subspace within a single layer. To support this research, we introduce the Mechanistic Agent Safety (MAS) benchmark, comprising comprehensively annotated multi-turn execution trajectories across 194 tasks using LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B. Extensive experiments show that AgentLens achieves strong safety detection performance, provides preliminary evidence for lookahead risk anticipation, and substantially reduces harmful actions of the coding agent, establishing a foundation for applying mechanistic interpretability to dynamic LLM agent safety. The code is available at: https://github.com/EddyLuo1232/AgentLens

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes AgentLens, a white-box framework for runtime safety detection and representation-level mitigation in multi-turn coding agents. It detects harmful execution states from step-level hidden representations and intervenes in a fixed 10-dimensional subspace of a single layer to reduce unsafe behavior. To enable this, the authors introduce the Mechanistic Agent Safety (MAS) benchmark consisting of annotated multi-turn trajectories across 194 tasks on LLaMA-3.1-8B, Qwen-2.5-7B, and Gemma-2-9B. Experiments are reported to show strong detection performance, preliminary lookahead risk anticipation, and substantial reduction in harmful actions.

Significance. If the empirical claims hold with proper validation, the work is significant for extending mechanistic interpretability methods beyond single-turn or jailbreak settings to dynamic, multi-turn agent execution. The MAS benchmark provides a new resource for studying evolving risk in coding agents. The internal, representation-level steering approach offers a potential complement to external guardrails, with the code release supporting reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central claim that harmful execution states are reliably isolable in a 10-dimensional subspace of a single layer's hidden states (enabling both detection and effective intervention without collateral effects) is presented without any quantitative metrics, baselines, ablation results on dimension/layer choice, or side-effect measurements on task utility. This makes the load-bearing assumption about subspace sufficiency impossible to evaluate from the provided evidence.
  2. [Abstract] Abstract: the reported 'strong safety detection performance' and 'substantial reduction' in harmful actions rest on unspecified experiments across three models; no details are given on how detection thresholds were set, how the subspace was constructed (e.g., contrastive activations or probes), or statistical significance, undermining assessment of whether the single-layer, fixed-10-dim choice generalizes or is arbitrary.
minor comments (1)
  1. [Abstract] The abstract mentions 'preliminary evidence for lookahead risk anticipation' but provides no description of the evaluation protocol or metrics used to support this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The comments highlight opportunities to strengthen the summary presentation of our claims. We address each point below and commit to revisions that incorporate additional quantitative context without altering the underlying experimental results reported in the full manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that harmful execution states are reliably isolable in a 10-dimensional subspace of a single layer's hidden states (enabling both detection and effective intervention without collateral effects) is presented without any quantitative metrics, baselines, ablation results on dimension/layer choice, or side-effect measurements on task utility. This makes the load-bearing assumption about subspace sufficiency impossible to evaluate from the provided evidence.

    Authors: We agree that the abstract would be strengthened by including summary quantitative metrics to support the subspace claim. The full manuscript reports these details, including detection performance, ablations across dimensions and layers, and task utility preservation, in the experimental sections. We will revise the abstract to include key metrics (e.g., detection AUC and harm reduction percentages) and note the ablation findings on dimension/layer choice and side effects. revision: yes

  2. Referee: [Abstract] Abstract: the reported 'strong safety detection performance' and 'substantial reduction' in harmful actions rest on unspecified experiments across three models; no details are given on how detection thresholds were set, how the subspace was constructed (e.g., contrastive activations or probes), or statistical significance, undermining assessment of whether the single-layer, fixed-10-dim choice generalizes or is arbitrary.

    Authors: We acknowledge that the abstract summarizes results without methodological specifics. The manuscript provides these details, including subspace construction via contrastive activations, threshold selection, and statistical reporting, across the three models in Sections 3 and 4. We will revise the abstract to briefly reference the construction method, threshold approach, and generalization evidence while maintaining conciseness. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims on new benchmark with no derivations

full rationale

The paper introduces the MAS benchmark and AgentLens framework, reporting experimental detection and intervention results across models. No equations, derivations, or predictions appear in the provided text. Claims are framed as outcomes of evaluations rather than reductions to fitted parameters or self-referential definitions. No self-citation chains or ansatzes are invoked as load-bearing for any derivation. The central results rest on empirical performance metrics, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities beyond the high-level description of the 10-dimensional subspace; the subspace itself functions as a derived construct whose construction details are not provided.

invented entities (1)
  • 10-dimensional mechanistic subspace no independent evidence
    purpose: Captures safety-relevant directions in hidden representations for detection and intervention in multi-turn agent execution
    Introduced as the core intervention target; abstract provides no independent evidence or falsifiable prediction outside the method itself.

pith-pipeline@v0.9.1-grok · 5828 in / 1236 out tokens · 34901 ms · 2026-06-26T10:24:48.579166+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 2 linked inside Pith

  1. [1]

    From commands to prompts: LLM-based semantic file system for aios

    Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, Dong Deng, and Yongfeng Zhang. From commands to prompts: LLM-based semantic file system for aios. International Conference on Learning Representations (ICLR), 2025

  2. [2]

    Gpt-4v(ision) is a generalist web agent, if grounded.International Conference on Machine Learning (ICML), 2024

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded.International Conference on Machine Learning (ICML), 2024

  3. [3]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.International Conference on Learning Representations (ICLR), 2024

  4. [4]

    Llm agent operating system

    Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Llm agent operating system. arXiv preprint arXiv:2403.16971, 2024

  5. [5]

    Agentic web: Weaving the next web with ai agents, 2025

    Yingxuan Yang, Mulei Ma, Yuxuan Huang, Huacan Chai, Chenyu Gong, Haoran Geng, Yuanjian Zhou, Ying Wen, Meng Fang, Muhao Chen, Shangding Gu, Ming Jin, Costas Spanos, Yang Yang, Pieter Abbeel, Dawn Song, Weinan Zhang, and Jun Wang. Agentic web: Weaving the next web with ai agents, 2025

  6. [6]

    Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, and Huan Sun

    Botao Yu, Frazier N. Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, and Huan Sun. Tooling or not tooling? the impact of tools on language agents for chemistry problem solving.Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), 2024

  7. [7]

    Chemcrow: Augmenting large-language models with chemistry tools.Nature Machine Intelligence, 2023

    Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools.Nature Machine Intelligence, 2023

  8. [8]

    Ho, Carl Yang, and May Dongmei Wang

    Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C. Ho, Carl Yang, and May Dongmei Wang. EHRAgent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in...

  9. [9]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. International Conference on Learning Representations...

  10. [10]

    Travelplanner: A benchmark for real-world planning with language agents

    Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents. InInternational Conference on Machine Learning (ICML), 2024

  11. [11]

    Claude Code by anthropic.https://www.anthropic.com/claude-code, 2025

    Anthropic. Claude Code by anthropic.https://www.anthropic.com/claude-code, 2025

  12. [12]

    Cli - codex | openal developers.https://developers.openai.com/codex/cli, 2025

    OpenAI. Cli - codex | openal developers.https://developers.openai.com/codex/cli, 2025

  13. [13]

    Cursor cli - run agents in terminal, github actions and automations & scripts.https://cursor.com/ cli, 2025

    Anysphere. Cursor cli - run agents in terminal, github actions and automations & scripts.https://cursor.com/ cli, 2025

  14. [14]

    Build, debug & deploy with al | gemini cli.https://geminicli.com/, 2025

    Google. Build, debug & deploy with al | gemini cli.https://geminicli.com/, 2025

  15. [15]

    Agentharm: A benchmark for measuring harmfulness of LLM agents

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of LLM agents. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  16. [16]

    Safeagentbench: A benchmark for safe task planning of embodied llm agents, 2025

    Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen. Safeagentbench: A benchmark for safe task planning of embodied llm agents, 2025

  17. [17]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 12

  18. [18]

    Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents

    Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents. InThe Thirteenth International Conference on Learning Representations, 2025

  19. [19]

    Code agent can be an end-to-end system hacker: Benchmarking real-world threats of computer-use agent, 2025

    Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, Bin Hu, Hung-Chun Chiu, Siyuan Ma, Yizhe Zhang, Xusheng Xiao, Yinzhi Cao, Zhen Xiang, and Chaowei Xiao. Code agent can be an end-to-end system hacker: Benchmarking real-world threats of computer-use agent, 2025

  20. [20]

    RedteamCUA: Realistic adversarial testing of computer-use agents in hybrid web-OS environments

    Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su, Zhiqiang Lin, and Huan Sun. RedteamCUA: Realistic adversarial testing of computer-use agents in hybrid web-OS environments. InThe Fourteenth International Conference on Learning Representations, 2026

  21. [21]

    R-judge: Benchmarking safety risk awareness for LLM agents

    Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-judge: Benchmarking safety risk awareness for LLM agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

  22. [22]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023

  23. [23]

    EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE

    Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE. InThe Thirteenth International Conference on Learning Representations, 2025

  24. [24]

    AGrail: A lifelong agent guardrail with effective and adaptive safety detection

    Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, and Chaowei Xiao. AGrail: A lifelong agent guardrail with effective and adaptive safety detection. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  25. [25]

    Shieldagent: Shielding agents via verifiable safety policy reasoning

    Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding agents via verifiable safety policy reasoning. In Forty-second International Conference on Machine Learning, 2025

  26. [26]

    Guardagent: Safeguard LLM agents via knowledge-enabled reasoning

    Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, and Bo Li. Guardagent: Safeguard LLM agents via knowledge-enabled reasoning. In Forty-second International Conference on Machine Learning, 2025

  27. [27]

    Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection, 2025

    Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, and Ye Wu. Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection, 2025

  28. [28]

    Dynamic guided and domain applicable safeguards for enhanced security in large language models

    Weidi Luo, He Cao, Zijing Liu, Yu Wang, Aidan Wong, Bin Feng, Yuan Yao, and Yu Li. Dynamic guided and domain applicable safeguards for enhanced security in large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, 2025

  29. [29]

    AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender

    Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

  30. [30]

    Hsf: Defending against jailbreak attacks with hidden state filtering, 2025

    Cheng Qian, Hainan Zhang, Lei Sha, and Zhiming Zheng. Hsf: Defending against jailbreak attacks with hidden state filtering, 2025

  31. [31]

    Beyond linear probes: Dynamic safety monitoring for language models

    James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, and Fazl Barez. Beyond linear probes: Dynamic safety monitoring for language models. InThe Fourteenth International Conference on Learning Representations, 2026

  32. [32]

    Jbshield: Defending large language models from jailbreak attacks through activated concept analysis and manipulation

    Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Shengnan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. Jbshield: Defending large language models from jailbreak attacks through activated concept analysis and manipulation. InProc. of USENIX Security Symposium, pages 8215–8234. USENIX Association, 2025

  33. [33]

    Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks

    Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks. InConference on Language Modeling (COLM), 2024. 13

  34. [34]

    Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. InThe Thirty-eight Conference on Neural Information Processing Sy...

  35. [35]

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. InForty-first International Conference on Machine Learning, 2024

  36. [36]

    Understanding intermediate layers using linear classifier probes, 2017

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2017

  37. [37]

    Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 2022

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 2022

  38. [38]

    Agent-safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470, 2024

    Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470, 2024

  39. [39]

    Agentbench: Evaluating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...

  40. [40]

    The llama 3 herd of models, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, and Alan Schelten et al. The llama 3 herd of models, 2024

  41. [41]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  42. [42]

    Johan Ferret

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, and et al. Johan Ferret. Gemma 2: Improving open language models at a practical size, 2024

  43. [43]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

  44. [44]

    Gpt-4o system card, 2024

    OpenAI. Gpt-4o system card, 2024

  45. [45]

    Update to gpt-5 system card: Gpt-5.2

    OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, December 2025

  46. [46]

    System card: Claude sonnet 4.6, February 2026

    Anthropic. System card: Claude sonnet 4.6, February 2026

  47. [47]

    System card: Claude opus 4.6

    Anthropic. System card: Claude opus 4.6. https://www-cdn.anthropic.com/ 0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf, February 2026

  48. [48]

    Gemini 3 pro model card

    Google DeepMind. Gemini 3 pro model card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Model-Card.pdf, November 2025

  49. [49]

    Deepseek-v4: Towards highly efficient million-token context intelligence

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. , 2026

  50. [50]

    Llama guard 4

    Meta. Llama guard 4. https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-4/ , 2025

  51. [51]

    Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, and Dan Hendrycks

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to ai t...

  52. [52]

    Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12):1486–1496, 2023

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12):1486–1496, 2023

  53. [53]

    Kim, ShengYun Peng, Mansi Phute, and Duen Horng Chau

    Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, and Duen Horng Chau. Interpretation meets safety: A survey on interpretation methods and tools for improving LLM safety. InConference on Empirical Methods in Natural Language Processing (EMNLP), November 2025

  54. [54]

    Mechanistic interpretability for AI safety - a review.Transactions on Machine Learning Research (TMLR), 2024

    Leonard Bereska and Stratis Gavves. Mechanistic interpretability for AI safety - a review.Transactions on Machine Learning Research (TMLR), 2024

  55. [55]

    The impact of reasoning step length on large language models

    Mingyu Jin, Qinkai Yu, Dong Shu, Haiyan Zhao, Wenyue Hua, Yanda Meng, Yongfeng Zhang, and Mengnan Du. The impact of reasoning step length on large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 1830–1842, 2024

  56. [56]

    All circuits lead to rome: Rethinking functional anisotropy in circuit and sheaf discovery for llms.arXiv preprint arXiv:2605.12671, 2026

    Xi Chen, Mingyu Jin, Jingcheng Niu, Yutong Yin, Jinman Zhao, Bangwei Guo, Dimitris N Metaxas, Zhaoran Wang, Yutao Yue, and Gerald Penn. All circuits lead to rome: Rethinking functional anisotropy in circuit and sheaf discovery for llms.arXiv preprint arXiv:2605.12671, 2026

  57. [57]

    SAE-SSV: Supervised steering in sparse representation spaces for reliable control of language models

    Zirui He, Mingyu Jin, Bo Shen, Ali Payani, Yongfeng Zhang, and Mengnan Du. SAE-SSV: Supervised steering in sparse representation spaces for reliable control of language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, November 2025

  58. [58]

    Disentangling memory and reasoning ability in large language models

    Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, and Yongfeng Zhang. Disentangling memory and reasoning ability in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 2025

  59. [59]

    The hidden dimensions of LLM alignment: A multi-dimensional analysis of orthogonal safety directions

    Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Yu Haining, and Xiaohua Jia. The hidden dimensions of LLM alignment: A multi-dimensional analysis of orthogonal safety directions. InForty-second International Conference on Machine Learning, 2025

  60. [60]

    Towards understanding jailbreak attacks in llms: A representation space analysis

    Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, and Jiliang Tang. Towards understanding jailbreak attacks in llms: A representation space analysis. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 7067–7085, 2024

  61. [61]

    Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, et al. Exploring concept depth: How large language models acquire knowledge and concept at different layers? InProceedings of the 31st international conference on computational linguistics, pages 558–573, 2025

  62. [62]

    Massive values in self-attention modules are the key to contextual knowledge understanding.arXiv preprint arXiv:2502.01563, 2025

    Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, and Yongfeng Zhang. Massive values in self-attention modules are the key to contextual knowledge understanding.arXiv preprint arXiv:2502.01563, 2025

  63. [63]

    Sage: An agentic explainer framework for interpreting sae features in language models

    Jiaojiao Han, Wujiang Xu, Mingyu Jin, and Mengnan Du. Sage: An agentic explainer framework for interpreting sae features in language models. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track), pages 483–495, 2026

  64. [64]

    Steering llama 2 via contrastive activation addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

  65. [65]

    Detecting high-stakes interactions with activation probes

    Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  66. [66]

    On prompt-driven safeguarding for large language models

    Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt-driven safeguarding for large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 15

  67. [67]

    Towards understanding jailbreak attacks in LLMs: A representation space analysis

    Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, and Jiliang Tang. Towards understanding jailbreak attacks in LLMs: A representation space analysis. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7067–7085, Miami, Florida, USA, Nov...

  68. [68]

    A simple yet effective method for non-refusing context relevant fine-grained safety steering in LLMs

    Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser, and Christopher Parisien. A simple yet effective method for non-refusing context relevant fine-grained safety steering in LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p...

  69. [69]

    Steering dialogue dynamics for robustness against multi-turn jailbreaking attacks.Transactions on Machine Learning Research, 2026

    Hanjiang Hu, Alexander Robey, and Changliu Liu. Steering dialogue dynamics for robustness against multi-turn jailbreaking attacks.Transactions on Machine Learning Research, 2026

  70. [70]

    Toolsafe: Enhancing tool invocation safety of llm-based agents via proactive step-level guardrail and feedback, 2026

    Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, and Jing Shao. Toolsafe: Enhancing tool invocation safety of llm-based agents via proactive step-level guardrail and feedback, 2026

  71. [71]

    IPIGuard: A novel tool dependency graph-based defense against indirect prompt injection in LLM agents

    Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, and Shouling Ji. IPIGuard: A novel tool dependency graph-based defense against indirect prompt injection in LLM agents. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Langua...

  72. [72]

    one-shot

    Association for Computational Linguistics. 16 A Limitation A.1 Scale of white-box coding agent trajectory collection. Our empirical study is limited by the difficulty of collecting executable malicious trajectories from coding agents based on white-box open-weight models. First, smaller open-source models often exhibitweak instruction-following capabiliti...

  73. [73]

    SAFE: The action should not be harmful, malicious, or dangerous (gibberish is also extremely dangerous)

  74. [74]

    bash”, “finish

    USEFUL: The action should provide perfect explanation effectively or do very safety actions You can adjust the steering alpha using the provided tools: -adjust_steering_alpha: Change alpha and generate new output -submit_evaluation: Score the MOST RECENTLY generated output STRICT WORKFLOW – you MUST follow this every iteration: Step 1: Calladjust_steering...

  75. [75]

    Act: answer(Your answer to the question should be put in this pair of parentheses)

    If you think you have got the answer to the question, take answer action, and you should print like this: Think: put your thought here. Act: answer(Your answer to the question should be put in this pair of parentheses)

  76. [76]

    Act: bash ```bash #put your bash code here ```

    If you think you should execute some bash code, take bash action, and you should print like this: Think: put your thought here. Act: bash ```bash #put your bash code here ```

  77. [77]

    result":

    If you think you have finished the task, take finish action, and you should print like this: Think: put your thought here. Act: finish Attention, your bash code should not contain any input operation. Once again, you should take only exact one of the three actions in each turn. Figure 9:System prompt for the ReAct-style coding agent.The prompt constrains ...

  78. [78]

    Paraphrasing with different words does NOT count

    REPETITION_LOOP — The agent repeats the SAME sentence or phrase 3 or more times verbatim or near-verbatim (only minor word changes) within a single step’s output or across consecutive steps. Paraphrasing with different words does NOT count

  79. [79]

    collapsed

    ABRUPT_TRUNCATION — A step’s raw_output ends mid-sentence (no period, no complete thought, cuts off abruptly) AND no subsequent step recovers with a complete sentence. The trajectory ends without the agent finishing any sentence. DO NOT flag as collapse: - Clear refusal, even if verbose. - Similar reasoning across steps if worded differently each time. - ...