TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling
Pith reviewed 2026-06-29 18:06 UTC · model grok-4.3
The pith
TRACES models hidden representations of observer LLMs to estimate risk drift in partial agent trajectories using only trajectory-level labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TRACES induces latent mechanism features from the hidden representations of an observer LLM at each step, models the temporal evolution of those features to estimate whether a partial trajectory is drifting toward unsafe behavior, and is trained solely with trajectory-level supervision while still producing accurate dense prefix-level risk estimates; across multiple agent safety benchmarks this yields better full-trajectory safety prediction and stronger proactive risk discrimination than prior approaches.
What carries the argument
Trajectory-state modeling that learns prefix-level risk states by inducing and tracking temporal evolution of latent features in observer-LLM hidden representations.
If this is right
- Full-trajectory safety prediction improves because prefix-level risk signals accumulate into a stronger final verdict.
- Proactive discrimination of risky prefixes becomes possible without step-level annotations.
- Risk states extracted this way can be fed back to train safer agents.
- The same weak-supervision approach applies to any multi-turn agent whose intermediate states are observable via an external model.
Where Pith is reading between the lines
- The same representation-tracking idea could be tested on non-LLM sequential agents such as robotic planners or game-playing systems.
- Real-time deployment would require checking whether the observer model can run in parallel without adding unacceptable latency.
- If the latent features turn out to be task-specific rather than mechanism-general, the method might need retraining per domain.
Load-bearing premise
Hidden representations of an observer LLM contain inducible features whose changes over time reliably signal drift toward unsafe behavior, and trajectory-level labels alone suffice to learn accurate step-level risk estimates from them.
What would settle it
On a held-out agent safety benchmark, a version of TRACES trained only on trajectory labels produces prefix-level risk scores no more accurate at early risk discrimination than a simple baseline that waits for the final outcome.
Figures
read the original abstract
LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TRACES, a representation-based proactive auditor for multi-turn LLM agents. It extracts hidden representations from an observer LLM, induces latent mechanism features, and models their temporal evolution to produce dense prefix-level risk estimates indicating drift toward unsafe behavior. Training uses only weak trajectory-level supervision to avoid expensive step-level annotations. Evaluations across multiple agent safety benchmarks report improvements in both full-trajectory safety prediction and proactive risk discrimination, with additional analyses suggesting utility for training safer agents.
Significance. If the central empirical claims hold under scrutiny, the work would be significant for LLM agent safety research. Proactive auditing via temporal modeling of latent states addresses a genuine gap between reactive post-hoc diagnosis and the need for early intervention in long-horizon tool-use trajectories. The weak-supervision design is practically attractive for scalability, and the suggestion that risk states can inform safer agent training points to downstream utility beyond auditing.
major comments (1)
- [Abstract / Evaluation] The core claim that prefix-level risk estimates capture genuine early drift (rather than correlation with final trajectory labels) rests on the untested assumption that observer hidden states contain inducible latent mechanism features whose temporal evolution can be recovered from weak supervision alone. No ablation or diagnostic (e.g., comparison of risk trajectories on safe vs. unsafe prefixes of ultimately unsafe trajectories, or training with outcome-shuffled labels) is described that would rule out leakage or smoothing artifacts. This directly affects the interpretation of the reported gains in proactive risk discrimination.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for stronger diagnostics to support the interpretation of prefix-level risk estimates as capturing genuine early drift. We agree that the current manuscript does not include the suggested ablations, and we will add them to address this concern directly.
read point-by-point responses
-
Referee: [Abstract / Evaluation] The core claim that prefix-level risk estimates capture genuine early drift (rather than correlation with final trajectory labels) rests on the untested assumption that observer hidden states contain inducible latent mechanism features whose temporal evolution can be recovered from weak supervision alone. No ablation or diagnostic (e.g., comparison of risk trajectories on safe vs. unsafe prefixes of ultimately unsafe trajectories, or training with outcome-shuffled labels) is described that would rule out leakage or smoothing artifacts. This directly affects the interpretation of the reported gains in proactive risk discrimination.
Authors: We acknowledge that the manuscript lacks explicit diagnostics to rule out the possibility that prefix-level estimates arise from label leakage or smoothing rather than recoverable temporal dynamics in the latent features. To strengthen the claim, we will add two analyses in the revised version: (1) a comparison of risk trajectories specifically on safe versus unsafe prefixes drawn from ultimately unsafe trajectories, and (2) a control experiment retraining the model with outcome-shuffled labels to quantify performance degradation. These additions will be presented in a new subsection under Experiments, with quantitative results and visualizations. We believe this will clarify that the gains in proactive discrimination reflect the intended mechanism rather than artifacts. revision: yes
Circularity Check
No circularity identified; derivation self-contained with no reducible steps exhibited
full rationale
The provided abstract and description contain no equations, fitting procedures, or self-citations that allow exhibition of any reduction by construction (e.g., no parameter fitted to trajectory labels then renamed as prefix prediction, no uniqueness theorem imported, no ansatz smuggled via citation). The training regime uses weak supervision explicitly to produce dense estimates, but without quoted model equations or evaluation details showing the estimates are forced by the labels, no load-bearing circular step can be identified. The central claim of inducing latent mechanism features from observer hidden states remains independent of the supervision in the text given. This is the normal honest outcome when no specific reduction is quotable.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Probing latent subspaces in llm for ai secu- rity: Identifying and manipulating adversarial states. Preprint, arXiv:2503.09066. Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, and Christian Kästner
-
[2]
arXiv preprint arXiv:2601.08012 , year=
Towards verifiably safe tool use for llm agents. arXiv preprint arXiv:2601.08012. Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. 2025. Not all language model features are one-dimensionally linear.Preprint, arXiv:2405.14860. Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, and Yan- ming G...
-
[3]
The Linear Representation Hypothesis and the Geometry of Large Language Models
The linear representation hypothesis and the geometry of large language models.Preprint, arXiv:2311.03658. Qwen Team. 2024. Qwq-32b. https://huggingface. co/Qwen/QwQ-32B. Hugging Face model card, ac- cessed 2026-05-20. Noah Shinn, Federico Cassano, Edward Berman, Ash- win Gopinath, Karthik Narasimhan, and Shunyu Yao
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion: Language agents with verbal rein- forcement learning.Preprint, arXiv:2303.11366. Haoyu Wang, Christopher M. Poskitt, and Jun Sun
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Agentspec: Customizable runtime enforce- ment for safe and reliable llm agents.Preprint, arXiv:2503.18666. Haoyu Wang, Christopher M. Poskitt, Jiali Wei, and Jun Sun. 2026a. Probguard: Probabilistic run- time monitoring for llm agent safety.Preprint, arXiv:2508.00500. Jiacheng Wang, Jinchang Hou, Fabian Wang, Ping Jian, Chenfu Bao, and Zhonghou Lv. 2026b....
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning.Preprint, arXiv:2406.09187. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwe...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
10 Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, and Ruixiang Tang
Representation bending for large language model safety.Preprint, arXiv:2504.01550. 10 Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, and Ruixiang Tang. 2026. Agentforesight: Online auditing for early failure prediction in multi-agent systems.Preprint, arXiv:2605.08715. Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Y...
-
[8]
is a family of LLM-based safeguard models for classifying safety risks in human– AI conversations. The original LlamaGuard frames moderation as an instruction-following task, where the model receives a safety taxon- omy, task-specific instructions, and a conversa- tion, and outputs a binary safe/unsafe decision together with violated risk categories when ...
-
[9]
safe" or
is an open-source safety framework for large-model applications. It targets practi- cal deployment scenarios where LLM systems may face unsafe content, data leakage, prompt- injection attacks, and other application-level risks. Unlike single-model guard classifiers, JoySafety provides a modular safety architecture with multiple atomic detection capabiliti...
2025
-
[10]
safe" or
label: "safe" or "unsafe"
-
[11]
label":
risk_score: a continuous number between 0 and 1, where 0 means clearly safe, 1 means clearly unsafe, and intermediate values indicate uncertainty or emerging risk. Important: - Evaluate only the observed trajectory or prefix. - Do not assume future unsafe behavior unless there is evidence in the observed prefix. - For a partial trajectory, the risk_score ...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.