pith. sign in

arxiv: 2605.27690 · v1 · pith:DS3WLYODnew · submitted 2026-05-26 · 💻 cs.CL · cs.LG

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

Pith reviewed 2026-06-29 18:06 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords proactive safety auditingLLM agentstrajectory risk modelingmulti-turn interactionsweak supervisionprefix-level risk estimationagent safety benchmarkshidden state evolution
0
0 comments X

The pith

TRACES models hidden representations of observer LLMs to estimate risk drift in partial agent trajectories using only trajectory-level labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that safety risks in multi-turn LLM agents often appear in intermediate steps, making post-hoc checks too late. TRACES extracts latent features from step representations and tracks how they evolve over time to flag whether a prefix is heading toward unsafe behavior. It trains this estimator with weak supervision at the full-trajectory level yet still outputs dense prefix-level risk scores. A sympathetic reader would care because early detection could allow intervention while the agent is still acting rather than after damage occurs. The work positions this as a step toward proactive auditing that scales to long-horizon interactions.

Core claim

TRACES induces latent mechanism features from the hidden representations of an observer LLM at each step, models the temporal evolution of those features to estimate whether a partial trajectory is drifting toward unsafe behavior, and is trained solely with trajectory-level supervision while still producing accurate dense prefix-level risk estimates; across multiple agent safety benchmarks this yields better full-trajectory safety prediction and stronger proactive risk discrimination than prior approaches.

What carries the argument

Trajectory-state modeling that learns prefix-level risk states by inducing and tracking temporal evolution of latent features in observer-LLM hidden representations.

If this is right

  • Full-trajectory safety prediction improves because prefix-level risk signals accumulate into a stronger final verdict.
  • Proactive discrimination of risky prefixes becomes possible without step-level annotations.
  • Risk states extracted this way can be fed back to train safer agents.
  • The same weak-supervision approach applies to any multi-turn agent whose intermediate states are observable via an external model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same representation-tracking idea could be tested on non-LLM sequential agents such as robotic planners or game-playing systems.
  • Real-time deployment would require checking whether the observer model can run in parallel without adding unacceptable latency.
  • If the latent features turn out to be task-specific rather than mechanism-general, the method might need retraining per domain.

Load-bearing premise

Hidden representations of an observer LLM contain inducible features whose changes over time reliably signal drift toward unsafe behavior, and trajectory-level labels alone suffice to learn accurate step-level risk estimates from them.

What would settle it

On a held-out agent safety benchmark, a version of TRACES trained only on trajectory labels produces prefix-level risk scores no more accurate at early risk discrimination than a simple baseline that waits for the final outcome.

Figures

Figures reproduced from arXiv: 2605.27690 by Boxuan Zhang, Jiaqian Li, Kuan-Hao Huang, Ruixiang Tang, Yanshu Li.

Figure 1
Figure 1. Figure 1: Motivating example for proactive trajectory [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of TRACES. Stage 1 maps observer-model representations into mechanism-level evidence. Stage 2 models their temporal evolution to produce prefix-level risk states for proactive auditing. models symbolic agent states with probabilistic transitions to anticipate future violations. Most recently, AgentForesight (Zhang et al., 2026) fur￾ther studies online auditing with supervision from annotated decis… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-benchmark generalization of TRACES [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt template used for LLM-as-a-Judge and prompt-based guard baselines. For proactive evaluation, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template used for fine-grained diagnosis of unsafe agent trajectories. The model predicts one [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Full-pipeline cross-benchmark generalization [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Type-1 failure case for [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Type-1 failure case for traj_id=607 of AT￾Bench. TRACES assigns high risk from the beginning, despite the task being framed as synthetic CRM valida￾tion. This suggests sensitivity to credential-conditioned CRM operations even when the context indicates a be￾nign test workflow. tory risk state, aligning with our use of both EDR and EAUPC to evaluate proactive auditing. F Failure Case Analysis Type-1 failur… view at source ↗
Figure 11
Figure 11. Figure 11: Type-2 failure case for traj_id=136 of AT￾Bench. TRACES remains low during video retrieval and OCR extraction, but rises only when the extracted identifier is used for the HVAC configuration update. a subset of mechanisms is consistently associated with a particular risk type, such as unreliable tool use and sensitive information exposure, these mech￾anisms could serve as targets for representation￾level … view at source ↗
Figure 12
Figure 12. Figure 12: Type-2 failure case for traj_id=277 of AT￾Bench. TRACES remains low throughout the trajectory, suggesting difficulty in detecting cumulative cost and scale risks distributed across repeated, seemingly valid tool calls. benchmarks provide valuable controlled trajecto￾ries, but they do not allow an auditor’s warning or intervention to change later tool calls, environment feedback, or downstream agent behavi… view at source ↗
Figure 13
Figure 13. Figure 13: GPT-5.2 prompt used for the LLM-as-a-judge evaluation in the PRM-based policy fine-tuning analysis. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
read the original abstract

LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes TRACES, a representation-based proactive auditor for multi-turn LLM agents. It extracts hidden representations from an observer LLM, induces latent mechanism features, and models their temporal evolution to produce dense prefix-level risk estimates indicating drift toward unsafe behavior. Training uses only weak trajectory-level supervision to avoid expensive step-level annotations. Evaluations across multiple agent safety benchmarks report improvements in both full-trajectory safety prediction and proactive risk discrimination, with additional analyses suggesting utility for training safer agents.

Significance. If the central empirical claims hold under scrutiny, the work would be significant for LLM agent safety research. Proactive auditing via temporal modeling of latent states addresses a genuine gap between reactive post-hoc diagnosis and the need for early intervention in long-horizon tool-use trajectories. The weak-supervision design is practically attractive for scalability, and the suggestion that risk states can inform safer agent training points to downstream utility beyond auditing.

major comments (1)
  1. [Abstract / Evaluation] The core claim that prefix-level risk estimates capture genuine early drift (rather than correlation with final trajectory labels) rests on the untested assumption that observer hidden states contain inducible latent mechanism features whose temporal evolution can be recovered from weak supervision alone. No ablation or diagnostic (e.g., comparison of risk trajectories on safe vs. unsafe prefixes of ultimately unsafe trajectories, or training with outcome-shuffled labels) is described that would rule out leakage or smoothing artifacts. This directly affects the interpretation of the reported gains in proactive risk discrimination.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for stronger diagnostics to support the interpretation of prefix-level risk estimates as capturing genuine early drift. We agree that the current manuscript does not include the suggested ablations, and we will add them to address this concern directly.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] The core claim that prefix-level risk estimates capture genuine early drift (rather than correlation with final trajectory labels) rests on the untested assumption that observer hidden states contain inducible latent mechanism features whose temporal evolution can be recovered from weak supervision alone. No ablation or diagnostic (e.g., comparison of risk trajectories on safe vs. unsafe prefixes of ultimately unsafe trajectories, or training with outcome-shuffled labels) is described that would rule out leakage or smoothing artifacts. This directly affects the interpretation of the reported gains in proactive risk discrimination.

    Authors: We acknowledge that the manuscript lacks explicit diagnostics to rule out the possibility that prefix-level estimates arise from label leakage or smoothing rather than recoverable temporal dynamics in the latent features. To strengthen the claim, we will add two analyses in the revised version: (1) a comparison of risk trajectories specifically on safe versus unsafe prefixes drawn from ultimately unsafe trajectories, and (2) a control experiment retraining the model with outcome-shuffled labels to quantify performance degradation. These additions will be presented in a new subsection under Experiments, with quantitative results and visualizations. We believe this will clarify that the gains in proactive discrimination reflect the intended mechanism rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity identified; derivation self-contained with no reducible steps exhibited

full rationale

The provided abstract and description contain no equations, fitting procedures, or self-citations that allow exhibition of any reduction by construction (e.g., no parameter fitted to trajectory labels then renamed as prefix prediction, no uniqueness theorem imported, no ansatz smuggled via citation). The training regime uses weak supervision explicitly to produce dense estimates, but without quoted model equations or evaluation details showing the estimates are forced by the labels, no load-bearing circular step can be identified. The central claim of inducing latent mechanism features from observer hidden states remains independent of the supervision in the text given. This is the normal honest outcome when no specific reduction is quotable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no explicit free parameters, axioms, or invented entities; the central claim implicitly rests on the unstated assumption that observer-LLM hidden states encode risk-relevant temporal dynamics.

pith-pipeline@v0.9.1-grok · 5702 in / 1046 out tokens · 29634 ms · 2026-06-29T18:06:30.604018+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    Preprint, arXiv:2503.09066

    Probing latent subspaces in llm for ai secu- rity: Identifying and manipulating adversarial states. Preprint, arXiv:2503.09066. Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, and Christian Kästner

  2. [2]

    arXiv preprint arXiv:2601.08012 , year=

    Towards verifiably safe tool use for llm agents. arXiv preprint arXiv:2601.08012. Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. 2025. Not all language model features are one-dimensionally linear.Preprint, arXiv:2405.14860. Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, and Yan- ming G...

  3. [3]

    The Linear Representation Hypothesis and the Geometry of Large Language Models

    The linear representation hypothesis and the geometry of large language models.Preprint, arXiv:2311.03658. Qwen Team. 2024. Qwq-32b. https://huggingface. co/Qwen/QwQ-32B. Hugging Face model card, ac- cessed 2026-05-20. Noah Shinn, Federico Cassano, Edward Berman, Ash- win Gopinath, Karthik Narasimhan, and Shunyu Yao

  4. [4]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Reflexion: Language agents with verbal rein- forcement learning.Preprint, arXiv:2303.11366. Haoyu Wang, Christopher M. Poskitt, and Jun Sun

  5. [5]

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

    Agentspec: Customizable runtime enforce- ment for safe and reliable llm agents.Preprint, arXiv:2503.18666. Haoyu Wang, Christopher M. Poskitt, Jiali Wei, and Jun Sun. 2026a. Probguard: Probabilistic run- time monitoring for llm agent safety.Preprint, arXiv:2508.00500. Jiacheng Wang, Jinchang Hou, Fabian Wang, Ping Jian, Chenfu Bao, and Zhonghou Lv. 2026b....

  6. [6]

    Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning.Preprint, arXiv:2406.09187. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwe...

  7. [7]

    10 Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, and Ruixiang Tang

    Representation bending for large language model safety.Preprint, arXiv:2504.01550. 10 Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, and Ruixiang Tang. 2026. Agentforesight: Online auditing for early failure prediction in multi-agent systems.Preprint, arXiv:2605.08715. Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Y...

  8. [8]

    is a family of LLM-based safeguard models for classifying safety risks in human– AI conversations. The original LlamaGuard frames moderation as an instruction-following task, where the model receives a safety taxon- omy, task-specific instructions, and a conversa- tion, and outputs a binary safe/unsafe decision together with violated risk categories when ...

  9. [9]

    safe" or

    is an open-source safety framework for large-model applications. It targets practi- cal deployment scenarios where LLM systems may face unsafe content, data leakage, prompt- injection attacks, and other application-level risks. Unlike single-model guard classifiers, JoySafety provides a modular safety architecture with multiple atomic detection capabiliti...

  10. [10]

    safe" or

    label: "safe" or "unsafe"

  11. [11]

    label":

    risk_score: a continuous number between 0 and 1, where 0 means clearly safe, 1 means clearly unsafe, and intermediate values indicate uncertainty or emerging risk. Important: - Evaluate only the observed trajectory or prefix. - Do not assume future unsafe behavior unless there is evidence in the observed prefix. - For a partial trajectory, the risk_score ...