TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents
Pith reviewed 2026-06-27 22:09 UTC · model grok-4.3
The pith
TRACE detects hidden sabotage in LLM agent trajectories by linking evidence across distant steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TRACE is a monitoring framework for long-horizon LLM agent trajectories that runs a TIJ (Triage-Inspect-Judge) loop: it identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. On ten task domains from SHADE-Arena it reaches an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks that require long-range evidence linking.
What carries the argument
The TIJ (Triage-Inspect-Judge) loop that selects high-signal regions, inspects them with cross-step evidence accumulation, and produces a final verdict.
If this is right
- Detection of sabotage improves when evidence is aggregated across temporally distant actions rather than scored in isolation.
- Recall reaches 0.844 on the SHADE-Arena domains while maintaining an F1 of 0.713.
- Gains are largest precisely on tasks whose malicious objectives depend on long-range evidence connections.
- Existing single-pass and independent-window methods are outperformed because they cannot carry evidence forward.
Where Pith is reading between the lines
- The same accumulation mechanism could be tested on agent trajectories in non-adversarial settings such as debugging or planning.
- Integration with existing safety filters might reduce false negatives on slow-burn behaviors without increasing latency on short trajectories.
- The triage step could be replaced with other selection heuristics to measure how much performance depends on the particular identification method.
Load-bearing premise
The TIJ loop reliably identifies high-signal regions and maintains accumulated evidence across steps without missing critical distant connections or introducing selection bias in the triage phase.
What would settle it
A replication on a fresh set of long-horizon trajectories in which TRACE shows no improvement over single-pass or windowed baselines on tasks that require distant evidence linking would falsify the performance claim.
Figures
read the original abstract
Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TRACE, a monitoring framework for detecting hidden malicious objectives in long-horizon LLM agent trajectories. TRACE uses a TIJ (Triage-Inspect-Judge) loop to identify high-signal regions, perform targeted inspection while accumulating evidence across steps, and synthesize a trajectory-level verdict. It is evaluated on ten task domains from SHADE-Arena, reporting an aggregate F1 of 0.713 and recall of 0.844 with largest gains on tasks requiring long-range evidence linking, outperforming state-of-the-art baselines.
Significance. If the empirical results and the TIJ loop's reliability hold under scrutiny, this could meaningfully advance AI safety monitoring for agentic systems by addressing limitations of single-pass or windowed trajectory evaluation. The multi-domain evaluation on SHADE-Arena provides some breadth, but the absence of ablations, error analysis, or protocol details in the supplied text limits assessment of whether the gains are robust or artifactual.
major comments (1)
- [Evaluation] Evaluation protocol: the central performance claims (aggregate F1 0.713, recall 0.844, and largest gains on long-range tasks) rest on external benchmark results, yet no details are provided on baseline implementations, statistical significance testing, per-domain breakdowns, or error analysis. This makes it impossible to verify whether the TIJ loop's triage step avoids selection bias or missed distant connections, directly bearing on the weakest assumption flagged in the review.
minor comments (1)
- [Abstract] The abstract states results on 'ten task domains' but does not name the domains or report per-domain metrics, reducing clarity on where gains occur.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater transparency in our evaluation protocol. We will revise the manuscript to include the requested details on baselines, statistical testing, per-domain results, and error analysis.
read point-by-point responses
-
Referee: [Evaluation] Evaluation protocol: the central performance claims (aggregate F1 0.713, recall 0.844, and largest gains on long-range tasks) rest on external benchmark results, yet no details are provided on baseline implementations, statistical significance testing, per-domain breakdowns, or error analysis. This makes it impossible to verify whether the TIJ loop's triage step avoids selection bias or missed distant connections, directly bearing on the weakest assumption flagged in the review.
Authors: We agree that additional protocol details are required for independent verification. In the revised manuscript we will add: (1) explicit descriptions of baseline implementations including any hyperparameter choices or adaptations from original papers; (2) statistical significance results (bootstrap 95% CI and paired tests on F1/recall); (3) a per-domain table breaking down F1, recall, and precision across the ten SHADE-Arena tasks; and (4) an error analysis subsection that quantifies triage-step selection bias and examines whether distant connections were missed, with concrete examples of how the inspect and judge stages accumulate cross-step evidence to mitigate these risks. These additions will directly support the claim that gains on long-range tasks are not artifactual. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces the TRACE framework with a TIJ loop for trajectory monitoring and reports empirical performance (aggregate F1 0.713, recall 0.844) on the external SHADE-Arena benchmark across ten task domains. No mathematical derivations, equations, parameter fittings, or self-referential predictions are present. Claims rest on independent benchmark evaluation against baselines rather than internal consistency, self-citations, or renamed inputs. The derivation chain is self-contained as a standard empirical proposal with no load-bearing reductions to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2025 , eprint=
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents , author=. 2025 , eprint=
2025
-
[2]
2025 , eprint=
Reliable Weak-to-Strong Monitoring of LLM Agents , author=. 2025 , eprint=
2025
-
[3]
2025 , eprint=
Frontier Models are Capable of In-context Scheming , author=. 2025 , eprint=
2025
-
[4]
2026 , eprint=
Compound Deception in Elite Peer Review: A Failure Mode Taxonomy of 100 Fabricated Citations at NeurIPS 2025 , author=. 2026 , eprint=
2025
-
[5]
2019 , howpublished=
OECD Principles on Artificial Intelligence , author=. 2019 , howpublished=
2019
-
[6]
2025 , eprint=
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought , author=. 2025 , eprint=
2025
-
[7]
2023 , eprint=
ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=
2023
-
[8]
2025 , eprint=
Towards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning , author=. 2025 , eprint=
2025
-
[9]
2024 , eprint=
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks , author=. 2024 , eprint=
2024
-
[10]
2025 , eprint=
Auditing Language Models for Hidden Objectives , author=. 2025 , eprint=
2025
-
[11]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Many-shot Jailbreaking , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[12]
2024 , eprint=
Agent-SafetyBench: Evaluating the Safety of LLM Agents , author=. 2024 , eprint=
2024
-
[13]
ArXiv , year=
Prompt Injection Attacks and Defenses in LLM-Integrated Applications , author=. ArXiv , year=
-
[14]
arXiv preprint arXiv:2312.06942 , year =
AI Control: Improving Safety Despite Intentional Subversion , author =. arXiv preprint arXiv:2312.06942 , year =
-
[15]
How Does Information Access Affect
Aether AI Research , journal =. How Does Information Access Affect. 2026 , url =
2026
-
[16]
2025 , eprint=
AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection , author=. 2025 , eprint=
2025
-
[17]
2026 , eprint=
Constitutional Black-Box Monitoring for Scheming in LLM Agents , author=. 2026 , eprint=
2026
-
[18]
2025 , eprint=
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents , author=. 2025 , eprint=
2025
-
[19]
2025 , eprint=
Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems , author=. 2025 , eprint=
2025
-
[20]
2023 , eprint=
Scheming AIs: Will AIs Fake Alignment During Training to Get Power? , author=. 2023 , eprint=
2023
-
[21]
2024 , eprint=
Alignment Faking in Large Language Models , author=. 2024 , eprint=
2024
-
[22]
2024 , eprint=
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=
2024
-
[23]
2023 , eprint=
Measuring Faithfulness in Chain-of-Thought Reasoning , author=. 2023 , eprint=
2023
-
[24]
2024 , eprint=
Chain of Thought Reasoning Without Prompting , author=. 2024 , eprint=
2024
-
[25]
2024 , eprint=
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents , author=. 2024 , eprint=
2024
-
[26]
2023 , eprint=
Identifying the Risks of LM Agents with Emulated Environments , author=. 2023 , eprint=
2023
-
[27]
2024 , eprint=
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , author=. 2024 , eprint=
2024
-
[28]
2024 , eprint=
Towards evaluations-based safety cases for AI scheming , author=. 2024 , eprint=
2024
-
[29]
2023 , eprint=
Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision , author=. 2023 , eprint=
2023
-
[30]
Monitoring
Chen Yueh-Han and Nitish Joshi and Yulin Chen and He He and Rico Angell , booktitle=. Monitoring. 2025 , url=
2025
-
[31]
2023 , eprint=
Generative Agents: Interactive Simulacra of Human Behavior , author=. 2023 , eprint=
2023
-
[32]
2023 , eprint=
Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. 2023 , eprint=
2023
-
[33]
2023 , eprint=
Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=
2023
-
[34]
2023 , eprint=
Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=
2023
-
[35]
2023 , eprint=
CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society , author=. 2023 , eprint=
2023
-
[36]
2023 , eprint=
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation , author=. 2023 , eprint=
2023
-
[37]
2025 , url =
Anthropic , title =. 2025 , url =
2025
-
[38]
2025 , url =
OpenAI , title =. 2025 , url =
2025
-
[39]
2025 , url =
Google DeepMind , title =. 2025 , url =
2025
-
[40]
2024 , eprint=
STAC: Steering Agents with Control-Flow Aware Jailbreaking , author=. 2024 , eprint=
2024
-
[41]
2024 , archivePrefix=
AgentAuditor: Auditing Large Language Model Agents via Consensus-Based Automatic Jailbreaking , author=. 2024 , archivePrefix=
2024
-
[42]
Technical Report , year=
Safety compliance agents for large-scale evaluation , author=. Technical Report , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.