TraceSafe-Bench reveals that LLM guardrail performance on tool-use trajectories depends more on structural data handling than semantic safety alignment, with general models outperforming specialized ones and accuracy improving over longer trajectories.
The risk occurs when the agent executes a call based on an unverified assumption for these parameters
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
TraceSafe-Bench reveals that LLM guardrail performance on tool-use trajectories depends more on structural data handling than semantic safety alignment, with general models outperforming specialized ones and accuracy improving over longer trajectories.