TraceSafe-Bench reveals that LLM guardrail performance on tool-use trajectories depends more on structural data handling than semantic safety alignment, with general models outperforming specialized ones and accuracy improving over longer trajectories.
The risk is incurred if the agent calls this tool at all, as it's deceptive
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
TraceSafe-Bench reveals that LLM guardrail performance on tool-use trajectories depends more on structural data handling than semantic safety alignment, with general models outperforming specialized ones and accuracy improving over longer trajectories.