pith. sign in

arxiv: 2505.11063 · v3 · pith:DFIU34BCnew · submitted 2025-05-16 · 💻 cs.AI · cs.CR

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

classification 💻 cs.AI cs.CR
keywords agentthought-alignerthoughtssafetythoughtunsafebeforebehavioral
0
0 comments X
read the original abstract

LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought directly shapes subsequent actions. Small deviations in these thoughts can therefore propagate into unsafe behaviors, yet existing guardrails typically operate only on final outputs or require intrusive model modifications. We introduce Thought-Aligner, a lightweight plug-in safety model that performs causal correction on unsafe thoughts before action execution, without altering the underlying agent. The corrected thoughts are fed back into the agent, steering its decision process and tool use toward safer trajectories. Because it operates solely at the thought level, Thought-Aligner is model-agnostic and can be integrated into diverse agent frameworks. We train Thought-Aligner via two-stage contrastive learning on paired safe and unsafe thoughts generated across ten risk scenarios. Experiments on diverse agent-safety benchmarks and six LLMs show that Thought-Aligner increases behavioral safety from about 50% without protection to around 90% on average, exceeding state-of-the-art guardrails by roughly 23%, while also improving helpfulness by about 5%. The method incurs low per-step latency and minimal overhead, enabling scalable and practical deployment. We publicly release Thought-Aligner-7B at https://huggingface.co/WhitzardAgent/Thought-Aligner-7B.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ATAAT: Adaptive Threat-Aware Adversarial Tuning Framework against Backdoor Attacks on Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ATAAT is an adaptive adversarial tuning method that enables effective, stealthy backdoor attacks on VLA models by dynamically selecting gradient decoupling strategies based on attacker capabilities.

  2. Governance by Construction for Generalist Agents

    cs.AI 2026-05 unverdicted novelty 5.0

    CUGA introduces a runtime governance architecture that enforces policies at five checkpoints in generalist agent execution pipelines for predictable and compliant behavior.