BarrierSteer: LLM Safety via Learning Barrier Steering
Pith reviewed 2026-05-25 06:59 UTC · model grok-4.3
The pith
Treating hidden-state safety classifiers as control barrier functions steers LLM generations away from unsafe outputs at inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BarrierSteer embeds learned nonlinear safety constraints into the LLM's latent representation space by treating hidden-state safety classifiers as Control Barrier Functions, which enables constraint-guided steering of unsafe latent trajectories during generation, with theoretical guarantees conditional on the barriers capturing the intended safety property, while preserving model utility through efficient merging of multiple constraints without parameter modification.
What carries the argument
Control Barrier Functions applied to hidden-state safety classifiers, which enforce that the generation trajectory in latent space remains within safe regions by correcting unsafe directions at each step.
If this is right
- Adversarial attack success rates decrease substantially while safe task performance is preserved.
- Multiple safety constraints can be combined without retraining the underlying model.
- The approach applies across different LLM families and datasets at inference time only.
- Computational cost remains low because constraint enforcement occurs in latent space rather than through output sampling or fine-tuning.
Where Pith is reading between the lines
- The same latent-space steering could apply to other sequence models such as those for code or images if similar classifiers exist.
- Layering this inference-time method with training-time alignment techniques might create redundant safety checks.
- Updating safety rules would require only retraining the classifiers rather than the full model, allowing faster response to new risks.
Load-bearing premise
The learned safety classifiers in the hidden states must actually detect and represent the intended safety properties for the steering guarantees to apply.
What would settle it
A test set of generations where the method is applied yet unsafe outputs still appear at rates similar to the baseline, even after the classifiers were trained on the same safety data.
Figures
read the original abstract
Despite the strong performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a significant obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and theoretically grounded. In this paper, we introduce BarrierSteer, a novel inference-time framework that improves response safety by embedding learned nonlinear safety constraints directly into the model's latent representation space. BarrierSteer treats hidden-state safety classifiers as Control Barrier Functions (CBFs), enabling constraint-guided steering of unsafe latent trajectories during generation. By composing multiple safety constraints through efficient constraint merging without modifying the underlying LLM parameters, BarrierSteer preserves model utility. We provide theoretical results showing that applying CBFs in the latent space yields a principled, modular, and computationally efficient approach for steering with respect to learned safety constraints, with guarantees conditional on the learned barriers capturing the intended safety property. Our extensive experimental results across multiple model families and datasets demonstrate that BarrierSteer substantially reduces adversarial attack success rates and unsafe generations, outperforming the existing method. The code is available in our \href{https://github.com/thanhquangtran/BarrierSteer}{GitHub repository}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BarrierSteer, an inference-time framework that treats hidden-state safety classifiers as Control Barrier Functions (CBFs) to steer LLM latent trajectories for improved response safety. It claims that this yields a principled, modular, and efficient approach with theoretical guarantees conditional on the learned barriers capturing the intended safety property, while preserving model utility through constraint composition without parameter modification. Extensive experiments across model families and datasets are reported to show substantial reductions in adversarial attack success rates and unsafe generations, outperforming prior methods, with code released.
Significance. If the conditional guarantees hold after verification, the work would provide a modular way to enforce multiple learned safety constraints at inference time. The public GitHub repository supporting reproducibility is a positive contribution. However, the absence of derivation details and verification steps for the CBF conditions limits the immediate impact of the theoretical component.
major comments (2)
- [Abstract] Abstract: the assertion of 'theoretical results' and 'guarantees conditional on the learned barriers capturing the intended safety property' is presented without any derivation details, error analysis, or description of how the conditional guarantees are established.
- [Theoretical results] Theoretical results section: the CBF treatment requires that the learned classifiers satisfy the standard differential inequality (existence of class-K α such that L_f h + L_g h u + α(h) ≥ 0 inside the safe set), but the manuscript describes no training enforcement or post-training verification of this Lie-derivative condition on the latent dynamics.
minor comments (2)
- [Abstract] Abstract: the claim of 'outperforming the existing method' should name the specific baseline(s) for clarity.
- [Experiments] Experiments: tables reporting attack success rates should include exact numerical values, standard deviations, and the precise list of model families and datasets used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on the theoretical components while preserving the conditional nature of the guarantees as originally presented.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion of 'theoretical results' and 'guarantees conditional on the learned barriers capturing the intended safety property' is presented without any derivation details, error analysis, or description of how the conditional guarantees are established.
Authors: We agree that the abstract would benefit from additional context. The derivation of the latent-space CBF steering law and the conditional guarantees (under the assumption that the learned classifiers satisfy the barrier conditions) are provided in the Theoretical Results section. In the revised manuscript, we will update the abstract to reference this section explicitly and briefly note that the guarantees are conditional on the classifiers capturing the safety property, with empirical support from the reported experiments. revision: yes
-
Referee: [Theoretical results] Theoretical results section: the CBF treatment requires that the learned classifiers satisfy the standard differential inequality (existence of class-K α such that L_f h + L_g h u + α(h) ≥ 0 inside the safe set), but the manuscript describes no training enforcement or post-training verification of this Lie-derivative condition on the latent dynamics.
Authors: The framework assumes the learned safety classifiers act as candidate barrier functions and derives the control law under the standard CBF condition, which is standard when the barrier is provided rather than synthesized. We do not enforce the Lie-derivative inequality during classifier training, as the classifiers are trained as safety predictors. We acknowledge the absence of explicit post-training verification steps. In revision, we will add a subsection outlining how the condition can be checked approximately via sampled latent trajectories and note any associated error considerations, while retaining the conditional framing already stated in the paper. revision: partial
Circularity Check
No significant circularity; claims explicitly conditional on external validation of learned barriers
full rationale
The paper's theoretical results are stated as conditional on the learned barriers capturing the intended safety property (abstract). This conditioning prevents any reduction of the guarantee to the training step by construction. No self-definitional loops, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the derivation chain. Standard CBF theory is applied to latent space under the stated condition; the learning of classifiers is treated as a separate data-driven step whose correctness is not claimed to be proven within the paper. This is the most common honest non-finding for papers that correctly scope their guarantees.
Axiom & Free-Parameter Ledger
free parameters (1)
- parameters of hidden-state safety classifiers
axioms (1)
- domain assumption Hidden states of an LLM during generation can be modeled as a dynamical system to which control barrier functions apply
Reference graph
Works this paper leans on
-
[1]
Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E
[Online; accessed 01 February 2008]. Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E. S., Jenner, E., Casper, S., Sourbut, O., et al. Foundational challenges in assuring alignment and safety of large language models.arXiv:2404.09932,
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assis- tant with reinforcement learning from human feedback. arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Ai safety in generative ai large language models: A survey
Chua, J., Li, Y ., Yang, S., Wang, C., and Yao, L. Ai safety in generative ai large language models: A survey. arXiv:2407.18369,
-
[5]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe rlhf: Safe reinforcement learning from human feedback.arXiv:2310.12773,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems
Dalrymple, D., Skalse, J., Bengio, Y ., Russell, S., Tegmark, M., Seshia, S., Omohundro, S., Szegedy, C., Goldhaber, B., Ammann, N., et al. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems. arXiv:2405.06624,
-
[8]
Google. Palm 2 technical report.arXiv:2305.10403,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Gradient-based adversarial attacks against text transform- ers.arXiv:2104.13733,
Guo, C., Sablayrolles, A., Jégou, H., and Kiela, D. Gradient-based adversarial attacks against text transform- ers.arXiv:2104.13733,
-
[10]
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
9 BARRIERSTEER: LLM Safety via Learning Barrier Steering Han, S., Rao, K., Ettinger, A., Jiang, L., Lin, B. Y ., Lambert, N., Choi, Y ., and Dziri, N. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv:2406.18495,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Measuring Massive Multitask Language Understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[12]
B., Frost, L., Bruckner, L., and Lawrence, C
Hung, C.-C., Rim, W. B., Frost, L., Bruckner, L., and Lawrence, C. Walking a tightrope–evaluating large lan- guage models in high-risk domains.arXiv:2311.14966,
-
[13]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Liu, W., Xiao, W., and Belta, C. Learning robust and correct controllers from signal temporal logic specifications using barriernet. In2023 62nd IEEE Conference on Decision and Control (CDC), pp. 7049–7054. IEEE, 2023a. Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv:2310.0445...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv:2402.04249, 2024a. Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. H...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
OpenAI. Gpt-4 technical report.arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Man- ning, C. D., and Finn, C. Direct preference optimiza- tion: Your language model is secretly a reward model. arXiv:2305.18290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Shen, X., Chen, Z., Backes, M., Shen, Y ., and Zhang, Y . " do anything now": Characterizing and evaluating in- the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,
work page 2024
-
[18]
Shin, T., Razeghi, Y ., Logan IV , R. L., Wallace, E., and Singh, S. Autoprompt: Eliciting knowledge from lan- guage models with automatically generated prompts. arXiv:2010.15980,
-
[19]
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on F oundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html,
work page 2023
-
[20]
Reward Constrained Policy Optimization
Tessler, C., Mankowitz, D. J., and Mannor, S. Reward con- strained policy optimization.arXiv:1805.11074,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Steering Language Models With Activation Engineering
Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering.arXiv:2308.10248,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Universal adversarial triggers for attacking and analyz- ing nlp.arXiv:1908.07125,
Wallace, E., Feng, S., Kandpal, N., Gardner, M., and Singh, S. Universal adversarial triggers for attacking and analyz- ing nlp.arXiv:1908.07125,
-
[24]
Guardians and offenders: A survey on harmful content generation and safety mitigation of llm
Zhang, C., Zhu, C., Xiong, J., Xu, X., Li, L., Liu, Y ., and Lu, Z. Guardians and offenders: A survey on harmful content generation and safety mitigation of llm. arXiv:2508.05775,
-
[25]
A Survey of Large Language Models
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models.arXiv:2303.18223,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization
Zhou, Z., Liu, J., Shao, J., Yue, X., Yang, C., Ouyang, W., and Qiao, Y . Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. InFind- ings of the Association for Computational Linguistics: ACL 2024, pp. 10586–10613,
work page 2024
-
[27]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models.arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
11 BARRIERSTEER: LLM Safety via Learning Barrier Steering A. Closed-form of BARRIERSTEER In order to efficiently enforce the proposed BARRIERSTEER, we first merge many constraints into no-greater than two constraints, as (7) shown in Sec. 4.3. Here, we explicitly show the closed-form solution to the QP (6) following (Luenberger, 1997). We consider the cas...
work page 1997
-
[29]
[Ch. 3], the unique solution to (10) is given by w∗ =λ 1(ht−1)ˆg1(ht−1) +λ 2(ht−1)ˆg2(ht−1) where λ1(ht−1) = 0ifG 21(ht−1)[ˆh2(ht−1)]+ −G 22(ht−1)ˆh1(ht−1)<0 [ˆh1(ht−1)]+ G11(ht−1) ifG 12(ht−1)[ˆh1(ht−1)]+ −G 11(ht−1)ˆh2(ht−1)<0 [G22(ht−1)ˆh1(ht−1)−G21(ht−1)ˆh2(ht−1)]+ G11(ht−1)G22(ht−1)−G12(ht−1)G21(ht−1) otherwise. λ2(ht−1) = [ˆh2(ht...
work page 2004
-
[30]
ensures the statehstays in the safe setCfor allt≥0 If on the other side that initial hidden state h0 is not in the safe set C, then bk(h0)≤0 . We define a new function V(h t) =−b k(ht)withV(h 0)≥0, and the equation (11) can be rewritten as ˙V(h t) +αV(h t) =∇V(h t)⊤u+αV(h t)≤0,(13) Suppose we have ˙V(h t) +αV(h t) = 0, the solution to the above equation i...
work page 2002
-
[31]
For theSafety Polytope (SaP)baseline, we implement the framework as specified by the authors, utilizing a Concept Encoder that projects hidden states into a 16,384 -dimensional sparse representation space. We fix the number of facets to K= 30 , as empirical results suggest diminishing returns beyond this count (Chen et al., 2025). Model-specific hyperpara...
work page 2025
-
[32]
across all considered adversarial attack vectors. To ensure statistical significance, results for learning-based methodologies are reported as averages over 5 independent trials with associated standard deviations. In contrast, prompt-based and heuristic baselines are evaluated deterministically, as their output remains invariant under fixed decoding para...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.