BarrierSteer: LLM Safety via Learning Barrier Steering

Arun Verma; Bryan Kian Hsiang Low; Daniela Rus; Kiwan Wong; Thanh Q. Tran; Wei Xiao

arxiv: 2602.20102 · v2 · pith:GS2YOPBInew · submitted 2026-02-23 · 💻 cs.LG · cs.AI

BarrierSteer: LLM Safety via Learning Barrier Steering

Thanh Q. Tran , Arun Verma , Kiwan Wong , Bryan Kian Hsiang Low , Daniela Rus , Wei Xiao This is my paper

Pith reviewed 2026-05-25 06:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM safetycontrol barrier functionslatent space steeringinference-time interventionadversarial robustnesssafety constraintsgenerative models

0 comments

The pith

Treating hidden-state safety classifiers as control barrier functions steers LLM generations away from unsafe outputs at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method that applies mathematical safety constraints directly to the internal hidden states of large language models during text generation. It frames learned safety classifiers as control barrier functions that guide the model's latent trajectory to avoid unsafe regions without altering the model parameters. Multiple constraints can be merged efficiently to handle different safety rules in a modular way. Theoretical results establish steering guarantees provided the classifiers correctly identify unsafe states. Experiments across model families show lower rates of adversarial attacks and unsafe content compared to existing approaches.

Core claim

BarrierSteer embeds learned nonlinear safety constraints into the LLM's latent representation space by treating hidden-state safety classifiers as Control Barrier Functions, which enables constraint-guided steering of unsafe latent trajectories during generation, with theoretical guarantees conditional on the barriers capturing the intended safety property, while preserving model utility through efficient merging of multiple constraints without parameter modification.

What carries the argument

Control Barrier Functions applied to hidden-state safety classifiers, which enforce that the generation trajectory in latent space remains within safe regions by correcting unsafe directions at each step.

If this is right

Adversarial attack success rates decrease substantially while safe task performance is preserved.
Multiple safety constraints can be combined without retraining the underlying model.
The approach applies across different LLM families and datasets at inference time only.
Computational cost remains low because constraint enforcement occurs in latent space rather than through output sampling or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-space steering could apply to other sequence models such as those for code or images if similar classifiers exist.
Layering this inference-time method with training-time alignment techniques might create redundant safety checks.
Updating safety rules would require only retraining the classifiers rather than the full model, allowing faster response to new risks.

Load-bearing premise

The learned safety classifiers in the hidden states must actually detect and represent the intended safety properties for the steering guarantees to apply.

What would settle it

A test set of generations where the method is applied yet unsafe outputs still appear at rates similar to the baseline, even after the classifiers were trained on the same safety data.

Figures

Figures reproduced from arXiv: 2602.20102 by Arun Verma, Bryan Kian Hsiang Low, Daniela Rus, Kiwan Wong, Thanh Q. Tran, Wei Xiao.

**Figure 1.** Figure 1: BARRIERSTEER for Safe LLMs. This method efficiently steers the hidden states of LLMs within nonlinear safe sets learned from demonstrations, thereby ensuring the generation of safe language responses during the inference-time. tion answering (Taori et al., 2023), among others (Zhao et al., 2023), they remain vulnerable to adversarial attacks (Zou et al., 2023; Cao et al., 2024) and unintended generation o… view at source ↗

**Figure 2.** Figure 2: Overview of BARRIERSTEER for safe LLM generation. There is a three-stage pipeline of BARRIERSTEER: (i) extracting intermediate latent representations from a pre-trained LLM and constructing an LLM-specific safety dataset with binary safety labels; (ii) learning expressive, non-linear safety constraints in the latent space; and (iii) enforcing these constraints at inference time via CBF-based steering to pr… view at source ↗

read the original abstract

Despite the strong performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a significant obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and theoretically grounded. In this paper, we introduce BarrierSteer, a novel inference-time framework that improves response safety by embedding learned nonlinear safety constraints directly into the model's latent representation space. BarrierSteer treats hidden-state safety classifiers as Control Barrier Functions (CBFs), enabling constraint-guided steering of unsafe latent trajectories during generation. By composing multiple safety constraints through efficient constraint merging without modifying the underlying LLM parameters, BarrierSteer preserves model utility. We provide theoretical results showing that applying CBFs in the latent space yields a principled, modular, and computationally efficient approach for steering with respect to learned safety constraints, with guarantees conditional on the learned barriers capturing the intended safety property. Our extensive experimental results across multiple model families and datasets demonstrate that BarrierSteer substantially reduces adversarial attack success rates and unsafe generations, outperforming the existing method. The code is available in our \href{https://github.com/thanhquangtran/BarrierSteer}{GitHub repository}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BarrierSteer, an inference-time framework that treats hidden-state safety classifiers as Control Barrier Functions (CBFs) to steer LLM latent trajectories for improved response safety. It claims that this yields a principled, modular, and efficient approach with theoretical guarantees conditional on the learned barriers capturing the intended safety property, while preserving model utility through constraint composition without parameter modification. Extensive experiments across model families and datasets are reported to show substantial reductions in adversarial attack success rates and unsafe generations, outperforming prior methods, with code released.

Significance. If the conditional guarantees hold after verification, the work would provide a modular way to enforce multiple learned safety constraints at inference time. The public GitHub repository supporting reproducibility is a positive contribution. However, the absence of derivation details and verification steps for the CBF conditions limits the immediate impact of the theoretical component.

major comments (2)

[Abstract] Abstract: the assertion of 'theoretical results' and 'guarantees conditional on the learned barriers capturing the intended safety property' is presented without any derivation details, error analysis, or description of how the conditional guarantees are established.
[Theoretical results] Theoretical results section: the CBF treatment requires that the learned classifiers satisfy the standard differential inequality (existence of class-K α such that L_f h + L_g h u + α(h) ≥ 0 inside the safe set), but the manuscript describes no training enforcement or post-training verification of this Lie-derivative condition on the latent dynamics.

minor comments (2)

[Abstract] Abstract: the claim of 'outperforming the existing method' should name the specific baseline(s) for clarity.
[Experiments] Experiments: tables reporting attack success rates should include exact numerical values, standard deviations, and the precise list of model families and datasets used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on the theoretical components while preserving the conditional nature of the guarantees as originally presented.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'theoretical results' and 'guarantees conditional on the learned barriers capturing the intended safety property' is presented without any derivation details, error analysis, or description of how the conditional guarantees are established.

Authors: We agree that the abstract would benefit from additional context. The derivation of the latent-space CBF steering law and the conditional guarantees (under the assumption that the learned classifiers satisfy the barrier conditions) are provided in the Theoretical Results section. In the revised manuscript, we will update the abstract to reference this section explicitly and briefly note that the guarantees are conditional on the classifiers capturing the safety property, with empirical support from the reported experiments. revision: yes
Referee: [Theoretical results] Theoretical results section: the CBF treatment requires that the learned classifiers satisfy the standard differential inequality (existence of class-K α such that L_f h + L_g h u + α(h) ≥ 0 inside the safe set), but the manuscript describes no training enforcement or post-training verification of this Lie-derivative condition on the latent dynamics.

Authors: The framework assumes the learned safety classifiers act as candidate barrier functions and derives the control law under the standard CBF condition, which is standard when the barrier is provided rather than synthesized. We do not enforce the Lie-derivative inequality during classifier training, as the classifiers are trained as safety predictors. We acknowledge the absence of explicit post-training verification steps. In revision, we will add a subsection outlining how the condition can be checked approximately via sampled latent trajectories and note any associated error considerations, while retaining the conditional framing already stated in the paper. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims explicitly conditional on external validation of learned barriers

full rationale

The paper's theoretical results are stated as conditional on the learned barriers capturing the intended safety property (abstract). This conditioning prevents any reduction of the guarantee to the training step by construction. No self-definitional loops, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the derivation chain. Standard CBF theory is applied to latent space under the stated condition; the learning of classifiers is treated as a separate data-driven step whose correctness is not claimed to be proven within the paper. This is the most common honest non-finding for papers that correctly scope their guarantees.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on treating learned hidden-state classifiers as valid CBFs and on the assumption that those classifiers correctly encode safety; no new physical entities are introduced.

free parameters (1)

parameters of hidden-state safety classifiers
Classifiers are learned from data to serve as nonlinear safety constraints; their parameters are fitted rather than derived from first principles.

axioms (1)

domain assumption Hidden states of an LLM during generation can be modeled as a dynamical system to which control barrier functions apply
The paper directly states that it treats hidden-state safety classifiers as CBFs for steering latent trajectories.

pith-pipeline@v0.9.0 · 5754 in / 1244 out tokens · 53506 ms · 2026-05-25T06:59:24.563742+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 16 internal anchors

[1]

Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E

[Online; accessed 01 February 2008]. Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E. S., Jenner, E., Casper, S., Sourbut, O., et al. Foundational challenges in assuring alignment and safety of large language models.arXiv:2404.09932,

work page arXiv 2008
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assis- tant with reinforcement learning from human feedback. arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Ai safety in generative ai large language models: A survey

Chua, J., Li, Y ., Yang, S., Wang, C., and Yao, L. Ai safety in generative ai large language models: A survey. arXiv:2407.18369,

work page arXiv
[5]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe rlhf: Safe reinforcement learning from human feedback.arXiv:2310.12773,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems

Dalrymple, D., Skalse, J., Bengio, Y ., Russell, S., Tegmark, M., Seshia, S., Omohundro, S., Szegedy, C., Goldhaber, B., Ammann, N., et al. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems. arXiv:2405.06624,

work page arXiv
[8]

PaLM 2 Technical Report

Google. Palm 2 technical report.arXiv:2305.10403,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Gradient-based adversarial attacks against text transform- ers.arXiv:2104.13733,

Guo, C., Sablayrolles, A., Jégou, H., and Kiela, D. Gradient-based adversarial attacks against text transform- ers.arXiv:2104.13733,

work page arXiv
[10]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

9 BARRIERSTEER: LLM Safety via Learning Barrier Steering Han, S., Rao, K., Ettinger, A., Jiang, L., Lin, B. Y ., Lambert, N., Choi, Y ., and Dziri, N. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv:2406.18495,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[12]

B., Frost, L., Bruckner, L., and Lawrence, C

Hung, C.-C., Rim, W. B., Frost, L., Bruckner, L., and Lawrence, C. Walking a tightrope–evaluating large lan- guage models in high-risk domains.arXiv:2311.14966,

work page arXiv
[13]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu, W., Xiao, W., and Belta, C. Learning robust and correct controllers from signal temporal logic specifications using barriernet. In2023 62nd IEEE Conference on Decision and Control (CDC), pp. 7049–7054. IEEE, 2023a. Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv:2310.0445...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv:2402.04249, 2024a. Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. H...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Man- ning, C. D., and Finn, C. Direct preference optimiza- tion: Your language model is secretly a reward model. arXiv:2305.18290,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

do anything now

Shen, X., Chen, Z., Backes, M., Shen, Y ., and Zhang, Y . " do anything now": Characterizing and evaluating in- the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,

work page 2024
[18]

L., Wallace, E., and Singh, S

Shin, T., Razeghi, Y ., Logan IV , R. L., Wallace, E., and Singh, S. Autoprompt: Eliciting knowledge from lan- guage models with automatically generated prompts. arXiv:2010.15980,

work page arXiv 2010
[19]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on F oundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html,

work page 2023
[20]

Reward Constrained Policy Optimization

Tessler, C., Mankowitz, D. J., and Mannor, S. Reward con- strained policy optimization.arXiv:1805.11074,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering.arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Universal adversarial triggers for attacking and analyz- ing nlp.arXiv:1908.07125,

Wallace, E., Feng, S., Kandpal, N., Gardner, M., and Singh, S. Universal adversarial triggers for attacking and analyz- ing nlp.arXiv:1908.07125,

work page arXiv 1908
[24]

Guardians and offenders: A survey on harmful content generation and safety mitigation of llm

Zhang, C., Zhu, C., Xiong, J., Xu, X., Li, L., Liu, Y ., and Lu, Z. Guardians and offenders: A survey on harmful content generation and safety mitigation of llm. arXiv:2508.05775,

work page arXiv
[25]

A Survey of Large Language Models

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models.arXiv:2303.18223,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

Zhou, Z., Liu, J., Shao, J., Yue, X., Yang, C., Ouyang, W., and Qiao, Y . Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. InFind- ings of the Association for Computational Linguistics: ACL 2024, pp. 10586–10613,

work page 2024
[27]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models.arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Closed-form of BARRIERSTEER In order to efficiently enforce the proposed BARRIERSTEER, we first merge many constraints into no-greater than two constraints, as (7) shown in Sec

11 BARRIERSTEER: LLM Safety via Learning Barrier Steering A. Closed-form of BARRIERSTEER In order to efficiently enforce the proposed BARRIERSTEER, we first merge many constraints into no-greater than two constraints, as (7) shown in Sec. 4.3. Here, we explicitly show the closed-form solution to the QP (6) following (Luenberger, 1997). We consider the cas...

work page 1997
[29]

[Ch. 3], the unique solution to (10) is given by w∗ =λ 1(ht−1)ˆg1(ht−1) +λ 2(ht−1)ˆg2(ht−1) where λ1(ht−1) =    0ifG 21(ht−1)[ˆh2(ht−1)]+ −G 22(ht−1)ˆh1(ht−1)<0 [ˆh1(ht−1)]+ G11(ht−1) ifG 12(ht−1)[ˆh1(ht−1)]+ −G 11(ht−1)ˆh2(ht−1)<0 [G22(ht−1)ˆh1(ht−1)−G21(ht−1)ˆh2(ht−1)]+ G11(ht−1)G22(ht−1)−G12(ht−1)G21(ht−1) otherwise. λ2(ht−1) =    [ˆh2(ht...

work page 2004
[30]

ensures the statehstays in the safe setCfor allt≥0 If on the other side that initial hidden state h0 is not in the safe set C, then bk(h0)≤0 . We define a new function V(h t) =−b k(ht)withV(h 0)≥0, and the equation (11) can be rewritten as ˙V(h t) +αV(h t) =∇V(h t)⊤u+αV(h t)≤0,(13) Suppose we have ˙V(h t) +αV(h t) = 0, the solution to the above equation i...

work page 2002
[31]

We fix the number of facets to K= 30 , as empirical results suggest diminishing returns beyond this count (Chen et al., 2025)

For theSafety Polytope (SaP)baseline, we implement the framework as specified by the authors, utilizing a Concept Encoder that projects hidden states into a 16,384 -dimensional sparse representation space. We fix the number of facets to K= 30 , as empirical results suggest diminishing returns beyond this count (Chen et al., 2025). Model-specific hyperpara...

work page 2025
[32]

To ensure statistical significance, results for learning-based methodologies are reported as averages over 5 independent trials with associated standard deviations

across all considered adversarial attack vectors. To ensure statistical significance, results for learning-based methodologies are reported as averages over 5 independent trials with associated standard deviations. In contrast, prompt-based and heuristic baselines are evaluated deterministically, as their output remains invariant under fixed decoding para...

work page 2025

[1] [1]

Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E

[Online; accessed 01 February 2008]. Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E. S., Jenner, E., Casper, S., Sourbut, O., et al. Foundational challenges in assuring alignment and safety of large language models.arXiv:2404.09932,

work page arXiv 2008

[2] [2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assis- tant with reinforcement learning from human feedback. arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Ai safety in generative ai large language models: A survey

Chua, J., Li, Y ., Yang, S., Wang, C., and Yao, L. Ai safety in generative ai large language models: A survey. arXiv:2407.18369,

work page arXiv

[5] [5]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe rlhf: Safe reinforcement learning from human feedback.arXiv:2310.12773,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems

Dalrymple, D., Skalse, J., Bengio, Y ., Russell, S., Tegmark, M., Seshia, S., Omohundro, S., Szegedy, C., Goldhaber, B., Ammann, N., et al. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems. arXiv:2405.06624,

work page arXiv

[8] [8]

PaLM 2 Technical Report

Google. Palm 2 technical report.arXiv:2305.10403,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Gradient-based adversarial attacks against text transform- ers.arXiv:2104.13733,

Guo, C., Sablayrolles, A., Jégou, H., and Kiela, D. Gradient-based adversarial attacks against text transform- ers.arXiv:2104.13733,

work page arXiv

[10] [10]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

9 BARRIERSTEER: LLM Safety via Learning Barrier Steering Han, S., Rao, K., Ettinger, A., Jiang, L., Lin, B. Y ., Lambert, N., Choi, Y ., and Dziri, N. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv:2406.18495,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[12] [12]

B., Frost, L., Bruckner, L., and Lawrence, C

Hung, C.-C., Rim, W. B., Frost, L., Bruckner, L., and Lawrence, C. Walking a tightrope–evaluating large lan- guage models in high-risk domains.arXiv:2311.14966,

work page arXiv

[13] [13]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu, W., Xiao, W., and Belta, C. Learning robust and correct controllers from signal temporal logic specifications using barriernet. In2023 62nd IEEE Conference on Decision and Control (CDC), pp. 7049–7054. IEEE, 2023a. Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv:2310.0445...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv:2402.04249, 2024a. Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. H...

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Man- ning, C. D., and Finn, C. Direct preference optimiza- tion: Your language model is secretly a reward model. arXiv:2305.18290,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

do anything now

Shen, X., Chen, Z., Backes, M., Shen, Y ., and Zhang, Y . " do anything now": Characterizing and evaluating in- the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,

work page 2024

[18] [18]

L., Wallace, E., and Singh, S

Shin, T., Razeghi, Y ., Logan IV , R. L., Wallace, E., and Singh, S. Autoprompt: Eliciting knowledge from lan- guage models with automatically generated prompts. arXiv:2010.15980,

work page arXiv 2010

[19] [19]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on F oundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html,

work page 2023

[20] [20]

Reward Constrained Policy Optimization

Tessler, C., Mankowitz, D. J., and Mannor, S. Reward con- strained policy optimization.arXiv:1805.11074,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering.arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Universal adversarial triggers for attacking and analyz- ing nlp.arXiv:1908.07125,

Wallace, E., Feng, S., Kandpal, N., Gardner, M., and Singh, S. Universal adversarial triggers for attacking and analyz- ing nlp.arXiv:1908.07125,

work page arXiv 1908

[24] [24]

Guardians and offenders: A survey on harmful content generation and safety mitigation of llm

Zhang, C., Zhu, C., Xiong, J., Xu, X., Li, L., Liu, Y ., and Lu, Z. Guardians and offenders: A survey on harmful content generation and safety mitigation of llm. arXiv:2508.05775,

work page arXiv

[25] [25]

A Survey of Large Language Models

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models.arXiv:2303.18223,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

Zhou, Z., Liu, J., Shao, J., Yue, X., Yang, C., Ouyang, W., and Qiao, Y . Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. InFind- ings of the Association for Computational Linguistics: ACL 2024, pp. 10586–10613,

work page 2024

[27] [27]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models.arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Closed-form of BARRIERSTEER In order to efficiently enforce the proposed BARRIERSTEER, we first merge many constraints into no-greater than two constraints, as (7) shown in Sec

11 BARRIERSTEER: LLM Safety via Learning Barrier Steering A. Closed-form of BARRIERSTEER In order to efficiently enforce the proposed BARRIERSTEER, we first merge many constraints into no-greater than two constraints, as (7) shown in Sec. 4.3. Here, we explicitly show the closed-form solution to the QP (6) following (Luenberger, 1997). We consider the cas...

work page 1997

[29] [29]

[Ch. 3], the unique solution to (10) is given by w∗ =λ 1(ht−1)ˆg1(ht−1) +λ 2(ht−1)ˆg2(ht−1) where λ1(ht−1) =    0ifG 21(ht−1)[ˆh2(ht−1)]+ −G 22(ht−1)ˆh1(ht−1)<0 [ˆh1(ht−1)]+ G11(ht−1) ifG 12(ht−1)[ˆh1(ht−1)]+ −G 11(ht−1)ˆh2(ht−1)<0 [G22(ht−1)ˆh1(ht−1)−G21(ht−1)ˆh2(ht−1)]+ G11(ht−1)G22(ht−1)−G12(ht−1)G21(ht−1) otherwise. λ2(ht−1) =    [ˆh2(ht...

work page 2004

[30] [30]

ensures the statehstays in the safe setCfor allt≥0 If on the other side that initial hidden state h0 is not in the safe set C, then bk(h0)≤0 . We define a new function V(h t) =−b k(ht)withV(h 0)≥0, and the equation (11) can be rewritten as ˙V(h t) +αV(h t) =∇V(h t)⊤u+αV(h t)≤0,(13) Suppose we have ˙V(h t) +αV(h t) = 0, the solution to the above equation i...

work page 2002

[31] [31]

We fix the number of facets to K= 30 , as empirical results suggest diminishing returns beyond this count (Chen et al., 2025)

For theSafety Polytope (SaP)baseline, we implement the framework as specified by the authors, utilizing a Concept Encoder that projects hidden states into a 16,384 -dimensional sparse representation space. We fix the number of facets to K= 30 , as empirical results suggest diminishing returns beyond this count (Chen et al., 2025). Model-specific hyperpara...

work page 2025

[32] [32]

To ensure statistical significance, results for learning-based methodologies are reported as averages over 5 independent trials with associated standard deviations

across all considered adversarial attack vectors. To ensure statistical significance, results for learning-based methodologies are reported as averages over 5 independent trials with associated standard deviations. In contrast, prompt-based and heuristic baselines are evaluated deterministically, as their output remains invariant under fixed decoding para...

work page 2025