arxiv: 2604.05179 · v1 · submitted 2026-04-06 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Purva Chiniya , Kevin Scaria , Sagar Chaturvedi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM safetyjailbreak defensegradient-controlled decodingfalse positive reductionanchor tokensrefusal injectionprompt injectiondecoding guardrails

0 comments

The pith

Gradient-Controlled Decoding uses dual anchor tokens to cut false refusals in LLM safety filters while blocking more attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM safety filters often block harmless user requests while still allowing some dangerous ones through. The paper introduces Gradient-Controlled Decoding as a way to improve this balance without any model retraining. It steers the model's internal gradients toward either an acceptance token or a refusal token to make better decisions about whether a prompt is safe. If unsafe, it starts the response with a refusal phrase to lock in safe output from the first token onward. This leads to fewer mistaken refusals and stronger protection against attacks.

Core claim

The central claim is that Gradient-Controlled Decoding provides a training-free safety guardrail by using dual anchors—an acceptance token 'Sure' and a refusal token 'Sorry'—to tighten the classification boundary during gradient evaluation of the prompt. When a prompt is deemed risky, the method injects one or two refusal tokens before resuming normal autoregressive generation, which guarantees that the output cannot begin with harmful content irrespective of the sampling strategy employed. Experiments demonstrate that this dual-anchor approach reduces false positives by 52% relative to the single-anchor GradSafe baseline while maintaining comparable recall on ToxicChat, XSTest-v2, and AdvB

What carries the argument

Dual-anchor gradient steering, where gradients guide evaluation toward both an acceptance token 'Sure' and a refusal token 'Sorry' to refine safety classification before any generation begins.

If this is right

Reduces false positives by 52% versus GradSafe at comparable recall on ToxicChat, XSTest-v2, and AdvBench.
Lowers attack success rate by up to 10% compared to the strongest decoding-only baseline.
Adds under 15-20 ms latency on V100 hardware.
Transfers successfully to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B using only 20 demonstration templates.
Guarantees first-token safety through preset refusal injection independent of later sampling choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The low template requirement could enable rapid safety adaptation when new models are released without full retraining.
Similar dual-anchor steering might extend to controlling other generation properties such as factual accuracy using different target tokens.
Layering this guardrail with existing alignment methods could produce compounded safety gains at low added cost.

Load-bearing premise

The specific choice of anchor tokens Sure and Sorry combined with preset injection of refusal prefixes will reliably prevent emission of harmful content regardless of sampling strategy, model, and prompt distribution.

What would settle it

A test case where an adversarial prompt still produces harmful output after the refusal prefix injection on one of the evaluated models such as LLaMA-2-7B.

Figures

Figures reproduced from arXiv: 2604.05179 by Kevin Scaria, Purva Chiniya, Sagar Chaturvedi.

**Figure 1.** Figure 1: Overview of our proposed approach Gradient Controlled Decoding (GCD) - threephase gradient-based safety evaluation framework. Phase 1 computes and averages safe and unsafe template gradients from multiple samples. Phase 2 processes a given prompt (e.g., "What is the capital of France") to generate gradients with "Sure" and "Sorry" responses, which are compared against the template gradients using cosine s… view at source ↗

**Figure 2.** Figure 2: Precision-Recall curves for the “Sure” (compliance) and “Sorry” (refusal) gradient anchors on [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GCD adds dual anchors and prefix injection to gradient safety detection, cutting false positives versus GradSafe, but the safety claim is limited to the first token with thin evidence on what follows.

read the letter

GCD builds on GradSafe by using two gradient anchors—one for acceptance and one for refusal—then injects a short refusal prefix like 'Sorry, I can't...' when a prompt looks unsafe. This setup is training-free and runs with low added latency on V100 hardware. The abstract reports a 52% drop in false positives at similar recall on ToxicChat, XSTest-v2, and AdvBench, plus some reduction in attack success rate against decoding baselines, and it transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B using only 20 templates.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Gradient-Controlled Decoding (GCD), a training-free guardrail for LLMs that uses dual anchor tokens ('Sure' for acceptance and 'Sorry' for refusal) to tighten the decision boundary for unsafe prompt detection, improving on single-anchor methods like GradSafe. When a prompt is flagged, GCD injects one or two refusal prefix tokens ('Sorry, I can't...') before resuming autoregressive decoding to guarantee first-token safety independent of sampling. On ToxicChat, XSTest-v2, and AdvBench, it reports a 52% reduction in false positives versus GradSafe at comparable recall, up to 10% lower attack success rate versus the strongest decoding-only baseline, under 15-20 ms added latency on V100, transfer to LLaMA-2-7B/Mixtral-8x7B/Qwen-2-7B, and effectiveness with only 20 demonstration templates.

Significance. If the empirical results and safety guarantee are substantiated, GCD would offer a lightweight, deployable engineering solution for balancing LLM safety against over-refusal, with the training-free design and cross-model transferability as notable strengths. The dual-anchor approach directly targets brittleness in prior gradient-based detectors while adding deterministic prefix injection for early safety.

major comments (2)

Abstract: the central claim that preset injection of refusal prefixes 'guarantees first-token safety regardless of sampling strategy' is load-bearing for the safety contribution, yet the manuscript provides no ablations or results on post-injection continuations under high-temperature sampling, nucleus sampling, or prompts designed to elicit harmful follow-through after the injected prefix.
Abstract: the reported 52% false-positive reduction versus GradSafe at 'comparable recall' is presented without the underlying recall values, dataset statistics, error bars, or threshold-selection procedure, preventing assessment of whether the dual-anchor tightening actually improves the operating point or merely shifts it.

minor comments (2)

The description of how gradients are computed and combined with the two anchors during decoding lacks an explicit algorithmic outline or pseudocode, which would aid reproducibility.
The paper could clarify the exact criteria for injecting one versus two refusal tokens and whether this choice is deterministic or prompt-dependent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline revisions that will strengthen the clarity and substantiation of our claims without altering the core contributions.

read point-by-point responses

Referee: Abstract: the central claim that preset injection of refusal prefixes 'guarantees first-token safety regardless of sampling strategy' is load-bearing for the safety contribution, yet the manuscript provides no ablations or results on post-injection continuations under high-temperature sampling, nucleus sampling, or prompts designed to elicit harmful follow-through after the injected prefix.

Authors: The manuscript's safety claim is narrowly scoped to first-token safety: by deterministically injecting the refusal prefix tokens before autoregressive decoding begins, the initial output token is guaranteed to be a refusal token irrespective of any subsequent sampling strategy. This prevents the model from emitting harmful content at the very start of generation. We agree that empirical validation of continuation behavior would further support the practical utility of the approach. In the revised manuscript we will add targeted ablations that measure attack success rate and safety metrics on post-injection generations under high-temperature sampling, nucleus sampling, and prompts engineered to override the prefix, thereby addressing the concern directly. revision: yes
Referee: Abstract: the reported 52% false-positive reduction versus GradSafe at 'comparable recall' is presented without the underlying recall values, dataset statistics, error bars, or threshold-selection procedure, preventing assessment of whether the dual-anchor tightening actually improves the operating point or merely shifts it.

Authors: The full manuscript reports the underlying recall values, per-dataset statistics, error bars across runs, and the threshold-selection procedure (based on validation-set tuning to match recall) in the experimental section and associated tables. We acknowledge that the abstract would benefit from greater transparency on these points to allow immediate evaluation of the operating-point improvement. We will revise the abstract to include the key recall figures and a concise description of the threshold methodology. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit engineering construction with independent empirical claims

full rationale

The paper defines GCD as a training-free procedure that selects fixed lexical anchors ('Sure', 'Sorry') and injects refusal prefixes before resuming autoregressive decoding. No equations, fitted parameters, or predictions are introduced whose values are derived from the same evaluation data used to report improvements on ToxicChat, XSTest-v2, or AdvBench. The central claims rest on the observable behavior of the chosen anchors and injection rule under the stated sampling conditions, which are externally verifiable and not tautological. The citation to GradSafe is to prior independent work and does not supply the load-bearing justification for the dual-anchor design or the reported false-positive reduction. The method therefore contains no self-definitional, fitted-input, or self-citation-load-bearing steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard autoregressive generation properties and a handful of manually chosen elements rather than new theoretical constructs.

free parameters (2)

Anchor tokens
Specific strings 'Sure' and 'Sorry' selected as acceptance and refusal anchors; choice is not derived from data in the abstract.
Demonstration templates
Exactly 20 templates used; number and content appear hand-selected for the method.

axioms (1)

domain assumption LLM next-token distributions can be meaningfully steered by initial tokens and gradient signals.
Required for both the detection and the injection guarantee to hold.

pith-pipeline@v0.9.0 · 5548 in / 1218 out tokens · 49692 ms · 2026-05-10T18:57:00.644241+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token (Sure) and refusal anchor token (Sorry) tightening the decision boundary
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GCD preset-injects one or two refusal tokens (Sorry, I can't...) before autoregressive decoding resumes, guaranteeing first-token safety

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Introduction Thewidespreadadoptionoflargelanguagemodels (LLMs) in various applications has amplified con- cerns about adversarial manipulations like prompt injection and jailbreaks (Carlini et al., 2023; Zou et al., 2023). Existing safety pipelines, whether fine-tuning models on refusal corpora or using rule- based filters, rely on static guardrails. Thes...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Sure"and

Gradient Controled Decoding This section introduces the notations and back- ground concepts upon which the rest of the paper builds. We divide our approach in two phases, first is the gradient based detection and secondly con- trolled decoding based on these detection outputs. 2.1. Gradient based detection To identify safety-critical parameters, we follow...
[3]

Sure” (compliance) and “Sorry

Results and Analysis 3.1. Main Results Throughout this section,over-refusalrefers to the rateatwhichaguardrailincorrectlyblocksabenign query, measured as the False Positive Rate (FP%). Figure 2: Precision-Recall curves for the “Sure” (compliance) and “Sorry” (refusal) gradient anchors on ToxicChat (left pair) and XSTest (right pair). The operating point m...

2024
[4]

Our method ensures that safe prompts are accurately identified, enhancing both the reliability and user experience of LLMs

Conclusion This study introduces a significant improvement in the safety mechanisms of large language models (LLMs) by effectively reducing false positives (FPs) in prompt classification. Our method ensures that safe prompts are accurately identified, enhancing both the reliability and user experience of LLMs. The approach is lightweight, requiring neithe...
[5]

The reliance on tailoredtemplatepromptsforspecifictasks, particu- larly in security and privacy, may limit generalizabil- ity

Limitations and Future Work Despite its advantages, the method has limitations that warrant further exploration. The reliance on tailoredtemplatepromptsforspecifictasks, particu- larly in security and privacy, may limit generalizabil- ity. Also, the computation for an incoming prompt to get the gradients during inference adds to the latency and runtime me...
[6]

Bibliographical References Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sas- try, Amanda Askell, et al. 2020. Language mod- els are few-shot learners.Advances in neural information processing systems, 33:1877–1901. Zouying Cao, Yifei Yang, and Hai Zhao. 2025. Scans: Miti...

work page arXiv 2020
[7]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958. Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, YuxinGuo, YujiaWang, andJingboShang.2023. Toxicchat: Unveiling hidden challenges of toxic- ity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389. LongOuyang,JeffreyWu,XuJiang,DiogoAlmeida, Carr...

work page internal anchor Pith review arXiv 2023
[8]

Red Teaming Language Models with Language Models

Red teaming language models with lan- guage models.arXiv preprint arXiv:2202.03286. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson

work page Pith review arXiv
[9]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Fine-tuningalignedlanguagemodelscom- promises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimiza- tion: Your language model is secretly a reward model.Advances in neural information process- ing sys...

work page internal anchor Pith review arXiv 2023