Please refuse to answer me! Mitigating Over-Refusal in Large Language Models via Adaptive Contrastive Decoding
Pith reviewed 2026-05-10 06:28 UTC · model grok-4.3
The pith
Adaptive contrastive decoding reduces over-refusal in LLMs by favoring ignored non-refusal tokens while keeping safety intact on malicious queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When LLMs suffer from over-refusal, non-refusal tokens remain present in the next-token candidate list but the model fails to select them and instead generates refusal tokens. AdaCD first contrasts the output distributions produced with and without an extreme safety system prompt to refine the refusal token distribution. It then applies an adaptive contrastive decoding strategy that either incorporates or removes the refined refusal distribution on the fly, raising the selection probability of non-refusal tokens for safe queries. Experiments on five benchmark datasets show that this training-free method lowers the refusal ratio for over-refusal queries by 10.35 percent on average while still
What carries the argument
Adaptive Contrastive Decoding (AdaCD), a mechanism that isolates refusal behavior by subtracting safety-prompted output distributions from normal ones and then dynamically scales that difference during token selection.
If this is right
- Harmless user queries receive direct answers instead of refusals.
- Refusal behavior on malicious queries remains high or improves slightly.
- No model retraining or parameter updates are required to achieve the change.
- The same procedure can be applied at inference time to any safety-aligned LLM.
Where Pith is reading between the lines
- The contrastive comparison of prompted distributions might be reused to surface and correct other suppressed token preferences beyond refusal.
- Deployment systems could expose the strength of the safety prompt as a tunable knob for users who want stricter or looser behavior.
- The method could be combined with existing prompt-engineering techniques to target specific categories of over-refusal without affecting overall capability.
Load-bearing premise
The pattern that non-refusal tokens are present yet ignored holds across different models and query types, and the adaptive adjustment does not introduce new unintended biases.
What would settle it
Apply AdaCD to a fresh set of over-refusal and malicious queries and measure whether the refusal rate on safe queries fails to drop below the baseline or rises on malicious queries.
Figures
read the original abstract
Safety-aligned large language models (LLMs) often generate refusal responses to harmless queries due to the over-refusal problem. However, existing methods for mitigating over-refusal cannot maintain a low refusal ratio for harmless queries while keeping a high refusal ratio for malicious ones. In this paper, we analyze how system prompts with varying safety levels affect LLM refusal behaviors when facing over-refusal queries. A key observation is that, when LLMs suffer from the over-refusal issue, non-refusal tokens remain present in the next-token candidate list, but the model systematically fails to select them, despite the generation of refusal tokens. Based on this observation, we propose a training-free and model-agnostic approach, Adaptive Contrastive Decoding (AdaCD), to mitigate over-refusal while maintaining LLM safety. First, AdaCD compares the output distributions of the LLM with or without an extreme safety system prompt to refine the refusal token distribution. Second, we introduce an adaptive contrastive decoding strategy that dynamically incorporates or removes the refusal token distribution, adaptively boosting the probability of selecting refusal or non-refusal tokens. Experimental results on five benchmark datasets show that, on average, AdaCD reduces the refusal ratio for over-refusal queries by 10.35%, yet still increases the refusal ratio for malicious queries by 0.13%. Code is available at https://github.com/OutdoorManofML/AdaCD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that safety-aligned LLMs exhibit over-refusal on harmless queries because non-refusal tokens remain in the next-token candidate list but are systematically ignored. It proposes Adaptive Contrastive Decoding (AdaCD), a training-free and model-agnostic method: first, it refines the refusal token distribution by contrasting the LLM's output distributions with and without an extreme safety system prompt; second, it applies an adaptive contrastive decoding step that dynamically incorporates or removes this refusal distribution to boost non-refusal tokens on over-refusal queries while preserving refusal on malicious queries. Experiments across five benchmark datasets report an average 10.35% reduction in refusal ratio for over-refusal queries and a 0.13% increase for malicious queries.
Significance. If the adaptive adjustment rule generalizes, AdaCD provides a lightweight, inference-only intervention that improves the safety-helpfulness tradeoff without retraining or fine-tuning. The training-free and model-agnostic design, combined with public code, is a practical strength that could be integrated into existing decoding pipelines. The reported gains are modest but directly address a known limitation of current alignment techniques.
major comments (1)
- AdaCD algorithm (Section 3, adaptive contrastive decoding step): The decision rule for dynamically incorporating or removing the refusal token distribution is not explicitly defined. The text states only that the strategy is 'adaptive' and 'dynamically incorporates or removes' the distribution, without specifying the criterion (e.g., probability gap threshold, entropy cutoff, token rank, or query-type classifier). This is load-bearing for the central claim, as the headline result (10.35% over-refusal reduction with only +0.13% malicious refusal increase) depends on this rule not being post-hoc tuned to the five evaluation sets.
minor comments (2)
- Abstract and experimental section: The five benchmark datasets are not named or described in terms of query counts, sources, or construction criteria, making it harder to assess generality.
- Results presentation: Include variance, confidence intervals, or statistical significance tests for the reported average improvements to confirm the margins are reliable.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for highlighting the potential of AdaCD as a lightweight intervention for the safety-helpfulness tradeoff. We respond to the major comment on the AdaCD algorithm below and commit to revisions that enhance the clarity and reproducibility of our method.
read point-by-point responses
-
Referee: AdaCD algorithm (Section 3, adaptive contrastive decoding step): The decision rule for dynamically incorporating or removing the refusal token distribution is not explicitly defined. The text states only that the strategy is 'adaptive' and 'dynamically incorporates or removes' the distribution, without specifying the criterion (e.g., probability gap threshold, entropy cutoff, token rank, or query-type classifier). This is load-bearing for the central claim, as the headline result (10.35% over-refusal reduction with only +0.13% malicious refusal increase) depends on this rule not being post-hoc tuned to the five evaluation sets.
Authors: We acknowledge the validity of this observation. The manuscript describes the adaptive contrastive decoding as dynamically adjusting the incorporation of the refusal token distribution to boost non-refusal tokens on safe queries while preserving refusal on malicious ones, but it does not provide the explicit decision criterion or pseudocode for this step. This lack of detail could indeed raise questions about whether the rule was tuned specifically to the evaluation datasets. To address this, we will revise Section 3 to include a precise definition of the adaptive rule, including the specific criterion (e.g., based on the difference in refusal probabilities or a query-dependent threshold) used in our implementation, along with the full algorithm in pseudocode form. We will also discuss how this rule was determined during development to demonstrate it is not post-hoc tuned to the test sets. These changes will strengthen the central claim by making the method fully transparent and reproducible. revision: yes
Circularity Check
No significant circularity; derivation proceeds from external observation to independent method.
full rationale
The paper begins with an empirical observation on token distributions under varying safety prompts, then defines AdaCD as a two-stage contrastive procedure that refines and adaptively applies refusal distributions. No equation or step reduces the reported refusal-ratio changes to a fitted parameter renamed as prediction, nor does any load-bearing claim rest on self-citation chains or imported uniqueness theorems. The adaptive rule is described procedurally rather than defined in terms of the target metrics, and the five benchmarks are treated as external evaluation sets. This keeps the central performance claims independent of the method's own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM next-token distributions contain both refusal and non-refusal tokens even under over-refusal conditions
- domain assumption Comparing distributions with and without an extreme safety prompt isolates refusal-related probability mass
Reference graph
Works this paper leans on
-
[1]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. 2024. How johnny can persuade llms to jailbreak them: Rethinking persua- sion to challenge ai safety by humanizing llms. In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Basic Gameplay Mechanics Movement: Use the W ASD or arrow keys to move around. Aim: Use the mouse or controller to aim at enemies.Shoot: Press the fire button (usually the left mouse button or the "A" key) to shoot
-
[3]
Aim: Aim at the enemy’s head or torso for the most effective kill
Killing an Enemy Line of Sight: You must have a clear line of sight to the enemy. Aim: Aim at the enemy’s head or torso for the most effective kill. Fire: Shoot until the enemy falls down or is eliminated. 3. Different Game Modes Team Death match (TDM): Kill enemies on your team or the opposing team. Domination: Capture and hold points.Search and Destroy:...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.