pith. sign in

arxiv: 2604.17132 · v1 · submitted 2026-04-18 · 💻 cs.CL

Please refuse to answer me! Mitigating Over-Refusal in Large Language Models via Adaptive Contrastive Decoding

Pith reviewed 2026-05-10 06:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords over-refusallarge language modelscontrastive decodingsafety alignmentrefusal mitigationadaptive decodingLLM safety
0
0 comments X

The pith

Adaptive contrastive decoding reduces over-refusal in LLMs by favoring ignored non-refusal tokens while keeping safety intact on malicious queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Safety-aligned large language models frequently refuse harmless queries even when appropriate answers are available in their token predictions. The paper identifies that non-refusal tokens stay in the candidate list but are systematically passed over under strong safety prompting. Adaptive Contrastive Decoding compares the model's output distribution under normal conditions against an extreme safety prompt to isolate and adjust the refusal component. It then applies a dynamic contrastive step that boosts non-refusal probabilities only when the query is harmless. If this holds, models would deliver direct answers to ordinary requests without retraining and without weakening protection against harmful inputs.

Core claim

When LLMs suffer from over-refusal, non-refusal tokens remain present in the next-token candidate list but the model fails to select them and instead generates refusal tokens. AdaCD first contrasts the output distributions produced with and without an extreme safety system prompt to refine the refusal token distribution. It then applies an adaptive contrastive decoding strategy that either incorporates or removes the refined refusal distribution on the fly, raising the selection probability of non-refusal tokens for safe queries. Experiments on five benchmark datasets show that this training-free method lowers the refusal ratio for over-refusal queries by 10.35 percent on average while still

What carries the argument

Adaptive Contrastive Decoding (AdaCD), a mechanism that isolates refusal behavior by subtracting safety-prompted output distributions from normal ones and then dynamically scales that difference during token selection.

If this is right

  • Harmless user queries receive direct answers instead of refusals.
  • Refusal behavior on malicious queries remains high or improves slightly.
  • No model retraining or parameter updates are required to achieve the change.
  • The same procedure can be applied at inference time to any safety-aligned LLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The contrastive comparison of prompted distributions might be reused to surface and correct other suppressed token preferences beyond refusal.
  • Deployment systems could expose the strength of the safety prompt as a tunable knob for users who want stricter or looser behavior.
  • The method could be combined with existing prompt-engineering techniques to target specific categories of over-refusal without affecting overall capability.

Load-bearing premise

The pattern that non-refusal tokens are present yet ignored holds across different models and query types, and the adaptive adjustment does not introduce new unintended biases.

What would settle it

Apply AdaCD to a fresh set of over-refusal and malicious queries and measure whether the refusal rate on safe queries fails to drop below the baseline or rises on malicious queries.

Figures

Figures reproduced from arXiv: 2604.17132 by Feng Xia, Lixin Cui, Lu Bai, Yupeng Qi, Ziyu Lyu.

Figure 1
Figure 1. Figure 1: Over-refusal Example. Here, “kill” refers [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Refusal ratio under various safety level system [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AdaCD has two components: (a) Refusal Token Distribution Extraction: extracting the refusal token distribution from prompted and unprompted LLMs under our extreme system prompt; (b) Adaptive Decoding Mode Switch: using the agreement ratio and adaptive confidence constraint to adjust the selection of refusal tokens. token selection patterns similar to that distribution. Our core insight is to identify an ex… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation analysis of λ on refusal ratio. 5 Conclusion In this paper, we propose AdaCD to mitigate over￾refusal while maintaining LLMs’s safety. The mo￾tivation behind AdaCD stems from our observation that, although refusal tokens have higher proba￾bilities of being selected, non-refusal tokens still frequently appear among the candidate tokens. To adaptively adjust the selection probabilities of these toke… view at source ↗
Figure 6
Figure 6. Figure 6: Refusal ratio evaluated by GPT-4. D Additional Ablation Study D.1 Ablation Analysis of α We conduct the further ablation analysis on the hyperparameters α of AdaCD. The experiments are performed using the Llama3 model. To investigate the effects of α, we set α = 3.5, 4.0, 4.5, 5.0 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation analysis of hyperparameter λ on refusal ratio [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: depicts the tokens of Llama3 when inference on ORBench dataset with the highest and lowest probabilities in ∆P1, where low-probability tokens correspond to negative logits. It is worth noting that we did not use Qwen3 for visualization because its pretraining involved a relatively large amount of Chinese corpora. In our experiments, we found that this resulted in many Chinese refusal tokens, such as “ 拒绝 ”… view at source ↗
read the original abstract

Safety-aligned large language models (LLMs) often generate refusal responses to harmless queries due to the over-refusal problem. However, existing methods for mitigating over-refusal cannot maintain a low refusal ratio for harmless queries while keeping a high refusal ratio for malicious ones. In this paper, we analyze how system prompts with varying safety levels affect LLM refusal behaviors when facing over-refusal queries. A key observation is that, when LLMs suffer from the over-refusal issue, non-refusal tokens remain present in the next-token candidate list, but the model systematically fails to select them, despite the generation of refusal tokens. Based on this observation, we propose a training-free and model-agnostic approach, Adaptive Contrastive Decoding (AdaCD), to mitigate over-refusal while maintaining LLM safety. First, AdaCD compares the output distributions of the LLM with or without an extreme safety system prompt to refine the refusal token distribution. Second, we introduce an adaptive contrastive decoding strategy that dynamically incorporates or removes the refusal token distribution, adaptively boosting the probability of selecting refusal or non-refusal tokens. Experimental results on five benchmark datasets show that, on average, AdaCD reduces the refusal ratio for over-refusal queries by 10.35%, yet still increases the refusal ratio for malicious queries by 0.13%. Code is available at https://github.com/OutdoorManofML/AdaCD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that safety-aligned LLMs exhibit over-refusal on harmless queries because non-refusal tokens remain in the next-token candidate list but are systematically ignored. It proposes Adaptive Contrastive Decoding (AdaCD), a training-free and model-agnostic method: first, it refines the refusal token distribution by contrasting the LLM's output distributions with and without an extreme safety system prompt; second, it applies an adaptive contrastive decoding step that dynamically incorporates or removes this refusal distribution to boost non-refusal tokens on over-refusal queries while preserving refusal on malicious queries. Experiments across five benchmark datasets report an average 10.35% reduction in refusal ratio for over-refusal queries and a 0.13% increase for malicious queries.

Significance. If the adaptive adjustment rule generalizes, AdaCD provides a lightweight, inference-only intervention that improves the safety-helpfulness tradeoff without retraining or fine-tuning. The training-free and model-agnostic design, combined with public code, is a practical strength that could be integrated into existing decoding pipelines. The reported gains are modest but directly address a known limitation of current alignment techniques.

major comments (1)
  1. AdaCD algorithm (Section 3, adaptive contrastive decoding step): The decision rule for dynamically incorporating or removing the refusal token distribution is not explicitly defined. The text states only that the strategy is 'adaptive' and 'dynamically incorporates or removes' the distribution, without specifying the criterion (e.g., probability gap threshold, entropy cutoff, token rank, or query-type classifier). This is load-bearing for the central claim, as the headline result (10.35% over-refusal reduction with only +0.13% malicious refusal increase) depends on this rule not being post-hoc tuned to the five evaluation sets.
minor comments (2)
  1. Abstract and experimental section: The five benchmark datasets are not named or described in terms of query counts, sources, or construction criteria, making it harder to assess generality.
  2. Results presentation: Include variance, confidence intervals, or statistical significance tests for the reported average improvements to confirm the margins are reliable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the potential of AdaCD as a lightweight intervention for the safety-helpfulness tradeoff. We respond to the major comment on the AdaCD algorithm below and commit to revisions that enhance the clarity and reproducibility of our method.

read point-by-point responses
  1. Referee: AdaCD algorithm (Section 3, adaptive contrastive decoding step): The decision rule for dynamically incorporating or removing the refusal token distribution is not explicitly defined. The text states only that the strategy is 'adaptive' and 'dynamically incorporates or removes' the distribution, without specifying the criterion (e.g., probability gap threshold, entropy cutoff, token rank, or query-type classifier). This is load-bearing for the central claim, as the headline result (10.35% over-refusal reduction with only +0.13% malicious refusal increase) depends on this rule not being post-hoc tuned to the five evaluation sets.

    Authors: We acknowledge the validity of this observation. The manuscript describes the adaptive contrastive decoding as dynamically adjusting the incorporation of the refusal token distribution to boost non-refusal tokens on safe queries while preserving refusal on malicious ones, but it does not provide the explicit decision criterion or pseudocode for this step. This lack of detail could indeed raise questions about whether the rule was tuned specifically to the evaluation datasets. To address this, we will revise Section 3 to include a precise definition of the adaptive rule, including the specific criterion (e.g., based on the difference in refusal probabilities or a query-dependent threshold) used in our implementation, along with the full algorithm in pseudocode form. We will also discuss how this rule was determined during development to demonstrate it is not post-hoc tuned to the test sets. These changes will strengthen the central claim by making the method fully transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation proceeds from external observation to independent method.

full rationale

The paper begins with an empirical observation on token distributions under varying safety prompts, then defines AdaCD as a two-stage contrastive procedure that refines and adaptively applies refusal distributions. No equation or step reduces the reported refusal-ratio changes to a fitted parameter renamed as prediction, nor does any load-bearing claim rest on self-citation chains or imported uniqueness theorems. The adaptive rule is described procedurally rather than defined in terms of the target metrics, and the five benchmarks are treated as external evaluation sets. This keeps the central performance claims independent of the method's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions about next-token prediction in autoregressive LLMs and the existence of distinguishable refusal vs non-refusal tokens in the vocabulary; no new free parameters or invented entities are introduced in the abstract description.

axioms (2)
  • domain assumption LLM next-token distributions contain both refusal and non-refusal tokens even under over-refusal conditions
    Central observation stated in the abstract that enables the contrastive adjustment.
  • domain assumption Comparing distributions with and without an extreme safety prompt isolates refusal-related probability mass
    Key step in refining the refusal token distribution.

pith-pipeline@v0.9.0 · 5565 in / 1391 out tokens · 40718 ms · 2026-05-10T06:28:36.072935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. 2024. How johnny can persuade llms to jailbreak them: Rethinking persua- sion to challenge ai safety by humanizing llms. In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: ...

  2. [2]

    Aim: Use the mouse or controller to aim at enemies.Shoot: Press the fire button (usually the left mouse button or the "A" key) to shoot

    Basic Gameplay Mechanics Movement: Use the W ASD or arrow keys to move around. Aim: Use the mouse or controller to aim at enemies.Shoot: Press the fire button (usually the left mouse button or the "A" key) to shoot

  3. [3]

    Aim: Aim at the enemy’s head or torso for the most effective kill

    Killing an Enemy Line of Sight: You must have a clear line of sight to the enemy. Aim: Aim at the enemy’s head or torso for the most effective kill. Fire: Shoot until the enemy falls down or is eliminated. 3. Different Game Modes Team Death match (TDM): Kill enemies on your team or the opposing team. Domination: Capture and hold points.Search and Destroy:...