arxiv: 2605.09549 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: no theorem link

When Adaptation Fails: A Gradient-Based Diagnosis of Collapsed Gating in Vision-Language Prompt Learning

Yunxuan Fang , Ziwei Zhang , Xinhe Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:15 UTC · model grok-4.3

classification 💻 cs.LG

keywords adaptive promptingvision-language modelsprompt learninggating mechanismsgradient imbalancefew-shot learningCLIPadaptation failure

0 comments

The pith

Adaptive gates in few-shot vision-language prompt learning collapse to constant outputs and match fixed prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines adaptive prompting in frozen few-shot setups using CLIP-style models and finds that gates and prompt selectors frequently stop varying with the input. These components produce nearly identical outputs across examples, send almost no training signal through gradients, and deliver no consistent gain over non-adaptive baselines. Controlled tests across datasets and architectures trace the pattern to two issues: large differences in gradient sizes between components and progressive loss of gate variability. The results indicate that adding adaptive layers does not automatically improve parameter-efficient learning in this regime.

Core claim

What carries the argument

Gradient magnitude imbalance and gate degradation within adaptive gating and prompt-selection modules.

If this is right

Indiscriminately adding architectural complexity to parameter-efficient prompt learning should be re-examined.
Prompt-level adaptive gating is effective only under conditions that avoid gradient imbalance and degradation.
Fixed prompts remain competitive when adaptive mechanisms lose their ability to vary.
Training dynamics rather than module design determine whether adaptation succeeds in this regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar collapse patterns may appear in other parameter-efficient fine-tuning methods that rely on learned selectors or gates.
Techniques that normalize or re-scale gradients during training could prevent the identified failure modes.
The findings may not hold when backbones are unfrozen or when more shots are available for adaptation.
Design of future adaptive modules should prioritize stable gradient flow over added expressivity.

Load-bearing premise

That gradient magnitude imbalance and gate degradation are the primary causes of collapse across frozen few-shot setups rather than artifacts of the specific architectures or datasets.

What would settle it

An experiment that equalizes gradient magnitudes across prompt and gate components and checks whether the gates retain input-dependent variation and outperform fixed prompts on multiple datasets.

Figures

Figures reproduced from arXiv: 2605.09549 by Xinhe Wang, Yunxuan Fang, Ziwei Zhang.

**Figure 1.** Figure 1: Gradient Cancellation Rates. Cancellation rates for length and depth gate parameters on ImageNet. Each bar represents averages over three random seeds. Gate Strategy ImageNet Caltech101 EuroSAT Fixed (All-on) 75.31 96.49 79.41 Random 74.18 96.17 73.82 Per-Layer 75.31 96.49 79.41 Adaptive (Per-Token) 75.31 96.49 79.41 Table II: Performance under Different Gating Strategies. All gating strategies achieve nea… view at source ↗

**Figure 2.** Figure 2: Gradient Norm Comparison. Gradient norms of prompt parameters and gate parameters measured across training iterations. Shaded regions indicate variation across random seeds. Gate parameters receive gradients 2-3 orders smaller than prompt parameters across all datasets. Model Prompt Grad Norm Gate Grad Norm Magnitude Gap AdaptiveBiMaPLe 5.11×10−1 ± 2.56×10−2 1.59×10−3 ± 8.32×10−5 2.60 ± 1.87×10−2 AdaptiveB… view at source ↗

**Figure 3.** Figure 3: The effective prompt lengths and depth activation probabilities during training. Both quantities remain nearly constant throughout training. The stability of overall accuracy despite collapsed gates can be explained by the fact that CLIP’s frozen features and MaPLe-style deep prompts already dominate the optimization signal. Once gates saturate, the model effectively reduces to a fixed-prompt variant with … view at source ↗

read the original abstract

Adaptive prompting mechanisms have been proposed to enhance vision-language models by dynamically tailoring prompts to inputs. However, in frozen few-shot prompt learning with CLIP-style backbones, we systematically observe that adaptive gates and prompt-selection modules often collapse: they produce nearly constant outputs, contribute negligible gradient signals, and frequently fail to outperform fixed prompts. To further explore this issue, we present a systematic diagnostic study to uncover the underlying causes and conditions of adaptation failure. Through controlled experiments across datasets and multiple prompt learning architectures, we identify two recurring failure modes: gradient magnitude imbalance and gate degradation. Our findings invite a re-examination of indiscriminately adding architectural complexity in parameter-efficient learning and clarify when prompt-level adaptive gating is, and is not, effective in this regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper documents that adaptive gates in frozen few-shot CLIP prompt learning routinely collapse to constant outputs due to gradient magnitude imbalance and gate degradation, shown through their controlled experiments.

read the letter

The main thing to know is that adaptive gating and prompt-selection modules in this setting often stop adapting. They output nearly the same thing for every input, send almost no gradient signal back, and end up no better than fixed prompts. The authors trace this to two patterns: one where the gate gets much smaller gradients than the rest of the model, and another where the gate weights degrade over training steps.

Referee Report

0 major / 3 minor

Summary. The paper claims that in frozen few-shot prompt learning with CLIP-style vision-language backbones, adaptive gates and prompt-selection modules frequently collapse, producing nearly constant outputs, negligible gradient signals, and no performance gain over fixed prompts. Through controlled experiments across multiple datasets and prompt-learning architectures, it identifies two recurring failure modes—gradient magnitude imbalance and gate degradation—as the underlying causes, and calls for re-examination of adding adaptive complexity in parameter-efficient learning.

Significance. If the empirical patterns hold, the work provides a useful diagnostic framework for understanding when adaptive prompting succeeds or fails in the frozen CLIP regime. It offers concrete gradient-based evidence that could steer the community away from over-complex gating modules toward simpler or more carefully conditioned designs, particularly valuable given the prevalence of prompt-tuning methods.

minor comments (3)

Abstract: the statement that adaptive modules 'frequently fail to outperform fixed prompts' would be strengthened by reporting the exact fraction of runs or datasets where this occurs, rather than the qualitative term 'frequently'.
The description of the two failure modes (gradient magnitude imbalance and gate degradation) is clear in the abstract but would benefit from a short table or figure in the main text that directly contrasts the gradient norms and output variance of adaptive vs. fixed baselines.
The manuscript would be improved by an explicit statement of the precise few-shot shot counts, learning rates, and prompt lengths used in the controlled experiments, to facilitate exact reproduction.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, as well as the recommendation for minor revision. We appreciate the recognition that the diagnostic framework could help steer the community toward more careful design of adaptive modules in prompt learning.

Circularity Check

0 steps flagged

Empirical observational study with no derivation chain or circular steps

full rationale

The paper is a diagnostic empirical study based on controlled experiments documenting collapsed gating via observed near-constant outputs and negligible gradients in CLIP-style prompt learning. No mathematical derivation, first-principles prediction, or model is claimed that could reduce by construction to fitted parameters, self-definitions, or self-citations. The two failure modes are presented as recurring patterns from training dynamics across datasets and architectures, with the work framed as an invitation to re-examine rather than a constructed result. The analysis is self-contained as observation and carries no circularity burden.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical diagnostic study with no explicit mathematical derivations, free parameters, or newly postulated entities; it relies on standard assumptions of gradient-based optimization in neural networks.

axioms (1)

standard math Standard assumptions of gradient-based optimization and training dynamics in neural networks hold for the prompt-learning setups tested.
The diagnosis of gradient imbalance and gate degradation presupposes typical back-propagation behavior.

pith-pipeline@v0.9.0 · 5427 in / 1185 out tokens · 45673 ms · 2026-05-12T03:15:48.860263+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

Learning transferable visual models from natural language supervision,

A. Radfordet al., “Learning transferable visual models from natural language supervision,” inICML, 2021

work page 2021
[2]

Visual prompt tuning,

M. Jiaet al., “Visual prompt tuning,” inECCV, 2022

work page 2022
[3]

Learning to prompt for vision-language models,

K. Zhouet al., “Learning to prompt for vision-language models,”IJCV, 2022

work page 2022
[4]

Conditional prompt learning for vision-language models,

——, “Conditional prompt learning for vision-language models,” inCVPR, 2022

work page 2022
[5]

Maple: Multi-modal prompt learn- ing,

M. U. khattaket al., “Maple: Multi-modal prompt learn- ing,” inCVPR, 2023

work page 2023
[6]

Imagenet: A large-scale hierarchical image database,

J. Denget al., “Imagenet: A large-scale hierarchical image database,” inCVPR, 2009

work page 2009
[7]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,

L. Fei-Feiet al., “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” inCVPR workshop, 2004

work page 2004
[8]

Eurosat: A novel dataset and deep learn- ing benchmark for land use and land cover classification,

P. Helberet al., “Eurosat: A novel dataset and deep learn- ing benchmark for land use and land cover classification,” JSTARS, 2019

work page 2019
[9]

Clip-adapter: Better vision-language mod- els with feature adapters,

P. Gaoet al., “Clip-adapter: Better vision-language mod- els with feature adapters,”IJCV, 2024

work page 2024
[10]

Vision-language model fine-tuning via sim- ple parameter-efficient modification,

M. Liet al., “Vision-language model fine-tuning via sim- ple parameter-efficient modification,” inEMNLP, 2024, pp. 14 394–14 410

work page 2024
[11]

A parameter-efficient and fine-grained prompt learning for vision-language models,

Y . Guoet al., “A parameter-efficient and fine-grained prompt learning for vision-language models,” inACL, 2025, pp. 31 346–31 359

work page 2025
[12]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, 1997

work page 1997
[13]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,”NeurIPS, 2017

work page 2017
[14]

Atp-llava: Adaptive token pruning for large vision language models,

X. Yeet al., “Atp-llava: Adaptive token pruning for large vision language models,” inCVPR, 2025

work page 2025
[15]

Adaptive gating in mixture-of-experts based language models,

J. Liet al., “Adaptive gating in mixture-of-experts based language models,”EMNLP, 2023

work page 2023
[16]

Prompthash: Affinity-prompted collabora- tive cross-modal learning for adaptive hashing retrieval,

Q. Zouet al., “Prompthash: Affinity-prompted collabora- tive cross-modal learning for adaptive hashing retrieval,” inCVPR, 2025, pp. 19 649–19 658

work page 2025
[17]

Degap: Dual event-guided adaptive prefixes for templated-based event argument extraction with slot querying,

G. Wanget al., “Degap: Dual event-guided adaptive prefixes for templated-based event argument extraction with slot querying,” inCOLING, 2025, pp. 7598–7613

work page 2025
[18]

Stabilizing transformer training by pre- venting attention entropy collapse,

S. Zhaiet al., “Stabilizing transformer training by pre- venting attention entropy collapse,” inICML, 2023

work page 2023
[19]

Signal propagation in transformers: Theoretical perspectives and the role of rank collapse,

A. Nociet al., “Signal propagation in transformers: Theoretical perspectives and the role of rank collapse,” NeurIPS, 2022

work page 2022
[20]

Gradient starvation: A learning proclivity in neural networks,

M. Pezeshkiet al., “Gradient starvation: A learning proclivity in neural networks,”NeurIPS, 2021

work page 2021
[21]

Mode collapse in generative adversar- ial networks: An overview,

A. Kossaleet al., “Mode collapse in generative adversar- ial networks: An overview,” inICOA. IEEE, 2022, pp. 1–6

work page 2022
[22]

Soft filter pruning for accelerating deep convolutional neural networks,

Y . Heet al., “Soft filter pruning for accelerating deep convolutional neural networks,”IJCAI, 2018. APPENDIX A. Supplementary Material Overview This supplementary material provides supporting details for the camera-ready version of the paper. Its purpose is not to introduce a substantially different manuscript, but to docu- ment the technical details, ad...

work page 2018
[23]

For each transformer layerd, we maintain textual promptsP (d) t ∈R N×D t and visual promptsP (d) v ∈R N×D v

BiMaPLe: Bidirectional Cross-Modal Prompt Coupling: BiMaPLe extends MaPLe by introducing bidirectional cou- pling between textual and visual prompts. For each transformer layerd, we maintain textual promptsP (d) t ∈R N×D t and visual promptsP (d) v ∈R N×D v. Two lightweight mapping networks translate prompts between modalities: ˜P (d) v =f (d) L→V (P (d) ...

work page
[24]

a) Length gating.:For each tokenp (d) i at layerd, we introduce a learnable scalar gateg (d) i : ˜p(d) i =σ(g (d) i )·p (d) i

AdaptiveBiMaPLe: Prompt Length and Depth Gating: AdaptiveBiMaPLe augments BiMaPLe with two gating mech- anisms. a) Length gating.:For each tokenp (d) i at layerd, we introduce a learnable scalar gateg (d) i : ˜p(d) i =σ(g (d) i )·p (d) i . The effective prompt length is measured as L(d) eff = NmaxX i=1 σ(g(d) i ). b) Depth gating.:For each insertion depth...

work page
[25]

a) CoOp-gating.:For learnable context tokensC= [C1,

Cross-Model Gating Variants for CoOp and CoCoOp: To test whether the observed failure mode is architecture- specific, we implemented prompt-level gating in CoOp and CoCoOp. a) CoOp-gating.:For learnable context tokensC= [C1, . . . , CM], we introduce per-token scalar gates: ˜Ci =σ(g i)·C i, L eff = MX i=1 σ(gi). b) CoCoOp-gating.:For instance-conditioned ...

work page
[26]

All experiments use a frozen CLIP ViT- B/16 backbone and are trained on base classes in the 16-shot setting, with evaluation on both base and novel classes

Datasets, Backbone, and Optimization:We follow the standard few-shot prompt-learning protocol used in prior CLIP-based work. All experiments use a frozen CLIP ViT- B/16 backbone and are trained on base classes in the 16-shot setting, with evaluation on both base and novel classes. We report results on three classification benchmarks of different scales an...

work page
[27]

In response to reviewer concerns about training horizon, we additionally extended representative runs up to 10 epochs

Training Horizon, Variance, and Reproducibility:All methods in the main paper follow the standard training schedules used by the original prompt-learning baselines. In response to reviewer concerns about training horizon, we additionally extended representative runs up to 10 epochs. We consistently observed that gate gradients decay early and then remain ...

work page
[28]

a) Regularization weights.:We varyλ sparse,λ smooth, andλ cyc around their default values

Standard Ablations:We perform two types of standard ablations on ImageNet. a) Regularization weights.:We varyλ sparse,λ smooth, andλ cyc around their default values. Across all settings, the harmonic mean changes by less than 0.5%, indicating that the final performance is largely insensitive to these hyperparame- ters. b) Gating configurations.:We compare...

work page
[29]

None of these strategies restored meaningful gate diversity or improved performance beyond the fixed-prompt baselines

Gradient Balancing Attempts:To test whether the ob- served failure can be rescued by standard optimization tricks, we explored several gradient-balancing strategies: •Learning-rate scaling:assigning a 50×larger learning rate to gate parameters; •Gradient clipping:clipping gradients with max norm 1.0; •Alternative initialization:zero, uniform, and biased i...

work page
[30]

Attempts to Revive Adaptive Gating:We also tested two more direct repair strategies motivated by the diagnosed failure modes. a) Gradient equilibrium mechanism.:We introduced a scaling factor intended to offset sigmoid-induced attenuation: ˜gi =α i ∂L ∂gi , α i = 1 σ(gi)(1−σ(g i)) +ϵ , withϵ= 10 −8 and a maximum gradient scale of 10.0. b) Entropy regulari...

work page
[31]

The main paper reports that ParamMatched and ExplicitReg achieve comparable or higher performance than the original adaptive model

EuroSAT Controlled Variants:To understand why Adap- tiveBiMaPLe performs relatively better on EuroSAT than on ImageNet or Caltech101, we evaluated controlled variants designed to separate adaptive behavior from parameter-count and regularization effects: •ParamMatched:remove adaptive gates while adding comparable trainable parameters; •AlwaysFrozen:keep g...

work page
[32]

We do not claim a single universal threshold

On Gradient Ratio and Seed Variance:Reviewer com- ments asked what magnitude ratio should be considered “healthy” for adaptive behavior. We do not claim a single universal threshold. Instead, we consider gradients healthy when gate parameters receive signals comparable in scale to other trainable parameters and continue to exhibit sustained variation over...

work page
[33]

In MoE architectures, gating routes among high-capacity expert networks whose parameters are fully trainable and receive strong gradient signals through deep activation paths

Relation to MoE-Style Gating:Our conclusions should not be conflated with results on gating in mixture-of-experts (MoE) systems. In MoE architectures, gating routes among high-capacity expert networks whose parameters are fully trainable and receive strong gradient signals through deep activation paths. In contrast, our setting studies prompt-level gating...

work page