Recognition: no theorem link
When Adaptation Fails: A Gradient-Based Diagnosis of Collapsed Gating in Vision-Language Prompt Learning
Pith reviewed 2026-05-12 03:15 UTC · model grok-4.3
The pith
Adaptive gates in few-shot vision-language prompt learning collapse to constant outputs and match fixed prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Adaptive prompting mechanisms have been proposed to enhance vision-language models by dynamically tailoring prompts to inputs. However, in frozen few-shot prompt learning with CLIP-style backbones, adaptive gates and prompt-selection modules often collapse: they produce nearly constant outputs, contribute negligible gradient signals, and frequently fail to outperform fixed prompts. The underlying causes identified through systematic experiments are gradient magnitude imbalance and gate degradation.
What carries the argument
Gradient magnitude imbalance and gate degradation within adaptive gating and prompt-selection modules.
If this is right
- Indiscriminately adding architectural complexity to parameter-efficient prompt learning should be re-examined.
- Prompt-level adaptive gating is effective only under conditions that avoid gradient imbalance and degradation.
- Fixed prompts remain competitive when adaptive mechanisms lose their ability to vary.
- Training dynamics rather than module design determine whether adaptation succeeds in this regime.
Where Pith is reading between the lines
- Similar collapse patterns may appear in other parameter-efficient fine-tuning methods that rely on learned selectors or gates.
- Techniques that normalize or re-scale gradients during training could prevent the identified failure modes.
- The findings may not hold when backbones are unfrozen or when more shots are available for adaptation.
- Design of future adaptive modules should prioritize stable gradient flow over added expressivity.
Load-bearing premise
That gradient magnitude imbalance and gate degradation are the primary causes of collapse across frozen few-shot setups rather than artifacts of the specific architectures or datasets.
What would settle it
An experiment that equalizes gradient magnitudes across prompt and gate components and checks whether the gates retain input-dependent variation and outperform fixed prompts on multiple datasets.
Figures
read the original abstract
Adaptive prompting mechanisms have been proposed to enhance vision-language models by dynamically tailoring prompts to inputs. However, in frozen few-shot prompt learning with CLIP-style backbones, we systematically observe that adaptive gates and prompt-selection modules often collapse: they produce nearly constant outputs, contribute negligible gradient signals, and frequently fail to outperform fixed prompts. To further explore this issue, we present a systematic diagnostic study to uncover the underlying causes and conditions of adaptation failure. Through controlled experiments across datasets and multiple prompt learning architectures, we identify two recurring failure modes: gradient magnitude imbalance and gate degradation. Our findings invite a re-examination of indiscriminately adding architectural complexity in parameter-efficient learning and clarify when prompt-level adaptive gating is, and is not, effective in this regime.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in frozen few-shot prompt learning with CLIP-style vision-language backbones, adaptive gates and prompt-selection modules frequently collapse, producing nearly constant outputs, negligible gradient signals, and no performance gain over fixed prompts. Through controlled experiments across multiple datasets and prompt-learning architectures, it identifies two recurring failure modes—gradient magnitude imbalance and gate degradation—as the underlying causes, and calls for re-examination of adding adaptive complexity in parameter-efficient learning.
Significance. If the empirical patterns hold, the work provides a useful diagnostic framework for understanding when adaptive prompting succeeds or fails in the frozen CLIP regime. It offers concrete gradient-based evidence that could steer the community away from over-complex gating modules toward simpler or more carefully conditioned designs, particularly valuable given the prevalence of prompt-tuning methods.
minor comments (3)
- Abstract: the statement that adaptive modules 'frequently fail to outperform fixed prompts' would be strengthened by reporting the exact fraction of runs or datasets where this occurs, rather than the qualitative term 'frequently'.
- The description of the two failure modes (gradient magnitude imbalance and gate degradation) is clear in the abstract but would benefit from a short table or figure in the main text that directly contrasts the gradient norms and output variance of adaptive vs. fixed baselines.
- The manuscript would be improved by an explicit statement of the precise few-shot shot counts, learning rates, and prompt lengths used in the controlled experiments, to facilitate exact reproduction.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work, as well as the recommendation for minor revision. We appreciate the recognition that the diagnostic framework could help steer the community toward more careful design of adaptive modules in prompt learning.
Circularity Check
Empirical observational study with no derivation chain or circular steps
full rationale
The paper is a diagnostic empirical study based on controlled experiments documenting collapsed gating via observed near-constant outputs and negligible gradients in CLIP-style prompt learning. No mathematical derivation, first-principles prediction, or model is claimed that could reduce by construction to fitted parameters, self-definitions, or self-citations. The two failure modes are presented as recurring patterns from training dynamics across datasets and architectures, with the work framed as an invitation to re-examine rather than a constructed result. The analysis is self-contained as observation and carries no circularity burden.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of gradient-based optimization and training dynamics in neural networks hold for the prompt-learning setups tested.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision,
A. Radfordet al., “Learning transferable visual models from natural language supervision,” inICML, 2021
work page 2021
- [2]
-
[3]
Learning to prompt for vision-language models,
K. Zhouet al., “Learning to prompt for vision-language models,”IJCV, 2022
work page 2022
-
[4]
Conditional prompt learning for vision-language models,
——, “Conditional prompt learning for vision-language models,” inCVPR, 2022
work page 2022
-
[5]
Maple: Multi-modal prompt learn- ing,
M. U. khattaket al., “Maple: Multi-modal prompt learn- ing,” inCVPR, 2023
work page 2023
-
[6]
Imagenet: A large-scale hierarchical image database,
J. Denget al., “Imagenet: A large-scale hierarchical image database,” inCVPR, 2009
work page 2009
-
[7]
L. Fei-Feiet al., “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” inCVPR workshop, 2004
work page 2004
-
[8]
Eurosat: A novel dataset and deep learn- ing benchmark for land use and land cover classification,
P. Helberet al., “Eurosat: A novel dataset and deep learn- ing benchmark for land use and land cover classification,” JSTARS, 2019
work page 2019
-
[9]
Clip-adapter: Better vision-language mod- els with feature adapters,
P. Gaoet al., “Clip-adapter: Better vision-language mod- els with feature adapters,”IJCV, 2024
work page 2024
-
[10]
Vision-language model fine-tuning via sim- ple parameter-efficient modification,
M. Liet al., “Vision-language model fine-tuning via sim- ple parameter-efficient modification,” inEMNLP, 2024, pp. 14 394–14 410
work page 2024
-
[11]
A parameter-efficient and fine-grained prompt learning for vision-language models,
Y . Guoet al., “A parameter-efficient and fine-grained prompt learning for vision-language models,” inACL, 2025, pp. 31 346–31 359
work page 2025
-
[12]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, 1997
work page 1997
-
[13]
A. Vaswaniet al., “Attention is all you need,”NeurIPS, 2017
work page 2017
-
[14]
Atp-llava: Adaptive token pruning for large vision language models,
X. Yeet al., “Atp-llava: Adaptive token pruning for large vision language models,” inCVPR, 2025
work page 2025
-
[15]
Adaptive gating in mixture-of-experts based language models,
J. Liet al., “Adaptive gating in mixture-of-experts based language models,”EMNLP, 2023
work page 2023
-
[16]
Prompthash: Affinity-prompted collabora- tive cross-modal learning for adaptive hashing retrieval,
Q. Zouet al., “Prompthash: Affinity-prompted collabora- tive cross-modal learning for adaptive hashing retrieval,” inCVPR, 2025, pp. 19 649–19 658
work page 2025
-
[17]
G. Wanget al., “Degap: Dual event-guided adaptive prefixes for templated-based event argument extraction with slot querying,” inCOLING, 2025, pp. 7598–7613
work page 2025
-
[18]
Stabilizing transformer training by pre- venting attention entropy collapse,
S. Zhaiet al., “Stabilizing transformer training by pre- venting attention entropy collapse,” inICML, 2023
work page 2023
-
[19]
Signal propagation in transformers: Theoretical perspectives and the role of rank collapse,
A. Nociet al., “Signal propagation in transformers: Theoretical perspectives and the role of rank collapse,” NeurIPS, 2022
work page 2022
-
[20]
Gradient starvation: A learning proclivity in neural networks,
M. Pezeshkiet al., “Gradient starvation: A learning proclivity in neural networks,”NeurIPS, 2021
work page 2021
-
[21]
Mode collapse in generative adversar- ial networks: An overview,
A. Kossaleet al., “Mode collapse in generative adversar- ial networks: An overview,” inICOA. IEEE, 2022, pp. 1–6
work page 2022
-
[22]
Soft filter pruning for accelerating deep convolutional neural networks,
Y . Heet al., “Soft filter pruning for accelerating deep convolutional neural networks,”IJCAI, 2018. APPENDIX A. Supplementary Material Overview This supplementary material provides supporting details for the camera-ready version of the paper. Its purpose is not to introduce a substantially different manuscript, but to docu- ment the technical details, ad...
work page 2018
-
[23]
BiMaPLe: Bidirectional Cross-Modal Prompt Coupling: BiMaPLe extends MaPLe by introducing bidirectional cou- pling between textual and visual prompts. For each transformer layerd, we maintain textual promptsP (d) t ∈R N×D t and visual promptsP (d) v ∈R N×D v. Two lightweight mapping networks translate prompts between modalities: ˜P (d) v =f (d) L→V (P (d) ...
-
[24]
AdaptiveBiMaPLe: Prompt Length and Depth Gating: AdaptiveBiMaPLe augments BiMaPLe with two gating mech- anisms. a) Length gating.:For each tokenp (d) i at layerd, we introduce a learnable scalar gateg (d) i : ˜p(d) i =σ(g (d) i )·p (d) i . The effective prompt length is measured as L(d) eff = NmaxX i=1 σ(g(d) i ). b) Depth gating.:For each insertion depth...
-
[25]
a) CoOp-gating.:For learnable context tokensC= [C1,
Cross-Model Gating Variants for CoOp and CoCoOp: To test whether the observed failure mode is architecture- specific, we implemented prompt-level gating in CoOp and CoCoOp. a) CoOp-gating.:For learnable context tokensC= [C1, . . . , CM], we introduce per-token scalar gates: ˜Ci =σ(g i)·C i, L eff = MX i=1 σ(gi). b) CoCoOp-gating.:For instance-conditioned ...
-
[26]
Datasets, Backbone, and Optimization:We follow the standard few-shot prompt-learning protocol used in prior CLIP-based work. All experiments use a frozen CLIP ViT- B/16 backbone and are trained on base classes in the 16-shot setting, with evaluation on both base and novel classes. We report results on three classification benchmarks of different scales an...
-
[27]
Training Horizon, Variance, and Reproducibility:All methods in the main paper follow the standard training schedules used by the original prompt-learning baselines. In response to reviewer concerns about training horizon, we additionally extended representative runs up to 10 epochs. We consistently observed that gate gradients decay early and then remain ...
-
[28]
a) Regularization weights.:We varyλ sparse,λ smooth, andλ cyc around their default values
Standard Ablations:We perform two types of standard ablations on ImageNet. a) Regularization weights.:We varyλ sparse,λ smooth, andλ cyc around their default values. Across all settings, the harmonic mean changes by less than 0.5%, indicating that the final performance is largely insensitive to these hyperparame- ters. b) Gating configurations.:We compare...
-
[29]
Gradient Balancing Attempts:To test whether the ob- served failure can be rescued by standard optimization tricks, we explored several gradient-balancing strategies: •Learning-rate scaling:assigning a 50×larger learning rate to gate parameters; •Gradient clipping:clipping gradients with max norm 1.0; •Alternative initialization:zero, uniform, and biased i...
-
[30]
Attempts to Revive Adaptive Gating:We also tested two more direct repair strategies motivated by the diagnosed failure modes. a) Gradient equilibrium mechanism.:We introduced a scaling factor intended to offset sigmoid-induced attenuation: ˜gi =α i ∂L ∂gi , α i = 1 σ(gi)(1−σ(g i)) +ϵ , withϵ= 10 −8 and a maximum gradient scale of 10.0. b) Entropy regulari...
-
[31]
EuroSAT Controlled Variants:To understand why Adap- tiveBiMaPLe performs relatively better on EuroSAT than on ImageNet or Caltech101, we evaluated controlled variants designed to separate adaptive behavior from parameter-count and regularization effects: •ParamMatched:remove adaptive gates while adding comparable trainable parameters; •AlwaysFrozen:keep g...
-
[32]
We do not claim a single universal threshold
On Gradient Ratio and Seed Variance:Reviewer com- ments asked what magnitude ratio should be considered “healthy” for adaptive behavior. We do not claim a single universal threshold. Instead, we consider gradients healthy when gate parameters receive signals comparable in scale to other trainable parameters and continue to exhibit sustained variation over...
-
[33]
Relation to MoE-Style Gating:Our conclusions should not be conflated with results on gating in mixture-of-experts (MoE) systems. In MoE architectures, gating routes among high-capacity expert networks whose parameters are fully trainable and receive strong gradient signals through deep activation paths. In contrast, our setting studies prompt-level gating...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.