arxiv: 2604.03980 · v1 · submitted 2026-04-05 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics

Minglei Chen , Weilong Wang , Jiang Duan , Ye Deng

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords prompt learningvision-language modelsGram matricessecond-order statisticsdomain adaptationparameter-efficient learning

0 comments

The pith

Anchoring text prompts to second-order Gram matrices lets vision-language models adapt more robustly to domain shifts than first-order features alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gram-Anchored Prompt Learning (GAPL) that augments standard first-order prompt alignment in vision-language models with a second-order statistical stream based on Gram matrices. This adds global structural consistency to the usual local semantic alignment, addressing the claim that spatially entangled first-order features are too easily disrupted by domain shifts and noise. A sympathetic reader would care because the approach promises more reliable parameter-efficient adaptation without retraining the full model when data distributions change across tasks or environments.

Core claim

By introducing an additional second-order statistical stream via Gram matrices that augments the standard first-order spatial interaction and anchoring prompts to these second-order priors, the approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains.

What carries the argument

Gram matrices capturing second-order statistics of visual features, used to anchor prompts and supply global structural consistency alongside local first-order interactions.

If this is right

Prompts anchored to Gram matrices enable dynamic adaptation of language representations to distribution shifts.
The method combines local semantic alignment with global structural consistency for more robust VLM adaptation.
Experiments demonstrate effectiveness of the second-order stream on various downstream benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Gram-anchoring idea could be tested in other parameter-efficient tuning settings outside vision-language models.
One could measure whether the consistency benefit scales with the severity of domain shift in controlled synthetic experiments.
Future variants might combine Gram matrices with additional higher-order statistics to further strengthen the prior.

Load-bearing premise

First-order spatial features are highly susceptible to domain shifts and local noise while second-order Gram matrices supply sufficient global structural consistency to overcome those limitations.

What would settle it

An experiment in which GAPL shows no performance gain or outright underperforms standard first-order prompt methods on benchmarks with documented domain shifts would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.03980 by Jiang Duan, Minglei Chen, Weilong Wang, Ye Deng.

**Figure 2.** Figure 2: Overview of the proposed Gram-Anchored Prompt Learning (GAPL) framework. Learnable prompt tokens are inserted into the text encoder, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Detailed architecture of the Gram-based Style Modulator and the Contextual Modulator. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visual specificity comparison (query point on the dog’s ear). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: t-SNE visualization of latent manifolds across four domains for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Quantitative analysis of domain alignment. We visualize the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Parameter-efficient prompt learning has become the de facto standard for adapting Vision-Language Models (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbf{Gram-Anchored Prompt Learning (GAPL)} for Vision-Language Models via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbf{Gram matrices} that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAPL adds a Gram-matrix second-order stream to standard prompt tuning for VLMs, but the gains look incremental and the abstract gives almost no implementation or result details.

read the letter

The core move is to keep the usual first-order prompt alignment but add a parallel second-order path that uses Gram matrices to capture global structure. The authors argue this helps prompts stay stable when domains shift or local noise appears. That specific pairing is not in the cited prior work, so the named framework is new even if both pieces are established techniques elsewhere. The motivation is straightforward and the abstract explains it plainly: first-order features get tangled in spatial details that break under distribution change, while second-order stats supply a more consistent prior to anchor the language side. If the experiments back that up, it could be a useful tweak for people doing domain adaptation with frozen VLMs. The main weakness is that the abstract supplies no equations, no ablation tables, and no concrete numbers on the benchmarks it claims to beat. Without seeing how the second-order stream is actually fused or how large the reported lifts are, it is difficult to know whether the extra term is doing real work or just riding along with standard prompt tuning. The assumption that Gram matrices will reliably overcome first-order fragility is plausible on paper but needs the full results to evaluate. This is the kind of paper that belongs in a reading group for people already running prompt experiments on VLMs; they can decide whether the extra stream is worth coding up. I would send it to peer review because the idea is clear, the motivation is reasonable, and the authors appear to have run the necessary experiments even if the abstract hides the details.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Gram-Anchored Prompt Learning (GAPL) for Vision-Language Models. It argues that first-order spatial features are susceptible to domain shifts and local noise, and introduces a second-order stream based on Gram matrices to supply global structural consistency. Prompts are anchored to these second-order priors to enable dynamic adaptation across domains. The work claims that this augmentation yields effective performance on diverse benchmarks.

Significance. If the experimental results hold, the contribution is of moderate significance for parameter-efficient adaptation of VLMs. Gram matrices are a standard tool for capturing feature correlations, and their use to anchor prompts represents a reasonable extension from texture and style modeling in computer vision. The separation of local semantic alignment from global structural consistency is a clear conceptual contribution. The work would benefit from explicit quantification of gains over strong first-order baselines to establish the incremental value of the second-order term.

major comments (2)

[§3.2] §3.2: The precise mechanism by which the Gram-matrix stream is combined with the first-order prompt objective is underspecified. It is unclear whether the second-order term enters as an additive regularizer, a feature concatenation, or a separate optimization branch; without the explicit loss equation the parameter-efficiency claim cannot be verified.
[§4.2, Table 2] §4.2, Table 2: The ablation isolating the Gram-matrix component reports only aggregate accuracy; the per-domain delta relative to the first-order-only ablation is not shown, leaving open whether the reported gains are driven by the second-order prior or by additional hyper-parameter tuning.

minor comments (2)

[Abstract] Abstract: The abstract asserts effectiveness on benchmarks but supplies no numerical results or dataset names; adding one or two representative accuracy figures would improve immediate readability.
[§2] §2: The related-work discussion of first-order prompt methods (CoOp, CoCoOp) is adequate but would benefit from a short table contrasting their loss formulations with the proposed second-order term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of our work. We address each major comment below and will revise the manuscript accordingly to improve clarity and strengthen the evidence.

read point-by-point responses

Referee: [§3.2] §3.2: The precise mechanism by which the Gram-matrix stream is combined with the first-order prompt objective is underspecified. It is unclear whether the second-order term enters as an additive regularizer, a feature concatenation, or a separate optimization branch; without the explicit loss equation the parameter-efficiency claim cannot be verified.

Authors: We thank the referee for highlighting this lack of clarity. The second-order Gram-matrix stream is incorporated as an additive regularizer to the standard first-order prompt learning objective. The total loss is defined as L_total = L_first-order + λ L_Gram, where L_Gram is the alignment loss between the learnable prompt embeddings and the second-order Gram matrix statistics computed from the visual features. This formulation introduces no extra trainable parameters beyond the prompts themselves, preserving parameter efficiency. We will add the explicit loss equation and a brief derivation in §3.2 of the revised manuscript. revision: yes
Referee: [§4.2, Table 2] §4.2, Table 2: The ablation isolating the Gram-matrix component reports only aggregate accuracy; the per-domain delta relative to the first-order-only ablation is not shown, leaving open whether the reported gains are driven by the second-order prior or by additional hyper-parameter tuning.

Authors: We agree that per-domain deltas would provide more transparent evidence. All ablations were performed with identical hyper-parameters and training protocols to isolate the contribution of the Gram-matrix term. In the revised version we will augment Table 2 (or add a supplementary table) with per-domain accuracy deltas between the full GAPL model and the first-order-only baseline, confirming that the observed improvements are attributable to the second-order prior rather than hyper-parameter variation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces Gram-Anchored Prompt Learning as a methodological augmentation that adds a second-order Gram matrix stream to standard first-order prompt alignment in VLMs. No equations are presented that define a prediction or result in terms of a fitted parameter derived from the same quantity, nor are there self-citations used as load-bearing uniqueness theorems that reduce the central claim to prior author work by construction. The argument rests on the independent assumption that second-order statistics supply global consistency, supported by claimed experimental results on benchmarks rather than any definitional equivalence or renaming of known patterns. The derivation chain is therefore self-contained with no reductions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; full text required for ledger construction.

pith-pipeline@v0.9.0 · 5475 in / 1069 out tokens · 34183 ms · 2026-05-14T22:15:29.371176+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we construct a Gram-anchored stream that extracts compact descriptors from Gram matrices and uses them to modulate the prompted text representations
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gram matrices capture global texture and appearance statistics that are relatively stable across style changes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Goldfish

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark (a) CLS (b) Contextual (c) Style Categories(color) Zebra Banana Goldfish Ladybug Mushroom Domains(shape) ImageNet ImageNet-Sketch ImageNet-R ImageNet-V2 Fig. 5. t-SNE visualization of latent manifolds across four domains for five selected classes....

work page 2021
[2]

Learning to prompt for vision-language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,”International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022

work page 2022
[3]

Conditional prompt learning for vision- language models,

——, “Conditional prompt learning for vision- language models,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2022, pp. 16 816–16 825

work page 2022
[4]

Visual prompt tuning,

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inEuropean conference on computer vision. Springer, 2022, pp. 709–727

work page 2022
[5]

Self-regulating prompts: Foundational model adaptation without forgetting,

M. U. Khattak, S. T. Wasim, M. Naseer, S. Khan, M.- H. Yang, and F. S. Khan, “Self-regulating prompts: Foundational model adaptation without forgetting,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 15 190–15 200

work page 2023
[6]

Tcp:textual-based class- aware prompt tuning for visual-language model,

H. Yao, R. Zhang, and C. Xu, “Tcp:textual-based class- aware prompt tuning for visual-language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 23 438–23 448

work page 2024
[7]

Image style transfer using convolutional neural networks,

L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” inPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, MONTH 2026 9

work page 2016
[8]

Deep coral: Correlation align- ment for deep domain adaptation,

B. Sun and K. Saenko, “Deep coral: Correlation align- ment for deep domain adaptation,” inEuropean confer- ence on computer vision. Springer, 2016, pp. 443–450

work page 2016
[9]

Scaling up vi- sual and vision-language representation learning with noisy text supervision,

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up vi- sual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

work page 2021
[10]

Lit: Zero-shot transfer with locked-image text tuning,

X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer, “Lit: Zero-shot transfer with locked-image text tuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, 2022, pp. 18 123–18 133

work page 2022
[11]

Visual-language prompt tuning with knowledge-guided context optimization,

H. Yao, R. Zhang, and C. Xu, “Visual-language prompt tuning with knowledge-guided context optimization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6757–6767

work page 2023
[12]

Plot: Prompt learning with optimal transport for vision-language models,

G. Chen, W. Yao, X. Song, X. Li, Y. Rao, and K. Zhang, “Plot: Prompt learning with optimal transport for vision-language models,” inThe Eleventh International Conference on Learning Representations

work page
[13]

Gallop: Learning global and local prompts for vision-language models,

M. Lafon, E. Ramzi, C. Rambour, N. Audebert, and N. Thome, “Gallop: Learning global and local prompts for vision-language models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 264–282

work page 2024
[14]

Hierarchical cross-modal prompt learning for vision- language models,

H. Zheng, S. Yang, Z. He, J. Yang, and Z. Huang, “Hierarchical cross-modal prompt learning for vision- language models,” inProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), October 2025, pp. 1891–1901

work page 2025
[15]

Maple: Multi-modal prompt learning,

M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 19 113–19 122

work page 2023
[16]

Multi-modal adapter for vision-language models,

D. Seputis, S. Mihailov, S. Chatterjee, and Z. Xiao, “Multi-modal adapter for vision-language models,” arXiv preprint arXiv:2409.02958, 2024

work page arXiv 2024
[17]

Consistency-guided prompt learning for vision-language models,

S. Roy and A. Etemad, “Consistency-guided prompt learning for vision-language models,”arXiv preprint arXiv:2306.01195, 2023

work page arXiv 2023
[18]

Mmrl: Multi-modal representation learning for vision-language models,

Y. Guo and X. Gu, “Mmrl: Multi-modal representation learning for vision-language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25 015–25 025

work page 2025
[19]

Dualcoop: Fast adap- tation to multi-label recognition with limited annota- tions,

X. Sun, P . Hu, and K. Saenko, “Dualcoop: Fast adap- tation to multi-label recognition with limited annota- tions,”Advances in Neural Information Processing Sys- tems, vol. 35, pp. 30 569–30 582, 2022

work page 2022
[20]

Extract free dense labels from clip,

C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” inEuropean conference on computer vision. Springer, 2022, pp. 696–712

work page 2022
[21]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

work page 2009
[22]

Learning gener- ative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,

L. Fei-Fei, R. Fergus, and P . Perona, “Learning gener- ative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in2004 conference on computer vision and pattern recognition workshop. IEEE, 2004, pp. 178–178

work page 2004
[23]

Cats and dogs,

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawa- har, “Cats and dogs,” in2012 IEEE conference on com- puter vision and pattern recognition. IEEE, 2012, pp. 3498–3505

work page 2012
[24]

3d object representations for fine-grained categorization,

J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” inPro- ceedings of the IEEE international conference on computer vision workshops, 2013, pp. 554–561

work page 2013
[25]

Automated flower classification over a large number of classes,

M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 2008, pp. 722–729

work page 2008
[26]

Food- 101–mining discriminative components with random forests,

L. Bossard, M. Guillaumin, and L. Van Gool, “Food- 101–mining discriminative components with random forests,” inEuropean conference on computer vision. Springer, 2014, pp. 446–461

work page 2014
[27]

Fine-Grained Visual Classification of Aircraft

S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of air- craft,”arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

Sun database: Large-scale scene recognition from abbey to zoo,

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Tor- ralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in2010 IEEE computer society con- ference on computer vision and pattern recognition. IEEE, 2010, pp. 3485–3492

work page 2010
[29]

Describing textures in the wild,

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” inPro- ceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613

work page 2014
[30]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,

P . Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,”IEEE Journal of Selected T opics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019

work page 2019
[31]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[32]

Do imagenet classifiers generalize to imagenet?

B. Recht, R. Roelofs, L. Schmidt, and V . Shankar, “Do imagenet classifiers generalize to imagenet?” inInter- national conference on machine learning. PMLR, 2019, pp. 5389–5400

work page 2019
[33]

Learning robust global representations by penalizing local pre- dictive power,

H. Wang, S. Ge, Z. Lipton, and E. P . Xing, “Learning robust global representations by penalizing local pre- dictive power,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[34]

Natural adversarial examples,

D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 262–15 271

work page 2021
[35]

The many faces of robustness: A critical analysis of out-of-distribution generalization,

D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 8340–8349

work page 2021
[36]

Dept: Decoupled prompt tuning,

J. Zhang, S. Wu, L. Gao, H. T. Shen, and J. Song, “Dept: Decoupled prompt tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 924–12 933

work page 2024
[37]

Visualizing data using t-sne,

L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008

work page 2008