pith. machine review for the scientific record. sign in

arxiv: 2605.04504 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

SpecPL: Disentangling Spectral Granularity for Prompt Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords prompt learningvision-language modelsspectral granularitycounterfactual supervisiondisentanglementfine-grained discriminationVAE decomposition
0
0 comments X

The pith

SpecPL improves prompt learning by disentangling spectral granularity in visual signals for better fine-grained discrimination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing prompt learning methods for vision-language models focus on text tokens but overlook the spectral details in images needed for precise classification. SpecPL tackles this by decomposing images into low-frequency semantic content and high-frequency granular features using a frozen variational autoencoder. It anchors text prompts to the semantic parts and applies counterfactual training by altering the granular parts to teach the distinction. This results in stronger performance across benchmarks while maintaining stability, offering a simple addition to current techniques.

Core claim

We introduce SpecPL to address modality asymmetry in prompt learning for VLMs. By leveraging a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details, and anchoring text representations to low-frequency invariants with a Visual Semantic Bank, the method uses counterfactual granule training via permutation of high-frequency signals. This compels explicit distinction of granularity from semantic invariance, revitalizing text-oriented baselines and achieving 81.51% harmonic-mean accuracy on 11 benchmarks.

What carries the argument

The spectral decomposition via frozen VAE combined with Counterfactual Granule Supervision that permutes high-frequency signals to drive fine-grained learning.

If this is right

  • Serves as a universal plug-and-play booster for methods like CoOp and MaPLe.
  • Achieves a new state-of-the-art harmonic-mean accuracy of 81.51% across 11 benchmarks.
  • Bridges the stability-generalization trade-off through spectral disentanglement and counterfactual supervision.
  • Enables better fine-grained discrimination by explicitly separating granularity from semantic content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The spectral approach could be adapted to other frequency-based decompositions in multimodal learning tasks.
  • Future work might explore dynamic rather than frozen decomposers to further optimize the separation.
  • This suggests that explicit supervision on different frequency bands may uncover more about how models encode invariance versus detail.

Load-bearing premise

The frozen VAE reliably decomposes visual signals into semantic low-frequency bands and granular high-frequency details, and permuting the high-frequency signals compels the model to distinguish granularity from semantic invariance without introducing artifacts or instability.

What would settle it

A significant drop in performance on fine-grained tasks when the high-frequency permutation is omitted or when the VAE decomposition fails to align with semantic content would falsify the effectiveness of the counterfactual supervision mechanism.

Figures

Figures reproduced from arXiv: 2605.04504 by Feiyang Huang, Jingtao Zhou, Lai-Man Po, Xirui Kang.

Figure 1
Figure 1. Figure 1: Breaking the Modality Asymmetry. (a) Existing Methods suffer from severe modality asymmetry: optimization is heavily concentrated on the textual side (often augmented by noisy external knowledge like LLMs), while visual representations remain holistic and static (indicated by the red bolt). (b) Ours (SpecPL) bridges this gap via spectral disentanglement. We in￾troduce a Spatial-Spectral Proxy to decompose … view at source ↗
Figure 2
Figure 2. Figure 2: SpecPL framework (train vs. inference). Given an input image, a frozen pretrained VAE encoder provides spatial latents that are factorized by a lightweight Spatial–Spectral Proxy (sliding average pooling) into a low-frequency base band (semantic invariants) and a high-frequency residual detail band (instance granules). Two lightweight projection heads map the two bands into the CLIP embedding space, produc… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of SpecPL’s spatial–spectral proxy decomposition. For each example, we show the original image, the full representation (ALL), and the disentangled Base (low-frequency proxy) and Detail (high-frequency proxy) components. The Base captures stable semantic invariants and global structure (e.g., shape/topology), while the Detail highlights fine-grained textures and local variations that provide … view at source ↗
Figure 4
Figure 4. Figure 4: Mean base–novel generalization gap (%). G(%) = 100 × (Base − Novel)/Base averaged over 11 datasets (lower is better). We also report relative gap reduction (%) w.r.t. each baseline. Spectral Diagnostic of Base–Detail Separability. We eval￾uate the suitability of a frozen VAE as a granularity teacher by measuring Base/Detail spectral separability under the same Route-A decomposition. We quantify mixing by s… view at source ↗
Figure 5
Figure 5. Figure 5: shows substantially lower overlap for VAE than CLIP (0.270 vs. 0.488), supporting cleaner Base/Detail separation in the VAE manifold view at source ↗
Figure 6
Figure 6. Figure 6: Hyperparameter sensitivity across five datasets. We report Base, Novel, and HM. The ablations quantify the contribution of each SpecPL component, and the sensitivity plots vary key hyperparameters (e.g., bank size, bank temperature, and shared input modality). 14 view at source ↗
Figure 7
Figure 7. Figure 7: Training convergence on DTD. We plot epoch-wise optimization curves for CoOp + SpecPL under the 16-shot base-to-novel setting. The leftmost panel summarizes the total objective L = Lcls + λ1Lsem + λ2L f g + λ3L cf g , and the remaining panels show the individual loss terms. The counterfactual granule loss is shown before weighting; its contribution to the total objective is scaled by λ3 = 0.1. Method Base … view at source ↗
Figure 8
Figure 8. Figure 8: Few-shot HM under the base-to-novel protocol on 11 datasets. HM curves for CoOp and CoOp + SpecPL under 1/2/4/8/16-shot settings view at source ↗
Figure 9
Figure 9. Figure 9: Few-shot Base/Novel/HM performance under the base-to-novel protocol on 11 datasets. We report full Base, Novel, and HM curves to show how SpecPL affects both base-class fitting and novel-class transfer. 18 view at source ↗
Figure 10
Figure 10. Figure 10: Conventional all-class few-shot performance on 11 datasets. Unlike the base-to-novel setting in Figures 8 and 9, this protocol samples few-shot training data from all categories and evaluates on the full label space. We report all-class performance curves for CoOp and CoOp + SpecPL under 1/2/4/8/16-shot settings. Train Infer Method HM Gain over MMRL Learnable Params. One-time Preproc. Optim. Time (s) GPU … view at source ↗
Figure 11
Figure 11. Figure 11: Spectral energy curves across all datasets. Each subplot reports the overlap scores (VAE/CLIP) in the title; one shared legend is shown on top. 20 view at source ↗
read the original abstract

Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine-grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug-and-play booster, revitalizing text-oriented baselines like CoOp and MaPLe via visual-side guidance. Experiments on 11 benchmarks demonstrate competitive state-of-the-art performance, achieving a new performance ceiling of 81.51\% harmonic-mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability-generalization trade-off. Code is released at https://github.com/Mlrac1e/SpecPL-Prompt-Learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce SpecPL for disentangling spectral granularity in prompt learning for VLMs. It leverages a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details, uses a Visual Semantic Bank to anchor invariants, and employs counterfactual supervision by permuting high-frequency signals to force distinction between granularity and semantic invariance. The method is presented as a plug-and-play addition to baselines like CoOp and MaPLe, with experiments on 11 benchmarks showing competitive SOTA performance and a new 81.51% harmonic mean accuracy, validating the bridging of stability-generalization trade-off.

Significance. Should the core assumptions prove correct and the experimental results be robustly supported, this could represent a meaningful advance in prompt learning by incorporating visual spectral information to mitigate overfitting and improve fine-grained performance. The universal applicability as a booster and code release add to its potential impact. However, the significance is tempered by the need to confirm that the VAE-based decomposition achieves the intended separation without introducing confounding factors.

major comments (2)
  1. The abstract states competitive SOTA results and a new 81.51% harmonic mean but provides no information on baselines, ablations, statistical significance, error bars, or data splits; this makes it impossible to evaluate the central claim of bridging the stability-generalization trade-off from the given text.
  2. The assumption that a frozen VAE reliably decomposes visual signals into semantic low-frequency bands and granular high-frequency details, with permuting the high-frequency signals compelling the model to distinguish granularity from semantic invariance, is not justified; VAEs are typically optimized for reconstruction rather than explicit frequency-semantic separation, and the manuscript must demonstrate that this does not lead to semantic leakage or artifacts.
minor comments (1)
  1. The term 'harmonic-mean accuracy' should be clarified if it refers to the standard harmonic mean of base-to-new or other metrics common in prompt learning literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, providing clarifications from the full manuscript and committing to targeted revisions where appropriate to strengthen the presentation and validation of our claims.

read point-by-point responses
  1. Referee: The abstract states competitive SOTA results and a new 81.51% harmonic mean but provides no information on baselines, ablations, statistical significance, error bars, or data splits; this makes it impossible to evaluate the central claim of bridging the stability-generalization trade-off from the given text.

    Authors: We acknowledge that the abstract's brevity precludes inclusion of full experimental metadata. The complete manuscript reports all baselines (CoOp, MaPLe, and recent prompt-learning methods), component ablations, multi-run statistics with error bars, and standard data splits across the 11 benchmarks in Sections 4 and 5, with the 81.51% harmonic mean computed directly from these results. To improve accessibility, we will revise the abstract to include a brief reference to the evaluation setup and the magnitude of improvement over prior approaches. revision: partial

  2. Referee: The assumption that a frozen VAE reliably decomposes visual signals into semantic low-frequency bands and granular high-frequency details, with permuting the high-frequency signals compelling the model to distinguish granularity from semantic invariance, is not justified; VAEs are typically optimized for reconstruction rather than explicit frequency-semantic separation, and the manuscript must demonstrate that this does not lead to semantic leakage or artifacts.

    Authors: We agree that explicit validation of the decomposition is necessary. While VAEs are reconstruction-focused, their latent representations have been shown in prior frequency-analysis studies to encode semantic content preferentially in lower-frequency components. Our counterfactual high-frequency permutation is intended to isolate granularity effects while the Visual Semantic Bank anchors low-frequency invariants. In the revised manuscript we will add dedicated analyses, including frequency-band visualizations, semantic similarity metrics between original and permuted features, and reconstruction comparisons, to confirm minimal semantic leakage and the effectiveness of the separation. revision: yes

Circularity Check

0 steps flagged

No circularity: method relies on external frozen components and benchmark validation

full rationale

The paper introduces SpecPL as a plug-and-play approach that uses a frozen VAE for low/high-frequency decomposition of visual signals, a frozen Visual Semantic Bank to anchor invariants, and high-frequency permutation for counterfactual granule supervision. No equations, derivations, or fitted parameters are described that reduce the claimed harmonic-mean accuracy gains or stability-generalization bridging to inputs by construction. Performance is validated externally via experiments on 11 benchmarks rather than internal self-referential loops. No load-bearing self-citations or ansatzes imported from prior author work are evident in the provided description.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that a frozen VAE can separate semantic low-frequency content from granular high-frequency details and that counterfactual permutation of the latter will produce useful supervision signals. No free parameters or invented entities with independent evidence are explicitly quantified in the provided text.

axioms (2)
  • domain assumption A frozen VAE can decompose visual signals into semantic low-frequency bands and granular high-frequency details
    Invoked to justify the spectral decomposition step in the method description
  • ad hoc to paper Permuting high-frequency signals creates effective counterfactual examples that force distinction between granularity and semantic invariance
    Core mechanism of the counterfactual granule supervision
invented entities (2)
  • Counterfactual Granule Supervision no independent evidence
    purpose: Drive fine-grained discrimination by compelling the model to distinguish visual granularity from semantic invariance
    New training paradigm introduced to address the modality asymmetry
  • Visual Semantic Bank no independent evidence
    purpose: Anchor text representations to universal low-frequency invariants to mitigate overfitting
    Frozen component used to stabilize text prompts

pith-pipeline@v0.9.0 · 5531 in / 1644 out tokens · 34980 ms · 2026-05-08T17:46:31.840443+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    NeurIPS , volume=

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks , author=. NeurIPS , volume=

  10. [10]

    Lxmert: Learning cross-modality encoder representations from transformers,

    Lxmert: Learning cross-modality encoder representations from transformers , author=. arXiv preprint arXiv:1908.07490 , year=

  11. [11]

    ICML , pages=

    Scaling up visual and vision-language representation learning with noisy text supervision , author=. ICML , pages=. 2021 , organization=

  12. [12]

    ICML , pages=

    Learning transferable visual models from natural language supervision , author=. ICML , pages=. 2021 , organization=

  13. [13]

    CVMJ , volume=

    CLIP-Flow: Decoding images encoded in CLIP space , author=. CVMJ , volume=. 2024 , publisher=

  14. [14]

    CVPR , pages=

    Alpha-clip: A clip model focusing on wherever you want , author=. CVPR , pages=

  15. [15]

    International Journal of Computer Vision , volume=

    Learning to prompt for vision-language models , author=. International Journal of Computer Vision , volume=. 2022 , publisher=

  16. [16]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Conditional prompt learning for vision-language models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  17. [17]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Self-regulating prompts: Foundational model adaptation without forgetting , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  18. [18]

    Advances in Neural Information Processing Systems , volume=

    Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition , author=. Advances in Neural Information Processing Systems , volume=

  19. [19]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Argue: Attribute-guided prompt tuning for vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  20. [20]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Tcp: Textual-based class-aware prompt tuning for visual-language model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  21. [21]

    Advances in Neural Information Processing Systems , volume=

    Awt: Transferring vision-language models via augmentation, weighting, and transportation , author=. Advances in Neural Information Processing Systems , volume=

  22. [22]

    arXiv preprint arXiv:2505.05180 , year=

    OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning , author=. arXiv preprint arXiv:2505.05180 , year=

  23. [23]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    NLPrompt: Noise-Label Prompt Learning for Vision-Language Models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  24. [24]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Hierarchical cross-modal prompt learning for vision-language models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  25. [25]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Advancing textual prompt learning with anchored attributes , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  26. [26]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Large language models are good prompt learners for low-shot image classification , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  27. [27]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Learning to prompt with text only supervision for vision-language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  28. [28]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Maple: Multi-modal prompt learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  29. [29]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Prompt distribution learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  30. [30]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    What does a platypus look like? generating customized prompts for zero-shot image classification , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  31. [31]

    International Journal of Computer Vision , volume=

    Clip-adapter: Better vision-language models with feature adapters , author=. International Journal of Computer Vision , volume=. 2024 , publisher=

  32. [32]

    arXiv preprint arXiv:2111.03930 , year=

    Tip-adapter: Training-free clip-adapter for better vision-language modeling , author=. arXiv preprint arXiv:2111.03930 , year=

  33. [33]

    European conference on computer vision , pages=

    Visual prompt tuning , author=. European conference on computer vision , pages=. 2022 , organization=

  34. [34]

    CVPR , pages=

    Imagenet: A large-scale hierarchical image database , author=. CVPR , pages=. 2009 , organization=

  35. [35]

    International conference on machine learning , pages=

    Do imagenet classifiers generalize to imagenet? , author=. International conference on machine learning , pages=. 2019 , organization=

  36. [36]

    Advances in Neural Information Processing Systems , volume=

    Learning robust global representations by penalizing local predictive power , author=. Advances in Neural Information Processing Systems , volume=

  37. [37]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Natural adversarial examples , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  38. [38]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    The many faces of robustness: A critical analysis of out-of-distribution generalization , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  39. [39]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Mmrl: Multi-modal representation learning for vision-language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  40. [40]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  41. [41]

    Advances in Neural Information Processing Systems , volume=

    Fast fourier convolution , author=. Advances in Neural Information Processing Systems , volume=

  42. [42]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Focal frequency loss for image reconstruction and synthesis , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  43. [43]

    Computer Vision and Pattern Recognition Workshop , year=

    Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , author=. Computer Vision and Pattern Recognition Workshop , year=

  44. [44]

    2012 IEEE conference on computer vision and pattern recognition , pages=

    Cats and dogs , author=. 2012 IEEE conference on computer vision and pattern recognition , pages=. 2012 , organization=

  45. [45]

    In: 2013 IEEE International Conference on Computer Vision Workshops

    Krause, Jonathan and Stark, Michael and Deng, Jia and Fei-Fei, Li , booktitle=. 3D Object Representations for Fine-Grained Categorization , year=. doi:10.1109/ICCVW.2013.77 , ISSN=

  46. [46]

    2008 Sixth Indian conference on computer vision, graphics & image processing , pages=

    Automated flower classification over a large number of classes , author=. 2008 Sixth Indian conference on computer vision, graphics & image processing , pages=. 2008 , organization=

  47. [47]

    European conference on computer vision , pages=

    Food-101--mining discriminative components with random forests , author=. European conference on computer vision , pages=. 2014 , organization=

  48. [48]

    Fine-Grained Visual Classification of Aircraft

    Fine-grained visual classification of aircraft , author=. arXiv preprint arXiv:1306.5151 , year=

  49. [49]

    2010 IEEE computer society conference on computer vision and pattern recognition , pages=

    Sun database: Large-scale scene recognition from abbey to zoo , author=. 2010 IEEE computer society conference on computer vision and pattern recognition , pages=. 2010 , organization=

  50. [50]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Describing textures in the wild , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  51. [51]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=

    Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=. 2019 , publisher=

  52. [52]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Ucf101: A dataset of 101 human actions classes from videos in the wild , author=. arXiv preprint arXiv:1212.0402 , year=

  53. [53]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Waffling around for performance: Visual classification with random words and broad concepts , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  54. [54]

    International conference on learning representations , year=

    ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , author=. International conference on learning representations , year=

  55. [55]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-frequency component helps explain the generalization of convolutional neural networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  56. [56]

    International conference on machine learning , pages=

    Making convolutional networks shift-invariant again , author=. International conference on machine learning , pages=. 2019 , organization=

  57. [57]

    Neural computation , volume=

    Slow feature analysis: Unsupervised learning of invariances , author=. Neural computation , volume=. 2002 , publisher=

  58. [58]

    Advances in neural information processing systems , volume=

    Do vision transformers see like convolutional neural networks? , author=. Advances in neural information processing systems , volume=

  59. [59]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model , author=. arXiv preprint arXiv:2510.10274 , year=

  60. [60]

    2026 , url=

    Junjie Zhou and WEI SHAO and Yagao Yue and Wei Mu and Peng Wan and Qi Zhu and Daoqiang Zhang , booktitle=. 2026 , url=

  61. [61]

    International conference on machine learning , pages=

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International conference on machine learning , pages=. 2022 , organization=

  62. [62]

    2024 , eprint=

    FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition , author=. 2024 , eprint=