pith. sign in

arxiv: 2605.18464 · v2 · pith:XMIEAJQZnew · submitted 2026-05-18 · 💻 cs.CV

PERL: Parameter Efficient Reasoning in CLIP Latent Space

Pith reviewed 2026-05-20 11:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords parameter-efficient adaptationCLIPlatent reasoningfew-shot learningvision-language modelsrecurrent refinementbase-to-novel generalization
0
0 comments X

The pith

A recurrent reasoning module with roughly 6K parameters adapts frozen CLIP models for few-shot tasks while keeping open-vocabulary performance intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adaptation in contrastively trained vision-language models can arise from repeated reasoning steps over latent representations instead of from larger numbers of added parameters. It presents PERL, which freezes the original CLIP weights and adds one small shared module that creates a reasoning token at each step and feeds it into an intermediate encoder layer. The process repeats across several refinement steps to sharpen higher-level semantic features for the target task. A reader would care because the method reaches competitive accuracy on novel classes and transfer benchmarks while using far fewer trainable weights than prompt-based or adapter-based alternatives.

Core claim

PERL augments a frozen CLIP model with a compact shared reasoning module applied recurrently across refinement steps. At each step the module produces a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, progressively refining higher-level semantic representations while preserving the pretrained multimodal structure.

What carries the argument

a compact shared reasoning module applied recurrently across refinement steps that generates a latent reasoning token and injects it into an intermediate encoder layer

If this is right

  • Strong accuracy on novel classes alongside competitive transfer to new datasets
  • Only about 6K trainable parameters required under few-shot fast adaptation
  • Up to 817 times fewer parameters than the largest compared adaptation methods
  • Best overall parameter-performance trade-off across 15 benchmarks including out-of-distribution ImageNet variants
  • Iterative latent reasoning acts as a complementary mechanism to simply scaling parameter count

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recurrent-injection pattern could be tested on other frozen multimodal encoders to check whether the low-parameter benefit generalizes beyond CLIP
  • Because the module is shared and small, the approach may allow on-device adaptation where memory for extra weights is limited
  • Future work could measure how many refinement steps are needed before returns diminish, turning the number of iterations into a controllable hyperparameter

Load-bearing premise

Repeated application of the shared reasoning module improves task-specific representations without eroding the frozen CLIP model's original multimodal alignment or creating instability in the latent space.

What would settle it

A measurable drop in zero-shot accuracy on standard CLIP evaluation sets after several refinement steps would show that the recurrent injection harms the pretrained alignment.

Figures

Figures reproduced from arXiv: 2605.18464 by Concetto Spampinato, Daniela Giordano, Matteo Pennisi, Salvatore Calcagno, Simone Carnemolla.

Figure 1
Figure 1. Figure 1: Novel-class accuracy versus trainable parameters for representa￾tive CLIP adaptation methods. PERL achieves competitive novel-class gener￾alization with only 6K trainable parame￾ters, suggesting that iterative latent rea￾soning can complement parameter scal￾ing in CLIP adaptation. We instantiate this idea in PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework in wh… view at source ↗
Figure 2
Figure 2. Figure 2: PERL architecture. PERL adds iterative latent reasoning to a frozen CLIP model. Each encoder is split into frozen base Bm and refinement Rm modules; a small trainable projector φm, shared across steps, recurrently maps the pooled state into a reasoning token z (k) m appended before the refinement blocks. This trades extra compute for adaptation capacity without adding trainable parameters. Refined CLS/EOS … view at source ↗
Figure 3
Figure 3. Figure 3: Ablation on reasoning configurations. Base, Novel, and Harmonic Mean accuracy w.r.t the injection layer (left) and number of reasoning steps (right). Performance peaks at layer 7 with 4 reasoning iterations [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-sample prediction drift. KL di￾vergence from the step-0 distribution at each rea￾soning step, grouped by setting and transition type (Correct-to-Correct, Wrong-to-Correct). Larger values indicate stronger drift. a single corrective update. At the same time, the decreasing Jacobian norm indicates that predictions become progressively more stable and less sensitive to input perturbations. We next analyze… view at source ↗
Figure 6
Figure 6. Figure 6: Wrong-to-correct transition under reasoning. Confidence trajectories (left) and contribu￾tion maps (right) show evidence reallocating from the initial class to semantically relevant regions of the corrected class, reshaping both predictions and visual support. 5 Conclusion This paper investigated whether iterative latent reasoning can serve as a parameter-efficient adaptation mechanism for frozen contrasti… view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation metrics across varying numbers of reasoning steps in the three settings: base-to [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy and Jacobian norm of the target logit with respect to the input in the base-to-novel [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy and Jacobian norm of the target logit with respect to the input in the cross-dataset [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy and Jacobian norm of the target logit with respect to the input in the out-of [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Target confidence and Brier score in the base-to-novel evaluation settings. We report each [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Target confidence and Brier score in the cross-dataset evaluation settings. We report each [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Target confidence and Brier score in the out-of-distribution evaluation settings. We report [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Examples of wrong-to-correct refinements through reasoning. The model progressively shifts its confidence from an incorrect initial prediction to the correct class across reasoning steps. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Examples of wrong-to-correct refinements through reasoning. The model progressively shifts its confidence from an incorrect initial prediction to the correct class across reasoning steps. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Examples of wrong-to-correct refinements through reasoning. The model progressively shifts its confidence from an incorrect initial prediction to the correct class across reasoning steps. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Examples of wrong-to-correct refinements through reasoning. The model progressively shifts its confidence from an incorrect initial prediction to the correct class across reasoning steps. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
read the original abstract

Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary generalization remains challenging. Existing parameter-efficient adaptation methods typically improve task specialization through learned prompts, adapters, or multimodal transformations, where adaptation capacity is primarily expressed through additional trainable parameters. Inspired by recent latent reasoning methods in language models, we investigate a complementary perspective: can adaptation emerge from iterative reasoning on latent representations rather than from increasing parameter count alone? We introduce PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework that augments a frozen CLIP model with a compact shared reasoning module applied recurrently across refinement steps. At each step, PERL generates a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, progressively refining higher-level semantic representations while preserving CLIP's pretrained multimodal structure. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and out-of-distribution ImageNet variants, PERL achieves the best parameter-performance trade-off among the compared methods under a fast-adaptation few-shot setting, combining strong novel-class accuracy and competitive transfer performance with only about 6K trainable parameters, up to 817x fewer than the largest compared approach. Overall, our results suggest that iterative latent reasoning provides a complementary adaptation mechanism to parameter scaling in discriminative vision-language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PERL, a parameter-efficient adaptation framework for frozen CLIP vision-language models. It augments the model with a compact shared reasoning module (~6K trainable parameters) that is applied recurrently: at each refinement step the module generates a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, with the goal of progressively refining higher-level semantics while preserving the pretrained multimodal alignment. The central empirical claim is that this approach achieves the best parameter-performance trade-off across 15 benchmarks (base-to-novel generalization, cross-dataset transfer, and ImageNet OOD variants) in a fast-adaptation few-shot setting, outperforming or matching prior methods while using up to 817× fewer parameters.

Significance. If the performance claims and the preservation of alignment are substantiated, the work would demonstrate that iterative latent-space reasoning can serve as a complementary adaptation mechanism to parameter scaling in discriminative vision-language models, enabling strong generalization with extremely low parameter budgets.

major comments (2)
  1. [Method] Method section (description of recurrent injection): the claim that recurrent application of the shared reasoning module 'progressively refines higher-level semantic representations while preserving CLIP's pretrained multimodal structure' is load-bearing for the central contribution, yet the manuscript provides no direct verification (pre/post cosine similarities between image and text embeddings, or zero-shot transfer ablations before and after refinement steps). Downstream accuracy numbers alone cannot distinguish true reasoning from task-specific drift.
  2. [Experiments] Experiments section (benchmark results): the reported best parameter-performance trade-off rests on comparisons with prior methods, but without an ablation that isolates the effect of the number of recurrent steps (e.g., 1-step vs. 3-step vs. 5-step) on both novel-class accuracy and zero-shot transfer capability, it remains unclear whether additional steps improve or erode the pretrained alignment.
minor comments (2)
  1. [Method] The notation for the latent reasoning token and its injection point should be formalized with an equation or pseudocode block to make the recurrent update rule unambiguous.
  2. [Experiments] Table captions and axis labels in the main results figure should explicitly state the number of shots and the exact parameter counts for each baseline to facilitate direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and indicating where we will incorporate revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Method] Method section (description of recurrent injection): the claim that recurrent application of the shared reasoning module 'progressively refines higher-level semantic representations while preserving CLIP's pretrained multimodal structure' is load-bearing for the central contribution, yet the manuscript provides no direct verification (pre/post cosine similarities between image and text embeddings, or zero-shot transfer ablations before and after refinement steps). Downstream accuracy numbers alone cannot distinguish true reasoning from task-specific drift.

    Authors: We agree that direct verification of alignment preservation would strengthen the central claim. Although our competitive zero-shot transfer results across multiple benchmarks provide indirect evidence that the pretrained multimodal structure is largely maintained, we acknowledge that explicit metrics such as pre- and post-refinement cosine similarities between image and text embeddings, as well as zero-shot ablations at different refinement steps, would more convincingly distinguish iterative reasoning from potential task-specific drift. We will add these analyses to the revised manuscript, including quantitative measurements of embedding similarities and corresponding zero-shot performance evaluations. revision: yes

  2. Referee: [Experiments] Experiments section (benchmark results): the reported best parameter-performance trade-off rests on comparisons with prior methods, but without an ablation that isolates the effect of the number of recurrent steps (e.g., 1-step vs. 3-step vs. 5-step) on both novel-class accuracy and zero-shot transfer capability, it remains unclear whether additional steps improve or erode the pretrained alignment.

    Authors: We appreciate this suggestion for a more granular ablation study. The current experiments focus on the performance achieved with our selected number of recurrent steps, but to better isolate the contribution of iterative refinement, we will include an ablation varying the number of steps (specifically comparing 1, 3, and 5 steps) and report the impacts on novel-class accuracy as well as zero-shot transfer performance. This will help clarify whether additional steps enhance or potentially degrade the alignment. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with independent benchmark validation

full rationale

The paper introduces PERL as a lightweight recurrent reasoning module for adapting frozen CLIP models and supports its claims exclusively through empirical results on 15 benchmarks covering base-to-novel generalization, cross-dataset transfer, and OOD variants. No mathematical derivation, first-principles prediction, or uniqueness theorem is presented that reduces to fitted inputs or self-citations by construction. The recurrent injection mechanism is a design choice whose effectiveness is measured by downstream accuracy and parameter count, not by any self-referential loop where outputs are defined as inputs. This is a standard empirical contribution whose central claims remain independent of the paper's own definitions or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method depends on the design choice of a compact shared module and the assumption that recurrent token injection refines semantics; no free parameters are explicitly fitted beyond the module size itself, and no new physical entities are postulated.

axioms (1)
  • domain assumption The base CLIP model remains frozen throughout adaptation to preserve its pretrained multimodal structure.
    Stated in the abstract as the foundation for maintaining open-vocabulary generalization.
invented entities (1)
  • latent reasoning token no independent evidence
    purpose: Generated at each step to condition and refine the current representation when injected into an intermediate encoder layer.
    Core mechanism introduced to enable iterative refinement without increasing parameter count.

pith-pipeline@v0.9.0 · 5798 in / 1435 out tokens · 48660 ms · 2026-05-20T11:41:14.878630+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 4 internal anchors

  1. [1]

    Deep equilibrium models.Advances in neural information processing systems, 32, 2019

    Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models.Advances in neural information processing systems, 32, 2019

  2. [2]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

  3. [3]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

  4. [4]

    Uni- versal transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni- versal transformers. InInternational Conference on Learning Representations

  5. [5]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  6. [6]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004

  7. [7]

    Clip-adapter: Better vision-language models with feature adapters.International journal of computer vision, 132(2):581–595, 2024

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International journal of computer vision, 132(2):581–595, 2024

  8. [8]

    Looped transformers as programmable computers

    Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023

  9. [9]

    Mmrl: Multi-modal representation learning for vision- language models

    Yuncheng Guo and Xiaodong Gu. Mmrl: Multi-modal representation learning for vision- language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25015–25025, 2025

  10. [10]

    Mmrl++: Parameter-efficient and interaction-aware repre- sentation learning for vision-language models, 2025

    Yuncheng Guo and Xiaodong Gu. Mmrl++: Parameter-efficient and interaction-aware repre- sentation learning for vision-language models, 2025. URL https://arxiv.org/abs/2505. 10088

  11. [11]

    Training large language models to reason in a continuous latent space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling

  12. [12]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

  13. [13]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021

  14. [14]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271, 2021

  15. [15]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019

  16. [16]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 11

  17. [17]

    Fast quiet-star: Thinking without thought tokens.CoRR, abs/2505.17746, 2025

    Wei Huang, Yizhe Xiong, Xin Ye, Zhijie Deng, Hui Chen, Zijia Lin, and Guiguang Ding. Fast quiet-star: Thinking without thought tokens.CoRR, abs/2505.17746, 2025

  18. [18]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fa- had Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, June 2023

  19. [19]

    Self-regulating prompts: Foundational model adaptation without forgetting

    Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15190–15200, October 2023

  20. [20]

    3d object representations for fine- grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

  21. [21]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013

  22. [22]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

  23. [23]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

  24. [24]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

  25. [25]

    Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019

  26. [26]

    Reasoning with latent thoughts: On the power of looped transformers

    Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations

  27. [27]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

  28. [28]

    Learning robust global repre- sentations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global repre- sentations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019

  29. [29]

    Sun database: Large-scale scene recognition from abbey to zoo

    Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

  30. [30]

    Progressive visual prompt learning with contrastive feature re-formation,

    Chen Xu, Yuhan Zhu, Haocheng Shen, Fengyuan Shi, Boheng Chen, Yixuan Liao, Xiaoxin Chen, and Limin Wang. Progressive visual prompt learning with contrastive feature re-formation,

  31. [31]

    URLhttps://arxiv.org/abs/2304.08386

  32. [32]

    Mma: Multi-modal adapter for vision-language models

    Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, and Xiaohua Xie. Mma: Multi-modal adapter for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23826–23837, June 2024. 12

  33. [33]

    Visual-language prompt tuning with knowledge- guided context optimization

    Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge- guided context optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6757–6767, June 2023

  34. [34]

    Tcp:textual-based class-aware prompt tuning for visual-language model

    Hantao Yao, Rui Zhang, and Changsheng Xu. Tcp:textual-based class-aware prompt tuning for visual-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23438–23448, June 2024

  35. [35]

    Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

    Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629, 2024

  36. [36]

    Dept: Decoupled prompt tuning

    Ji Zhang, Shihan Wu, Lianli Gao, Heng Tao Shen, and Jingkuan Song. Dept: Decoupled prompt tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12924–12933, 2024

  37. [37]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16816–16825, June 2022

  38. [38]

    International Journal of Computer Vision , author =

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.Int. J. Comput. Vis., 130(9):2337–2348, 2022. doi: 10.1007/ S11263-022-01653-1. URLhttps://doi.org/10.1007/s11263-022-01653-1

  39. [39]

    Scaling Latent Reasoning via Looped Language Models

    Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, et al. Scaling latent reasoning via looped language models.arXiv preprint arXiv:2510.25741, 2025. 13 Technical appendices and supplementary material A Implementation Details We use OpenAI CLIP with a ViT-B/16 visual backbone (vision hiddendv =768,...