pith. sign in

arxiv: 2606.25758 · v1 · pith:ZCC2UMY5new · submitted 2026-06-24 · 💻 cs.CV

Dual Distribution Estimation for Zero-shot Noisy Test-Time Adaptation with VLMs

Pith reviewed 2026-06-25 21:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords test-time adaptationzero-shot learningout-of-distribution detectionvision-language modelsGaussian distribution estimationnoisy test-time adaptationdistribution modeling
0
0 comments X

The pith

Training-free Gaussian modeling of VLM features raises noisy test-time adaptation accuracy by 3.7 percent while cutting OOD detection errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that noisy test-time adaptation of vision-language models can be handled by shifting from instance-level discriminative training to explicit modeling of class-wise Gaussian distributions on frozen model features. It introduces Positive Feature Distribution Estimation to build inclusion and exclusion Gaussians that yield calibrated contrastive scores for better in-distribution classification, paired with Negative Label Distribution Estimation to identify highly discriminative labels and reduce spurious correlations for stronger out-of-distribution filtering. This matters because real-world streams mix in-distribution and out-of-distribution samples, and prior zero-shot approaches either degrade efficiency or produce overconfident errors. The resulting method remains fully training-free and zero-shot, delivering simultaneous gains in accuracy and detection on large-scale benchmarks while preserving online scalability.

Core claim

DDE shifts the zero-shot NTTA paradigm from instance-level learning to training-free Gaussian distribution modeling. PFDE explicitly models class-wise inclusion and exclusion Gaussian distributions from test-batch features to formulate a calibrated contrastive score that robustly enhances ID accuracy. NLDE improves OOD identification by explicitly modeling the negative label distribution to mine highly discriminative labels and mitigate spurious correlations. On the large-scale ImageNet benchmark this yields a 3.70 percent improvement in harmonic mean accuracy and a 6.20 percent reduction in FPR95 for OOD detection while ensuring highly scalable and efficient online inference.

What carries the argument

Dual Distribution Estimation (DDE) via Positive Feature Distribution Estimation (PFDE) and Negative Label Distribution Estimation (NLDE), which fit class-wise Gaussians to frozen VLM test-batch features to produce contrastive inclusion/exclusion scores and mined negative labels.

If this is right

  • Enables highly scalable and efficient online inference without retraining or post-hoc tuning.
  • Maintains robustness in data-scarce scenarios while remaining zero-shot.
  • Simultaneously improves in-distribution classification accuracy and out-of-distribution detection.
  • Avoids overconfident misclassifications that arise from test-time discriminative training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the Gaussian modeling holds, the same dual-estimation structure could replace instance-level training loops in other online adaptation settings.
  • A testable extension is to measure how performance changes when test-batch size shrinks below the point where reliable Gaussian fits become possible.
  • The approach implies that single-batch feature statistics alone suffice for reliable ID/OOD separation, which would simplify deployment in streaming environments where labeled data never arrives.
  • If the contrastive scores prove stable across domains, similar distribution estimation could be explored for non-vision modalities that also rely on frozen encoders.

Load-bearing premise

Test-batch features from a frozen VLM can be reliably modeled as class-wise Gaussians whose inclusion/exclusion contrastive scores separate ID from OOD samples without any labeled supervision or post-hoc tuning.

What would settle it

Apply DDE to ImageNet with controlled mixtures of ID and OOD samples and check whether harmonic-mean accuracy gains of 3.70 percent and FPR95 reductions of 6.20 percent disappear relative to prior zero-shot NTTA baselines.

read the original abstract

While test-time adaptation (TTA) empowers vision-language models to adapt without costly retraining, it remains highly vulnerable to out-of-distribution (OOD) outliers prevalent in real-world applications. This discrepancy motivates Noisy TTA (NTTA), an online task to filter noisy OOD samples on the fly while maximizing in-distribution (ID) classification accuracy. Existing zero-shot NTTA approaches typically rely on test-time discriminative training, leading to overconfident misclassifications and significantly degraded inference efficiency. To address these limitations, we propose a novel framework named Dual Distribution Estimation (DDE), shifting the zero-shot NTTA paradigm from instance-level learning to training-free Gaussian distribution modeling. DDE incorporates two novel modules: Positive Feature Distribution Estimation (PFDE) and Negative Label Distribution Estimation (NLDE). PFDE explicitly models class-wise inclusion and exclusion Gaussian distributions to formulate a calibrated contrastive score, robustly enhancing ID accuracy. In parallel, NLDE improves OOD identification by explicitly modeling the negative label distribution to mine highly discriminative labels, effectively mitigating spurious correlations. Extensive experiments show that on the large-scale ImageNet benchmark, DDE achieves an improvement of 3.70\% in harmonic mean accuracy and reduces the FPR95 for OOD detection by 6.20\%, while ensuring highly scalable and efficient online inference. Furthermore, DDE is zero-shot and training-free, demonstrating remarkable robustness in data-scarce scenarios. Codes are available at https://github.com/ZhuWenjie98/DDE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Dual Distribution Estimation (DDE) for zero-shot noisy test-time adaptation (NTTA) with vision-language models. It introduces Positive Feature Distribution Estimation (PFDE) to explicitly model class-wise inclusion/exclusion Gaussian distributions from frozen VLM test-batch features for a calibrated contrastive score, and Negative Label Distribution Estimation (NLDE) to model negative label distributions for mining discriminative labels and mitigating spurious correlations. The central empirical claim is a 3.70% gain in harmonic mean accuracy and 6.20% reduction in FPR95 on ImageNet, with the method positioned as training-free, post-hoc-tuning-free, and scalable for online inference.

Significance. If the Gaussian modeling assumptions hold without supervision, the shift from instance-level discriminative training to explicit dual distribution estimation could meaningfully improve robustness and efficiency in real-world NTTA settings, particularly for data-scarce or OOD-contaminated batches. The training-free nature and reported scalability are notable strengths if supported by reproducible code and ablations.

major comments (2)
  1. The PFDE module's reliance on modeling test-batch features as class-wise Gaussians (whose inclusion/exclusion contrastive scores drive ID accuracy) rests on unsupervised assignment of samples to classes; this creates a potential feedback loop when initial zero-shot logits misassign OOD-contaminated samples, directly undermining the claimed 3.70% harmonic mean gain. No derivation details, sensitivity analysis, or robustness checks against this circularity are evident from the abstract.
  2. The abstract reports quantitative gains (3.70% harmonic mean, 6.20% FPR95) but provides no error bars, statistical significance tests, ablation evidence on PFDE/NLDE components, or full experimental protocol, making it impossible to verify whether the improvements are load-bearing or sensitive to post-hoc choices.
minor comments (1)
  1. The abstract should explicitly name the VLM backbone, test-batch sizes, and any clustering or logit-thresholding steps used for initial class assignment in PFDE/NLDE.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive feedback on our work. We address each major comment below with clarifications from the manuscript and commitments to revisions where appropriate.

read point-by-point responses
  1. Referee: The PFDE module's reliance on modeling test-batch features as class-wise Gaussians (whose inclusion/exclusion contrastive scores drive ID accuracy) rests on unsupervised assignment of samples to classes; this creates a potential feedback loop when initial zero-shot logits misassign OOD-contaminated samples, directly undermining the claimed 3.70% harmonic mean gain. No derivation details, sensitivity analysis, or robustness checks against this circularity are evident from the abstract.

    Authors: The manuscript details in Section 3.2 that PFDE fits class-wise Gaussians directly on the frozen VLM features of the test batch after initial zero-shot logit-based pseudo-labeling, then derives a calibrated contrastive score from the inclusion/exclusion distributions. This is not a closed feedback loop because the Gaussian parameters are estimated once per batch in a training-free manner and the contrastive score explicitly down-weights outliers via the exclusion component; NLDE further mitigates spurious assignments by mining discriminative negative labels. Derivation of the calibrated score appears in Equations (3)–(5). We agree that sensitivity analysis to initial assignment errors is not present and will add it (varying pseudo-label noise levels on ImageNet) in the revision. revision: partial

  2. Referee: The abstract reports quantitative gains (3.70% harmonic mean, 6.20% FPR95) but provides no error bars, statistical significance tests, ablation evidence on PFDE/NLDE components, or full experimental protocol, making it impossible to verify whether the improvements are load-bearing or sensitive to post-hoc choices.

    Authors: The full manuscript provides the experimental protocol in Section 4.1, component ablations in Section 4.3 (showing PFDE and NLDE each contribute to the harmonic-mean gain), and results on multiple datasets beyond the abstract. However, the abstract itself omits error bars and significance tests. We will add per-run standard deviations, paired t-test results, and expanded ablation tables to both the abstract and main results in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained distribution estimation

full rationale

The paper's core contribution is explicit Gaussian modeling (PFDE for class-wise inclusion/exclusion and NLDE for negative labels) directly from unlabeled test-batch VLM features to produce contrastive scores. This is the method itself rather than any prediction or result that reduces by construction to fitted inputs or prior self-citations. No equations or steps in the abstract or description exhibit self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations. The approach is presented as training-free and zero-shot, with performance claims tied to empirical benchmarks rather than internal tautologies. This matches the default case of an honest non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The approach rests on the assumption that VLM features admit useful Gaussian modeling and that negative label distributions mined from test data are discriminative; no free parameters or invented physical entities are stated.

axioms (2)
  • domain assumption Test-batch features from a frozen VLM follow class-conditional Gaussian distributions that can be estimated without labels.
    Invoked to justify PFDE module construction.
  • domain assumption Negative label distributions mined from the test batch reduce spurious correlations for OOD detection.
    Invoked to justify NLDE module.
invented entities (2)
  • Positive Feature Distribution Estimation (PFDE) no independent evidence
    purpose: Models class-wise inclusion and exclusion Gaussians to produce calibrated contrastive scores.
    New module introduced to replace instance-level discriminative training.
  • Negative Label Distribution Estimation (NLDE) no independent evidence
    purpose: Models negative label distribution to identify highly discriminative labels and mitigate spurious correlations.
    New module introduced for OOD identification.

pith-pipeline@v0.9.1-grok · 5812 in / 1352 out tokens · 23607 ms · 2026-06-25T21:00:20.648096+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 5 linked inside Pith

  1. [1]

    Noisy test-time adaptation in vision-language models.arXiv preprint arXiv:2502.14604, 2025

    Chentao Cao, Zhun Zhong, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, and Bo Han. Noisy test-time adaptation in vision-language models.arXiv preprint arXiv:2502.14604, 2025

  2. [2]

    Dual memory networks: A versatile adaptation approach for vision-language models

    Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, and Lei Zhang. Dual memory networks: A versatile adaptation approach for vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28718–28728, 2024

  3. [3]

    Efficienttest-timeadaptationof vision-language models

    AdilbekKarmanov,DayanGuan,ShijianLu,AbdulmotalebElSaddik,andEricXing. Efficienttest-timeadaptationof vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14162–14171, 2024

  4. [4]

    Multi-cache enhanced prototype learning for test-time generalization of vision-language models

    Xinyu Chen, Haotian Zhai, Can Zhang, Xiupeng Shi, and Ruirui Li. Multi-cache enhanced prototype learning for test-time generalization of vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2281–2291, 2025

  5. [5]

    Dual prototype evolving for test-time generalization of vision-language models.Advances in Neural Information Processing Systems, 37:32111–32136, 2024

    Ce Zhang, Simon Stepputtis, Katia Sycara, and Yaqi Xie. Dual prototype evolving for test-time generalization of vision-language models.Advances in Neural Information Processing Systems, 37:32111–32136, 2024

  6. [6]

    Adaneg: Adaptive negative proxy guided ood detection with vision-language models

    Yabin Zhang and Lei Zhang. Adaneg: Adaptive negative proxy guided ood detection with vision-language models. Advances in Neural Information Processing Systems, 37:38744–38768, 2024

  7. [7]

    Oodd: Test-time out-of-distribution detection with dynamic dictionary

    Yifeng Yang, Lin Zhu, Zewen Sun, Hengyu Liu, Qinying Gu, and Nanyang Ye. Oodd: Test-time out-of-distribution detection with dynamic dictionary. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30630–30639, 2025

  8. [8]

    On the robustness of open-world test-time training: Self-training with dynamic prototype expansion

    Yushu Li, Xun Xu, Yongyi Su, and Kui Jia. On the robustness of open-world test-time training: Self-training with dynamic prototype expansion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11836–11846, 2023

  9. [9]

    Tent: Fullytest-timeadaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020

    DequanWang,EvanShelhamer,ShaotengLiu,BrunoOlshausen,andTrevorDarrell. Tent: Fullytest-timeadaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020

  10. [10]

    Towards stable test-time adaptation in dynamic wild world.arXiv preprint arXiv:2302.12400, 2023

    Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world.arXiv preprint arXiv:2302.12400, 2023

  11. [11]

    On the efficacy of small self-supervised contrastive models without distillation signals

    Haizhou Shi, Youcai Zhang, Siliang Tang, Wenjie Zhu, Yaqian Li, Yandong Guo, and Yueting Zhuang. On the efficacy of small self-supervised contrastive models without distillation signals. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 2225–2234, 2022

  12. [12]

    Topocl: Topological contrastive learning for medical imaging

    Guangyu Meng, Pengfei Gu, Peixian Liang, John P Lalor, Erin Wolf Chambers, and Danny Z Chen. Topocl: Topological contrastive learning for medical imaging. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 42681–42690, 2026

  13. [13]

    Instruct where the model fails: Generative data augmentation via guided self-contrastive fine-tuning

    Weijian Ma, Ruoxin Chen, Keyue Zhang, Shuang Wu, and Shouhong Ding. Instruct where the model fails: Generative data augmentation via guided self-contrastive fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5991–5999, 2025

  14. [14]

    The norm must go on: Dynamic unsupervised domain adaptation by normalization

    M Jehanzeb Mirza, Jakub Micorek, Horst Possegger, and Horst Bischof. The norm must go on: Dynamic unsupervised domain adaptation by normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14765–14775, 2022

  15. [15]

    Delta: degradation-free fully test-time adaptation.arXiv preprint arXiv:2301.13018, 2023

    Bowen Zhao, Chen Chen, and Shu-Tao Xia. Delta: degradation-free fully test-time adaptation.arXiv preprint arXiv:2301.13018, 2023

  16. [16]

    Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models

    Elaine Sui, Xiaohan Wang, and Serena Yeung-Levy. Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 825–835. IEEE, 2025

  17. [17]

    Enhancing zero-shot vision models by label-free prompt distribution learning and bias correcting.Advances in Neural Information Processing Systems, 37:2001–2025, 2024

    Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, and Hanwang Zhang. Enhancing zero-shot vision models by label-free prompt distribution learning and bias correcting.Advances in Neural Information Processing Systems, 37:2001–2025, 2024

  18. [18]

    Dota: Distributional test-time adaptation of vision-language models.arXiv preprint arXiv:2409.19375, 2024

    Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, and Changqing Zhang. Dota: Distributional test-time adaptation of vision-language models.arXiv preprint arXiv:2409.19375, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 17

  19. [19]

    Cliptta: Robust contrastive vision-language test-time adaptation.arXiv preprint arXiv:2507.14312, 2025

    Marc Lafon, Gustavo Adolfo Vargas Hakim, Clément Rambour, Christian Desrosier, and Nicolas Thome. Cliptta: Robust contrastive vision-language test-time adaptation.arXiv preprint arXiv:2507.14312, 2025

  20. [20]

    Bayesian test-time adaptation for vision-language models

    Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Xiatian Zhu, Lei Deng, Hongbin Liu, and Zhen Lei. Bayesian test-time adaptation for vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29999–30009, 2025

  21. [21]

    Protodcs: Towards robust and efficient open-set test-time adaptation for vision-language models.arXiv preprint arXiv:2602.23653, 2026

    Wei Luo, Yangfan Ou, Jin Deng, Zeshuai Deng, Xiquan Yan, Zhiquan Wen, and Mingkui Tan. Protodcs: Towards robust and efficient open-set test-time adaptation for vision-language models.arXiv preprint arXiv:2602.23653, 2026

  22. [22]

    Model-free test time adaptation for out-of-distribution detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    YiFan Zhang, Xue Wang, Tian Zhou, Kun Yuan, Zhang Zhang, Liang Wang, and Rong Jin. Model-free test time adaptation for out-of-distribution detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  23. [23]

    Re-calibrating feature attributions for model interpretation

    Peiyu Yang, Naveed Akhtar, Zeyi Wen, Mubarak Shah, and Ajmal Saeed Mian. Re-calibrating feature attributions for model interpretation. InInternational Conference on Learning Representations, 2023

  24. [24]

    Backdoor-based explainable ai benchmark for high fidelity evaluation of attribution methods.arXiv preprint arXiv:2405.02344, 2024

    Peiyu Yang, Naveed Akhtar, Jiantong Jiang, and Ajmal Mian. Backdoor-based explainable ai benchmark for high fidelity evaluation of attribution methods.arXiv preprint arXiv:2405.02344, 2024

  25. [25]

    Out-of-distribution detection: A task-oriented survey of recent advances.ACM Computing Surveys, 58(2):1–39, 2025

    Shuo Lu, Yingsheng Wang, Lijun Sheng, Lingxiao He, Aihua Zheng, and Jian Liang. Out-of-distribution detection: A task-oriented survey of recent advances.ACM Computing Surveys, 58(2):1–39, 2025

  26. [26]

    Beyond the static world: Continual category discovery under visual drift

    Wei Feng, Yiwen Jiang, Sijin Zhou, and Zongyuan Ge. Beyond the static world: Continual category discovery under visual drift. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25032–25042, 2026

  27. [27]

    Seeing through the shift: Causality-inspired robust generalized category discovery

    Wei Feng, Yiwen Jiang, Sijin Zhou, Zhuang Qi, Zhongxing Xu, Zhonghua Wang, Feilong Tang, and Zongyuan Ge. Seeing through the shift: Causality-inspired robust generalized category discovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17766–17775, 2026

  28. [28]

    Generalized category discovery under domain shift: A frequency domain perspective

    Wei Feng and Zongyuan Ge. Generalized category discovery under domain shift: A frequency domain perspective. Advances in Neural Information Processing Systems, 38:111721–111749, 2026

  29. [29]

    Physics-informed unsupervised domain adaptation framework for cross-machine bearing fault diagnosis.Advanced Engineering Informatics, 62:102774, 2024

    Jia Ning, Weiguo Huang, Chuancang Ding, Jun Wang, and Zhongkui Zhu. Physics-informed unsupervised domain adaptation framework for cross-machine bearing fault diagnosis.Advanced Engineering Informatics, 62:102774, 2024

  30. [30]

    A physics-guided memory enhancement and causality-inspired generalization framework for continual fault diagnosis

    Jia Ning, Weiguo Huang, Panpan Guo, Chuancang Ding, Yifan Huangfu, Changqing Shen, and Zhongkui Zhu. A physics-guided memory enhancement and causality-inspired generalization framework for continual fault diagnosis. Knowledge-Based Systems, 325:114044, 2025. Corresponding author: Weiguo Huang

  31. [31]

    Cinematte: Background matting for virtual production and beyond

    Yuanjian He, Chen Zhang, Fasheng Chen, and Jiangbo Cao. Cinematte: Background matting for virtual production and beyond. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8725–8735, 2026

  32. [32]

    Advancing ultrasoundmedicalcontinuouslearningwithtask-specificgeneralizationandadaptability

    Chunzheng Zhu, Jianxin Lin, Guanghua Tan, Ningbo Zhu, Kenli Li, Chunlian Wang, and Shengli Li. Advancing ultrasoundmedicalcontinuouslearningwithtask-specificgeneralizationandadaptability. In2024IEEEInternational Conference on Bioinformatics and Biomedicine (BIBM), pages 3019–3025. IEEE, 2024

  33. [33]

    Medeyes: Learning dynamic visual focus for medical progressive diagnosis

    Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, and Jianxin Lin. Medeyes: Learning dynamic visual focus for medical progressive diagnosis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13916–13924, 2026

  34. [34]

    Grasp: Plan-guided graph retrieval with adaptive fusion and reranking on semi-structured knowledge bases.arXiv preprint arXiv:2605.30237, 2026

    Yicheng Tao, Yiqun Wang, Xiangchen Song, Xin Luo, Kai Liu, and Jie Liu. Grasp: Plan-guided graph retrieval with adaptive fusion and reranking on semi-structured knowledge bases.arXiv preprint arXiv:2605.30237, 2026

  35. [35]

    Modalimmune: Immunity driven unlearning via self destructive training.arXiv preprint arXiv:2602.16197, 2026

    RongFu,WeiZhiTang,ZimingWang,JiaYeeTan,ZijianZhang,ZhaoluKang,MugeQi,ShuningZhang,andSimon Fong. Modalimmune: Immunity driven unlearning via self destructive training.arXiv preprint arXiv:2602.16197, 2026

  36. [36]

    Sppo: Sequence-level ppo for long-horizon reasoning tasks, 2026

    Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, and Guanhua Chen. Sppo: Sequence-level ppo for long-horizon reasoning tasks, 2026. URL https://arxiv.org/abs/2604.08865

  37. [37]

    Regulating model reliance on non-robust features by smoothing input marginal density

    Peiyu Yang, Naveed Akhtar, Mubarak Shah, and Ajmal Mian. Regulating model reliance on non-robust features by smoothing input marginal density. InEuropean Conference on Computer Vision, pages 329–347. Springer, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 17

  38. [38]

    Lapt: Label-driven automated prompt tuning for ood detection with vision-language models

    Yabin Zhang, Wenjie Zhu, Chenhang He, and Lei Zhang. Lapt: Label-driven automated prompt tuning for ood detection with vision-language models. InEuropean conference on computer vision, pages 271–288. Springer, 2024

  39. [39]

    Knowledge regularized negative feature tuning of vision-language models for out-of-distribution detection

    Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, and Lei Zhang. Knowledge regularized negative feature tuning of vision-language models for out-of-distribution detection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3565–3574, 2025

  40. [40]

    Ants: Adaptive negative textual space shaping for ood detection via test-time mllm understanding and reasoning

    Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, and Lei Zhang. Ants: Adaptive negative textual space shaping for ood detection via test-time mllm understanding and reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20–30, 2026

  41. [41]

    Activationmatters: Test-timeactivatednegativelabelsforooddetectionwithvision-languagemodels.arXivpreprint arXiv:2603.25250, 2026

    Yabin Zhang, Maya Varma, Yunhe Gao, Jean-Benoit Delbrouck, Jiaming Liu, Chong Wang, and Curtis Langlotz. Activationmatters: Test-timeactivatednegativelabelsforooddetectionwithvision-languagemodels.arXivpreprint arXiv:2603.25250, 2026

  42. [42]

    Cross-modal proxy evolving for ood detection with vision-language models.arXiv preprint arXiv:2601.08476, 2026

    Hao Tang, Yu Liu, Shuanglin Yan, Fei Shen, Shengfeng He, and Jing Qin. Cross-modal proxy evolving for ood detection with vision-language models.arXiv preprint arXiv:2601.08476, 2026

  43. [43]

    Quantized-tinyllava: a new multimodal foundation model enables efficient split learning.arXiv preprint arXiv:2511.23402, 2025

    Jiajun Guo, Xin Luo, Jiayin Zheng, Yiqun Wang, Kai-Wei Chang, Wei Wang, and Jie Liu. Quantized-tinyllava: a new multimodal foundation model enables efficient split learning.arXiv preprint arXiv:2511.23402, 2025

  44. [44]

    Future- sightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Future- sightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

  45. [45]

    Reversible primitive– composition alignment for continual vision–language learning

    Canran Xiao, Tianxiang Xu, Siyuan Ma, Yiyang Jiang, Haoyu Gao, and Yuhan Wu. Reversible primitive– composition alignment for continual vision–language learning. InThe Fourteenth International Conference on Learning Representations, 2026

  46. [46]

    Pi-cca: Prompt-invariant cca certificates for replay-free continual multimodal learning

    Jiayu Zhang, Chuangxin Zhao, Canran Xiao, Ruibo Duan, Wenyi Mo, Haoyu Gao, and Wenshuo Wang. Pi-cca: Prompt-invariant cca certificates for replay-free continual multimodal learning. InThe Fourteenth International Conference on Learning Representations, 2026

  47. [47]

    Thinkingwithblueprints: Assist- ingvision-languagemodelsinspatialreasoningviastructuredobjectrepresentation.arXivpreprintarXiv:2601.01984, 2026

    WeijianMa,ShizhaoSun,TianyuYu,RuiyuWang,Tat-SengChua,andJiangBian. Thinkingwithblueprints: Assist- ingvision-languagemodelsinspatialreasoningviastructuredobjectrepresentation.arXivpreprintarXiv:2601.01984, 2026

  48. [48]

    Mmt-ard: Multimodal multi-teacher adversarial distillation for robust vision-language models.arXiv preprint arXiv:2511.17448, 2025

    YuqiLi,JunhaoDong, ChuanguangYang, ShipingWen,PiotrKoniusz, TingwenHuang,YingliTian, andYew-Soon Ong. Mmt-ard: Multimodal multi-teacher adversarial distillation for robust vision-language models.arXiv preprint arXiv:2511.17448, 2025

  49. [49]

    Ammkd: Adaptive multimodal multi-teacher distillation for lightweight vision-language models.arXiv preprint arXiv:2509.00039, 2025

    Yuqi Li, Chuanguang Yang, Junhao Dong, Zhengtao Yao, Haoyan Xu, Zeyu Dong, Hansheng Zeng, Zhulin An, and Yingli Tian. Ammkd: Adaptive multimodal multi-teacher distillation for lightweight vision-language models.arXiv preprint arXiv:2509.00039, 2025

  50. [50]

    Discriminant analysis by gaussian mixtures.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):155–176, 1996

    Trevor Hastie and Robert Tibshirani. Discriminant analysis by gaussian mixtures.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):155–176, 1996

  51. [51]

    Delving into out-of-distribution detection with vision-language representations.Advances in neural information processing systems, 35:35087–35102, 2022

    Yifei Ming, Ziyang Cai, Jiuxiang Gu, Yiyou Sun, Wei Li, and Yixuan Li. Delving into out-of-distribution detection with vision-language representations.Advances in neural information processing systems, 35:35087–35102, 2022

  52. [52]

    Negative label guided ood detection with pretrained vision-language models.arXiv preprint arXiv:2403.20078, 2024

    Xue Jiang, Feng Liu, Zhen Fang, Hong Chen, Tongliang Liu, Feng Zheng, and Bo Han. Negative label guided ood detection with pretrained vision-language models.arXiv preprint arXiv:2403.20078, 2024

  53. [53]

    Springer, 2006

    Christopher M Bishop and Nasser M Nasrabadi.Pattern recognition and machine learning, volume 4. Springer, 2006

  54. [54]

    Regularized discriminant analysis.Journal of the American statistical association, 84(405): 165–175, 1989

    Jerome H Friedman. Regularized discriminant analysis.Journal of the American statistical association, 84(405): 165–175, 1989

  55. [55]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  56. [56]

    Test-time prompt tuning for zero-shot generalization in vision-language models.Advances in Neural Information Processing Systems, 35:14274–14289, 2022

    ManliShu,WeiliNie,De-AnHuang,ZhidingYu,TomGoldstein,AnimaAnandkumar,andChaoweiXiao. Test-time prompt tuning for zero-shot generalization in vision-language models.Advances in Neural Information Processing Systems, 35:14274–14289, 2022. Visual Computing Lab·The Hong Kong Polytechnic University 13 / 17

  57. [57]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  58. [58]

    Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019

  59. [59]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271, 2021

  60. [60]

    Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019

  61. [61]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021

  62. [62]

    Thecaltech-ucsdbirds-200-2011 dataset

    CatherineWah,SteveBranson,PeterWelinder,PietroPerona,SergeBelongie,etal. Thecaltech-ucsdbirds-200-2011 dataset. Technical report

  63. [63]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

  64. [64]

    Food-101–miningdiscriminativecomponentswithrandom forests

    LukasBossard,MatthieuGuillaumin,andLucVanGool. Food-101–miningdiscriminativecomponentswithrandom forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

  65. [65]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

  66. [66]

    The inaturalist species classification and detection dataset

    GrantVanHorn,OisinMacAodha,YangSong,YinCui,ChenSun,AlexShepard,HartwigAdam,PietroPerona,and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018

  67. [67]

    Sun database: Large-scale scene recognition from abbey to zoo

    Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

  68. [68]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

  69. [69]

    bell-shaped

    Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017. Visual Computing Lab·The Hong Kong Polytechnic University 14 / 17 Supplementary Material Table S1.Hyper-parameter settings for the pro...