Dual Distribution Estimation for Zero-shot Noisy Test-Time Adaptation with VLMs

Lei Zhang; Liang Xu; Wenjie Zhu; Wenjun Zeng; Xin Jin; Yabin Zhang

arxiv: 2606.25758 · v1 · pith:ZCC2UMY5new · submitted 2026-06-24 · 💻 cs.CV

Dual Distribution Estimation for Zero-shot Noisy Test-Time Adaptation with VLMs

Wenjie Zhu , Yabin Zhang , Liang Xu , Xin Jin , Wenjun Zeng , Lei Zhang This is my paper

Pith reviewed 2026-06-25 21:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords test-time adaptationzero-shot learningout-of-distribution detectionvision-language modelsGaussian distribution estimationnoisy test-time adaptationdistribution modeling

0 comments

The pith

Training-free Gaussian modeling of VLM features raises noisy test-time adaptation accuracy by 3.7 percent while cutting OOD detection errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that noisy test-time adaptation of vision-language models can be handled by shifting from instance-level discriminative training to explicit modeling of class-wise Gaussian distributions on frozen model features. It introduces Positive Feature Distribution Estimation to build inclusion and exclusion Gaussians that yield calibrated contrastive scores for better in-distribution classification, paired with Negative Label Distribution Estimation to identify highly discriminative labels and reduce spurious correlations for stronger out-of-distribution filtering. This matters because real-world streams mix in-distribution and out-of-distribution samples, and prior zero-shot approaches either degrade efficiency or produce overconfident errors. The resulting method remains fully training-free and zero-shot, delivering simultaneous gains in accuracy and detection on large-scale benchmarks while preserving online scalability.

Core claim

DDE shifts the zero-shot NTTA paradigm from instance-level learning to training-free Gaussian distribution modeling. PFDE explicitly models class-wise inclusion and exclusion Gaussian distributions from test-batch features to formulate a calibrated contrastive score that robustly enhances ID accuracy. NLDE improves OOD identification by explicitly modeling the negative label distribution to mine highly discriminative labels and mitigate spurious correlations. On the large-scale ImageNet benchmark this yields a 3.70 percent improvement in harmonic mean accuracy and a 6.20 percent reduction in FPR95 for OOD detection while ensuring highly scalable and efficient online inference.

What carries the argument

Dual Distribution Estimation (DDE) via Positive Feature Distribution Estimation (PFDE) and Negative Label Distribution Estimation (NLDE), which fit class-wise Gaussians to frozen VLM test-batch features to produce contrastive inclusion/exclusion scores and mined negative labels.

If this is right

Enables highly scalable and efficient online inference without retraining or post-hoc tuning.
Maintains robustness in data-scarce scenarios while remaining zero-shot.
Simultaneously improves in-distribution classification accuracy and out-of-distribution detection.
Avoids overconfident misclassifications that arise from test-time discriminative training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the Gaussian modeling holds, the same dual-estimation structure could replace instance-level training loops in other online adaptation settings.
A testable extension is to measure how performance changes when test-batch size shrinks below the point where reliable Gaussian fits become possible.
The approach implies that single-batch feature statistics alone suffice for reliable ID/OOD separation, which would simplify deployment in streaming environments where labeled data never arrives.
If the contrastive scores prove stable across domains, similar distribution estimation could be explored for non-vision modalities that also rely on frozen encoders.

Load-bearing premise

Test-batch features from a frozen VLM can be reliably modeled as class-wise Gaussians whose inclusion/exclusion contrastive scores separate ID from OOD samples without any labeled supervision or post-hoc tuning.

What would settle it

Apply DDE to ImageNet with controlled mixtures of ID and OOD samples and check whether harmonic-mean accuracy gains of 3.70 percent and FPR95 reductions of 6.20 percent disappear relative to prior zero-shot NTTA baselines.

read the original abstract

While test-time adaptation (TTA) empowers vision-language models to adapt without costly retraining, it remains highly vulnerable to out-of-distribution (OOD) outliers prevalent in real-world applications. This discrepancy motivates Noisy TTA (NTTA), an online task to filter noisy OOD samples on the fly while maximizing in-distribution (ID) classification accuracy. Existing zero-shot NTTA approaches typically rely on test-time discriminative training, leading to overconfident misclassifications and significantly degraded inference efficiency. To address these limitations, we propose a novel framework named Dual Distribution Estimation (DDE), shifting the zero-shot NTTA paradigm from instance-level learning to training-free Gaussian distribution modeling. DDE incorporates two novel modules: Positive Feature Distribution Estimation (PFDE) and Negative Label Distribution Estimation (NLDE). PFDE explicitly models class-wise inclusion and exclusion Gaussian distributions to formulate a calibrated contrastive score, robustly enhancing ID accuracy. In parallel, NLDE improves OOD identification by explicitly modeling the negative label distribution to mine highly discriminative labels, effectively mitigating spurious correlations. Extensive experiments show that on the large-scale ImageNet benchmark, DDE achieves an improvement of 3.70\% in harmonic mean accuracy and reduces the FPR95 for OOD detection by 6.20\%, while ensuring highly scalable and efficient online inference. Furthermore, DDE is zero-shot and training-free, demonstrating remarkable robustness in data-scarce scenarios. Codes are available at https://github.com/ZhuWenjie98/DDE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DDE shifts NTTA to training-free Gaussian modeling on test batches but the class assignment step for those distributions looks vulnerable to OOD contamination.

read the letter

The main point is that this paper replaces test-time discriminative training with explicit class-wise Gaussian fits on unlabeled test features. PFDE models inclusion and exclusion distributions per class to produce a contrastive score, and NLDE mines negative labels from the same features to cut spurious correlations. On ImageNet it reports a 3.70% harmonic mean accuracy gain and 6.20% FPR95 reduction while staying online and training-free.

The work is new in packaging these two distribution estimates together for the noisy TTA case. It correctly flags the efficiency and overconfidence problems in prior zero-shot methods and releases code, which helps anyone wanting to check the implementation.

The soft spot is the assignment process itself. Without labels the means and covariances must be built from initial zero-shot logits or clustering on the mixed batch. OOD samples can easily pull those estimates off, and the method then uses the resulting scores to decide what counts as ID. That loop is exactly what the stress-test flags, and the abstract gives no ablations on assignment robustness or error bars on the reported gains. The Gaussian assumption is presented as reliable but rests on the frozen VLM features separating cleanly, which is the weakest link when the setting is noisy.

The paper is for people working on practical VLM robustness in streaming or noisy data. A reader who needs an efficient online baseline might pick up the PFDE/NLDE framing, though they will want to verify the distribution quality on their own data.

It deserves peer review because the framing is distinct, the benchmark numbers are concrete, and the code is public. Referees can check whether the assignment step holds up under realistic OOD rates.

Referee Report

2 major / 1 minor

Summary. The paper proposes Dual Distribution Estimation (DDE) for zero-shot noisy test-time adaptation (NTTA) with vision-language models. It introduces Positive Feature Distribution Estimation (PFDE) to explicitly model class-wise inclusion/exclusion Gaussian distributions from frozen VLM test-batch features for a calibrated contrastive score, and Negative Label Distribution Estimation (NLDE) to model negative label distributions for mining discriminative labels and mitigating spurious correlations. The central empirical claim is a 3.70% gain in harmonic mean accuracy and 6.20% reduction in FPR95 on ImageNet, with the method positioned as training-free, post-hoc-tuning-free, and scalable for online inference.

Significance. If the Gaussian modeling assumptions hold without supervision, the shift from instance-level discriminative training to explicit dual distribution estimation could meaningfully improve robustness and efficiency in real-world NTTA settings, particularly for data-scarce or OOD-contaminated batches. The training-free nature and reported scalability are notable strengths if supported by reproducible code and ablations.

major comments (2)

The PFDE module's reliance on modeling test-batch features as class-wise Gaussians (whose inclusion/exclusion contrastive scores drive ID accuracy) rests on unsupervised assignment of samples to classes; this creates a potential feedback loop when initial zero-shot logits misassign OOD-contaminated samples, directly undermining the claimed 3.70% harmonic mean gain. No derivation details, sensitivity analysis, or robustness checks against this circularity are evident from the abstract.
The abstract reports quantitative gains (3.70% harmonic mean, 6.20% FPR95) but provides no error bars, statistical significance tests, ablation evidence on PFDE/NLDE components, or full experimental protocol, making it impossible to verify whether the improvements are load-bearing or sensitive to post-hoc choices.

minor comments (1)

The abstract should explicitly name the VLM backbone, test-batch sizes, and any clustering or logit-thresholding steps used for initial class assignment in PFDE/NLDE.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive feedback on our work. We address each major comment below with clarifications from the manuscript and commitments to revisions where appropriate.

read point-by-point responses

Referee: The PFDE module's reliance on modeling test-batch features as class-wise Gaussians (whose inclusion/exclusion contrastive scores drive ID accuracy) rests on unsupervised assignment of samples to classes; this creates a potential feedback loop when initial zero-shot logits misassign OOD-contaminated samples, directly undermining the claimed 3.70% harmonic mean gain. No derivation details, sensitivity analysis, or robustness checks against this circularity are evident from the abstract.

Authors: The manuscript details in Section 3.2 that PFDE fits class-wise Gaussians directly on the frozen VLM features of the test batch after initial zero-shot logit-based pseudo-labeling, then derives a calibrated contrastive score from the inclusion/exclusion distributions. This is not a closed feedback loop because the Gaussian parameters are estimated once per batch in a training-free manner and the contrastive score explicitly down-weights outliers via the exclusion component; NLDE further mitigates spurious assignments by mining discriminative negative labels. Derivation of the calibrated score appears in Equations (3)–(5). We agree that sensitivity analysis to initial assignment errors is not present and will add it (varying pseudo-label noise levels on ImageNet) in the revision. revision: partial
Referee: The abstract reports quantitative gains (3.70% harmonic mean, 6.20% FPR95) but provides no error bars, statistical significance tests, ablation evidence on PFDE/NLDE components, or full experimental protocol, making it impossible to verify whether the improvements are load-bearing or sensitive to post-hoc choices.

Authors: The full manuscript provides the experimental protocol in Section 4.1, component ablations in Section 4.3 (showing PFDE and NLDE each contribute to the harmonic-mean gain), and results on multiple datasets beyond the abstract. However, the abstract itself omits error bars and significance tests. We will add per-run standard deviations, paired t-test results, and expanded ablation tables to both the abstract and main results in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained distribution estimation

full rationale

The paper's core contribution is explicit Gaussian modeling (PFDE for class-wise inclusion/exclusion and NLDE for negative labels) directly from unlabeled test-batch VLM features to produce contrastive scores. This is the method itself rather than any prediction or result that reduces by construction to fitted inputs or prior self-citations. No equations or steps in the abstract or description exhibit self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations. The approach is presented as training-free and zero-shot, with performance claims tied to empirical benchmarks rather than internal tautologies. This matches the default case of an honest non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The approach rests on the assumption that VLM features admit useful Gaussian modeling and that negative label distributions mined from test data are discriminative; no free parameters or invented physical entities are stated.

axioms (2)

domain assumption Test-batch features from a frozen VLM follow class-conditional Gaussian distributions that can be estimated without labels.
Invoked to justify PFDE module construction.
domain assumption Negative label distributions mined from the test batch reduce spurious correlations for OOD detection.
Invoked to justify NLDE module.

invented entities (2)

Positive Feature Distribution Estimation (PFDE) no independent evidence
purpose: Models class-wise inclusion and exclusion Gaussians to produce calibrated contrastive scores.
New module introduced to replace instance-level discriminative training.
Negative Label Distribution Estimation (NLDE) no independent evidence
purpose: Models negative label distribution to identify highly discriminative labels and mitigate spurious correlations.
New module introduced for OOD identification.

pith-pipeline@v0.9.1-grok · 5812 in / 1352 out tokens · 23607 ms · 2026-06-25T21:00:20.648096+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 5 linked inside Pith

[1]

Noisy test-time adaptation in vision-language models.arXiv preprint arXiv:2502.14604, 2025

Chentao Cao, Zhun Zhong, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, and Bo Han. Noisy test-time adaptation in vision-language models.arXiv preprint arXiv:2502.14604, 2025

arXiv 2025
[2]

Dual memory networks: A versatile adaptation approach for vision-language models

Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, and Lei Zhang. Dual memory networks: A versatile adaptation approach for vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28718–28728, 2024

2024
[3]

Efficienttest-timeadaptationof vision-language models

AdilbekKarmanov,DayanGuan,ShijianLu,AbdulmotalebElSaddik,andEricXing. Efficienttest-timeadaptationof vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14162–14171, 2024

2024
[4]

Multi-cache enhanced prototype learning for test-time generalization of vision-language models

Xinyu Chen, Haotian Zhai, Can Zhang, Xiupeng Shi, and Ruirui Li. Multi-cache enhanced prototype learning for test-time generalization of vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2281–2291, 2025

2025
[5]

Dual prototype evolving for test-time generalization of vision-language models.Advances in Neural Information Processing Systems, 37:32111–32136, 2024

Ce Zhang, Simon Stepputtis, Katia Sycara, and Yaqi Xie. Dual prototype evolving for test-time generalization of vision-language models.Advances in Neural Information Processing Systems, 37:32111–32136, 2024

2024
[6]

Adaneg: Adaptive negative proxy guided ood detection with vision-language models

Yabin Zhang and Lei Zhang. Adaneg: Adaptive negative proxy guided ood detection with vision-language models. Advances in Neural Information Processing Systems, 37:38744–38768, 2024

2024
[7]

Oodd: Test-time out-of-distribution detection with dynamic dictionary

Yifeng Yang, Lin Zhu, Zewen Sun, Hengyu Liu, Qinying Gu, and Nanyang Ye. Oodd: Test-time out-of-distribution detection with dynamic dictionary. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30630–30639, 2025

2025
[8]

On the robustness of open-world test-time training: Self-training with dynamic prototype expansion

Yushu Li, Xun Xu, Yongyi Su, and Kui Jia. On the robustness of open-world test-time training: Self-training with dynamic prototype expansion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11836–11846, 2023

2023
[9]

Tent: Fullytest-timeadaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020

DequanWang,EvanShelhamer,ShaotengLiu,BrunoOlshausen,andTrevorDarrell. Tent: Fullytest-timeadaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020

Pith/arXiv arXiv 2006
[10]

Towards stable test-time adaptation in dynamic wild world.arXiv preprint arXiv:2302.12400, 2023

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world.arXiv preprint arXiv:2302.12400, 2023

arXiv 2023
[11]

On the efficacy of small self-supervised contrastive models without distillation signals

Haizhou Shi, Youcai Zhang, Siliang Tang, Wenjie Zhu, Yaqian Li, Yandong Guo, and Yueting Zhuang. On the efficacy of small self-supervised contrastive models without distillation signals. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 2225–2234, 2022

2022
[12]

Topocl: Topological contrastive learning for medical imaging

Guangyu Meng, Pengfei Gu, Peixian Liang, John P Lalor, Erin Wolf Chambers, and Danny Z Chen. Topocl: Topological contrastive learning for medical imaging. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 42681–42690, 2026

2026
[13]

Instruct where the model fails: Generative data augmentation via guided self-contrastive fine-tuning

Weijian Ma, Ruoxin Chen, Keyue Zhang, Shuang Wu, and Shouhong Ding. Instruct where the model fails: Generative data augmentation via guided self-contrastive fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5991–5999, 2025

2025
[14]

The norm must go on: Dynamic unsupervised domain adaptation by normalization

M Jehanzeb Mirza, Jakub Micorek, Horst Possegger, and Horst Bischof. The norm must go on: Dynamic unsupervised domain adaptation by normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14765–14775, 2022

2022
[15]

Delta: degradation-free fully test-time adaptation.arXiv preprint arXiv:2301.13018, 2023

Bowen Zhao, Chen Chen, and Shu-Tao Xia. Delta: degradation-free fully test-time adaptation.arXiv preprint arXiv:2301.13018, 2023

arXiv 2023
[16]

Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models

Elaine Sui, Xiaohan Wang, and Serena Yeung-Levy. Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 825–835. IEEE, 2025

2025
[17]

Enhancing zero-shot vision models by label-free prompt distribution learning and bias correcting.Advances in Neural Information Processing Systems, 37:2001–2025, 2024

Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, and Hanwang Zhang. Enhancing zero-shot vision models by label-free prompt distribution learning and bias correcting.Advances in Neural Information Processing Systems, 37:2001–2025, 2024

2001
[18]

Dota: Distributional test-time adaptation of vision-language models.arXiv preprint arXiv:2409.19375, 2024

Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, and Changqing Zhang. Dota: Distributional test-time adaptation of vision-language models.arXiv preprint arXiv:2409.19375, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 17

arXiv 2024
[19]

Cliptta: Robust contrastive vision-language test-time adaptation.arXiv preprint arXiv:2507.14312, 2025

Marc Lafon, Gustavo Adolfo Vargas Hakim, Clément Rambour, Christian Desrosier, and Nicolas Thome. Cliptta: Robust contrastive vision-language test-time adaptation.arXiv preprint arXiv:2507.14312, 2025

arXiv 2025
[20]

Bayesian test-time adaptation for vision-language models

Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Xiatian Zhu, Lei Deng, Hongbin Liu, and Zhen Lei. Bayesian test-time adaptation for vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29999–30009, 2025

2025
[21]

Protodcs: Towards robust and efficient open-set test-time adaptation for vision-language models.arXiv preprint arXiv:2602.23653, 2026

Wei Luo, Yangfan Ou, Jin Deng, Zeshuai Deng, Xiquan Yan, Zhiquan Wen, and Mingkui Tan. Protodcs: Towards robust and efficient open-set test-time adaptation for vision-language models.arXiv preprint arXiv:2602.23653, 2026

arXiv 2026
[22]

Model-free test time adaptation for out-of-distribution detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

YiFan Zhang, Xue Wang, Tian Zhou, Kun Yuan, Zhang Zhang, Liang Wang, and Rong Jin. Model-free test time adaptation for out-of-distribution detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[23]

Re-calibrating feature attributions for model interpretation

Peiyu Yang, Naveed Akhtar, Zeyi Wen, Mubarak Shah, and Ajmal Saeed Mian. Re-calibrating feature attributions for model interpretation. InInternational Conference on Learning Representations, 2023

2023
[24]

Backdoor-based explainable ai benchmark for high fidelity evaluation of attribution methods.arXiv preprint arXiv:2405.02344, 2024

Peiyu Yang, Naveed Akhtar, Jiantong Jiang, and Ajmal Mian. Backdoor-based explainable ai benchmark for high fidelity evaluation of attribution methods.arXiv preprint arXiv:2405.02344, 2024

arXiv 2024
[25]

Out-of-distribution detection: A task-oriented survey of recent advances.ACM Computing Surveys, 58(2):1–39, 2025

Shuo Lu, Yingsheng Wang, Lijun Sheng, Lingxiao He, Aihua Zheng, and Jian Liang. Out-of-distribution detection: A task-oriented survey of recent advances.ACM Computing Surveys, 58(2):1–39, 2025

2025
[26]

Beyond the static world: Continual category discovery under visual drift

Wei Feng, Yiwen Jiang, Sijin Zhou, and Zongyuan Ge. Beyond the static world: Continual category discovery under visual drift. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25032–25042, 2026

2026
[27]

Seeing through the shift: Causality-inspired robust generalized category discovery

Wei Feng, Yiwen Jiang, Sijin Zhou, Zhuang Qi, Zhongxing Xu, Zhonghua Wang, Feilong Tang, and Zongyuan Ge. Seeing through the shift: Causality-inspired robust generalized category discovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17766–17775, 2026

2026
[28]

Generalized category discovery under domain shift: A frequency domain perspective

Wei Feng and Zongyuan Ge. Generalized category discovery under domain shift: A frequency domain perspective. Advances in Neural Information Processing Systems, 38:111721–111749, 2026

2026
[29]

Physics-informed unsupervised domain adaptation framework for cross-machine bearing fault diagnosis.Advanced Engineering Informatics, 62:102774, 2024

Jia Ning, Weiguo Huang, Chuancang Ding, Jun Wang, and Zhongkui Zhu. Physics-informed unsupervised domain adaptation framework for cross-machine bearing fault diagnosis.Advanced Engineering Informatics, 62:102774, 2024

2024
[30]

A physics-guided memory enhancement and causality-inspired generalization framework for continual fault diagnosis

Jia Ning, Weiguo Huang, Panpan Guo, Chuancang Ding, Yifan Huangfu, Changqing Shen, and Zhongkui Zhu. A physics-guided memory enhancement and causality-inspired generalization framework for continual fault diagnosis. Knowledge-Based Systems, 325:114044, 2025. Corresponding author: Weiguo Huang

2025
[31]

Cinematte: Background matting for virtual production and beyond

Yuanjian He, Chen Zhang, Fasheng Chen, and Jiangbo Cao. Cinematte: Background matting for virtual production and beyond. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8725–8735, 2026

2026
[32]

Advancing ultrasoundmedicalcontinuouslearningwithtask-specificgeneralizationandadaptability

Chunzheng Zhu, Jianxin Lin, Guanghua Tan, Ningbo Zhu, Kenli Li, Chunlian Wang, and Shengli Li. Advancing ultrasoundmedicalcontinuouslearningwithtask-specificgeneralizationandadaptability. In2024IEEEInternational Conference on Bioinformatics and Biomedicine (BIBM), pages 3019–3025. IEEE, 2024

2024
[33]

Medeyes: Learning dynamic visual focus for medical progressive diagnosis

Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, and Jianxin Lin. Medeyes: Learning dynamic visual focus for medical progressive diagnosis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13916–13924, 2026

2026
[34]

Grasp: Plan-guided graph retrieval with adaptive fusion and reranking on semi-structured knowledge bases.arXiv preprint arXiv:2605.30237, 2026

Yicheng Tao, Yiqun Wang, Xiangchen Song, Xin Luo, Kai Liu, and Jie Liu. Grasp: Plan-guided graph retrieval with adaptive fusion and reranking on semi-structured knowledge bases.arXiv preprint arXiv:2605.30237, 2026

Pith/arXiv arXiv 2026
[35]

Modalimmune: Immunity driven unlearning via self destructive training.arXiv preprint arXiv:2602.16197, 2026

RongFu,WeiZhiTang,ZimingWang,JiaYeeTan,ZijianZhang,ZhaoluKang,MugeQi,ShuningZhang,andSimon Fong. Modalimmune: Immunity driven unlearning via self destructive training.arXiv preprint arXiv:2602.16197, 2026

Pith/arXiv arXiv 2026
[36]

Sppo: Sequence-level ppo for long-horizon reasoning tasks, 2026

Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, and Guanhua Chen. Sppo: Sequence-level ppo for long-horizon reasoning tasks, 2026. URL https://arxiv.org/abs/2604.08865

Pith/arXiv arXiv 2026
[37]

Regulating model reliance on non-robust features by smoothing input marginal density

Peiyu Yang, Naveed Akhtar, Mubarak Shah, and Ajmal Mian. Regulating model reliance on non-robust features by smoothing input marginal density. InEuropean Conference on Computer Vision, pages 329–347. Springer, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 17

2024
[38]

Lapt: Label-driven automated prompt tuning for ood detection with vision-language models

Yabin Zhang, Wenjie Zhu, Chenhang He, and Lei Zhang. Lapt: Label-driven automated prompt tuning for ood detection with vision-language models. InEuropean conference on computer vision, pages 271–288. Springer, 2024

2024
[39]

Knowledge regularized negative feature tuning of vision-language models for out-of-distribution detection

Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, and Lei Zhang. Knowledge regularized negative feature tuning of vision-language models for out-of-distribution detection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3565–3574, 2025

2025
[40]

Ants: Adaptive negative textual space shaping for ood detection via test-time mllm understanding and reasoning

Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, and Lei Zhang. Ants: Adaptive negative textual space shaping for ood detection via test-time mllm understanding and reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20–30, 2026

2026
[41]

Activationmatters: Test-timeactivatednegativelabelsforooddetectionwithvision-languagemodels.arXivpreprint arXiv:2603.25250, 2026

Yabin Zhang, Maya Varma, Yunhe Gao, Jean-Benoit Delbrouck, Jiaming Liu, Chong Wang, and Curtis Langlotz. Activationmatters: Test-timeactivatednegativelabelsforooddetectionwithvision-languagemodels.arXivpreprint arXiv:2603.25250, 2026

arXiv 2026
[42]

Cross-modal proxy evolving for ood detection with vision-language models.arXiv preprint arXiv:2601.08476, 2026

Hao Tang, Yu Liu, Shuanglin Yan, Fei Shen, Shengfeng He, and Jing Qin. Cross-modal proxy evolving for ood detection with vision-language models.arXiv preprint arXiv:2601.08476, 2026

arXiv 2026
[43]

Quantized-tinyllava: a new multimodal foundation model enables efficient split learning.arXiv preprint arXiv:2511.23402, 2025

Jiajun Guo, Xin Luo, Jiayin Zheng, Yiqun Wang, Kai-Wei Chang, Wei Wang, and Jie Liu. Quantized-tinyllava: a new multimodal foundation model enables efficient split learning.arXiv preprint arXiv:2511.23402, 2025

arXiv 2025
[44]

Future- sightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Future- sightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

Pith/arXiv arXiv 2025
[45]

Reversible primitive– composition alignment for continual vision–language learning

Canran Xiao, Tianxiang Xu, Siyuan Ma, Yiyang Jiang, Haoyu Gao, and Yuhan Wu. Reversible primitive– composition alignment for continual vision–language learning. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[46]

Pi-cca: Prompt-invariant cca certificates for replay-free continual multimodal learning

Jiayu Zhang, Chuangxin Zhao, Canran Xiao, Ruibo Duan, Wenyi Mo, Haoyu Gao, and Wenshuo Wang. Pi-cca: Prompt-invariant cca certificates for replay-free continual multimodal learning. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[47]

Thinkingwithblueprints: Assist- ingvision-languagemodelsinspatialreasoningviastructuredobjectrepresentation.arXivpreprintarXiv:2601.01984, 2026

WeijianMa,ShizhaoSun,TianyuYu,RuiyuWang,Tat-SengChua,andJiangBian. Thinkingwithblueprints: Assist- ingvision-languagemodelsinspatialreasoningviastructuredobjectrepresentation.arXivpreprintarXiv:2601.01984, 2026

arXiv 2026
[48]

Mmt-ard: Multimodal multi-teacher adversarial distillation for robust vision-language models.arXiv preprint arXiv:2511.17448, 2025

YuqiLi,JunhaoDong, ChuanguangYang, ShipingWen,PiotrKoniusz, TingwenHuang,YingliTian, andYew-Soon Ong. Mmt-ard: Multimodal multi-teacher adversarial distillation for robust vision-language models.arXiv preprint arXiv:2511.17448, 2025

arXiv 2025
[49]

Ammkd: Adaptive multimodal multi-teacher distillation for lightweight vision-language models.arXiv preprint arXiv:2509.00039, 2025

Yuqi Li, Chuanguang Yang, Junhao Dong, Zhengtao Yao, Haoyan Xu, Zeyu Dong, Hansheng Zeng, Zhulin An, and Yingli Tian. Ammkd: Adaptive multimodal multi-teacher distillation for lightweight vision-language models.arXiv preprint arXiv:2509.00039, 2025

arXiv 2025
[50]

Discriminant analysis by gaussian mixtures.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):155–176, 1996

Trevor Hastie and Robert Tibshirani. Discriminant analysis by gaussian mixtures.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):155–176, 1996

1996
[51]

Delving into out-of-distribution detection with vision-language representations.Advances in neural information processing systems, 35:35087–35102, 2022

Yifei Ming, Ziyang Cai, Jiuxiang Gu, Yiyou Sun, Wei Li, and Yixuan Li. Delving into out-of-distribution detection with vision-language representations.Advances in neural information processing systems, 35:35087–35102, 2022

2022
[52]

Negative label guided ood detection with pretrained vision-language models.arXiv preprint arXiv:2403.20078, 2024

Xue Jiang, Feng Liu, Zhen Fang, Hong Chen, Tongliang Liu, Feng Zheng, and Bo Han. Negative label guided ood detection with pretrained vision-language models.arXiv preprint arXiv:2403.20078, 2024

arXiv 2024
[53]

Springer, 2006

Christopher M Bishop and Nasser M Nasrabadi.Pattern recognition and machine learning, volume 4. Springer, 2006

2006
[54]

Regularized discriminant analysis.Journal of the American statistical association, 84(405): 165–175, 1989

Jerome H Friedman. Regularized discriminant analysis.Journal of the American statistical association, 84(405): 165–175, 1989

1989
[55]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[56]

Test-time prompt tuning for zero-shot generalization in vision-language models.Advances in Neural Information Processing Systems, 35:14274–14289, 2022

ManliShu,WeiliNie,De-AnHuang,ZhidingYu,TomGoldstein,AnimaAnandkumar,andChaoweiXiao. Test-time prompt tuning for zero-shot generalization in vision-language models.Advances in Neural Information Processing Systems, 35:14274–14289, 2022. Visual Computing Lab·The Hong Kong Polytechnic University 13 / 17

2022
[57]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[58]

Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019

2019
[59]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271, 2021

2021
[60]

Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019

2019
[61]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021

2021
[62]

Thecaltech-ucsdbirds-200-2011 dataset

CatherineWah,SteveBranson,PeterWelinder,PietroPerona,SergeBelongie,etal. Thecaltech-ucsdbirds-200-2011 dataset. Technical report

2011
[63]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

2013
[64]

Food-101–miningdiscriminativecomponentswithrandom forests

LukasBossard,MatthieuGuillaumin,andLucVanGool. Food-101–miningdiscriminativecomponentswithrandom forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

2014
[65]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

2012
[66]

The inaturalist species classification and detection dataset

GrantVanHorn,OisinMacAodha,YangSong,YinCui,ChenSun,AlexShepard,HartwigAdam,PietroPerona,and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018

2018
[67]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

2010
[68]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

2014
[69]

bell-shaped

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017. Visual Computing Lab·The Hong Kong Polytechnic University 14 / 17 Supplementary Material Table S1.Hyper-parameter settings for the pro...

2017

[1] [1]

Noisy test-time adaptation in vision-language models.arXiv preprint arXiv:2502.14604, 2025

Chentao Cao, Zhun Zhong, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, and Bo Han. Noisy test-time adaptation in vision-language models.arXiv preprint arXiv:2502.14604, 2025

arXiv 2025

[2] [2]

Dual memory networks: A versatile adaptation approach for vision-language models

Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, and Lei Zhang. Dual memory networks: A versatile adaptation approach for vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28718–28728, 2024

2024

[3] [3]

Efficienttest-timeadaptationof vision-language models

AdilbekKarmanov,DayanGuan,ShijianLu,AbdulmotalebElSaddik,andEricXing. Efficienttest-timeadaptationof vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14162–14171, 2024

2024

[4] [4]

Multi-cache enhanced prototype learning for test-time generalization of vision-language models

Xinyu Chen, Haotian Zhai, Can Zhang, Xiupeng Shi, and Ruirui Li. Multi-cache enhanced prototype learning for test-time generalization of vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2281–2291, 2025

2025

[5] [5]

Dual prototype evolving for test-time generalization of vision-language models.Advances in Neural Information Processing Systems, 37:32111–32136, 2024

Ce Zhang, Simon Stepputtis, Katia Sycara, and Yaqi Xie. Dual prototype evolving for test-time generalization of vision-language models.Advances in Neural Information Processing Systems, 37:32111–32136, 2024

2024

[6] [6]

Adaneg: Adaptive negative proxy guided ood detection with vision-language models

Yabin Zhang and Lei Zhang. Adaneg: Adaptive negative proxy guided ood detection with vision-language models. Advances in Neural Information Processing Systems, 37:38744–38768, 2024

2024

[7] [7]

Oodd: Test-time out-of-distribution detection with dynamic dictionary

Yifeng Yang, Lin Zhu, Zewen Sun, Hengyu Liu, Qinying Gu, and Nanyang Ye. Oodd: Test-time out-of-distribution detection with dynamic dictionary. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30630–30639, 2025

2025

[8] [8]

On the robustness of open-world test-time training: Self-training with dynamic prototype expansion

Yushu Li, Xun Xu, Yongyi Su, and Kui Jia. On the robustness of open-world test-time training: Self-training with dynamic prototype expansion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11836–11846, 2023

2023

[9] [9]

Tent: Fullytest-timeadaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020

DequanWang,EvanShelhamer,ShaotengLiu,BrunoOlshausen,andTrevorDarrell. Tent: Fullytest-timeadaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020

Pith/arXiv arXiv 2006

[10] [10]

Towards stable test-time adaptation in dynamic wild world.arXiv preprint arXiv:2302.12400, 2023

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world.arXiv preprint arXiv:2302.12400, 2023

arXiv 2023

[11] [11]

On the efficacy of small self-supervised contrastive models without distillation signals

Haizhou Shi, Youcai Zhang, Siliang Tang, Wenjie Zhu, Yaqian Li, Yandong Guo, and Yueting Zhuang. On the efficacy of small self-supervised contrastive models without distillation signals. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 2225–2234, 2022

2022

[12] [12]

Topocl: Topological contrastive learning for medical imaging

Guangyu Meng, Pengfei Gu, Peixian Liang, John P Lalor, Erin Wolf Chambers, and Danny Z Chen. Topocl: Topological contrastive learning for medical imaging. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 42681–42690, 2026

2026

[13] [13]

Instruct where the model fails: Generative data augmentation via guided self-contrastive fine-tuning

Weijian Ma, Ruoxin Chen, Keyue Zhang, Shuang Wu, and Shouhong Ding. Instruct where the model fails: Generative data augmentation via guided self-contrastive fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5991–5999, 2025

2025

[14] [14]

The norm must go on: Dynamic unsupervised domain adaptation by normalization

M Jehanzeb Mirza, Jakub Micorek, Horst Possegger, and Horst Bischof. The norm must go on: Dynamic unsupervised domain adaptation by normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14765–14775, 2022

2022

[15] [15]

Delta: degradation-free fully test-time adaptation.arXiv preprint arXiv:2301.13018, 2023

Bowen Zhao, Chen Chen, and Shu-Tao Xia. Delta: degradation-free fully test-time adaptation.arXiv preprint arXiv:2301.13018, 2023

arXiv 2023

[16] [16]

Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models

Elaine Sui, Xiaohan Wang, and Serena Yeung-Levy. Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 825–835. IEEE, 2025

2025

[17] [17]

Enhancing zero-shot vision models by label-free prompt distribution learning and bias correcting.Advances in Neural Information Processing Systems, 37:2001–2025, 2024

Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, and Hanwang Zhang. Enhancing zero-shot vision models by label-free prompt distribution learning and bias correcting.Advances in Neural Information Processing Systems, 37:2001–2025, 2024

2001

[18] [18]

Dota: Distributional test-time adaptation of vision-language models.arXiv preprint arXiv:2409.19375, 2024

Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, and Changqing Zhang. Dota: Distributional test-time adaptation of vision-language models.arXiv preprint arXiv:2409.19375, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 17

arXiv 2024

[19] [19]

Cliptta: Robust contrastive vision-language test-time adaptation.arXiv preprint arXiv:2507.14312, 2025

Marc Lafon, Gustavo Adolfo Vargas Hakim, Clément Rambour, Christian Desrosier, and Nicolas Thome. Cliptta: Robust contrastive vision-language test-time adaptation.arXiv preprint arXiv:2507.14312, 2025

arXiv 2025

[20] [20]

Bayesian test-time adaptation for vision-language models

Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Xiatian Zhu, Lei Deng, Hongbin Liu, and Zhen Lei. Bayesian test-time adaptation for vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29999–30009, 2025

2025

[21] [21]

Protodcs: Towards robust and efficient open-set test-time adaptation for vision-language models.arXiv preprint arXiv:2602.23653, 2026

Wei Luo, Yangfan Ou, Jin Deng, Zeshuai Deng, Xiquan Yan, Zhiquan Wen, and Mingkui Tan. Protodcs: Towards robust and efficient open-set test-time adaptation for vision-language models.arXiv preprint arXiv:2602.23653, 2026

arXiv 2026

[22] [22]

Model-free test time adaptation for out-of-distribution detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

YiFan Zhang, Xue Wang, Tian Zhou, Kun Yuan, Zhang Zhang, Liang Wang, and Rong Jin. Model-free test time adaptation for out-of-distribution detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[23] [23]

Re-calibrating feature attributions for model interpretation

Peiyu Yang, Naveed Akhtar, Zeyi Wen, Mubarak Shah, and Ajmal Saeed Mian. Re-calibrating feature attributions for model interpretation. InInternational Conference on Learning Representations, 2023

2023

[24] [24]

Backdoor-based explainable ai benchmark for high fidelity evaluation of attribution methods.arXiv preprint arXiv:2405.02344, 2024

Peiyu Yang, Naveed Akhtar, Jiantong Jiang, and Ajmal Mian. Backdoor-based explainable ai benchmark for high fidelity evaluation of attribution methods.arXiv preprint arXiv:2405.02344, 2024

arXiv 2024

[25] [25]

Out-of-distribution detection: A task-oriented survey of recent advances.ACM Computing Surveys, 58(2):1–39, 2025

Shuo Lu, Yingsheng Wang, Lijun Sheng, Lingxiao He, Aihua Zheng, and Jian Liang. Out-of-distribution detection: A task-oriented survey of recent advances.ACM Computing Surveys, 58(2):1–39, 2025

2025

[26] [26]

Beyond the static world: Continual category discovery under visual drift

Wei Feng, Yiwen Jiang, Sijin Zhou, and Zongyuan Ge. Beyond the static world: Continual category discovery under visual drift. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25032–25042, 2026

2026

[27] [27]

Seeing through the shift: Causality-inspired robust generalized category discovery

Wei Feng, Yiwen Jiang, Sijin Zhou, Zhuang Qi, Zhongxing Xu, Zhonghua Wang, Feilong Tang, and Zongyuan Ge. Seeing through the shift: Causality-inspired robust generalized category discovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17766–17775, 2026

2026

[28] [28]

Generalized category discovery under domain shift: A frequency domain perspective

Wei Feng and Zongyuan Ge. Generalized category discovery under domain shift: A frequency domain perspective. Advances in Neural Information Processing Systems, 38:111721–111749, 2026

2026

[29] [29]

Physics-informed unsupervised domain adaptation framework for cross-machine bearing fault diagnosis.Advanced Engineering Informatics, 62:102774, 2024

Jia Ning, Weiguo Huang, Chuancang Ding, Jun Wang, and Zhongkui Zhu. Physics-informed unsupervised domain adaptation framework for cross-machine bearing fault diagnosis.Advanced Engineering Informatics, 62:102774, 2024

2024

[30] [30]

A physics-guided memory enhancement and causality-inspired generalization framework for continual fault diagnosis

Jia Ning, Weiguo Huang, Panpan Guo, Chuancang Ding, Yifan Huangfu, Changqing Shen, and Zhongkui Zhu. A physics-guided memory enhancement and causality-inspired generalization framework for continual fault diagnosis. Knowledge-Based Systems, 325:114044, 2025. Corresponding author: Weiguo Huang

2025

[31] [31]

Cinematte: Background matting for virtual production and beyond

Yuanjian He, Chen Zhang, Fasheng Chen, and Jiangbo Cao. Cinematte: Background matting for virtual production and beyond. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8725–8735, 2026

2026

[32] [32]

Advancing ultrasoundmedicalcontinuouslearningwithtask-specificgeneralizationandadaptability

Chunzheng Zhu, Jianxin Lin, Guanghua Tan, Ningbo Zhu, Kenli Li, Chunlian Wang, and Shengli Li. Advancing ultrasoundmedicalcontinuouslearningwithtask-specificgeneralizationandadaptability. In2024IEEEInternational Conference on Bioinformatics and Biomedicine (BIBM), pages 3019–3025. IEEE, 2024

2024

[33] [33]

Medeyes: Learning dynamic visual focus for medical progressive diagnosis

Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, and Jianxin Lin. Medeyes: Learning dynamic visual focus for medical progressive diagnosis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13916–13924, 2026

2026

[34] [34]

Grasp: Plan-guided graph retrieval with adaptive fusion and reranking on semi-structured knowledge bases.arXiv preprint arXiv:2605.30237, 2026

Yicheng Tao, Yiqun Wang, Xiangchen Song, Xin Luo, Kai Liu, and Jie Liu. Grasp: Plan-guided graph retrieval with adaptive fusion and reranking on semi-structured knowledge bases.arXiv preprint arXiv:2605.30237, 2026

Pith/arXiv arXiv 2026

[35] [35]

Modalimmune: Immunity driven unlearning via self destructive training.arXiv preprint arXiv:2602.16197, 2026

RongFu,WeiZhiTang,ZimingWang,JiaYeeTan,ZijianZhang,ZhaoluKang,MugeQi,ShuningZhang,andSimon Fong. Modalimmune: Immunity driven unlearning via self destructive training.arXiv preprint arXiv:2602.16197, 2026

Pith/arXiv arXiv 2026

[36] [36]

Sppo: Sequence-level ppo for long-horizon reasoning tasks, 2026

Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, and Guanhua Chen. Sppo: Sequence-level ppo for long-horizon reasoning tasks, 2026. URL https://arxiv.org/abs/2604.08865

Pith/arXiv arXiv 2026

[37] [37]

Regulating model reliance on non-robust features by smoothing input marginal density

Peiyu Yang, Naveed Akhtar, Mubarak Shah, and Ajmal Mian. Regulating model reliance on non-robust features by smoothing input marginal density. InEuropean Conference on Computer Vision, pages 329–347. Springer, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 17

2024

[38] [38]

Lapt: Label-driven automated prompt tuning for ood detection with vision-language models

Yabin Zhang, Wenjie Zhu, Chenhang He, and Lei Zhang. Lapt: Label-driven automated prompt tuning for ood detection with vision-language models. InEuropean conference on computer vision, pages 271–288. Springer, 2024

2024

[39] [39]

Knowledge regularized negative feature tuning of vision-language models for out-of-distribution detection

Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, and Lei Zhang. Knowledge regularized negative feature tuning of vision-language models for out-of-distribution detection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3565–3574, 2025

2025

[40] [40]

Ants: Adaptive negative textual space shaping for ood detection via test-time mllm understanding and reasoning

Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, and Lei Zhang. Ants: Adaptive negative textual space shaping for ood detection via test-time mllm understanding and reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20–30, 2026

2026

[41] [41]

Activationmatters: Test-timeactivatednegativelabelsforooddetectionwithvision-languagemodels.arXivpreprint arXiv:2603.25250, 2026

Yabin Zhang, Maya Varma, Yunhe Gao, Jean-Benoit Delbrouck, Jiaming Liu, Chong Wang, and Curtis Langlotz. Activationmatters: Test-timeactivatednegativelabelsforooddetectionwithvision-languagemodels.arXivpreprint arXiv:2603.25250, 2026

arXiv 2026

[42] [42]

Cross-modal proxy evolving for ood detection with vision-language models.arXiv preprint arXiv:2601.08476, 2026

Hao Tang, Yu Liu, Shuanglin Yan, Fei Shen, Shengfeng He, and Jing Qin. Cross-modal proxy evolving for ood detection with vision-language models.arXiv preprint arXiv:2601.08476, 2026

arXiv 2026

[43] [43]

Quantized-tinyllava: a new multimodal foundation model enables efficient split learning.arXiv preprint arXiv:2511.23402, 2025

Jiajun Guo, Xin Luo, Jiayin Zheng, Yiqun Wang, Kai-Wei Chang, Wei Wang, and Jie Liu. Quantized-tinyllava: a new multimodal foundation model enables efficient split learning.arXiv preprint arXiv:2511.23402, 2025

arXiv 2025

[44] [44]

Future- sightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Future- sightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

Pith/arXiv arXiv 2025

[45] [45]

Reversible primitive– composition alignment for continual vision–language learning

Canran Xiao, Tianxiang Xu, Siyuan Ma, Yiyang Jiang, Haoyu Gao, and Yuhan Wu. Reversible primitive– composition alignment for continual vision–language learning. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[46] [46]

Pi-cca: Prompt-invariant cca certificates for replay-free continual multimodal learning

Jiayu Zhang, Chuangxin Zhao, Canran Xiao, Ruibo Duan, Wenyi Mo, Haoyu Gao, and Wenshuo Wang. Pi-cca: Prompt-invariant cca certificates for replay-free continual multimodal learning. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[47] [47]

Thinkingwithblueprints: Assist- ingvision-languagemodelsinspatialreasoningviastructuredobjectrepresentation.arXivpreprintarXiv:2601.01984, 2026

WeijianMa,ShizhaoSun,TianyuYu,RuiyuWang,Tat-SengChua,andJiangBian. Thinkingwithblueprints: Assist- ingvision-languagemodelsinspatialreasoningviastructuredobjectrepresentation.arXivpreprintarXiv:2601.01984, 2026

arXiv 2026

[48] [48]

Mmt-ard: Multimodal multi-teacher adversarial distillation for robust vision-language models.arXiv preprint arXiv:2511.17448, 2025

YuqiLi,JunhaoDong, ChuanguangYang, ShipingWen,PiotrKoniusz, TingwenHuang,YingliTian, andYew-Soon Ong. Mmt-ard: Multimodal multi-teacher adversarial distillation for robust vision-language models.arXiv preprint arXiv:2511.17448, 2025

arXiv 2025

[49] [49]

Ammkd: Adaptive multimodal multi-teacher distillation for lightweight vision-language models.arXiv preprint arXiv:2509.00039, 2025

Yuqi Li, Chuanguang Yang, Junhao Dong, Zhengtao Yao, Haoyan Xu, Zeyu Dong, Hansheng Zeng, Zhulin An, and Yingli Tian. Ammkd: Adaptive multimodal multi-teacher distillation for lightweight vision-language models.arXiv preprint arXiv:2509.00039, 2025

arXiv 2025

[50] [50]

Discriminant analysis by gaussian mixtures.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):155–176, 1996

Trevor Hastie and Robert Tibshirani. Discriminant analysis by gaussian mixtures.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):155–176, 1996

1996

[51] [51]

Delving into out-of-distribution detection with vision-language representations.Advances in neural information processing systems, 35:35087–35102, 2022

Yifei Ming, Ziyang Cai, Jiuxiang Gu, Yiyou Sun, Wei Li, and Yixuan Li. Delving into out-of-distribution detection with vision-language representations.Advances in neural information processing systems, 35:35087–35102, 2022

2022

[52] [52]

Negative label guided ood detection with pretrained vision-language models.arXiv preprint arXiv:2403.20078, 2024

Xue Jiang, Feng Liu, Zhen Fang, Hong Chen, Tongliang Liu, Feng Zheng, and Bo Han. Negative label guided ood detection with pretrained vision-language models.arXiv preprint arXiv:2403.20078, 2024

arXiv 2024

[53] [53]

Springer, 2006

Christopher M Bishop and Nasser M Nasrabadi.Pattern recognition and machine learning, volume 4. Springer, 2006

2006

[54] [54]

Regularized discriminant analysis.Journal of the American statistical association, 84(405): 165–175, 1989

Jerome H Friedman. Regularized discriminant analysis.Journal of the American statistical association, 84(405): 165–175, 1989

1989

[55] [55]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[56] [56]

Test-time prompt tuning for zero-shot generalization in vision-language models.Advances in Neural Information Processing Systems, 35:14274–14289, 2022

ManliShu,WeiliNie,De-AnHuang,ZhidingYu,TomGoldstein,AnimaAnandkumar,andChaoweiXiao. Test-time prompt tuning for zero-shot generalization in vision-language models.Advances in Neural Information Processing Systems, 35:14274–14289, 2022. Visual Computing Lab·The Hong Kong Polytechnic University 13 / 17

2022

[57] [57]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009

[58] [58]

Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019

2019

[59] [59]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271, 2021

2021

[60] [60]

Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019

2019

[61] [61]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021

2021

[62] [62]

Thecaltech-ucsdbirds-200-2011 dataset

CatherineWah,SteveBranson,PeterWelinder,PietroPerona,SergeBelongie,etal. Thecaltech-ucsdbirds-200-2011 dataset. Technical report

2011

[63] [63]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

2013

[64] [64]

Food-101–miningdiscriminativecomponentswithrandom forests

LukasBossard,MatthieuGuillaumin,andLucVanGool. Food-101–miningdiscriminativecomponentswithrandom forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014

2014

[65] [65]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

2012

[66] [66]

The inaturalist species classification and detection dataset

GrantVanHorn,OisinMacAodha,YangSong,YinCui,ChenSun,AlexShepard,HartwigAdam,PietroPerona,and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018

2018

[67] [67]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

2010

[68] [68]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

2014

[69] [69]

bell-shaped

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017. Visual Computing Lab·The Hong Kong Polytechnic University 14 / 17 Supplementary Material Table S1.Hyper-parameter settings for the pro...

2017