pith. machine review for the scientific record. sign in

arxiv: 2604.18867 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Hierarchically Robust Zero-shot Vision-language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords adversarial robustnesszero-shot classificationvision-language modelshierarchical embeddingsadversarial fine-tuningmargin sizesmulti-level alignment
0
0 comments X

The pith

Hierarchical embeddings and multi-level alignments make zero-shot vision-language models robust to adversarial attacks at different class levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that existing robust fine-tuning for vision-language models hurts their zero-shot accuracy and fails against attacks on superclasses. Instead, by building embeddings hierarchically and aligning image and text features robustly at multiple levels, the models can defend better while keeping natural performance. A key theoretical result links deeper hierarchy placements to larger possible margins against adversaries. The approach also uses multiple class trees sharing the same leaves to add more semantic information.

Core claim

We propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. We provide a theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes, boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety.

What carries the argument

Hierarchical embeddings placed at desired depths with multi-level adversarially robust image-text alignment, theoretically tied to margin size.

Load-bearing premise

That visual embeddings can be reliably placed at desired hierarchy depths and that alignment across multiple trees with shared leaf labels boosts semantic variety without inconsistencies or degrading base performance.

What would settle it

If experiments show no gain in robustness to superclass attacks compared to standard methods, or if clean accuracy drops, the approach would not hold.

Figures

Figures reproduced from arXiv: 2604.18867 by Hao Zhu, Junhao Dong, Piotr Koniusz, Yew-Soon Ong, Yifei Zhang.

Figure 1
Figure 1. Figure 1: Motivation (ImageNet [6] Val.) Fig. 1a: robust accu￾racy on superclasses (one level above the leaf classes). TeCoA [32], PMG [42], and FARE [38] use adversaries derived from base classes and thus perform poorly on superclasses. Fig. 1b: transfer attack success rates on base classes. For our model, adversarial samples generated on superclasses were used. In contrast, TeCoA [32], PMG [42], and FARE [38], the… view at source ↗
Figure 2
Figure 2. Figure 2: Understanding Hyperbolic Geometry. Fig. 2a: Trees can be embedded in the Poincare ball with low distortion ( ´ ∼ 1+ϵ) [37]. Thus, the distance between tree nodes d˜r(u, v) is the analogue of the Riemannian distance dr(u, v) between hyperbolic embeddings. The norm of the vector measured in the Poincare ball is a proxy to the hierarchical level of a node in the tree (Fig. ´ 2b) (we use “reciprocal”: L is the… view at source ↗
Figure 3
Figure 3. Figure 3: Our pipeline. Text & image CLIP encoders are used. For text, we look up each class in a hierarchical tree (e.g., World￾Net for ImageNet) and extract the path from the leaf category to the root, e.g., animal←mammal←canine←wolf. For each cate￾gory level, we form one text prompt and encode it with CLIP. The exponential map elevates embeddings from the Euclidean space into the hyperbolic space (Poincare ball).… view at source ↗
Figure 4
Figure 4. Figure 4: Our pipeline. A. Pipeline [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The hierarchy of a sub-branch in ImageNet with the root node of “Feline”. The leaf nodes represent the original base classes. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The hierarchy of CIFAR-10. The leaf nodes represent the original base classes. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Hyper-parameter (λ1 and λ2) sensitivity of our method on average clean and (Auto-Attack) robust accuracy (%). 0 1e-2 3e-2 5e-2 7e-2 1e-1 ζvic 58 59 60 61 62 63 Clean Accuracy (%) 41 42 43 44 45 46 Robust Accuracy (%) (a) 0 5e-3 7e-3 1e-2 3e-2 5e-2 ζdes 58 59 60 61 62 63 Clean Accuracy (%) 41 42 43 44 45 46 Robust Accuracy (%) (b) [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Hyper-parameter (ζvic and ζgap) sensitivity of our adversarial fine-tuning method on average clean and (Auto-Attack) robust accuracy (%) across 15 datasets in the zero-shot setting. Experimental settings for medical CLIP extension. To expand our empirical analyses for robust medical imaging, we uti￾lize a CLIP model pre-trained specifically on radiology datasets following the CheXzero framework [93] with t… view at source ↗
Figure 9
Figure 9. Figure 9: Diverse types of taxonomies w.r.t. base categories for different hierarchies. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., mammal) in addition to their base (leaf) classes (e.g., cat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide a theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes, boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety. Experiments across several datasets are performed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes a novel adversarial fine-tuning framework for zero-shot vision-language models (VLMs) that uses hierarchical embeddings and multi-level adversarially robust alignment of image-text modalities. It introduces additional mechanisms to place visual embeddings at chosen hierarchy depths, claims a theoretical connection between embedding depth and maximum viable margin size (allowing the model to naturally realize several margin sizes for better adversary generalization), and considers alignment over multiple trees sharing leaf labels to increase semantic variety. The approach targets robustness degradation on both leaf classes and superclasses while aiming to preserve natural performance, with experiments reported across several datasets.

Significance. If the placement mechanisms and depth-margin theoretical link can be rigorously derived and shown to preserve zero-shot capability, the work could offer a principled, hierarchy-aware route to multi-scale adversarial robustness in VLMs that avoids the performance-robustness trade-offs of standard fine-tuning. The multi-tree alignment idea, if consistent, would add a low-cost way to increase semantic coverage. These elements address a genuine gap in current robust VLM literature, but their impact hinges on verifiable implementation details and empirical gains that are not yet substantiated in the provided description.

major comments (3)
  1. Abstract: The central claim of a 'theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size' is asserted without any derivation, equation, or proof sketch. This link is load-bearing for the assertion that the model 'naturally realizes several margin sizes' and boosts adversary generalization; the full manuscript must supply the derivation (including any assumptions on embedding placement) to allow verification that the relation holds beyond the leaf level and is not circular.
  2. Abstract: The statement that 'additional mechanisms place visual embeddings at the desired depth of hierarchy' is presented without describing the concrete procedure (loss term, projection operator, constraint, or regularization). This mechanism is essential to the framework's feasibility, stability, and preservation of zero-shot performance; without it, it is impossible to assess whether the placement is reliable or introduces inconsistencies when aligning across multiple trees with shared leaf labels.
  3. Abstract: The multi-tree alignment claim ('aligning over multiple trees to boost semantic variety') lacks any consistency guarantee or method for handling shared leaf labels across trees with differing parent labels. This is load-bearing for the semantic-variety benefit and could otherwise degrade base-task performance or introduce label conflicts; the manuscript must provide the alignment objective and any cross-tree consistency analysis.
minor comments (1)
  1. Abstract: The description of experiments is limited to 'across several datasets' with no mention of specific datasets, baselines, attack types, or metrics; adding these would improve clarity even at the abstract level.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the abstract requires additional supporting details to substantiate the core claims, and we will revise the manuscript to include the requested derivations, procedural descriptions, and consistency analysis. Our responses to each major comment are provided below.

read point-by-point responses
  1. Referee: Abstract: The central claim of a 'theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size' is asserted without any derivation, equation, or proof sketch. This link is load-bearing for the assertion that the model 'naturally realizes several margin sizes' and boosts adversary generalization; the full manuscript must supply the derivation (including any assumptions on embedding placement) to allow verification that the relation holds beyond the leaf level and is not circular.

    Authors: We agree that a self-contained derivation is essential. The full manuscript contains a theoretical analysis deriving the depth-margin relationship from the hierarchical embedding constraints and the adversarial objective, but we will add an explicit proof sketch (with equations) to the abstract and expand the assumptions in Section 3. The derivation starts from the placement of embeddings at tree depths and uses margin bounds induced by the hierarchy levels; it is grounded in the loss formulation rather than being circular. We will also verify and state its extension to superclass attacks. revision: yes

  2. Referee: Abstract: The statement that 'additional mechanisms place visual embeddings at the desired depth of hierarchy' is presented without describing the concrete procedure (loss term, projection operator, constraint, or regularization). This mechanism is essential to the framework's feasibility, stability, and preservation of zero-shot performance; without it, it is impossible to assess whether the placement is reliable or introduces inconsistencies when aligning across multiple trees with shared leaf labels.

    Authors: We will provide the concrete implementation details in the revised methods section. The placement uses a depth-conditioned projection operator combined with a regularization term in the overall loss that enforces embedding proximity to the target hierarchy depth. We will include the exact mathematical formulation, stability analysis, and empirical checks confirming that zero-shot performance is preserved and that the mechanism remains consistent under multi-tree alignment. revision: yes

  3. Referee: Abstract: The multi-tree alignment claim ('aligning over multiple trees to boost semantic variety') lacks any consistency guarantee or method for handling shared leaf labels across trees with differing parent labels. This is load-bearing for the semantic-variety benefit and could otherwise degrade base-task performance or introduce label conflicts; the manuscript must provide the alignment objective and any cross-tree consistency analysis.

    Authors: We will expand the multi-tree alignment description to include the full objective function and a cross-tree consistency constraint that reconciles shared leaf embeddings while respecting differing parent labels (via label-aware averaging in the alignment loss). A consistency analysis will be added showing that semantic variety increases without label conflicts or degradation on base tasks. This will be supported by additional ablation results in the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical link and hierarchical placement presented as derived properties without reduction to inputs

full rationale

The paper's core claims rest on proposing a framework with hierarchical embeddings, multi-level robust alignment, and a stated theoretical connection between embedding depth and margin size, plus mechanisms for placing embeddings at chosen depths. No equations, loss terms, or derivations are shown in the provided text that reduce the claimed connection or placement procedure to a self-definition, fitted parameter renamed as prediction, or self-citation chain. The multi-tree alignment is described as an additional consideration without invoking uniqueness theorems or ansatzes from prior self-work. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method likely involves unspecified hyperparameters for alignment strengths, hierarchy depths, and margin sizes, but these cannot be audited without the full text.

pith-pipeline@v0.9.0 · 5490 in / 1174 out tokens · 44205 ms · 2026-05-10T04:21:17.668051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures

    James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. InInternational conference on machine learning, pages 115–123. PMLR, 2013

  2. [2]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014

  3. [3]

    Padchest: A large chest x-ray image dataset with multi-label annotated reports.Medical image analysis, 66:101797, 2020

    Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria De La Iglesia-Vaya. Padchest: A large chest x-ray image dataset with multi-label annotated reports.Medical image analysis, 66:101797, 2020

  4. [4]

    Hyperbolic geometry.Flavors of geometry, 31(59-115): 2, 1997

    James W Cannon, William J Floyd, Richard Kenyon, Walter R Parry, et al. Hyperbolic geometry.Flavors of geometry, 31(59-115): 2, 1997

  5. [5]

    Towards evaluating the robustness of neural networks

    Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In2017 ieee symposium on security and privacy (sp), pages 39–57. IEEE, 2017

  6. [6]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

  7. [7]

    An analysis of single-layer networks in unsupervised feature learning

    Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011

  8. [8]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational conference on machine learning, pages 2206–2216. PMLR, 2020

  9. [9]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  10. [10]

    Hyperbolic image- text representations

    Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shanmukha Ramakrishna Vedantam. Hyperbolic image- text representations. InInternational Conference on Machine Learning, pages 7694–7731. PMLR, 2023

  11. [11]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, ICLR, 2021

  12. [12]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004

  13. [13]

    Caltech-256 object category dataset

    Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007

  14. [14]

    Gradient-based adversarial attacks against text transformers

    Chuan Guo, Alexandre Sablayrolles, Herv ´e J´egou, and Douwe Kiela. Gradient-based adversarial attacks against text transformers. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 5747–5757, 2021

  15. [15]

    Hypercolumns for object segmentation and fine-grained localization

    Bharath Hariharan, Pablo Arbel ´aez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 447–456, 2015

  16. [16]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7): 2217–2226, 2019

  17. [17]

    A token-wise cnn-based method for sentence compression

    Weiwei Hou, Hanna Suominen, Piotr Koniusz, Sabrina Caldwell, and Tom Gedeon. A token-wise cnn-based method for sentence compression. InInternational Conference on Neural Information Processing (ICONIP), pages 668–679. Springer, Cham, 2020

  18. [18]

    Adversarial attacks on foundational vision models.arXiv preprint arXiv:2308.14597, 2023

    Nathan Inkawhich, Gwendolyn McDonald, and Ryan Luley. Adversarial attacks on foundational vision models.arXiv preprint arXiv:2308.14597, 2023

  19. [19]

    Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

    Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, pages 590–597, 2019

  20. [20]

    Visual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022

  21. [21]

    Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

  22. [22]

    Hyperbolic image embeddings

    Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan Oseledets, and Victor Lempitsky. Hyperbolic image embeddings. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6418–6428, 2020

  23. [23]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

  24. [24]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  25. [25]

    Carzero: Cross- attention alignment for radiology zero-shot classification

    Haoran Lai, Qingsong Yao, Zihang Jiang, Rongsheng Wang, Zhiyang He, Xiaodong Tao, and S Kevin Zhou. Carzero: Cross- attention alignment for radiology zero-shot classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11137–11146, 2024

  26. [26]

    Biobert: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

  27. [27]

    Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InICML, 2022

  28. [28]

    BERT-ATTACK: adversarial attack against BERT using BERT

    Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: adversarial attack against BERT using BERT. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing EMNLP, pages 6193–6202, 2020

  29. [29]

    One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

    Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

  30. [30]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

  31. [31]

    Openkd: Opening prompt diversity for zero- and few-shot keypoint detection

    Changsheng Lu, Zheyuan Liu, and Piotr Koniusz. Openkd: Opening prompt diversity for zero- and few-shot keypoint detection. In Computer Vision – ECCV 2024, pages 148–165, Cham, 2025. Springer Nature Switzerland

  32. [32]

    Set-level guidance attack: Boosting adversar- ial transferability of vision-language pre-training models

    Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. Set-level guidance attack: Boosting adversar- ial transferability of vision-language pre-training models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 102–111, 2023

  33. [33]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In6th International Conference on Learning Representations, ICLR, 2018

  34. [34]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

  35. [35]

    Understanding zero-shot adversarial robustness for large-scale models

    Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl V ondrick. Understanding zero-shot adversarial robustness for large-scale models. InThe Eleventh International Conference on Learning Representations,ICLR, 2023

  36. [36]

    Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995

    George A Miller. Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995

  37. [37]

    PACE: marrying the generalization of PArameter-efficient fine-tuning with consistency regularization

    Yao Ni, Shan Zhang, and Piotr Koniusz. PACE: marrying the generalization of PArameter-efficient fine-tuning with consistency regularization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  38. [38]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

  39. [39]

    Chatgpt [large language model].https://chatgpt.com, 2024

    OpenAI. Chatgpt [large language model].https://chatgpt.com, 2024

  40. [40]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

  41. [41]

    Xplainer: From x-ray observations to explainable zero-shot diagnosis

    Chantal Pellegrini, Matthias Keicher, Ege ¨Ozsoy, Petra Jiraskova, Rickmer Braren, and Nassir Navab. Xplainer: From x-ray observations to explainable zero-shot diagnosis. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 420–429. Springer, 2023

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational confer- ence on machine learning, pages 8748–8763. PMLR, 2021

  43. [43]

    Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models

    Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024

  44. [44]

    Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning.Nature Biomedical Engineering, 6(12):1399–1406, 2022

    Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, Andrew Y Ng, and Pranav Rajpurkar. Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning.Nature Biomedical Engineering, 6(12):1399–1406, 2022

  45. [45]

    Rotation equivariant cnns for digital pathology

    Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant cnns for digital pathology. InMedical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II 11, pages 210–218. Springer, 2018

  46. [46]

    Pre-trained model guided fine-tuning for zero-shot adversarial robustness

    Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. Pre-trained model guided fine-tuning for zero-shot adversarial robustness. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

  47. [47]

    Chestx-ray8: Hospital- scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

    Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital- scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017

  48. [48]

    Sun database: Large-scale scene recognition from abbey to zoo

    Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

  49. [49]

    Coca: Contrastive captioners are image-text foundation models.Trans

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.Trans. Mach. Learn. Res., 2022, 2022

  50. [50]

    Towards adversarial attack on vision-language pre-training models

    Jiaming Zhang, Qi Yi, and Jitao Sang. Towards adversarial attack on vision-language pre-training models. InProceedings of the 30th ACM International Conference on Multimedia, pages 5005–5013, 2022

  51. [51]

    arXiv preprint arXiv:2410.19694 , year=

    Yifei Zhang, Hao Zhu, Aiwei Liu, Han Yu, Piotr Koniusz, and Irwin King. Less is more: Extreme gradient boost rank-1 adaption for efficient finetuning of llms. InarXiv/2410.19694, 2024

  52. [52]

    Crossspectra: Exploiting cross-layer smoothness for parameter-efficient fine-tuning

    Yifei Zhang, Hao Zhu, Junhao Dong, Haoran Shi, Ziqiao Meng, Piotr Koniusz, and Han Yu. Crossspectra: Exploiting cross-layer smoothness for parameter-efficient fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  53. [53]

    On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

    Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

  54. [54]

    Clip in medical imaging: A comprehensive survey.arXiv preprint arXiv:2312.07353, 2023

    Zihao Zhao, Yuxiao Liu, Han Wu, Yonghao Li, Sheng Wang, Lin Teng, Disheng Liu, Xiang Li, Zhiming Cui, Qian Wang, et al. Clip in medical imaging: A comprehensive survey.arXiv preprint arXiv:2312.07353, 2023

  55. [55]

    Learning to prompt for vision-language models.International Journal of Computer Vision (IJCV), 2022

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.International Journal of Computer Vision (IJCV), 2022

  56. [56]

    Bilora: Almost-orthogonal parameter spaces for continual learning

    Hao Zhu, Yifei Zhang, Junhao Dong, and Piotr Koniusz. Bilora: Almost-orthogonal parameter spaces for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25613–25622, 2025