pith. sign in

arxiv: 2407.17491 · v4 · submitted 2024-07-04 · 💻 cs.CV · cs.LG

Robust Adaptation of Foundation Models with Black-Box Visual Prompting

Pith reviewed 2026-05-23 23:21 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords black-box adaptationvisual promptingparameter-efficient transfer learningfoundation modelsrandomized smoothinggradient estimationdomain adaptationrobustness
0
0 comments X

The pith

BlackVIP adapts pre-trained models to new tasks and domains using only input-dependent visual prompts without accessing model parameters or caching activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BlackVIP to perform parameter-efficient transfer learning on foundation models that are available only as black boxes. It consists of a Coordinator that generates input-specific visual prompts and SPSA-GC that estimates gradients through simultaneous perturbation to update the Coordinator. Experiments across 19 datasets show that this approach achieves robust adaptation under diverse shifts while using far less memory than gradient-based methods that require full model access. A theoretical link is drawn between visual prompting and the certified robustness guarantees of randomized smoothing, with empirical results supporting improved generalization.

Core claim

BlackVIP enables adaptation of black-box pre-trained models by letting a Coordinator design input-dependent visual prompts whose effect on the model output is optimized via SPSA-GC gradient estimates; the method matches or exceeds white-box prompting baselines on 19 datasets while requiring only query access and minimal memory, and the generalization of such prompting is connected to the certified robustness of randomized smoothing.

What carries the argument

The Coordinator module that produces input-dependent visual prompts, updated via SPSA-GC gradient estimates on the black-box model outputs.

If this is right

  • Adaptation becomes feasible for proprietary or API-only models without internal access.
  • Memory footprint drops because no intermediate activations need to be stored.
  • A single trained Coordinator can be reused across multiple downstream tasks on the same model.
  • The randomized-smoothing connection supplies a route to certified robustness bounds for prompted models.
  • BlackVIP-SE trades some performance for substantially lower per-example runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same black-box prompting pattern could be tested on non-vision modalities where only output queries are available.
  • If the Coordinator generalizes across model families, one prompt generator might serve multiple unrelated foundation models.
  • The smoothing link suggests that increasing the number of prompt queries per example might directly improve certified robustness radius.

Load-bearing premise

The Coordinator can produce visual prompts that meaningfully steer the unknown model, and SPSA-GC can produce sufficiently accurate gradient estimates from output queries alone.

What would settle it

On a held-out domain-shift dataset, run BlackVIP and a memory-intensive white-box baseline; if BlackVIP requires comparable or higher memory or yields lower accuracy than the baseline while using only black-box queries, the claim of robust low-memory adaptation fails.

Figures

Figures reproduced from arXiv: 2407.17491 by Changdae Oh, Geunyoung Jung, Gyeongdeok Seo, Hosik Choi, Jiyoung Jung, Kyungwoo Song, Zhi-Qi Cheng.

Figure 1
Figure 1. Figure 1: For transfer learning of large-scale pre-trained models (PTM), [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We propose an input-dependent prompt designer (Coordinator) and a new zeroth-order optimization algorithm (SPSA-GC) for Coordinator training. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Optimizer comparison for (left) loss curve and noise sensitivity analysis of 100-Dimensional Rosenbrock optimization problem and (Right) optimization [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Grad-CAM analysis on CLEVR, Pets, and UCF101. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Query efficiency. (x-axis) A number of queries and cost for achieving [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Relationship between intrinsic dimensionality estimates (with varying [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (a) Empirical verification on the normality assumption, (b) Illustration for the decision boundaries and generalization behavior of randomized smoothing [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of y = 7 subset in Biased-MNIST [70] with ρ = 0.9. (Top) The train set is constructed with the spurious correlation between the background color and digit class (e.g., y = 7 occurs 90% with a pink background and 10% with other random colors in this case). (Bottom) The test set is constructed with a reversed correlation to that of the train set (e.g., y = 7 occurs 10% with a pink background and 90%… view at source ↗
Figure 9
Figure 9. Figure 9: Examples of the Loc-MNIST dataset. The real digit from MNIST is [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: (Left) Embedding visualization with t-SNE [104] on the prompt [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Classification accuracy for given queries and corresponding budget ($ USD) of different black-box visual prompting method. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Grad-CAM on CLEVR. Compared to baseline methods, BlackVIP extends the attention of models to broad areas of the image for effective reasoning [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Grad-CAM on UCF101. Compared to baseline methods, BlackVIP concentrates the attention of models on local areas of the image for effective [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Grad-CAM on OxfordPets. Compared to baseline methods, BlackVIP effectively adapts the model to focus on the target object rather than spurious [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Grad-CAM on SVHN. Compared to baseline methods, BlackVIP effectively adapts the model to focus on the target digit rather than spurious features [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Grad-CAM on EuroSAT. Compared to baseline methods, BlackVIP extends the attention of models to broad areas of the image for effective [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Grad-CAM on StanfordCars. Compared to baseline methods, BlackVIP concentrates the attention of models on an object or local areas of an image [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Grad-CAM on Biased-MNIST. While baseline methods attend to the background rather than digit shape, our BlackVIP can bypass this spurious [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Grad-CAM on Loc-MNIST. Compared to baseline methods, BlackVIP effectively adapts the model to aim at edge-located true digit corresponding [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
read the original abstract

With a surge of large-scale pre-trained models, parameter-efficient transfer learning (PETL) of large models has garnered significant attention. While promising, they commonly rely on two optimistic assumptions: 1) full access to the parameters of a PTM, and 2) sufficient memory capacity to cache all intermediate activations for gradient computation. However, in most real-world applications, PTMs serve as black-box APIs or proprietary software without full parameter accessibility. Besides, it is hard to meet a large memory requirement for modern PTMs. This work proposes black-box visual prompting (BlackVIP), which efficiently adapts the PTMs without knowledge of their architectures or parameters. BlackVIP has two components: 1) Coordinator and 2) simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). The Coordinator designs input-dependent visual prompts, which allow the target PTM to adapt in the wild. SPSA-GC efficiently estimates the gradient of PTM to update Coordinator. Besides, we introduce a variant, BlackVIP-SE, which significantly reduces the runtime and computational cost of BlackVIP. Extensive experiments on 19 datasets demonstrate that BlackVIPs enable robust adaptation to diverse domains and tasks with minimal memory requirements. We further provide a theoretical analysis on the generalization of visual prompting methods by presenting their connection to the certified robustness of randomized smoothing, and presenting an empirical support for improved robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes BlackVIP for parameter-efficient black-box adaptation of pre-trained models (PTMs) via visual prompting. It introduces a Coordinator module that generates input-dependent visual prompts and SPSA-GC (simultaneous perturbation stochastic approximation with gradient correction) to estimate gradients without access to PTM parameters or activations. A variant BlackVIP-SE is also presented for reduced cost. Experiments on 19 datasets are reported to show robust adaptation across domains and tasks with low memory use. A theoretical connection is drawn between visual prompting generalization and the certified robustness guarantees of randomized smoothing, with empirical support for improved robustness.

Significance. If the central claims hold, the work would be significant for enabling adaptation of large foundation models in black-box API settings where parameter access and memory are constrained, extending PETL methods beyond white-box assumptions. The randomized-smoothing connection, if rigorously established, provides a novel theoretical lens on prompting generalization and robustness that could inform future work.

major comments (2)
  1. [§3] §3 (SPSA-GC definition): The central claim that the Coordinator learns effective input-dependent prompts depends on SPSA-GC producing usable gradient estimates, yet the manuscript provides no direct verification of estimate quality (e.g., cosine similarity to finite-difference or white-box gradients on a surrogate model). SPSA variance scales with dimension and perturbation size; without quantifying how the correction term mitigates this for the prompt parameterization, the reported adaptation results on 19 datasets cannot be confidently attributed to successful optimization.
  2. [Experiments] Experiments section (ablation studies): No ablation isolates the contribution of the gradient-correction term in SPSA-GC versus plain SPSA. If the correction does not materially reduce noise for the Coordinator's input-dependent design, the method reduces to standard zeroth-order optimization whose reliability in high-dimensional prompt spaces is known to be limited; this directly affects the robustness-adaptation claim.
minor comments (2)
  1. The abstract states experiments on 19 datasets but the main text should explicitly list them with domain/task breakdown and baseline comparisons in a single table for clarity.
  2. Notation for the Coordinator's prompt parameterization and the SPSA perturbation schedule should be introduced earlier and used consistently to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the validation of SPSA-GC and the need for targeted ablations. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (SPSA-GC definition): The central claim that the Coordinator learns effective input-dependent prompts depends on SPSA-GC producing usable gradient estimates, yet the manuscript provides no direct verification of estimate quality (e.g., cosine similarity to finite-difference or white-box gradients on a surrogate model). SPSA variance scales with dimension and perturbation size; without quantifying how the correction term mitigates this for the prompt parameterization, the reported adaptation results on 19 datasets cannot be confidently attributed to successful optimization.

    Authors: We acknowledge that the manuscript does not include direct verification of SPSA-GC estimate quality such as cosine similarity to surrogate gradients. The current evidence for effective optimization rests on consistent performance gains across the 19 datasets and the design of the correction term to reduce variance in high-dimensional prompt spaces. To address the concern, we will add an analysis section using a surrogate model to report cosine similarities, variance metrics, and the effect of the correction term in the revised version. revision: yes

  2. Referee: [Experiments] Experiments section (ablation studies): No ablation isolates the contribution of the gradient-correction term in SPSA-GC versus plain SPSA. If the correction does not materially reduce noise for the Coordinator's input-dependent design, the method reduces to standard zeroth-order optimization whose reliability in high-dimensional prompt spaces is known to be limited; this directly affects the robustness-adaptation claim.

    Authors: We agree that an explicit ablation isolating the gradient-correction term versus plain SPSA would better substantiate its contribution. The manuscript currently demonstrates overall method performance but does not include this direct comparison. We will add the requested ablation study in the experiments section of the revision to quantify noise reduction and its impact on adaptation results. revision: yes

Circularity Check

0 steps flagged

No circularity: new algorithmic components and presented connection

full rationale

The paper introduces BlackVIP with two explicitly new components—the input-dependent Coordinator and the SPSA-GC estimator—without defining any quantity in terms of itself or renaming a fitted parameter as a prediction. The theoretical analysis consists of presenting a connection between visual prompting and randomized smoothing certification; this is an external linkage offered for generalization insight rather than a derivation that reduces to the paper's own fitted values or self-citations. No load-bearing self-citation chains, uniqueness theorems imported from the same authors, or ansatzes smuggled via prior work appear in the provided claims. The method is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Review based on abstract only; no specific free parameters, axioms, or invented entities with independent evidence can be identified without the full text. The Coordinator and SPSA-GC are new method components introduced by the paper.

invented entities (2)
  • Coordinator no independent evidence
    purpose: Designs input-dependent visual prompts for black-box adaptation
    Introduced as a core component of BlackVIP in the abstract
  • SPSA-GC no independent evidence
    purpose: Estimates gradients of the PTM for updating the Coordinator
    New variant of simultaneous perturbation stochastic approximation presented in the abstract

pith-pipeline@v0.9.0 · 5798 in / 1169 out tokens · 22140 ms · 2026-05-23T23:21:31.224567+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

111 extracted references · 111 canonical work pages · 6 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning . PMLR, 2021, pp. 8748–8763

  2. [2]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

  3. [3]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Ad- vances in neural information processing systems , vol. 36, pp. 34 892– 34 916, 2023

  4. [4]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190 , 2021. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

  5. [5]

    Visual prompt tuning,

    M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” arXiv preprint arXiv:2203.12119 , 2022

  6. [6]

    Visual prompting: Modifying pixel space to adapt pre-trained models,

    H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola, “Visual prompting: Modifying pixel space to adapt pre-trained models,” arXiv preprint arXiv:2203.17274, 2022

  7. [7]

    Maple: Multi-modal prompt learning,

    M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2023, pp. 19 113– 19 122

  8. [8]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations , 2020

  9. [9]

    Prompting visual- language models for efficient video understanding,

    C. Ju, T. Han, K. Zheng, Y . Zhang, and W. Xie, “Prompting visual- language models for efficient video understanding,” arXiv preprint arXiv:2112.04478, 2021

  10. [10]

    Learning to prompt for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision , vol. 130, no. 9, pp. 2337–2348, 2022

  11. [11]

    Conditional prompt learning for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022

  12. [12]

    Unified vision and language prompt learning,

    Y . Zang, W. Li, K. Zhou, C. Huang, and C. C. Loy, “Unified vision and language prompt learning,” arXiv preprint arXiv:2210.07225 , 2022

  13. [13]

    Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,

    J. Spall, “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,” IEEE Transactions on Automatic Control, vol. 37, no. 3, pp. 332–341, 1992

  14. [14]

    Blackvip: Black-box visual prompting for robust transfer learning,

    C. Oh, H. Hwang, H.-y. Lee, Y . Lim, G. Jung, J. Jung, H. Choi, and K. Song, “Blackvip: Black-box visual prompting for robust transfer learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 24 224–24 235

  15. [15]

    Adaptformer: Adapting vision transformers for scalable visual recog- nition,

    S. Chen, C. Ge, Z. Tong, J. Wang, Y . Song, J. Wang, and P. Luo, “Adaptformer: Adapting vision transformers for scalable visual recog- nition,” arXiv preprint arXiv:2205.13535 , 2022

  16. [16]

    Vision transformer adapter for dense predictions,

    Z. Chen, Y . Duan, W. Wang, J. He, T. Lu, J. Dai, and Y . Qiao, “Vision transformer adapter for dense predictions,” in The Eleventh International Conference on Learning Representations , 2023

  17. [17]

    Clip-adapter: Better vision-language models with feature adapters,

    P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y . Zhang, H. Li, and Y . Qiao, “Clip-adapter: Better vision-language models with feature adapters,” arXiv preprint arXiv:2110.04544 , 2021

  18. [18]

    Tip-adapter: Training-free adaption of clip for few-shot classification,

    R. Zhang, W. Zhang, R. Fang, P. Gao, K. Li, J. Dai, Y . Qiao, and H. Li, “Tip-adapter: Training-free adaption of clip for few-shot classification,” in European conference on computer vision . Springer, 2022

  19. [19]

    Black box few-shot adaptation for vision-language models,

    Y . Ouali, A. Bulat, B. Matinez, and G. Tzimiropoulos, “Black box few-shot adaptation for vision-language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023

  20. [20]

    Contrastive adapters for foundation model group robustness,

    M. Zhang and C. R ´e, “Contrastive adapters for foundation model group robustness,” Advances in Neural Information Processing Sys- tems, vol. 35, pp. 21 682–21 697, 2022

  21. [21]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017

  22. [22]

    Learning to prompt for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision (IJCV), 2022

  23. [23]

    Prompt-aligned gradient for prompt tuning,

    B. Zhu, Y . Niu, Y . Han, Y . Wu, and H. Zhang, “Prompt-aligned gradient for prompt tuning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 15 659–15 669

  24. [24]

    Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition,

    S. Ren, A. Zhang, Y . Zhu, S. Zhang, S. Zheng, M. Li, A. J. Smola, and X. Sun, “Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition,” Advances in Neural Information Processing Systems, vol. 36, 2023

  25. [25]

    Diversity-aware meta visual prompting,

    Q. Huang, X. Dong, D. Chen, W. Zhang, F. Wang, G. Hua, and N. Yu, “Diversity-aware meta visual prompting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 10 878–10 887

  26. [26]

    Understanding and improving visual prompting: A label-mapping perspective,

    A. Chen, Y . Yao, P.-Y . Chen, Y . Zhang, and S. Liu, “Understanding and improving visual prompting: A label-mapping perspective,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 133–19 143

  27. [27]

    Fine-grained vi- sual prompting,

    L. Yang, Y . Wang, X. Li, X. Wang, and J. Yang, “Fine-grained vi- sual prompting,” Advances in Neural Information Processing Systems , vol. 36, 2023

  28. [28]

    Lst: Ladder side-tuning for parameter and memory efficient transfer learning,

    Y .-L. Sung, J. Cho, and M. Bansal, “Lst: Ladder side-tuning for parameter and memory efficient transfer learning,” Advances in Neural Information Processing Systems , vol. 35, pp. 12 991–13 005, 2022

  29. [29]

    Make your pre-trained model reversible: From parameter to memory efficient fine-tuning,

    B. Liao, S. Tan, and C. Monz, “Make your pre-trained model reversible: From parameter to memory efficient fine-tuning,” arXiv preprint arXiv:2306.00477, 2023

  30. [30]

    Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,

    J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, “Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,” Advances in Neural Information Processing Systems, vol. 36, 2023

  31. [31]

    Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources,

    Y .-Y . Tsai, P.-Y . Chen, and T.-Y . Ho, “Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources,” in International Conference on Machine Learning. PMLR, 2020, pp. 9614–9624

  32. [32]

    Black-box tuning for language-model-as-a-service,

    T. Sun, Y . Shao, H. Qian, X. Huang, and X. Qiu, “Black-box tuning for language-model-as-a-service,” in Proceedings of ICML , 2022

  33. [33]

    Bbtv2: Towards a gradient-free future with large language models,

    T. Sun, Z. He, H. Qian, Y . Zhou, X. Huang, and X. Qiu, “Bbtv2: Towards a gradient-free future with large language models,” in Pro- ceedings of EMNLP , 2022

  34. [34]

    Rlprompt: Optimizing discrete text prompts with reinforcement learning,

    M. Deng, J. Wang, C.-P. Hsieh, Y . Wang, H. Guo, T. Shu, M. Song, E. P. Xing, and Z. Hu, “Rlprompt: Optimizing discrete text prompts with reinforcement learning,” arXiv preprint arXiv:2205.12548 , 2022

  35. [35]

    Completely derandomized self- adaptation in evolution strategies,

    N. Hansen and A. Ostermeier, “Completely derandomized self- adaptation in evolution strategies,” Evolutionary computation , vol. 9, no. 2, pp. 159–195, 2001

  36. [36]

    Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es),

    N. Hansen, S. D. M ¨uller, and P. Koumoutsakos, “Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es),” Evolutionary computation, vol. 11, no. 1, pp. 1–18, 2003

  37. [37]

    A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,

    S. Liu, P.-Y . Chen, B. Kailkhura, G. Zhang, A. O. Hero III, and P. K. Varshney, “A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,” IEEE Signal Processing Magazine , vol. 37, no. 5, pp. 43–54, 2020

  38. [38]

    Analysis and improve- ment of policy gradient estimation,

    T. Zhao, H. Hachiya, G. Niu, and M. Sugiyama, “Analysis and improve- ment of policy gradient estimation,” Advances in Neural Information Processing Systems, vol. 24, 2011

  39. [39]

    An overview of the simultaneous perturbation method for efficient optimization,

    J. C. Spall, “An overview of the simultaneous perturbation method for efficient optimization,” Johns Hopkins apl technical digest , vol. 19, no. 4, pp. 482–492, 1998

  40. [40]

    Adversarial reprogramming of neural networks,

    G. F. Elsayed, I. Goodfellow, and J. Sohl-Dickstein, “Adversarial reprogramming of neural networks,” arXiv preprint arXiv:1806.11146, 2018

  41. [41]

    Explaining and Harnessing Adversarial Examples

    I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572 , 2014

  42. [42]

    The limitations of deep learning in adversarial settings,

    N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in IEEE European Symposium on Security and Privacy (EuroS&P) , 2016

  43. [43]

    Adversarial attacks and defenses in images, graphs and text: A review,

    H. Xu, Y . Ma, H.-C. Liu, D. Deb, H. Liu, J.-L. Tang, and A. K. Jain, “Adversarial attacks and defenses in images, graphs and text: A review,” International Journal of Automation and Computing, vol. 17, no. 2, pp. 151–178, 2020

  44. [44]

    Cross-modal adversarial reprogramming,

    P. Neekhara, S. Hussain, J. Du, S. Dubnov, F. Koushanfar, and J. McAuley, “Cross-modal adversarial reprogramming,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2427–2435

  45. [45]

    Model reprogramming: Resource-efficient cross-domain machine learning,

    P.-Y . Chen, “Model reprogramming: Resource-efficient cross-domain machine learning,” arXiv preprint arXiv:2202.10629 , 2022

  46. [46]

    Reprogramming large pretrained language mod- els for antibody sequence infilling,

    I. Melnyk, V . Chenthamarakshan, P.-Y . Chen, P. Das, A. Dhurandhar, I. Padhi, and D. Das, “Reprogramming large pretrained language mod- els for antibody sequence infilling,” arXiv preprint arXiv:2210.07144 , 2022

  47. [47]

    Emerging properties in self-supervised vision trans- formers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision trans- formers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 9650–9660

  48. [48]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022

  49. [49]

    Bench- marking detection transfer learning with vision transformers,

    Y . Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Bench- marking detection transfer learning with vision transformers,” arXiv preprint arXiv:2111.11429, 2021

  50. [50]

    Self-supervised learning is more robust to dataset imbalance,

    H. Liu, J. Z. HaoChen, A. Gaidon, and T. Ma, “Self-supervised learning is more robust to dataset imbalance,” in International Conference on Learning Representations, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

  51. [51]

    A survey of self-supervised and few-shot object detection,

    G. Huang, I. Laradji, D. V ´azquez, S. Lacoste-Julien, and P. Rodriguez, “A survey of self-supervised and few-shot object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022

  52. [52]

    Unleashing vanilla vision transformer with masked image modeling for object detection,

    Y . Fang, S. Yang, S. Wang, Y . Ge, Y . Shan, and X. Wang, “Unleashing vanilla vision transformer with masked image modeling for object detection,” arXiv preprint arXiv:2204.02964 , 2022

  53. [53]

    Towards understanding why mask- reconstruction pretraining helps in downstream tasks,

    J. Pan, P. Zhou, and S. Yan, “Towards understanding why mask- reconstruction pretraining helps in downstream tasks,” arXiv preprint arXiv:2206.03826, 2022

  54. [54]

    Pytorch image models,

    R. Wightman, “Pytorch image models,” https://github.com/rwightman/ pytorch-image-models, 2019

  55. [55]

    What is being transferred in transfer learning?

    B. Neyshabur, H. Sedghi, and C. Zhang, “What is being transferred in transfer learning?” Advances in neural information processing systems, vol. 33, pp. 512–523, 2020

  56. [56]

    A one-measurement form of simultaneous perturbation stochastic approximation,

    J. C. Spall, “A one-measurement form of simultaneous perturbation stochastic approximation,” Automatica, vol. 33, no. 1, 1997

  57. [57]

    Spall, Introduction to Stochastic Search and Optimization , 1st ed

    J. Spall, Introduction to Stochastic Search and Optimization , 1st ed. USA: John Wiley & Sons, Inc., 2003

  58. [58]

    Robust neural network tracking controller using simultaneous perturbation stochastic approx- imation,

    Q. Song, J. C. Spall, Y . C. Soh, and J. Ni, “Robust neural network tracking controller using simultaneous perturbation stochastic approx- imation,” IEEE Transactions on Neural Networks , vol. 19, no. 5, pp. 817–835, 2008

  59. [59]

    Simultaneous per- turbation stochastic approximation for automatic speech recognition,

    D. Stein, J. Schwenninger, and M. Stadtschnitzer, “Simultaneous per- turbation stochastic approximation for automatic speech recognition,” in Proc. Interspeech 2013 , 2013, pp. 622–626

  60. [60]

    Simultaneous perturba- tion stochastic approximation for few-shot learning,

    A. Boiarov, O. Granichin, and O. Granichina, “Simultaneous perturba- tion stochastic approximation for few-shot learning,” in 2020 European Control Conference (ECC), 2020, pp. 350–355

  61. [61]

    Adaptive stochastic approximation by the simultaneous per- turbation method,

    J. Spall, “Adaptive stochastic approximation by the simultaneous per- turbation method,” IEEE Transactions on Automatic Control , vol. 45, no. 10, pp. 1839–1853, 2000

  62. [62]

    Accelerated second-order stochastic optimization using only function measurements,

    J. C. Spall, “Accelerated second-order stochastic optimization using only function measurements,” Proceedings of the 36th IEEE Confer- ence on Decision and Control , vol. 2, pp. 1417–1424 vol.2, 1997

  63. [63]

    A method for solving the convex programming problem with convergence rate o(1/k2),

    Y . Nesterov, “A method for solving the convex programming problem with convergence rate o(1/k2),” Proceedings of the USSR Academy of Sciences, vol. 269, pp. 543–547, 1983

  64. [64]

    On the importance of initialization and momentum in deep learning,

    I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in Proceedings of the 30th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, vol. 28, no. 3. PMLR, 17–19 Jun 2013

  65. [65]

    Stochastic first-and zeroth-order methods for nonconvex stochastic programming,

    S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM journal on optimization , vol. 23, no. 4, pp. 2341–2368, 2013

  66. [66]

    Generative pretraining for black-box optimization,

    S. M. Mashkaria, S. Krishnamoorthy, and A. Grover, “Generative pretraining for black-box optimization,” in International Conference on Machine Learning . PMLR, 2023, pp. 24 173–24 197

  67. [67]

    ZIP: An efficient zeroth-order prompt tuning for black-box vision-language models,

    S. Park, J. Jeong, Y . Kim, J. Lee, and N. Lee, “ZIP: An efficient zeroth-order prompt tuning for black-box vision-language models,” in The Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/ forum?id=2OegVbwvY2

  68. [68]

    Wilds: A benchmark of in-the-wild distribution shifts,

    P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsub- ramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao et al., “Wilds: A benchmark of in-the-wild distribution shifts,” in International confer- ence on machine learning . PMLR, 2021, pp. 5637–5664

  69. [69]

    Zeroth-order stochastic variance reduction for nonconvex optimiza- tion,

    S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini, “Zeroth-order stochastic variance reduction for nonconvex optimiza- tion,” Advances in Neural Information Processing Systems , vol. 31, 2018

  70. [70]

    Learning de- biased representations with biased representations,

    H. Bahng, S. Chun, S. Yun, J. Choo, and S. J. Oh, “Learning de- biased representations with biased representations,” in International Conference on Machine Learning . PMLR, 2020, pp. 528–539

  71. [71]

    Gradient-based learning applied to document recognition,

    Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998

  72. [72]

    De- scribing textures in the wild,

    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “De- scribing textures in the wild,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition , 2014, pp. 3606–3613

  73. [73]

    Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,

    P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019

  74. [74]

    Remote sensing image scene classifi- cation: Benchmark and state of the art,

    G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi- cation: Benchmark and state of the art,” Proceedings of the IEEE , vol. 105, no. 10, pp. 1865–1883, 2017

  75. [75]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,

    J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2901–2910

  76. [76]

    The intrinsic dimension of images and its impact on learning,

    P. Pope, C. Zhu, A. Abdelkader, M. Goldblum, and T. Goldstein, “The intrinsic dimension of images and its impact on learning,” in International Conference on Learning Representations , 2021

  77. [77]

    Hastie, R

    T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction . Springer, 2009, vol. 2

  78. [78]

    What does a platypus look like? generating customized prompts for zero-shot image classifi- cation,

    S. Pratt, I. Covert, R. Liu, and A. Farhadi, “What does a platypus look like? generating customized prompts for zero-shot image classifi- cation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 691–15 701

  79. [79]

    Language models as black-box optimizers for vision-language models,

    S. Liu, S. Yu, Z. Lin, D. Pathak, and D. Ramanan, “Language models as black-box optimizers for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 12 687–12 697

  80. [80]

    Maximum likelihood estimation of intrinsic dimension,

    E. Levina and P. Bickel, “Maximum likelihood estimation of intrinsic dimension,” in Advances in Neural Information Processing Systems , L. Saul, Y . Weiss, and L. Bottou, Eds., vol. 17. MIT Press,

Showing first 80 references.