Robust Adaptation of Foundation Models with Black-Box Visual Prompting
Pith reviewed 2026-05-23 23:21 UTC · model grok-4.3
The pith
BlackVIP adapts pre-trained models to new tasks and domains using only input-dependent visual prompts without accessing model parameters or caching activations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BlackVIP enables adaptation of black-box pre-trained models by letting a Coordinator design input-dependent visual prompts whose effect on the model output is optimized via SPSA-GC gradient estimates; the method matches or exceeds white-box prompting baselines on 19 datasets while requiring only query access and minimal memory, and the generalization of such prompting is connected to the certified robustness of randomized smoothing.
What carries the argument
The Coordinator module that produces input-dependent visual prompts, updated via SPSA-GC gradient estimates on the black-box model outputs.
If this is right
- Adaptation becomes feasible for proprietary or API-only models without internal access.
- Memory footprint drops because no intermediate activations need to be stored.
- A single trained Coordinator can be reused across multiple downstream tasks on the same model.
- The randomized-smoothing connection supplies a route to certified robustness bounds for prompted models.
- BlackVIP-SE trades some performance for substantially lower per-example runtime.
Where Pith is reading between the lines
- The same black-box prompting pattern could be tested on non-vision modalities where only output queries are available.
- If the Coordinator generalizes across model families, one prompt generator might serve multiple unrelated foundation models.
- The smoothing link suggests that increasing the number of prompt queries per example might directly improve certified robustness radius.
Load-bearing premise
The Coordinator can produce visual prompts that meaningfully steer the unknown model, and SPSA-GC can produce sufficiently accurate gradient estimates from output queries alone.
What would settle it
On a held-out domain-shift dataset, run BlackVIP and a memory-intensive white-box baseline; if BlackVIP requires comparable or higher memory or yields lower accuracy than the baseline while using only black-box queries, the claim of robust low-memory adaptation fails.
Figures
read the original abstract
With a surge of large-scale pre-trained models, parameter-efficient transfer learning (PETL) of large models has garnered significant attention. While promising, they commonly rely on two optimistic assumptions: 1) full access to the parameters of a PTM, and 2) sufficient memory capacity to cache all intermediate activations for gradient computation. However, in most real-world applications, PTMs serve as black-box APIs or proprietary software without full parameter accessibility. Besides, it is hard to meet a large memory requirement for modern PTMs. This work proposes black-box visual prompting (BlackVIP), which efficiently adapts the PTMs without knowledge of their architectures or parameters. BlackVIP has two components: 1) Coordinator and 2) simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). The Coordinator designs input-dependent visual prompts, which allow the target PTM to adapt in the wild. SPSA-GC efficiently estimates the gradient of PTM to update Coordinator. Besides, we introduce a variant, BlackVIP-SE, which significantly reduces the runtime and computational cost of BlackVIP. Extensive experiments on 19 datasets demonstrate that BlackVIPs enable robust adaptation to diverse domains and tasks with minimal memory requirements. We further provide a theoretical analysis on the generalization of visual prompting methods by presenting their connection to the certified robustness of randomized smoothing, and presenting an empirical support for improved robustness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes BlackVIP for parameter-efficient black-box adaptation of pre-trained models (PTMs) via visual prompting. It introduces a Coordinator module that generates input-dependent visual prompts and SPSA-GC (simultaneous perturbation stochastic approximation with gradient correction) to estimate gradients without access to PTM parameters or activations. A variant BlackVIP-SE is also presented for reduced cost. Experiments on 19 datasets are reported to show robust adaptation across domains and tasks with low memory use. A theoretical connection is drawn between visual prompting generalization and the certified robustness guarantees of randomized smoothing, with empirical support for improved robustness.
Significance. If the central claims hold, the work would be significant for enabling adaptation of large foundation models in black-box API settings where parameter access and memory are constrained, extending PETL methods beyond white-box assumptions. The randomized-smoothing connection, if rigorously established, provides a novel theoretical lens on prompting generalization and robustness that could inform future work.
major comments (2)
- [§3] §3 (SPSA-GC definition): The central claim that the Coordinator learns effective input-dependent prompts depends on SPSA-GC producing usable gradient estimates, yet the manuscript provides no direct verification of estimate quality (e.g., cosine similarity to finite-difference or white-box gradients on a surrogate model). SPSA variance scales with dimension and perturbation size; without quantifying how the correction term mitigates this for the prompt parameterization, the reported adaptation results on 19 datasets cannot be confidently attributed to successful optimization.
- [Experiments] Experiments section (ablation studies): No ablation isolates the contribution of the gradient-correction term in SPSA-GC versus plain SPSA. If the correction does not materially reduce noise for the Coordinator's input-dependent design, the method reduces to standard zeroth-order optimization whose reliability in high-dimensional prompt spaces is known to be limited; this directly affects the robustness-adaptation claim.
minor comments (2)
- The abstract states experiments on 19 datasets but the main text should explicitly list them with domain/task breakdown and baseline comparisons in a single table for clarity.
- Notation for the Coordinator's prompt parameterization and the SPSA perturbation schedule should be introduced earlier and used consistently to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the validation of SPSA-GC and the need for targeted ablations. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (SPSA-GC definition): The central claim that the Coordinator learns effective input-dependent prompts depends on SPSA-GC producing usable gradient estimates, yet the manuscript provides no direct verification of estimate quality (e.g., cosine similarity to finite-difference or white-box gradients on a surrogate model). SPSA variance scales with dimension and perturbation size; without quantifying how the correction term mitigates this for the prompt parameterization, the reported adaptation results on 19 datasets cannot be confidently attributed to successful optimization.
Authors: We acknowledge that the manuscript does not include direct verification of SPSA-GC estimate quality such as cosine similarity to surrogate gradients. The current evidence for effective optimization rests on consistent performance gains across the 19 datasets and the design of the correction term to reduce variance in high-dimensional prompt spaces. To address the concern, we will add an analysis section using a surrogate model to report cosine similarities, variance metrics, and the effect of the correction term in the revised version. revision: yes
-
Referee: [Experiments] Experiments section (ablation studies): No ablation isolates the contribution of the gradient-correction term in SPSA-GC versus plain SPSA. If the correction does not materially reduce noise for the Coordinator's input-dependent design, the method reduces to standard zeroth-order optimization whose reliability in high-dimensional prompt spaces is known to be limited; this directly affects the robustness-adaptation claim.
Authors: We agree that an explicit ablation isolating the gradient-correction term versus plain SPSA would better substantiate its contribution. The manuscript currently demonstrates overall method performance but does not include this direct comparison. We will add the requested ablation study in the experiments section of the revision to quantify noise reduction and its impact on adaptation results. revision: yes
Circularity Check
No circularity: new algorithmic components and presented connection
full rationale
The paper introduces BlackVIP with two explicitly new components—the input-dependent Coordinator and the SPSA-GC estimator—without defining any quantity in terms of itself or renaming a fitted parameter as a prediction. The theoretical analysis consists of presenting a connection between visual prompting and randomized smoothing certification; this is an external linkage offered for generalization insight rather than a derivation that reduces to the paper's own fitted values or self-citations. No load-bearing self-citation chains, uniqueness theorems imported from the same authors, or ansatzes smuggled via prior work appear in the provided claims. The method is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Coordinator
no independent evidence
-
SPSA-GC
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BlackVIP has two components: 1) Coordinator and 2) simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). ... theoretical analysis on the generalization of visual prompting methods by presenting their connection to the certified robustness of randomized smoothing
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We further provide a theoretical analysis on the generalization of visual prompting methods by presenting their connection to the certified robustness of randomized smoothing
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning . PMLR, 2021, pp. 8748–8763
work page 2021
-
[2]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Ad- vances in neural information processing systems , vol. 36, pp. 34 892– 34 916, 2023
work page 2023
-
[4]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190 , 2021. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” arXiv preprint arXiv:2203.12119 , 2022
-
[6]
Visual prompting: Modifying pixel space to adapt pre-trained models,
H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola, “Visual prompting: Modifying pixel space to adapt pre-trained models,” arXiv preprint arXiv:2203.17274, 2022
-
[7]
Maple: Multi-modal prompt learning,
M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2023, pp. 19 113– 19 122
work page 2023
-
[8]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations , 2020
work page 2020
-
[9]
Prompting visual- language models for efficient video understanding,
C. Ju, T. Han, K. Zheng, Y . Zhang, and W. Xie, “Prompting visual- language models for efficient video understanding,” arXiv preprint arXiv:2112.04478, 2021
-
[10]
Learning to prompt for vision-language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision , vol. 130, no. 9, pp. 2337–2348, 2022
work page 2022
-
[11]
Conditional prompt learning for vision-language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022
work page 2022
-
[12]
Unified vision and language prompt learning,
Y . Zang, W. Li, K. Zhou, C. Huang, and C. C. Loy, “Unified vision and language prompt learning,” arXiv preprint arXiv:2210.07225 , 2022
-
[13]
Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,
J. Spall, “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,” IEEE Transactions on Automatic Control, vol. 37, no. 3, pp. 332–341, 1992
work page 1992
-
[14]
Blackvip: Black-box visual prompting for robust transfer learning,
C. Oh, H. Hwang, H.-y. Lee, Y . Lim, G. Jung, J. Jung, H. Choi, and K. Song, “Blackvip: Black-box visual prompting for robust transfer learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 24 224–24 235
work page 2023
-
[15]
Adaptformer: Adapting vision transformers for scalable visual recog- nition,
S. Chen, C. Ge, Z. Tong, J. Wang, Y . Song, J. Wang, and P. Luo, “Adaptformer: Adapting vision transformers for scalable visual recog- nition,” arXiv preprint arXiv:2205.13535 , 2022
-
[16]
Vision transformer adapter for dense predictions,
Z. Chen, Y . Duan, W. Wang, J. He, T. Lu, J. Dai, and Y . Qiao, “Vision transformer adapter for dense predictions,” in The Eleventh International Conference on Learning Representations , 2023
work page 2023
-
[17]
Clip-adapter: Better vision-language models with feature adapters,
P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y . Zhang, H. Li, and Y . Qiao, “Clip-adapter: Better vision-language models with feature adapters,” arXiv preprint arXiv:2110.04544 , 2021
-
[18]
Tip-adapter: Training-free adaption of clip for few-shot classification,
R. Zhang, W. Zhang, R. Fang, P. Gao, K. Li, J. Dai, Y . Qiao, and H. Li, “Tip-adapter: Training-free adaption of clip for few-shot classification,” in European conference on computer vision . Springer, 2022
work page 2022
-
[19]
Black box few-shot adaptation for vision-language models,
Y . Ouali, A. Bulat, B. Matinez, and G. Tzimiropoulos, “Black box few-shot adaptation for vision-language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023
work page 2023
-
[20]
Contrastive adapters for foundation model group robustness,
M. Zhang and C. R ´e, “Contrastive adapters for foundation model group robustness,” Advances in Neural Information Processing Sys- tems, vol. 35, pp. 21 682–21 697, 2022
work page 2022
-
[21]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017
work page 2017
-
[22]
Learning to prompt for vision-language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision (IJCV), 2022
work page 2022
-
[23]
Prompt-aligned gradient for prompt tuning,
B. Zhu, Y . Niu, Y . Han, Y . Wu, and H. Zhang, “Prompt-aligned gradient for prompt tuning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 15 659–15 669
work page 2023
-
[24]
Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition,
S. Ren, A. Zhang, Y . Zhu, S. Zhang, S. Zheng, M. Li, A. J. Smola, and X. Sun, “Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition,” Advances in Neural Information Processing Systems, vol. 36, 2023
work page 2023
-
[25]
Diversity-aware meta visual prompting,
Q. Huang, X. Dong, D. Chen, W. Zhang, F. Wang, G. Hua, and N. Yu, “Diversity-aware meta visual prompting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 10 878–10 887
work page 2023
-
[26]
Understanding and improving visual prompting: A label-mapping perspective,
A. Chen, Y . Yao, P.-Y . Chen, Y . Zhang, and S. Liu, “Understanding and improving visual prompting: A label-mapping perspective,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 133–19 143
work page 2023
-
[27]
Fine-grained vi- sual prompting,
L. Yang, Y . Wang, X. Li, X. Wang, and J. Yang, “Fine-grained vi- sual prompting,” Advances in Neural Information Processing Systems , vol. 36, 2023
work page 2023
-
[28]
Lst: Ladder side-tuning for parameter and memory efficient transfer learning,
Y .-L. Sung, J. Cho, and M. Bansal, “Lst: Ladder side-tuning for parameter and memory efficient transfer learning,” Advances in Neural Information Processing Systems , vol. 35, pp. 12 991–13 005, 2022
work page 2022
-
[29]
Make your pre-trained model reversible: From parameter to memory efficient fine-tuning,
B. Liao, S. Tan, and C. Monz, “Make your pre-trained model reversible: From parameter to memory efficient fine-tuning,” arXiv preprint arXiv:2306.00477, 2023
-
[30]
Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,
J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, “Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,” Advances in Neural Information Processing Systems, vol. 36, 2023
work page 2023
-
[31]
Y .-Y . Tsai, P.-Y . Chen, and T.-Y . Ho, “Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources,” in International Conference on Machine Learning. PMLR, 2020, pp. 9614–9624
work page 2020
-
[32]
Black-box tuning for language-model-as-a-service,
T. Sun, Y . Shao, H. Qian, X. Huang, and X. Qiu, “Black-box tuning for language-model-as-a-service,” in Proceedings of ICML , 2022
work page 2022
-
[33]
Bbtv2: Towards a gradient-free future with large language models,
T. Sun, Z. He, H. Qian, Y . Zhou, X. Huang, and X. Qiu, “Bbtv2: Towards a gradient-free future with large language models,” in Pro- ceedings of EMNLP , 2022
work page 2022
-
[34]
Rlprompt: Optimizing discrete text prompts with reinforcement learning,
M. Deng, J. Wang, C.-P. Hsieh, Y . Wang, H. Guo, T. Shu, M. Song, E. P. Xing, and Z. Hu, “Rlprompt: Optimizing discrete text prompts with reinforcement learning,” arXiv preprint arXiv:2205.12548 , 2022
-
[35]
Completely derandomized self- adaptation in evolution strategies,
N. Hansen and A. Ostermeier, “Completely derandomized self- adaptation in evolution strategies,” Evolutionary computation , vol. 9, no. 2, pp. 159–195, 2001
work page 2001
-
[36]
N. Hansen, S. D. M ¨uller, and P. Koumoutsakos, “Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es),” Evolutionary computation, vol. 11, no. 1, pp. 1–18, 2003
work page 2003
-
[37]
S. Liu, P.-Y . Chen, B. Kailkhura, G. Zhang, A. O. Hero III, and P. K. Varshney, “A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,” IEEE Signal Processing Magazine , vol. 37, no. 5, pp. 43–54, 2020
work page 2020
-
[38]
Analysis and improve- ment of policy gradient estimation,
T. Zhao, H. Hachiya, G. Niu, and M. Sugiyama, “Analysis and improve- ment of policy gradient estimation,” Advances in Neural Information Processing Systems, vol. 24, 2011
work page 2011
-
[39]
An overview of the simultaneous perturbation method for efficient optimization,
J. C. Spall, “An overview of the simultaneous perturbation method for efficient optimization,” Johns Hopkins apl technical digest , vol. 19, no. 4, pp. 482–492, 1998
work page 1998
-
[40]
Adversarial reprogramming of neural networks,
G. F. Elsayed, I. Goodfellow, and J. Sohl-Dickstein, “Adversarial reprogramming of neural networks,” arXiv preprint arXiv:1806.11146, 2018
-
[41]
Explaining and Harnessing Adversarial Examples
I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[42]
The limitations of deep learning in adversarial settings,
N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in IEEE European Symposium on Security and Privacy (EuroS&P) , 2016
work page 2016
-
[43]
Adversarial attacks and defenses in images, graphs and text: A review,
H. Xu, Y . Ma, H.-C. Liu, D. Deb, H. Liu, J.-L. Tang, and A. K. Jain, “Adversarial attacks and defenses in images, graphs and text: A review,” International Journal of Automation and Computing, vol. 17, no. 2, pp. 151–178, 2020
work page 2020
-
[44]
Cross-modal adversarial reprogramming,
P. Neekhara, S. Hussain, J. Du, S. Dubnov, F. Koushanfar, and J. McAuley, “Cross-modal adversarial reprogramming,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2427–2435
work page 2022
-
[45]
Model reprogramming: Resource-efficient cross-domain machine learning,
P.-Y . Chen, “Model reprogramming: Resource-efficient cross-domain machine learning,” arXiv preprint arXiv:2202.10629 , 2022
-
[46]
Reprogramming large pretrained language mod- els for antibody sequence infilling,
I. Melnyk, V . Chenthamarakshan, P.-Y . Chen, P. Das, A. Dhurandhar, I. Padhi, and D. Das, “Reprogramming large pretrained language mod- els for antibody sequence infilling,” arXiv preprint arXiv:2210.07144 , 2022
-
[47]
Emerging properties in self-supervised vision trans- formers,
M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision trans- formers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 9650–9660
work page 2021
-
[48]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022
work page 2022
-
[49]
Bench- marking detection transfer learning with vision transformers,
Y . Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Bench- marking detection transfer learning with vision transformers,” arXiv preprint arXiv:2111.11429, 2021
-
[50]
Self-supervised learning is more robust to dataset imbalance,
H. Liu, J. Z. HaoChen, A. Gaidon, and T. Ma, “Self-supervised learning is more robust to dataset imbalance,” in International Conference on Learning Representations, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
work page 2022
-
[51]
A survey of self-supervised and few-shot object detection,
G. Huang, I. Laradji, D. V ´azquez, S. Lacoste-Julien, and P. Rodriguez, “A survey of self-supervised and few-shot object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022
work page 2022
-
[52]
Unleashing vanilla vision transformer with masked image modeling for object detection,
Y . Fang, S. Yang, S. Wang, Y . Ge, Y . Shan, and X. Wang, “Unleashing vanilla vision transformer with masked image modeling for object detection,” arXiv preprint arXiv:2204.02964 , 2022
-
[53]
Towards understanding why mask- reconstruction pretraining helps in downstream tasks,
J. Pan, P. Zhou, and S. Yan, “Towards understanding why mask- reconstruction pretraining helps in downstream tasks,” arXiv preprint arXiv:2206.03826, 2022
-
[54]
R. Wightman, “Pytorch image models,” https://github.com/rwightman/ pytorch-image-models, 2019
work page 2019
-
[55]
What is being transferred in transfer learning?
B. Neyshabur, H. Sedghi, and C. Zhang, “What is being transferred in transfer learning?” Advances in neural information processing systems, vol. 33, pp. 512–523, 2020
work page 2020
-
[56]
A one-measurement form of simultaneous perturbation stochastic approximation,
J. C. Spall, “A one-measurement form of simultaneous perturbation stochastic approximation,” Automatica, vol. 33, no. 1, 1997
work page 1997
-
[57]
Spall, Introduction to Stochastic Search and Optimization , 1st ed
J. Spall, Introduction to Stochastic Search and Optimization , 1st ed. USA: John Wiley & Sons, Inc., 2003
work page 2003
-
[58]
Q. Song, J. C. Spall, Y . C. Soh, and J. Ni, “Robust neural network tracking controller using simultaneous perturbation stochastic approx- imation,” IEEE Transactions on Neural Networks , vol. 19, no. 5, pp. 817–835, 2008
work page 2008
-
[59]
Simultaneous per- turbation stochastic approximation for automatic speech recognition,
D. Stein, J. Schwenninger, and M. Stadtschnitzer, “Simultaneous per- turbation stochastic approximation for automatic speech recognition,” in Proc. Interspeech 2013 , 2013, pp. 622–626
work page 2013
-
[60]
Simultaneous perturba- tion stochastic approximation for few-shot learning,
A. Boiarov, O. Granichin, and O. Granichina, “Simultaneous perturba- tion stochastic approximation for few-shot learning,” in 2020 European Control Conference (ECC), 2020, pp. 350–355
work page 2020
-
[61]
Adaptive stochastic approximation by the simultaneous per- turbation method,
J. Spall, “Adaptive stochastic approximation by the simultaneous per- turbation method,” IEEE Transactions on Automatic Control , vol. 45, no. 10, pp. 1839–1853, 2000
work page 2000
-
[62]
Accelerated second-order stochastic optimization using only function measurements,
J. C. Spall, “Accelerated second-order stochastic optimization using only function measurements,” Proceedings of the 36th IEEE Confer- ence on Decision and Control , vol. 2, pp. 1417–1424 vol.2, 1997
work page 1997
-
[63]
A method for solving the convex programming problem with convergence rate o(1/k2),
Y . Nesterov, “A method for solving the convex programming problem with convergence rate o(1/k2),” Proceedings of the USSR Academy of Sciences, vol. 269, pp. 543–547, 1983
work page 1983
-
[64]
On the importance of initialization and momentum in deep learning,
I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in Proceedings of the 30th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, vol. 28, no. 3. PMLR, 17–19 Jun 2013
work page 2013
-
[65]
Stochastic first-and zeroth-order methods for nonconvex stochastic programming,
S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM journal on optimization , vol. 23, no. 4, pp. 2341–2368, 2013
work page 2013
-
[66]
Generative pretraining for black-box optimization,
S. M. Mashkaria, S. Krishnamoorthy, and A. Grover, “Generative pretraining for black-box optimization,” in International Conference on Machine Learning . PMLR, 2023, pp. 24 173–24 197
work page 2023
-
[67]
ZIP: An efficient zeroth-order prompt tuning for black-box vision-language models,
S. Park, J. Jeong, Y . Kim, J. Lee, and N. Lee, “ZIP: An efficient zeroth-order prompt tuning for black-box vision-language models,” in The Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/ forum?id=2OegVbwvY2
work page 2025
-
[68]
Wilds: A benchmark of in-the-wild distribution shifts,
P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsub- ramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao et al., “Wilds: A benchmark of in-the-wild distribution shifts,” in International confer- ence on machine learning . PMLR, 2021, pp. 5637–5664
work page 2021
-
[69]
Zeroth-order stochastic variance reduction for nonconvex optimiza- tion,
S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini, “Zeroth-order stochastic variance reduction for nonconvex optimiza- tion,” Advances in Neural Information Processing Systems , vol. 31, 2018
work page 2018
-
[70]
Learning de- biased representations with biased representations,
H. Bahng, S. Chun, S. Yun, J. Choo, and S. J. Oh, “Learning de- biased representations with biased representations,” in International Conference on Machine Learning . PMLR, 2020, pp. 528–539
work page 2020
-
[71]
Gradient-based learning applied to document recognition,
Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998
work page 1998
-
[72]
De- scribing textures in the wild,
M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “De- scribing textures in the wild,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition , 2014, pp. 3606–3613
work page 2014
-
[73]
Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,
P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019
work page 2019
-
[74]
Remote sensing image scene classifi- cation: Benchmark and state of the art,
G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi- cation: Benchmark and state of the art,” Proceedings of the IEEE , vol. 105, no. 10, pp. 1865–1883, 2017
work page 2017
-
[75]
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,
J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2901–2910
work page 2017
-
[76]
The intrinsic dimension of images and its impact on learning,
P. Pope, C. Zhu, A. Abdelkader, M. Goldblum, and T. Goldstein, “The intrinsic dimension of images and its impact on learning,” in International Conference on Learning Representations , 2021
work page 2021
- [77]
-
[78]
What does a platypus look like? generating customized prompts for zero-shot image classifi- cation,
S. Pratt, I. Covert, R. Liu, and A. Farhadi, “What does a platypus look like? generating customized prompts for zero-shot image classifi- cation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 691–15 701
work page 2023
-
[79]
Language models as black-box optimizers for vision-language models,
S. Liu, S. Yu, Z. Lin, D. Pathak, and D. Ramanan, “Language models as black-box optimizers for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 12 687–12 697
work page 2024
-
[80]
Maximum likelihood estimation of intrinsic dimension,
E. Levina and P. Bickel, “Maximum likelihood estimation of intrinsic dimension,” in Advances in Neural Information Processing Systems , L. Saul, Y . Weiss, and L. Bottou, Eds., vol. 17. MIT Press,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.