Low-Cost Hard-Label Adversarial Attack with Theoretical Foundations
Pith reviewed 2026-05-25 07:09 UTC · model grok-4.3
The pith
Existing sign-flipping hard-label attacks approximate the true gradient sign, which directly yields a zero-query initialization and Pattern-Driven Optimization that lower query counts while raising success rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish a unified theoretical framework showing that existing sign-flipping hard-label attacks can be understood as approximating the true gradient sign. Guided by this, we propose a novel attack framework featuring a zero-query initialization strategy and a Pattern-Driven Optimization (PDO) algorithm. We provide theoretical guarantees that our initialization yields higher cosine similarity to the true gradient sign than random baselines, and our PDO module achieves significantly lower query complexity than baseline search methods.
What carries the argument
The unified theoretical framework that recasts sign-flipping hard-label attacks as approximations to the true gradient sign, which then supplies both the zero-query initialization and the Pattern-Driven Optimization (PDO) routine.
If this is right
- The zero-query initialization achieves measurably higher cosine similarity to the true gradient sign than random baselines.
- Pattern-Driven Optimization reduces query complexity compared with baseline search methods.
- The full attack achieves higher success rates than prior hard-label methods under low query budgets on CIFAR-10, ImageNet, and ObjectNet.
- The attack maintains high success on adversarially trained models, commercial APIs, CLIP, ImageNet-C, PathMNIST, and segmentation tasks.
- The attack evades the stateful defense Blacklight at a 0% detection rate.
Where Pith is reading between the lines
- Defense designers may need to monitor or randomize initial perturbations rather than only the optimization trajectory.
- The same gradient-sign view could be applied to other label-only or score-only threat models beyond images.
- Lower per-example query counts make rate-limited production APIs more vulnerable than previously measured.
- The framework supplies a concrete way to compare future hard-label methods by their cosine similarity to the true sign rather than by success rate alone.
Load-bearing premise
That modeling existing attacks as gradient-sign approximations produces initialization and optimization choices that improve performance on real neural networks.
What would settle it
An experiment in which the proposed initialization fails to produce higher cosine similarity to the true gradient sign than random initialization, or in which PDO fails to reduce query complexity relative to standard search baselines on the same models.
Figures
read the original abstract
Hard-label black-box attacks, relying solely on top-1 predictions, represent one of the most challenging yet practically threat models. Despite recent progress, existing approaches face two key limitations: (1) they overlook the critical role of initialization, focusing primarily on optimization strategies; and (2) they rely heavily on empirical heuristics without theoretical guarantees. To bridge this gap, we establish a unified theoretical framework showing that existing sign-flipping hard-label attacks can be understood as approximating the true gradient sign. Guided by this principled analysis, we propose a novel attack framework featuring a zero-query initialization strategy and a Pattern-Driven Optimization (PDO) algorithm. We provide theoretical guarantees that our initialization yields higher cosine similarity to the true gradient sign than random baselines, and our PDO module achieves significantly lower query complexity than baseline search methods. Extensive experiments across CIFAR-10, ImageNet, and ObjectNet-covering standard and adversarially trained models, commercial APIs, and CLIP models-demonstrate that our method consistently outperforms SOTA hard-label attacks in both success rate and efficiency, particularly under low query budgets. Furthermore, our method demonstrates robust generalization across corrupted data (ImageNet-C), biomedical images (PathMNIST), and dense prediction tasks such as segmentation. Notably, it bypasses the stateful defense Blacklight, achieving a 0% detection rate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to establish a unified theoretical framework interpreting existing sign-flipping hard-label black-box attacks as approximations to the true gradient sign. Guided by this analysis, it proposes a zero-query initialization strategy and a Pattern-Driven Optimization (PDO) algorithm, asserting theoretical guarantees that the initialization yields higher cosine similarity to the true gradient sign than random baselines and that PDO achieves significantly lower query complexity than baseline search methods. Extensive experiments on CIFAR-10, ImageNet, ObjectNet, commercial APIs, CLIP models, corrupted data, biomedical images, and segmentation tasks demonstrate consistent outperformance over SOTA hard-label attacks in success rate and efficiency, including bypassing the Blacklight defense.
Significance. If the unified framework provides rigorous, non-heuristic guarantees that transfer to real networks and the empirical gains are driven by the theory rather than implementation details, this could advance principled design of low-query hard-label attacks and improve understanding of existing methods in black-box adversarial robustness evaluation.
major comments (2)
- [Abstract] Abstract: the central claim that the unified theoretical framework recasts sign-flipping attacks as gradient-sign approximations and directly yields the stated cosine-similarity and query-complexity guarantees requires explicit derivation and error analysis; without these, it is unclear whether the modeling assumptions hold beyond restricted regimes such as locally linear boundaries.
- [Abstract] Abstract: the guarantee that the zero-query initialization achieves strictly higher cosine similarity than random baselines is load-bearing for the novelty claim; if this follows only from the modeling choice rather than a parameter-free derivation, the practical improvements on real networks may not be theoretically grounded.
minor comments (1)
- [Experiments] The experimental section should report variance, statistical significance, and exact query-budget protocols to support the 'consistent outperformance' claim across all listed datasets and models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the theoretical claims. We address the two major comments point-by-point below and will revise the manuscript to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the unified theoretical framework recasts sign-flipping attacks as gradient-sign approximations and directly yields the stated cosine-similarity and query-complexity guarantees requires explicit derivation and error analysis; without these, it is unclear whether the modeling assumptions hold beyond restricted regimes such as locally linear boundaries.
Authors: We agree the abstract is too condensed. Section 3 of the manuscript already derives the unified framework by modeling the hard-label query as a sign approximation to the gradient of the underlying loss, with sign-flipping attacks recovered as special cases. To directly address the request, we will add an appendix containing the full step-by-step derivation, explicit error bounds under the locally linear boundary assumption, and a discussion of the regimes (including non-linear boundaries) where the approximation remains useful. This will make the modeling assumptions and their scope transparent. revision: yes
-
Referee: [Abstract] Abstract: the guarantee that the zero-query initialization achieves strictly higher cosine similarity than random baselines is load-bearing for the novelty claim; if this follows only from the modeling choice rather than a parameter-free derivation, the practical improvements on real networks may not be theoretically grounded.
Authors: The cosine-similarity guarantee is obtained directly from the framework via a parameter-free comparison of expected inner products: our initialization is constructed from the sign pattern implied by the theoretical model, while the random baseline is drawn uniformly from the sphere; the strict inequality follows from the geometry of the model without additional tunable parameters. We will revise the abstract and add a short clarifying paragraph in Section 3 to state this derivation explicitly and emphasize its parameter-free character. The extensive experiments across real networks, APIs, and non-image domains already provide empirical support that the theoretical advantage translates to practice. revision: yes
Circularity Check
No circularity: framework and guarantees presented as independent theoretical analysis
full rationale
The paper constructs a unified theoretical framework interpreting sign-flipping attacks as gradient-sign approximations, then states separate theoretical guarantees for zero-query initialization (higher cosine similarity) and PDO (lower query complexity). No quoted equations, definitions, or self-citations in the abstract or description reduce these guarantees to fitted inputs, self-definitional loops, or load-bearing prior self-work by construction. The derivation chain remains self-contained, with empirical validation on external datasets serving as independent check rather than internal fit.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing sign-flipping hard-label attacks can be understood as approximating the true gradient sign.
invented entities (1)
-
Pattern-Driven Optimization (PDO) algorithm
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Abdullah Al-Dujaili and Una-May O’Reilly. 2020. Sign bits are all you need for black-box attacks. InICLR
work page 2020
-
[2]
Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. 2020. Square attack: a query-efficient black-box adversarial attack via random search. InECCV
work page 2020
-
[3]
Apple. 2023. Apple Core ML. https://developer.apple.com/cn/machine-learning/ core-ml/
work page 2023
-
[4]
https://cloud.baidu.com/doc/ IMAGERECOGNITION/index.html
Baidu. 2024. Image Recognition. “https://cloud.baidu.com/doc/ IMAGERECOGNITION/index.html"
work page 2024
-
[5]
Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. 2019. Objectnet: A large-scale bias- controlled dataset for pushing the limits of object recognition models. InNeurIPS, Vol. 32
work page 2019
-
[6]
Wieland Brendel, Jonas Rauber, and Matthias Bethge. 2018. Decision-Based Ad- versarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models. InICLR
work page 2018
-
[7]
Jinghui Chen and Quanquan Gu. 2020. RayS: A Ray Searching Method for Hard-label Adversarial Attack. InKDD
work page 2020
-
[8]
Jianbo Chen, Michael I Jordan, and Martin J Wainwright. 2020. Hopskipjumpat- tack: A query-efficient decision-based attack. InIEEE S&P
work page 2020
-
[9]
Yiting Chen, Qibing Ren, and Junchi Yan. 2022. Rethinking and improving robustness of convolutional neural networks: a shapley value-based approach in frequency domain.NeurIPS(2022)
work page 2022
-
[10]
Minhao Cheng, Thong Le, Pin-Yu Chen, Huan Zhang, JinFeng Yi, and Cho-Jui Hsieh. 2019. Query-Efficient Hard-label Black-box Attack: An Optimization-based Approach. InICLR
work page 2019
-
[11]
Minhao Cheng, Simranjit Singh, Patrick Chen, Pin Yu Chen, Sijia Liu, and Cho Jui Hsieh. 2020. SIGN-OPT: A QUERY-EFFICIENT HARD-LABEL ADVERSARIAL ATTACK. InICLR
work page 2020
-
[12]
Minhao Cheng, Huan Zhang, Cho-Jui Hsieh, Thong Le, Pin-Yu Chen, and Jinfeng Yi. 2019. Query-efficient hard-label black-box attack: An optimization-based approach. InICLR
work page 2019
-
[13]
Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. 2021. RobustBench: a standardized adversarial robustness benchmark. In NeurIPS
work page 2021
-
[14]
Francesco Croce and Matthias Hein. 2020. Minimally Distorted Adversarial Examples with a Fast Adaptive Boundary Attack. InICML
work page 2020
-
[15]
Media Cybernetics. [n. d.]. Mediacy. https://mediacy.com/
-
[16]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. InCVPR. IEEE, 248–255
work page 2009
-
[17]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4690–4699
work page 2019
-
[18]
Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li. 2018. Boosting Adversarial Attacks with Momentum. InCVPR
work page 2018
-
[19]
Logan Engstrom, Andrew Ilyas, Hadi Salman, Shibani Santurkar, and Dimitris Tsipras. 2019. Robustness. https://github.com/MadryLab/robustness
work page 2019
-
[20]
Zheng Fang, Tao Wang, Lingchen Zhao, Shenyi Zhang, Bowen Li, Yunjie Ge, Qi Li, Chao Shen, and Qian Wang. 2024. Zero-query adversarial attack on black-box automatic speech recognition systems. InACM CCS. 630–644
work page 2024
-
[21]
Rafael C. Gonzalez and Richard E. Woods. 2018.Digital Image Processing(4th ed.). Pearson
work page 2018
-
[22]
https://cloud.google.com/vision?hl= en
Google. [n. d.]. Google Cloud Vision API. “https://cloud.google.com/vision?hl= en"
-
[23]
Chuan Guo, Jacob Gardner, Yurong You, Andrew Gordon Wilson, and Kilian Weinberger. 2019. Simple black-box adversarial attacks. InICLR
work page 2019
-
[24]
Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. 2018. Countering Adversarial Images using Input Transformations. InICLR
work page 2018
-
[25]
Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. InICLR
work page 2019
-
[26]
Christian Szegedy Ian J. Goodfellow, Jonathon Shlens. 2015. Explaining and Harnessing Adversarial Examples. InICLR
work page 2015
-
[27]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. 2021.OpenCLIP. doi:10.5281/zenodo.5143773 If you use this software, please cite it as below
-
[28]
Andrew Ilyas, Logan Engstrom, and Aleksander Madry. 2019. Prior Convictions: Black-box Adversarial Attacks with Bandits and Priors. InICLR
work page 2019
-
[29]
https://imagga.com/solutions/ auto-tagging
Imagga. [n. d.]. AI-Powered Image Tagging API. “https://imagga.com/solutions/ auto-tagging"
-
[30]
Shuaifan Jin, He Wang, Zhibo Wang, Feng Xiao, Jiahui Hu, Yuan He, Wenwen Zhang, Zhongjie Ba, Weijie Fang, Shuhong Yuan, et al. 2024. Defending Deep Learning-based Privacy Attacks with Gradient Descent-resistant Features in Face Recognition. InUSENIX
work page 2024
-
[31]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al
-
[32]
Segment anything. InICCV. 4015–4026
-
[33]
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009)
work page 2009
-
[34]
Fengpeng Li, Kemou Li, Haiwei Wu, Jinyu Tian, and Jiantao Zhou. 2024. DAT: Im- proving Adversarial Robustness via Generative Amplitude Mix-up in Frequency Domain. InNeurIPS. 127099–127128
work page 2024
-
[35]
Fengpeng Li, Kemou Li, Haiwei Wu, Jinyu Tian, and Jiantao Zhou. 2025. Toward Robust Learning via Core Feature-Aware Adversarial Training.IEEE TIFS20 (2025), 6236–6251
work page 2025
-
[36]
Huiying Li, Shawn Shan, Emily Wenger, Jiayun Zhang, Haitao Zheng, and Ben Y Zhao. 2022. Blacklight: Scalable defense for neural networks against Query-Based Black-Box attacks. InUSENIX
work page 2022
-
[37]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InECCV. Springer, 740–755
work page 2014
-
[38]
Chen Ma, Xiangyu Guo, Li Chen, Jun-Hai Yong, and Yisen Wang. 2021. Finding optimal tangent points for reducing distortions of hard-label attacks.NeurIPS (2021)
work page 2021
-
[39]
Thibault Maho, Teddy Furon, and Erwan Le Merrer. 2021. Surfree: a fast surrogate- free black-box attack. InCVPR
work page 2021
-
[40]
Seungyong Moon, Gaon An, and Hyun Oh Song. 2019. Parsimonious black-box adversarial attacks via efficient combinatorial optimization. InICLR
work page 2019
-
[41]
PyTorch. 2023. https://github.com/pytorch/vision
work page 2023
-
[42]
Meng Shen, Changyue Li, Qi Li, Hao Lu, Liehuang Zhu, and Ke Xu. 2024. Trans- ferability of white-box perturbations: query-efficient adversarial attacks against commercial DNN services. InUSENIX
work page 2024
-
[43]
https://cloud.tencent.com/ document/product/865/75196
Tencent. 2024. Tencent General Image Label. “https://cloud.tencent.com/ document/product/865/75196"
work page 2024
-
[44]
Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. 2018. Ensemble Adversarial Training: Attacks and De- fenses. InICLR
work page 2018
-
[45]
J.H. Van Vleck and D. Middleton. 1966. The spectrum of clipped noise.Proc. IEEE 54, 1 (1966), 2–19. Conference’17, July 2017, Washington, DC, USA Trovato et al
work page 1966
-
[46]
Viet Quoc Vo, Ehsan Abbasnejad, and Damith C Ranasinghe. 2022. RamBoAttack: A Robust Query Efficient Deep Neural Network Decision Exploit. InNDSS
work page 2022
-
[47]
Gregory K Wallace. 1991. The JPEG still picture compression standard.Commun. ACM34, 4 (1991), 30–44
work page 1991
-
[48]
Jie Wan, Jianhao Fu, Lijin Wang, and Ziqi Yang. 2024. Bounceattack: A query- efficient decision-based adversarial attack by bouncing into the wild. InIEEE S&P
work page 2024
-
[49]
Feiyang Wang, Xingquan Zuo, Hai Huang, and Gang Chen. 2025. ADBA: Ap- proximation Decision Boundary Approach for Black-Box Adversarial Attacks. AAAI(2025)
work page 2025
-
[50]
Feiyang Wang, Xingquan Zuo, Hai Huang, and Gang Chen. 2025. TtBA: Two-third Bridge Approach for Decision-Based Adversarial Attack. InICML
work page 2025
-
[51]
Binyan Xu, Xilin Dai, Di Tang, and Kehuan Zhang. 2025. One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP. InACM CCS. 3087–3101
work page 2025
-
[52]
Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. 2023. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data10, 1 (2023), 41
work page 2023
-
[53]
Dong Yin, Raphael Gontijo Lopes, Jon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. 2019. A fourier perspective on model robustness in computer vision. NeurIPS32 (2019)
work page 2019
- [54]
-
[55]
Baolin Zheng, Peipei Jiang, Qian Wang, Qi Li, Chao Shen, Cong Wang, Yunjie Ge, Qingyang Teng, and Shenyi Zhang. 2021. Black-box adversarial attacks on commercial speech platforms with minimal information. InACM CCS. 86–107
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.