On the Extreme Variance of Certified Local Robustness Across Model Seeds
Pith reviewed 2026-05-16 12:49 UTC · model grok-4.3
The pith
Models differing only by random seed show extreme variance in certified robustness that exceeds typical reported gains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models that differ only in random seeds during training exhibit extreme variance in their certified robustness, with a standard deviation that is statistically larger than the marginal robustness improvements reported in recent machine learning papers. In addition, certified robustness generalization to unseen data varies significantly across datasets, falling short of the dependability expectations for safety-critical tasks.
What carries the argument
The standard deviation of certified robustness across models trained with different random seeds, compared against improvements reported in the literature.
If this is right
- Machine learning results in certified robustness are likely unconvincing due to extreme variance in certified robustness.
- A lucky model seed in a test set cannot be guaranteed to maintain its higher certified robustness under a different test set.
- Researchers should increase the reporting of confidence intervals for certified robustness.
- Verifiers of neural networks should be more comprehensive by using large-scale, diverse, and unseen data.
Where Pith is reading between the lines
- This implies that robustness benchmarks should require multiple random seeds to establish reliable performance.
- Verification methods may need adjustments to account for seed variability in addition to model parameters.
- Future work could explore whether averaging over seeds or using seed-robust training can reduce this variance.
Load-bearing premise
The observed variance in certified robustness is driven primarily by the choice of random seed rather than by the verification method or other training details.
What would settle it
A study showing that the standard deviation of certified robustness across seeds is smaller than or equal to the average improvements claimed in recent papers would falsify the main claim.
Figures
read the original abstract
Robustness verification of neural networks, referring to formally proving that neural networks satisfy robustness properties, is of crucial importance in safety-critical applications, where model failures can result in loss of human life or million-dollar damages. However, the dependability of verification results may be questioned due to sources of randomness in machine learning, and although this has been widely investigated for accuracy, its impact on robustness verification remains unknown. In this paper, we demonstrate a concerning result: Models that differ only in random seeds during training exhibit extreme variance in their certified robustness, with a standard deviation that is statistically larger than the marginal robustness improvements reported in recent machine learning papers. In addition, we also show that certified robustness generalization to unseen data varies significantly across datasets, falling short of the dependability expectations for safety-critical tasks. Our findings are major concerns because: (i) machine learning results in certified robustness are likely unconvincing due to extreme variance in certified robustness, and (ii) a ``lucky'' model seed in a test set cannot be guaranteed to maintain its higher certified robustness under a different test set. In light of these results, we urge researchers to increase the reporting of confidence intervals for certified robustness, and we urge those verifying neural networks to be more comprehensive in verification by using large-scale, diverse, and unseen data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that neural networks trained with different random seeds exhibit extreme variance in their certified local robustness, with the standard deviation across seeds being statistically larger than the marginal robustness improvements reported in recent machine learning papers. It further demonstrates that certified robustness generalization to unseen data varies significantly across datasets, falling short of expectations for safety-critical tasks, and recommends reporting confidence intervals and using large-scale diverse verification data.
Significance. If substantiated with proper controls, this result would be significant for the field of certified robustness in machine learning. It would highlight that many reported improvements in certified robustness may be within the variance induced by random seeds, urging the community to adopt more rigorous statistical reporting practices such as confidence intervals. The empirical focus on seed variance as a source of unreliability in verification results addresses an important gap, provided the comparisons are properly controlled.
major comments (2)
- The abstract asserts a demonstration and statistical comparison of seed variance to marginal improvements but provides no details on methods, number of models trained, dataset sizes, verification tools used, or how error bars were computed; this absence prevents evaluation of whether the central claim is supported.
- The claim that seed-induced standard deviation is larger than marginal improvements from recent papers is undermined by the lack of matched controls; the baselines may differ in architecture, dataset, perturbation radius, or verifier, making the cross-paper statistical comparison potentially biased and not isolating the effect of seeds.
minor comments (2)
- The manuscript should include specific details on the number of seeds used, the exact certified robustness metric (e.g., percentage of certified points), and the statistical test used to claim 'statistically larger'.
- Ensure all figures showing variance include error bars or confidence intervals as recommended in the conclusion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater experimental transparency and careful controls in our comparisons. We have revised the manuscript to address these points directly while preserving the core empirical findings on seed-induced variance in certified robustness.
read point-by-point responses
-
Referee: The abstract asserts a demonstration and statistical comparison of seed variance to marginal improvements but provides no details on methods, number of models trained, dataset sizes, verification tools used, or how error bars were computed; this absence prevents evaluation of whether the central claim is supported.
Authors: We agree that the original abstract was insufficiently detailed for evaluating the central claims. In the revised manuscript we have expanded the abstract to specify that we trained 10 models per configuration across multiple datasets (MNIST, CIFAR-10), used the CROWN verifier for certification, and computed standard deviations with bootstrap-derived confidence intervals. A new dedicated experimental-setup subsection now provides the full protocol, including seed ranges, perturbation radii, and statistical procedures, allowing readers to assess the support for the reported variance magnitudes. revision: yes
-
Referee: The claim that seed-induced standard deviation is larger than marginal improvements from recent papers is undermined by the lack of matched controls; the baselines may differ in architecture, dataset, perturbation radius, or verifier, making the cross-paper statistical comparison potentially biased and not isolating the effect of seeds.
Authors: We acknowledge that cross-paper comparisons carry risks of confounding factors. The revised manuscript now restricts the literature comparison to papers using comparable settings (CIFAR-10, similar epsilon values, and standard convolutional architectures) and includes an explicit table noting architectural and verifier differences. The primary evidence for extreme seed variance, however, derives from our own controlled experiments that hold architecture, dataset, radius, and verifier fixed; the literature comparison is presented only as contextual magnitude, with added caveats on its limitations. revision: partial
Circularity Check
No significant circularity in empirical variance measurements
full rationale
The paper reports direct empirical measurements of certified robustness variance across random seeds on trained models, with no mathematical derivations, predictions, or first-principles results that reduce to fitted parameters or self-referential definitions. Claims rest on observed standard deviations from experiments rather than any self-definitional loop, fitted-input renaming, or load-bearing self-citation chain. External comparisons to marginal improvements in other papers are not self-citations and do not create circularity by construction. The analysis is self-contained as standard empirical reporting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Certified robustness metrics are comparable across models that differ only by random seed
Reference graph
Works this paper leans on
-
[1]
Provably robust deep learning via adversarially trained smoothed classifiers,
H. Salman, J. Li, I. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck, and G. Yang, “Provably robust deep learning via adversarially trained smoothed classifiers,”Advances in neural information processing sys- tems, vol. 32, 2019
work page 2019
-
[2]
Towards stable and efficient training of verifiably robust neural networks
H. Zhang, H. Chen, C. Xiao, S. Gowal, R. Stanforth, B. Li, D. Boning, and C.-J. Hsieh, “Towards stable and efficient training of verifiably robust neural networks,”arXiv preprint arXiv:1906.06316, 2019
-
[3]
Adversarial training and provable de- fenses: Bridging the gap,
M. Balunovic and M. Vechev, “Adversarial training and provable de- fenses: Bridging the gap,” inInternational Conference on Learning Representations, 2020
work page 2020
-
[4]
Towards certifying l-infinity robustness using neural networks with l-inf-dist neurons,
B. Zhang, T. Cai, Z. Lu, D. He, and L. Wang, “Towards certifying l-infinity robustness using neural networks with l-inf-dist neurons,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 368–12 379
work page 2021
-
[5]
Boosting the certified robustness of l-infinity distance nets,
B. Zhang, D. Jiang, D. He, and L. Wang, “Boosting the certified robustness of l-infinity distance nets,”arXiv preprint arXiv:2110.06850, 2021
-
[6]
Double bubble, toil and trouble: enhancing certified robustness through transi- tivity,
A. Cullen, P. Montague, S. Liu, S. Erfani, and B. Rubinstein, “Double bubble, toil and trouble: enhancing certified robustness through transi- tivity,”Advances in Neural Information Processing Systems, vol. 35, pp. 19 099–19 112, 2022
work page 2022
-
[7]
M. N. Mueller, F. Eckert, M. Fischer, and M. Vechev, “Certified training: Small boxes are all you need,”arXiv preprint arXiv:2210.04871, 2022
-
[8]
A recipe for improved certifiable robustness: Capacity and data
K. Hu, K. Leino, Z. Wang, and M. Fredrikson, “A recipe for improved certifiable robustness,”arXiv preprint arXiv:2310.02513, 2023
-
[9]
On the scalability of certified adversarial robustness with generated data,
T. Altstidl, D. Dobre, A. Kosmala, B. Eskofier, G. Gidel, and L. Schwinn, “On the scalability of certified adversarial robustness with generated data,”Advances in Neural Information Processing Systems, vol. 37, pp. 102 255–102 278, 2024
work page 2024
-
[10]
Enhancing certi- fied robustness via block reflector orthogonal layers and logit annealing loss,
B.-H. Lai, P.-H. Huang, B.-H. Kung, and S.-T. Chen, “Enhancing certi- fied robustness via block reflector orthogonal layers and logit annealing loss,”arXiv preprint arXiv:2505.15174, 2025
-
[11]
Lipnext: Scaling up lipschitz- based certified robustness to billion-parameter models,
K. Hu, H. Hu, and M. Fredrikson, “Lipnext: Scaling up lipschitz- based certified robustness to billion-parameter models,”arXiv preprint arXiv:2601.18513, 2026
-
[12]
Reluplex: An efficient smt solver for verifying deep neural networks,
G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient smt solver for verifying deep neural networks,” inInternational conference on computer aided verification. Springer, 2017, pp. 97–117
work page 2017
-
[13]
Autopilot and full self-driving capability
Tesla, “Autopilot and full self-driving capability.” [Online]. Available: https://www.tesla.com/en gb/support/autopilot
-
[14]
Airborne collision avoidance system,
E. Williams, “Airborne collision avoidance system,” inProceedings of the 9th Australian workshop on Safety critical systems and software- Volume 47, 2004, pp. 97–110
work page 2004
-
[15]
R. B. Vilim and T. C. Esselman, “Advanced physics-based fluid system performance monitoring to support nuclear power plant operations. final crada report.” Argonne National Laboratory, Tech. Rep., 2020
work page 2020
-
[16]
Artificial intelligence-enabled medical devices
FDA, “Artificial intelligence-enabled medical devices.” [On- line]. Available: https://www.fda.gov/medical-devices/software-medical- device-samd/artificial-intelligence-enabled-medical-devices
-
[17]
G. HealthCare, “Ge healthcare drives growth with investment in ai-enabled medical devices and tops fda’s list of ai authorizations for 4th year with 100,” 7 2025. [Online]. Available: https://www.gehealthcare.com/middle-east/about/newsroom/press- releases/ge-healthcare-drives-growth-with-investment-in-ai-enabled- medical-devices-and-tops-fdas-list-of-ai-a...
work page 2025
-
[18]
Fda clearance for biograph one positron emission tomography/magnetic resonance imaging scanner,
S. Healthineers, “Fda clearance for biograph one positron emission tomography/magnetic resonance imaging scanner,” 1 2026. [Online]. Available: https://www.siemens-healthineers.com/en-us/press- room/press-releases/biograph-one-fda-clearance
work page 2026
-
[19]
Philips deviceguide gets fda clearance,
Philips, “Philips deviceguide gets fda clearance,” 3 2026. [Online]. Available: https://www.philips.com/a- w/about/news/archive/standard/news/press/2026/fda-clears-philips-ai- solution-that-provides-real-time-guidance-during-complex-minimally- invasive-heart-valve-repair.html
work page 2026
-
[20]
Verifying robustness of neural networks in vision-based end-to-end autonomous driving,
C. Bernardeschi, G. Lami, F. Merola, and F. Rossi, “Verifying robustness of neural networks in vision-based end-to-end autonomous driving,” IEEE Access, 2025
work page 2025
-
[21]
Parallel verification of neural networks applied to medical imaging
J. Andreasen, D. M. Lopez, T. T. Johnson, E. Begoli, and Y . K. Dodia, “Parallel verification of neural networks applied to medical imaging.”
-
[22]
Formal verification of a neural network based prognostics system for aircraft equipment,
D. Kirov, S. F. Rollini, L. Di Guglielmo, and D. Cofer, “Formal verification of a neural network based prognostics system for aircraft equipment,” inInternational Conference on Bridging the Gap between AI and Reality. Springer, 2023, pp. 225–240
work page 2023
-
[23]
Verification of neural network behaviour: Formal guarantees for power system applications,
A. Venzke and S. Chatzivasileiadis, “Verification of neural network behaviour: Formal guarantees for power system applications,”IEEE Transactions on Smart Grid, vol. 12, no. 1, pp. 383–397, 2020
work page 2020
-
[24]
B. Eryılmaz, O. A. Koras ¸, J. Schl ¨otterer, and C. Seifert, “Investigating the impact of randomness on reproducibility in computer vision: A study on applications in civil engineering and medicine,” in2024 IEEE 6th International Conference on Cognitive Machine Intelligence (CogMI). IEEE, 2024, pp. 265–274
work page 2024
-
[25]
On the variance of neural network training with respect to test sets and distributions,
K. Jordan, “On the variance of neural network training with respect to test sets and distributions,”arXiv preprint arXiv:2304.01910, 2023
-
[26]
doi:10.48550/arXiv.2109.08203 , pubstate =
D. Picard, “Torch. manual seed (3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision,” arXiv preprint arXiv:2109.08203, 2021
-
[27]
Assessing the macro and micro effects of random seeds on fine-tuning large language models,
N. T. Bui, G. K. Savova, and L. Wang, “Assessing the macro and micro effects of random seeds on fine-tuning large language models,” inProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 41–46
work page 2025
-
[28]
Ml-ready dataset for identification of frost in martian hirise images - jpl open repository,
W. et al., “Ml-ready dataset for identification of frost in martian hirise images - jpl open repository,” 6 2023
work page 2023
-
[29]
L. Schader, W. Song, R. Kempker, and D. Benkeser, “Don’t let your analysis go to seed: on the impact of random seed on machine learning- based causal inference,”Epidemiology, vol. 35, no. 6, pp. 764–778, 2024
work page 2024
-
[30]
Exploiting verified neural networks via floating point numerical error,
K. Jia and M. Rinard, “Exploiting verified neural networks via floating point numerical error,” inInternational Static Analysis Symposium. Springer, 2021, pp. 191–205
work page 2021
-
[31]
Floating-point neural network verification at the software level,
E. Manino, B. Farias, R. S. Menezes, F. Shmarov, and L. C. Cordeiro, “Floating-point neural network verification at the software level,”arXiv preprint arXiv:2510.23389, 2025
-
[32]
K. Kaulenet al., “The 6th international verification of neural networks competition (vnn-comp 2025): Summary and results,”arXiv preprint arXiv:2512.19007, 2025
-
[33]
Fast and effective robustness certification,
G. Singh, T. Gehr, M. Mirman, M. P”uschel, and M. Vechev, “Fast and effective robustness certification,” inAdvances in neural information processing systems, vol. 31, 2018
work page 2018
-
[34]
The impact of noise and brightness on object detection methods,
J. A. Rodriguez-Rodriguez, E. Lopez-Rubio, J. A. Angel-Ruiz, and M. A. Molina-Cabello, “The impact of noise and brightness on object detection methods,”Sensors, vol. 24, no. 3, p. 821, 2024
work page 2024
-
[35]
Generalisation in humans and deep neural networks,
R. Geirhos, C. R. Temme, J. Rauber, H. H. Sch ¨utt, M. Bethge, and F. A. Wichmann, “Generalisation in humans and deep neural networks,” Advances in neural information processing systems, vol. 31, 2018
work page 2018
-
[36]
“everyone wants to do the model work, not the data work
N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and L. M. Aroyo, ““everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai,” inproceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–15
work page 2021
-
[37]
Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks,
P. Li, X. Rao, J. Blase, Y . Zhang, X. Chu, and C. Zhang, “Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks,” in2021 IEEE 37th international conference on data engineering (ICDE). IEEE, 2021, pp. 13–24
work page 2021
-
[38]
A branch and bound framework for stronger adversarial attacks of relu networks,
H. Zhang, S. Wang, K. b. Xu, L. Li, L. Bo, S. Jana, C.-J. Hsieh, and C.-J. Hsieh, “A branch and bound framework for stronger adversarial attacks of relu networks,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 26 591–26 604
work page 2022
-
[39]
The marabou framework for verification and analysis of deep neural networks,
G. Katz, D. A. Huang, D. Ibeling, K. Julian, C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, A. Zelji ´cet al., “The marabou framework for verification and analysis of deep neural networks,” inInternational conference on computer aided verification. Springer, 2019, pp. 443– 452
work page 2019
-
[40]
Formal security analysis of neural networks using symbolic intervals,
S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, “Formal security analysis of neural networks using symbolic intervals,” in27th USENIX Security Symposium (USENIX Security 18), 2018, pp. 1599–1614
work page 2018
-
[41]
Efficient neural network verification via adaptive refinement and adversarial search,
P. Henriksen and A. Lomuscio, “Efficient neural network verification via adaptive refinement and adversarial search,” inECAI 2020: 24th European Conference on Artificial Intelligence, 29 August–8 September 2020, Santiago de Compostela, Spain–Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020). SAGE Publications 1 O...
work page 2020
-
[42]
Improved geometric path enumeration for verifying relu neural networks,
S. Bak, H.-D. Tran, K. Hobbs, and T. T. Johnson, “Improved geometric path enumeration for verifying relu neural networks,” inInternational conference on computer aided verification. Springer, 2020, pp. 66–96
work page 2020
-
[43]
Neuralsat: A high- performance verification tool for deep neural networks,
H. Duong, T. Nguyen, and M. B. Dwyer, “Neuralsat: A high- performance verification tool for deep neural networks,” inInternational Conference on Computer Aided Verification. Springer, 2025, pp. 409– 423
work page 2025
-
[44]
D. Zhou, J. Chavez, H. Chen, G. A. Hanasusanto, and H. Zhang, “Clip- and-verify: Linear constraint-driven domain clipping for accelerating neural network verification,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[45]
K. Xu, H. Zhang, S. Wang, Y . Wang, S. Jana, X. Lin, and C.-J. Hsieh, “Fast and complete: Enabling complete neural network verification with rapid and massively parallel incomplete verifiers,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=nVZtXBI6LNn
work page 2021
-
[46]
S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C.-J. Hsieh, and J. Z. Kolter, “Beta-crown: Efficient bound propagation with per-neuron split constraints for complete and incomplete neural network verification,” Advances in Neural Information Processing Systems, vol. 34, 2021
work page 2021
-
[47]
Mnist handwritten digit database,
Y . LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” AT&T Labs, Tech. Rep., 2010, available at http://yann.lecun.com/exdb/mnist
work page 2010
-
[48]
Critically assessing the state of the art in neural network verification,
M. Koeniget al., “Critically assessing the state of the art in neural network verification,”Journal of Machine Learning Research, vol. 25, no. 12, pp. 1–53, 2024
work page 2024
-
[49]
Suds challenge datasheet for ds v5.pdf - jpl open repository,
W. et al., “Suds challenge datasheet for ds v5.pdf - jpl open repository,” 6 2023
work page 2023
-
[50]
S. Diniega, G. Doran, S. Lu, M. Wronkiewicz, J. M. Widmer, U. Reb- bapragada, and R. Agrawal, “Holistic mapping of the present-day martian seasonal co2 frost. i. frost detection within global visible, thermal, and spectral data sets,”The Planetary Science Journal, vol. 6, no. 9, p. 209, 2025
work page 2025
-
[51]
NVIDIA, “Nvidia tesla v100.” [Online]. Available: https://www.nvidia.com/en-gb/data-center/tesla-v100/
- [52]
-
[53]
Deep reinforcement learning that matters,
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018
work page 2018
-
[54]
Accounting for variance in machine learning benchmarks,
X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V . V oletiet al., “Accounting for variance in machine learning benchmarks,”Proceedings of machine learning and systems, vol. 3, pp. 747–769, 2021
work page 2021
-
[55]
Underspecification presents challenges for credibility in modern machine learning,
D. et al., “Underspecification presents challenges for credibility in modern machine learning,”Journal of Machine Learning Research, vol. 23, no. 226, pp. 1–61, 2022
work page 2022
-
[56]
Shortcut learning in deep neural networks,
R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,”Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020
work page 2020
-
[57]
Adversarial examples are not bugs, they are features,
A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,”Advances in neural information processing systems, vol. 32, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.