On the Extreme Variance of Certified Local Robustness Across Model Seeds

Minh Le; Phuong Cao

arxiv: 2601.13303 · v2 · submitted 2026-01-19 · 💻 cs.LG

On the Extreme Variance of Certified Local Robustness Across Model Seeds

Minh Le , Phuong Cao This is my paper

Pith reviewed 2026-05-16 12:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords certified robustnessrandom seedsvarianceneural networksrobustness verificationmachine learning safetygeneralization

0 comments

The pith

Models differing only by random seed show extreme variance in certified robustness that exceeds typical reported gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural networks trained with different random seeds during training can have certified local robustness values that differ dramatically from one another. This standard deviation in robustness is statistically larger than the marginal improvements highlighted in many recent machine learning papers. Certified robustness also fails to generalize consistently to unseen data, with large differences across datasets. These findings suggest that single model evaluations may not be dependable for safety-critical applications where robustness must be reliable. The authors recommend reporting confidence intervals and using diverse verification data.

Core claim

Models that differ only in random seeds during training exhibit extreme variance in their certified robustness, with a standard deviation that is statistically larger than the marginal robustness improvements reported in recent machine learning papers. In addition, certified robustness generalization to unseen data varies significantly across datasets, falling short of the dependability expectations for safety-critical tasks.

What carries the argument

The standard deviation of certified robustness across models trained with different random seeds, compared against improvements reported in the literature.

If this is right

Machine learning results in certified robustness are likely unconvincing due to extreme variance in certified robustness.
A lucky model seed in a test set cannot be guaranteed to maintain its higher certified robustness under a different test set.
Researchers should increase the reporting of confidence intervals for certified robustness.
Verifiers of neural networks should be more comprehensive by using large-scale, diverse, and unseen data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This implies that robustness benchmarks should require multiple random seeds to establish reliable performance.
Verification methods may need adjustments to account for seed variability in addition to model parameters.
Future work could explore whether averaging over seeds or using seed-robust training can reduce this variance.

Load-bearing premise

The observed variance in certified robustness is driven primarily by the choice of random seed rather than by the verification method or other training details.

What would settle it

A study showing that the standard deviation of certified robustness across seeds is smaller than or equal to the average improvements claimed in recent papers would falsify the main claim.

Figures

Figures reproduced from arXiv: 2601.13303 by Minh Le, Phuong Cao.

**Figure 1.** Figure 1: Extreme variance in certified robustness for MNIST models. Horizontal axis is plotted with a log scale. Models with 0.057% stddev in accuracy (mean = 99.468%) exhibit a 28.9% stddev in certified robustness (mean = 54.3%) under perturbation level ϵ = 0.007. arXiv:2601.13303v2 [cs.LG] 2 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Correlation of certified robustness between test sets for MNIST models at perturbation ϵ = 0.008. A strong correlation is observed, showing that certified robustness generalizes well for MNIST models. We report our results in Table V. We find that the lucky MNIST model seeds generalize very well to a new test set, with the lowest correlation being 96.6% across the tested perturbations. However, the lucky M… view at source ↗

**Figure 3.** Figure 3: Correlation of certified robustness between test sets for Mars Frost models at perturbation ϵ = 0.0006. A weak correlation is observed, showing that certified robustness does not generalize well for Mars Frost models. Our results, therefore, demonstrate that random seed generalization varies widely across datasets; hence, it cannot be guaranteed that lucky model seeds will also be certified robust when ev… view at source ↗

read the original abstract

Robustness verification of neural networks, referring to formally proving that neural networks satisfy robustness properties, is of crucial importance in safety-critical applications, where model failures can result in loss of human life or million-dollar damages. However, the dependability of verification results may be questioned due to sources of randomness in machine learning, and although this has been widely investigated for accuracy, its impact on robustness verification remains unknown. In this paper, we demonstrate a concerning result: Models that differ only in random seeds during training exhibit extreme variance in their certified robustness, with a standard deviation that is statistically larger than the marginal robustness improvements reported in recent machine learning papers. In addition, we also show that certified robustness generalization to unseen data varies significantly across datasets, falling short of the dependability expectations for safety-critical tasks. Our findings are major concerns because: (i) machine learning results in certified robustness are likely unconvincing due to extreme variance in certified robustness, and (ii) a ``lucky'' model seed in a test set cannot be guaranteed to maintain its higher certified robustness under a different test set. In light of these results, we urge researchers to increase the reporting of confidence intervals for certified robustness, and we urge those verifying neural networks to be more comprehensive in verification by using large-scale, diverse, and unseen data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Certified robustness numbers vary a lot with training seeds, often more than the gains from new methods, but the paper needs tighter comparisons to make that stick.

read the letter

The main thing to know is that this paper measures how much certified robustness changes when you only change the random seed during training, and finds the variation is larger than the small gains that most new certified robustness methods report. It also shows that the certified robustness on one dataset doesn't predict well on another. What is new is the specific quantification for robustness verification. People have looked at seed effects on accuracy for years, but the abstract says the impact on formal verification was unknown, so this fills that gap with an empirical check. The paper does well by being direct about the practical implication: if your certified number depends heavily on the seed, then publishing a single number without intervals is misleading for safety-critical applications. The soft spots are around the comparison and the missing details. The claim that the standard deviation is statistically larger than marginal improvements requires that the compared papers use similar models, datasets, radii, and verifiers. The stress-test note is right that without matched controls, the difference might not be due to seeds. Also, the abstract doesn't include any methods, sample sizes, or verification tools, so the strength of the evidence is hard to assess from what's here. If the full paper has those, it would help. This paper is for researchers working on certified robustness or anyone who needs to trust these numbers in real systems. A reader interested in experimental practices in ML safety would find it useful as a cautionary note. It deserves a serious referee because the question it raises is important, even if the current writeup is light on details. I recommend sending it for peer review. The core observation is worth checking out, and referees can ask for the necessary controls and stats to make the comparison solid.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that neural networks trained with different random seeds exhibit extreme variance in their certified local robustness, with the standard deviation across seeds being statistically larger than the marginal robustness improvements reported in recent machine learning papers. It further demonstrates that certified robustness generalization to unseen data varies significantly across datasets, falling short of expectations for safety-critical tasks, and recommends reporting confidence intervals and using large-scale diverse verification data.

Significance. If substantiated with proper controls, this result would be significant for the field of certified robustness in machine learning. It would highlight that many reported improvements in certified robustness may be within the variance induced by random seeds, urging the community to adopt more rigorous statistical reporting practices such as confidence intervals. The empirical focus on seed variance as a source of unreliability in verification results addresses an important gap, provided the comparisons are properly controlled.

major comments (2)

The abstract asserts a demonstration and statistical comparison of seed variance to marginal improvements but provides no details on methods, number of models trained, dataset sizes, verification tools used, or how error bars were computed; this absence prevents evaluation of whether the central claim is supported.
The claim that seed-induced standard deviation is larger than marginal improvements from recent papers is undermined by the lack of matched controls; the baselines may differ in architecture, dataset, perturbation radius, or verifier, making the cross-paper statistical comparison potentially biased and not isolating the effect of seeds.

minor comments (2)

The manuscript should include specific details on the number of seeds used, the exact certified robustness metric (e.g., percentage of certified points), and the statistical test used to claim 'statistically larger'.
Ensure all figures showing variance include error bars or confidence intervals as recommended in the conclusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater experimental transparency and careful controls in our comparisons. We have revised the manuscript to address these points directly while preserving the core empirical findings on seed-induced variance in certified robustness.

read point-by-point responses

Referee: The abstract asserts a demonstration and statistical comparison of seed variance to marginal improvements but provides no details on methods, number of models trained, dataset sizes, verification tools used, or how error bars were computed; this absence prevents evaluation of whether the central claim is supported.

Authors: We agree that the original abstract was insufficiently detailed for evaluating the central claims. In the revised manuscript we have expanded the abstract to specify that we trained 10 models per configuration across multiple datasets (MNIST, CIFAR-10), used the CROWN verifier for certification, and computed standard deviations with bootstrap-derived confidence intervals. A new dedicated experimental-setup subsection now provides the full protocol, including seed ranges, perturbation radii, and statistical procedures, allowing readers to assess the support for the reported variance magnitudes. revision: yes
Referee: The claim that seed-induced standard deviation is larger than marginal improvements from recent papers is undermined by the lack of matched controls; the baselines may differ in architecture, dataset, perturbation radius, or verifier, making the cross-paper statistical comparison potentially biased and not isolating the effect of seeds.

Authors: We acknowledge that cross-paper comparisons carry risks of confounding factors. The revised manuscript now restricts the literature comparison to papers using comparable settings (CIFAR-10, similar epsilon values, and standard convolutional architectures) and includes an explicit table noting architectural and verifier differences. The primary evidence for extreme seed variance, however, derives from our own controlled experiments that hold architecture, dataset, radius, and verifier fixed; the literature comparison is presented only as contextual magnitude, with added caveats on its limitations. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical variance measurements

full rationale

The paper reports direct empirical measurements of certified robustness variance across random seeds on trained models, with no mathematical derivations, predictions, or first-principles results that reduce to fitted parameters or self-referential definitions. Claims rest on observed standard deviations from experiments rather than any self-definitional loop, fitted-input renaming, or load-bearing self-citation chain. External comparisons to marginal improvements in other papers are not self-citations and do not create circularity by construction. The analysis is self-contained as standard empirical reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of machine-learning training and robustness verification procedures without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)

domain assumption Certified robustness metrics are comparable across models that differ only by random seed
Invoked when attributing observed differences solely to seeds and when comparing variance magnitude to improvements in other papers.

pith-pipeline@v0.9.0 · 5530 in / 1149 out tokens · 31368 ms · 2026-05-16T12:49:45.573099+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages

[1]

Provably robust deep learning via adversarially trained smoothed classifiers,

H. Salman, J. Li, I. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck, and G. Yang, “Provably robust deep learning via adversarially trained smoothed classifiers,”Advances in neural information processing sys- tems, vol. 32, 2019

work page 2019
[2]

Towards stable and efficient training of verifiably robust neural networks

H. Zhang, H. Chen, C. Xiao, S. Gowal, R. Stanforth, B. Li, D. Boning, and C.-J. Hsieh, “Towards stable and efficient training of verifiably robust neural networks,”arXiv preprint arXiv:1906.06316, 2019

work page arXiv 1906
[3]

Adversarial training and provable de- fenses: Bridging the gap,

M. Balunovic and M. Vechev, “Adversarial training and provable de- fenses: Bridging the gap,” inInternational Conference on Learning Representations, 2020

work page 2020
[4]

Towards certifying l-infinity robustness using neural networks with l-inf-dist neurons,

B. Zhang, T. Cai, Z. Lu, D. He, and L. Wang, “Towards certifying l-infinity robustness using neural networks with l-inf-dist neurons,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 368–12 379

work page 2021
[5]

Boosting the certified robustness of l-infinity distance nets,

B. Zhang, D. Jiang, D. He, and L. Wang, “Boosting the certified robustness of l-infinity distance nets,”arXiv preprint arXiv:2110.06850, 2021

work page arXiv 2021
[6]

Double bubble, toil and trouble: enhancing certified robustness through transi- tivity,

A. Cullen, P. Montague, S. Liu, S. Erfani, and B. Rubinstein, “Double bubble, toil and trouble: enhancing certified robustness through transi- tivity,”Advances in Neural Information Processing Systems, vol. 35, pp. 19 099–19 112, 2022

work page 2022
[7]

Wikipedia contributors

M. N. Mueller, F. Eckert, M. Fischer, and M. Vechev, “Certified training: Small boxes are all you need,”arXiv preprint arXiv:2210.04871, 2022

work page arXiv 2022
[8]

A recipe for improved certifiable robustness: Capacity and data

K. Hu, K. Leino, Z. Wang, and M. Fredrikson, “A recipe for improved certifiable robustness,”arXiv preprint arXiv:2310.02513, 2023

work page arXiv 2023
[9]

On the scalability of certified adversarial robustness with generated data,

T. Altstidl, D. Dobre, A. Kosmala, B. Eskofier, G. Gidel, and L. Schwinn, “On the scalability of certified adversarial robustness with generated data,”Advances in Neural Information Processing Systems, vol. 37, pp. 102 255–102 278, 2024

work page 2024
[10]

Enhancing certi- fied robustness via block reflector orthogonal layers and logit annealing loss,

B.-H. Lai, P.-H. Huang, B.-H. Kung, and S.-T. Chen, “Enhancing certi- fied robustness via block reflector orthogonal layers and logit annealing loss,”arXiv preprint arXiv:2505.15174, 2025

work page arXiv 2025
[11]

Lipnext: Scaling up lipschitz- based certified robustness to billion-parameter models,

K. Hu, H. Hu, and M. Fredrikson, “Lipnext: Scaling up lipschitz- based certified robustness to billion-parameter models,”arXiv preprint arXiv:2601.18513, 2026

work page arXiv 2026
[12]

Reluplex: An efficient smt solver for verifying deep neural networks,

G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient smt solver for verifying deep neural networks,” inInternational conference on computer aided verification. Springer, 2017, pp. 97–117

work page 2017
[13]

Autopilot and full self-driving capability

Tesla, “Autopilot and full self-driving capability.” [Online]. Available: https://www.tesla.com/en gb/support/autopilot

work page
[14]

Airborne collision avoidance system,

E. Williams, “Airborne collision avoidance system,” inProceedings of the 9th Australian workshop on Safety critical systems and software- Volume 47, 2004, pp. 97–110

work page 2004
[15]

Advanced physics-based fluid system performance monitoring to support nuclear power plant operations. final crada report

R. B. Vilim and T. C. Esselman, “Advanced physics-based fluid system performance monitoring to support nuclear power plant operations. final crada report.” Argonne National Laboratory, Tech. Rep., 2020

work page 2020
[16]

Artificial intelligence-enabled medical devices

FDA, “Artificial intelligence-enabled medical devices.” [On- line]. Available: https://www.fda.gov/medical-devices/software-medical- device-samd/artificial-intelligence-enabled-medical-devices

work page
[17]

Ge healthcare drives growth with investment in ai-enabled medical devices and tops fda’s list of ai authorizations for 4th year with 100,

G. HealthCare, “Ge healthcare drives growth with investment in ai-enabled medical devices and tops fda’s list of ai authorizations for 4th year with 100,” 7 2025. [Online]. Available: https://www.gehealthcare.com/middle-east/about/newsroom/press- releases/ge-healthcare-drives-growth-with-investment-in-ai-enabled- medical-devices-and-tops-fdas-list-of-ai-a...

work page 2025
[18]

Fda clearance for biograph one positron emission tomography/magnetic resonance imaging scanner,

S. Healthineers, “Fda clearance for biograph one positron emission tomography/magnetic resonance imaging scanner,” 1 2026. [Online]. Available: https://www.siemens-healthineers.com/en-us/press- room/press-releases/biograph-one-fda-clearance

work page 2026
[19]

Philips deviceguide gets fda clearance,

Philips, “Philips deviceguide gets fda clearance,” 3 2026. [Online]. Available: https://www.philips.com/a- w/about/news/archive/standard/news/press/2026/fda-clears-philips-ai- solution-that-provides-real-time-guidance-during-complex-minimally- invasive-heart-valve-repair.html

work page 2026
[20]

Verifying robustness of neural networks in vision-based end-to-end autonomous driving,

C. Bernardeschi, G. Lami, F. Merola, and F. Rossi, “Verifying robustness of neural networks in vision-based end-to-end autonomous driving,” IEEE Access, 2025

work page 2025
[21]

Parallel verification of neural networks applied to medical imaging

J. Andreasen, D. M. Lopez, T. T. Johnson, E. Begoli, and Y . K. Dodia, “Parallel verification of neural networks applied to medical imaging.”

work page
[22]

Formal verification of a neural network based prognostics system for aircraft equipment,

D. Kirov, S. F. Rollini, L. Di Guglielmo, and D. Cofer, “Formal verification of a neural network based prognostics system for aircraft equipment,” inInternational Conference on Bridging the Gap between AI and Reality. Springer, 2023, pp. 225–240

work page 2023
[23]

Verification of neural network behaviour: Formal guarantees for power system applications,

A. Venzke and S. Chatzivasileiadis, “Verification of neural network behaviour: Formal guarantees for power system applications,”IEEE Transactions on Smart Grid, vol. 12, no. 1, pp. 383–397, 2020

work page 2020
[24]

Investigating the impact of randomness on reproducibility in computer vision: A study on applications in civil engineering and medicine,

B. Eryılmaz, O. A. Koras ¸, J. Schl ¨otterer, and C. Seifert, “Investigating the impact of randomness on reproducibility in computer vision: A study on applications in civil engineering and medicine,” in2024 IEEE 6th International Conference on Cognitive Machine Intelligence (CogMI). IEEE, 2024, pp. 265–274

work page 2024
[25]

On the variance of neural network training with respect to test sets and distributions,

K. Jordan, “On the variance of neural network training with respect to test sets and distributions,”arXiv preprint arXiv:2304.01910, 2023

work page arXiv 2023
[26]

doi:10.48550/arXiv.2109.08203 , pubstate =

D. Picard, “Torch. manual seed (3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision,” arXiv preprint arXiv:2109.08203, 2021

work page arXiv 2021
[27]

Assessing the macro and micro effects of random seeds on fine-tuning large language models,

N. T. Bui, G. K. Savova, and L. Wang, “Assessing the macro and micro effects of random seeds on fine-tuning large language models,” inProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 41–46

work page 2025
[28]

Ml-ready dataset for identification of frost in martian hirise images - jpl open repository,

W. et al., “Ml-ready dataset for identification of frost in martian hirise images - jpl open repository,” 6 2023

work page 2023
[29]

Don’t let your analysis go to seed: on the impact of random seed on machine learning- based causal inference,

L. Schader, W. Song, R. Kempker, and D. Benkeser, “Don’t let your analysis go to seed: on the impact of random seed on machine learning- based causal inference,”Epidemiology, vol. 35, no. 6, pp. 764–778, 2024

work page 2024
[30]

Exploiting verified neural networks via floating point numerical error,

K. Jia and M. Rinard, “Exploiting verified neural networks via floating point numerical error,” inInternational Static Analysis Symposium. Springer, 2021, pp. 191–205

work page 2021
[31]

Floating-point neural network verification at the software level,

E. Manino, B. Farias, R. S. Menezes, F. Shmarov, and L. C. Cordeiro, “Floating-point neural network verification at the software level,”arXiv preprint arXiv:2510.23389, 2025

work page arXiv 2025
[32]

The 6th international verification of neural networks competition (VNN-COMP 2025): Summary and results.arXiv preprint arXiv:2512.19007, 2025

K. Kaulenet al., “The 6th international verification of neural networks competition (vnn-comp 2025): Summary and results,”arXiv preprint arXiv:2512.19007, 2025

work page arXiv 2025
[33]

Fast and effective robustness certification,

G. Singh, T. Gehr, M. Mirman, M. P”uschel, and M. Vechev, “Fast and effective robustness certification,” inAdvances in neural information processing systems, vol. 31, 2018

work page 2018
[34]

The impact of noise and brightness on object detection methods,

J. A. Rodriguez-Rodriguez, E. Lopez-Rubio, J. A. Angel-Ruiz, and M. A. Molina-Cabello, “The impact of noise and brightness on object detection methods,”Sensors, vol. 24, no. 3, p. 821, 2024

work page 2024
[35]

Generalisation in humans and deep neural networks,

R. Geirhos, C. R. Temme, J. Rauber, H. H. Sch ¨utt, M. Bethge, and F. A. Wichmann, “Generalisation in humans and deep neural networks,” Advances in neural information processing systems, vol. 31, 2018

work page 2018
[36]

“everyone wants to do the model work, not the data work

N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and L. M. Aroyo, ““everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai,” inproceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–15

work page 2021
[37]

Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks,

P. Li, X. Rao, J. Blase, Y . Zhang, X. Chu, and C. Zhang, “Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks,” in2021 IEEE 37th international conference on data engineering (ICDE). IEEE, 2021, pp. 13–24

work page 2021
[38]

A branch and bound framework for stronger adversarial attacks of relu networks,

H. Zhang, S. Wang, K. b. Xu, L. Li, L. Bo, S. Jana, C.-J. Hsieh, and C.-J. Hsieh, “A branch and bound framework for stronger adversarial attacks of relu networks,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 26 591–26 604

work page 2022
[39]

The marabou framework for verification and analysis of deep neural networks,

G. Katz, D. A. Huang, D. Ibeling, K. Julian, C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, A. Zelji ´cet al., “The marabou framework for verification and analysis of deep neural networks,” inInternational conference on computer aided verification. Springer, 2019, pp. 443– 452

work page 2019
[40]

Formal security analysis of neural networks using symbolic intervals,

S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, “Formal security analysis of neural networks using symbolic intervals,” in27th USENIX Security Symposium (USENIX Security 18), 2018, pp. 1599–1614

work page 2018
[41]

Efficient neural network verification via adaptive refinement and adversarial search,

P. Henriksen and A. Lomuscio, “Efficient neural network verification via adaptive refinement and adversarial search,” inECAI 2020: 24th European Conference on Artificial Intelligence, 29 August–8 September 2020, Santiago de Compostela, Spain–Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020). SAGE Publications 1 O...

work page 2020
[42]

Improved geometric path enumeration for verifying relu neural networks,

S. Bak, H.-D. Tran, K. Hobbs, and T. T. Johnson, “Improved geometric path enumeration for verifying relu neural networks,” inInternational conference on computer aided verification. Springer, 2020, pp. 66–96

work page 2020
[43]

Neuralsat: A high- performance verification tool for deep neural networks,

H. Duong, T. Nguyen, and M. B. Dwyer, “Neuralsat: A high- performance verification tool for deep neural networks,” inInternational Conference on Computer Aided Verification. Springer, 2025, pp. 409– 423

work page 2025
[44]

Clip- and-verify: Linear constraint-driven domain clipping for accelerating neural network verification,

D. Zhou, J. Chavez, H. Chen, G. A. Hanasusanto, and H. Zhang, “Clip- and-verify: Linear constraint-driven domain clipping for accelerating neural network verification,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[45]

Fast and complete: Enabling complete neural network verification with rapid and massively parallel incomplete verifiers,

K. Xu, H. Zhang, S. Wang, Y . Wang, S. Jana, X. Lin, and C.-J. Hsieh, “Fast and complete: Enabling complete neural network verification with rapid and massively parallel incomplete verifiers,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=nVZtXBI6LNn

work page 2021
[46]

Beta-crown: Efficient bound propagation with per-neuron split constraints for complete and incomplete neural network verification,

S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C.-J. Hsieh, and J. Z. Kolter, “Beta-crown: Efficient bound propagation with per-neuron split constraints for complete and incomplete neural network verification,” Advances in Neural Information Processing Systems, vol. 34, 2021

work page 2021
[47]

Mnist handwritten digit database,

Y . LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” AT&T Labs, Tech. Rep., 2010, available at http://yann.lecun.com/exdb/mnist

work page 2010
[48]

Critically assessing the state of the art in neural network verification,

M. Koeniget al., “Critically assessing the state of the art in neural network verification,”Journal of Machine Learning Research, vol. 25, no. 12, pp. 1–53, 2024

work page 2024
[49]

Suds challenge datasheet for ds v5.pdf - jpl open repository,

W. et al., “Suds challenge datasheet for ds v5.pdf - jpl open repository,” 6 2023

work page 2023
[50]

Holistic mapping of the present-day martian seasonal co2 frost. i. frost detection within global visible, thermal, and spectral data sets,

S. Diniega, G. Doran, S. Lu, M. Wronkiewicz, J. M. Widmer, U. Reb- bapragada, and R. Agrawal, “Holistic mapping of the present-day martian seasonal co2 frost. i. frost detection within global visible, thermal, and spectral data sets,”The Planetary Science Journal, vol. 6, no. 9, p. 209, 2025

work page 2025
[51]

Nvidia tesla v100

NVIDIA, “Nvidia tesla v100.” [Online]. Available: https://www.nvidia.com/en-gb/data-center/tesla-v100/

work page
[52]

H100 gpu

Nvidia, “H100 gpu.” [Online]. Available: https://www.nvidia.com/en- us/data-center/h100/

work page
[53]

Deep reinforcement learning that matters,

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

work page 2018
[54]

Accounting for variance in machine learning benchmarks,

X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V . V oletiet al., “Accounting for variance in machine learning benchmarks,”Proceedings of machine learning and systems, vol. 3, pp. 747–769, 2021

work page 2021
[55]

Underspecification presents challenges for credibility in modern machine learning,

D. et al., “Underspecification presents challenges for credibility in modern machine learning,”Journal of Machine Learning Research, vol. 23, no. 226, pp. 1–61, 2022

work page 2022
[56]

Shortcut learning in deep neural networks,

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,”Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020

work page 2020
[57]

Adversarial examples are not bugs, they are features,

A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[1] [1]

Provably robust deep learning via adversarially trained smoothed classifiers,

H. Salman, J. Li, I. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck, and G. Yang, “Provably robust deep learning via adversarially trained smoothed classifiers,”Advances in neural information processing sys- tems, vol. 32, 2019

work page 2019

[2] [2]

Towards stable and efficient training of verifiably robust neural networks

H. Zhang, H. Chen, C. Xiao, S. Gowal, R. Stanforth, B. Li, D. Boning, and C.-J. Hsieh, “Towards stable and efficient training of verifiably robust neural networks,”arXiv preprint arXiv:1906.06316, 2019

work page arXiv 1906

[3] [3]

Adversarial training and provable de- fenses: Bridging the gap,

M. Balunovic and M. Vechev, “Adversarial training and provable de- fenses: Bridging the gap,” inInternational Conference on Learning Representations, 2020

work page 2020

[4] [4]

Towards certifying l-infinity robustness using neural networks with l-inf-dist neurons,

B. Zhang, T. Cai, Z. Lu, D. He, and L. Wang, “Towards certifying l-infinity robustness using neural networks with l-inf-dist neurons,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 368–12 379

work page 2021

[5] [5]

Boosting the certified robustness of l-infinity distance nets,

B. Zhang, D. Jiang, D. He, and L. Wang, “Boosting the certified robustness of l-infinity distance nets,”arXiv preprint arXiv:2110.06850, 2021

work page arXiv 2021

[6] [6]

Double bubble, toil and trouble: enhancing certified robustness through transi- tivity,

A. Cullen, P. Montague, S. Liu, S. Erfani, and B. Rubinstein, “Double bubble, toil and trouble: enhancing certified robustness through transi- tivity,”Advances in Neural Information Processing Systems, vol. 35, pp. 19 099–19 112, 2022

work page 2022

[7] [7]

Wikipedia contributors

M. N. Mueller, F. Eckert, M. Fischer, and M. Vechev, “Certified training: Small boxes are all you need,”arXiv preprint arXiv:2210.04871, 2022

work page arXiv 2022

[8] [8]

A recipe for improved certifiable robustness: Capacity and data

K. Hu, K. Leino, Z. Wang, and M. Fredrikson, “A recipe for improved certifiable robustness,”arXiv preprint arXiv:2310.02513, 2023

work page arXiv 2023

[9] [9]

On the scalability of certified adversarial robustness with generated data,

T. Altstidl, D. Dobre, A. Kosmala, B. Eskofier, G. Gidel, and L. Schwinn, “On the scalability of certified adversarial robustness with generated data,”Advances in Neural Information Processing Systems, vol. 37, pp. 102 255–102 278, 2024

work page 2024

[10] [10]

Enhancing certi- fied robustness via block reflector orthogonal layers and logit annealing loss,

B.-H. Lai, P.-H. Huang, B.-H. Kung, and S.-T. Chen, “Enhancing certi- fied robustness via block reflector orthogonal layers and logit annealing loss,”arXiv preprint arXiv:2505.15174, 2025

work page arXiv 2025

[11] [11]

Lipnext: Scaling up lipschitz- based certified robustness to billion-parameter models,

K. Hu, H. Hu, and M. Fredrikson, “Lipnext: Scaling up lipschitz- based certified robustness to billion-parameter models,”arXiv preprint arXiv:2601.18513, 2026

work page arXiv 2026

[12] [12]

Reluplex: An efficient smt solver for verifying deep neural networks,

G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient smt solver for verifying deep neural networks,” inInternational conference on computer aided verification. Springer, 2017, pp. 97–117

work page 2017

[13] [13]

Autopilot and full self-driving capability

Tesla, “Autopilot and full self-driving capability.” [Online]. Available: https://www.tesla.com/en gb/support/autopilot

work page

[14] [14]

Airborne collision avoidance system,

E. Williams, “Airborne collision avoidance system,” inProceedings of the 9th Australian workshop on Safety critical systems and software- Volume 47, 2004, pp. 97–110

work page 2004

[15] [15]

Advanced physics-based fluid system performance monitoring to support nuclear power plant operations. final crada report

R. B. Vilim and T. C. Esselman, “Advanced physics-based fluid system performance monitoring to support nuclear power plant operations. final crada report.” Argonne National Laboratory, Tech. Rep., 2020

work page 2020

[16] [16]

Artificial intelligence-enabled medical devices

FDA, “Artificial intelligence-enabled medical devices.” [On- line]. Available: https://www.fda.gov/medical-devices/software-medical- device-samd/artificial-intelligence-enabled-medical-devices

work page

[17] [17]

Ge healthcare drives growth with investment in ai-enabled medical devices and tops fda’s list of ai authorizations for 4th year with 100,

G. HealthCare, “Ge healthcare drives growth with investment in ai-enabled medical devices and tops fda’s list of ai authorizations for 4th year with 100,” 7 2025. [Online]. Available: https://www.gehealthcare.com/middle-east/about/newsroom/press- releases/ge-healthcare-drives-growth-with-investment-in-ai-enabled- medical-devices-and-tops-fdas-list-of-ai-a...

work page 2025

[18] [18]

Fda clearance for biograph one positron emission tomography/magnetic resonance imaging scanner,

S. Healthineers, “Fda clearance for biograph one positron emission tomography/magnetic resonance imaging scanner,” 1 2026. [Online]. Available: https://www.siemens-healthineers.com/en-us/press- room/press-releases/biograph-one-fda-clearance

work page 2026

[19] [19]

Philips deviceguide gets fda clearance,

Philips, “Philips deviceguide gets fda clearance,” 3 2026. [Online]. Available: https://www.philips.com/a- w/about/news/archive/standard/news/press/2026/fda-clears-philips-ai- solution-that-provides-real-time-guidance-during-complex-minimally- invasive-heart-valve-repair.html

work page 2026

[20] [20]

Verifying robustness of neural networks in vision-based end-to-end autonomous driving,

C. Bernardeschi, G. Lami, F. Merola, and F. Rossi, “Verifying robustness of neural networks in vision-based end-to-end autonomous driving,” IEEE Access, 2025

work page 2025

[21] [21]

Parallel verification of neural networks applied to medical imaging

J. Andreasen, D. M. Lopez, T. T. Johnson, E. Begoli, and Y . K. Dodia, “Parallel verification of neural networks applied to medical imaging.”

work page

[22] [22]

Formal verification of a neural network based prognostics system for aircraft equipment,

D. Kirov, S. F. Rollini, L. Di Guglielmo, and D. Cofer, “Formal verification of a neural network based prognostics system for aircraft equipment,” inInternational Conference on Bridging the Gap between AI and Reality. Springer, 2023, pp. 225–240

work page 2023

[23] [23]

Verification of neural network behaviour: Formal guarantees for power system applications,

A. Venzke and S. Chatzivasileiadis, “Verification of neural network behaviour: Formal guarantees for power system applications,”IEEE Transactions on Smart Grid, vol. 12, no. 1, pp. 383–397, 2020

work page 2020

[24] [24]

Investigating the impact of randomness on reproducibility in computer vision: A study on applications in civil engineering and medicine,

B. Eryılmaz, O. A. Koras ¸, J. Schl ¨otterer, and C. Seifert, “Investigating the impact of randomness on reproducibility in computer vision: A study on applications in civil engineering and medicine,” in2024 IEEE 6th International Conference on Cognitive Machine Intelligence (CogMI). IEEE, 2024, pp. 265–274

work page 2024

[25] [25]

On the variance of neural network training with respect to test sets and distributions,

K. Jordan, “On the variance of neural network training with respect to test sets and distributions,”arXiv preprint arXiv:2304.01910, 2023

work page arXiv 2023

[26] [26]

doi:10.48550/arXiv.2109.08203 , pubstate =

D. Picard, “Torch. manual seed (3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision,” arXiv preprint arXiv:2109.08203, 2021

work page arXiv 2021

[27] [27]

Assessing the macro and micro effects of random seeds on fine-tuning large language models,

N. T. Bui, G. K. Savova, and L. Wang, “Assessing the macro and micro effects of random seeds on fine-tuning large language models,” inProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 41–46

work page 2025

[28] [28]

Ml-ready dataset for identification of frost in martian hirise images - jpl open repository,

W. et al., “Ml-ready dataset for identification of frost in martian hirise images - jpl open repository,” 6 2023

work page 2023

[29] [29]

Don’t let your analysis go to seed: on the impact of random seed on machine learning- based causal inference,

L. Schader, W. Song, R. Kempker, and D. Benkeser, “Don’t let your analysis go to seed: on the impact of random seed on machine learning- based causal inference,”Epidemiology, vol. 35, no. 6, pp. 764–778, 2024

work page 2024

[30] [30]

Exploiting verified neural networks via floating point numerical error,

K. Jia and M. Rinard, “Exploiting verified neural networks via floating point numerical error,” inInternational Static Analysis Symposium. Springer, 2021, pp. 191–205

work page 2021

[31] [31]

Floating-point neural network verification at the software level,

E. Manino, B. Farias, R. S. Menezes, F. Shmarov, and L. C. Cordeiro, “Floating-point neural network verification at the software level,”arXiv preprint arXiv:2510.23389, 2025

work page arXiv 2025

[32] [32]

The 6th international verification of neural networks competition (VNN-COMP 2025): Summary and results.arXiv preprint arXiv:2512.19007, 2025

K. Kaulenet al., “The 6th international verification of neural networks competition (vnn-comp 2025): Summary and results,”arXiv preprint arXiv:2512.19007, 2025

work page arXiv 2025

[33] [33]

Fast and effective robustness certification,

G. Singh, T. Gehr, M. Mirman, M. P”uschel, and M. Vechev, “Fast and effective robustness certification,” inAdvances in neural information processing systems, vol. 31, 2018

work page 2018

[34] [34]

The impact of noise and brightness on object detection methods,

J. A. Rodriguez-Rodriguez, E. Lopez-Rubio, J. A. Angel-Ruiz, and M. A. Molina-Cabello, “The impact of noise and brightness on object detection methods,”Sensors, vol. 24, no. 3, p. 821, 2024

work page 2024

[35] [35]

Generalisation in humans and deep neural networks,

R. Geirhos, C. R. Temme, J. Rauber, H. H. Sch ¨utt, M. Bethge, and F. A. Wichmann, “Generalisation in humans and deep neural networks,” Advances in neural information processing systems, vol. 31, 2018

work page 2018

[36] [36]

“everyone wants to do the model work, not the data work

N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and L. M. Aroyo, ““everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai,” inproceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–15

work page 2021

[37] [37]

Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks,

P. Li, X. Rao, J. Blase, Y . Zhang, X. Chu, and C. Zhang, “Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks,” in2021 IEEE 37th international conference on data engineering (ICDE). IEEE, 2021, pp. 13–24

work page 2021

[38] [38]

A branch and bound framework for stronger adversarial attacks of relu networks,

H. Zhang, S. Wang, K. b. Xu, L. Li, L. Bo, S. Jana, C.-J. Hsieh, and C.-J. Hsieh, “A branch and bound framework for stronger adversarial attacks of relu networks,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 26 591–26 604

work page 2022

[39] [39]

The marabou framework for verification and analysis of deep neural networks,

G. Katz, D. A. Huang, D. Ibeling, K. Julian, C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, A. Zelji ´cet al., “The marabou framework for verification and analysis of deep neural networks,” inInternational conference on computer aided verification. Springer, 2019, pp. 443– 452

work page 2019

[40] [40]

Formal security analysis of neural networks using symbolic intervals,

S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, “Formal security analysis of neural networks using symbolic intervals,” in27th USENIX Security Symposium (USENIX Security 18), 2018, pp. 1599–1614

work page 2018

[41] [41]

Efficient neural network verification via adaptive refinement and adversarial search,

P. Henriksen and A. Lomuscio, “Efficient neural network verification via adaptive refinement and adversarial search,” inECAI 2020: 24th European Conference on Artificial Intelligence, 29 August–8 September 2020, Santiago de Compostela, Spain–Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020). SAGE Publications 1 O...

work page 2020

[42] [42]

Improved geometric path enumeration for verifying relu neural networks,

S. Bak, H.-D. Tran, K. Hobbs, and T. T. Johnson, “Improved geometric path enumeration for verifying relu neural networks,” inInternational conference on computer aided verification. Springer, 2020, pp. 66–96

work page 2020

[43] [43]

Neuralsat: A high- performance verification tool for deep neural networks,

H. Duong, T. Nguyen, and M. B. Dwyer, “Neuralsat: A high- performance verification tool for deep neural networks,” inInternational Conference on Computer Aided Verification. Springer, 2025, pp. 409– 423

work page 2025

[44] [44]

Clip- and-verify: Linear constraint-driven domain clipping for accelerating neural network verification,

D. Zhou, J. Chavez, H. Chen, G. A. Hanasusanto, and H. Zhang, “Clip- and-verify: Linear constraint-driven domain clipping for accelerating neural network verification,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[45] [45]

Fast and complete: Enabling complete neural network verification with rapid and massively parallel incomplete verifiers,

K. Xu, H. Zhang, S. Wang, Y . Wang, S. Jana, X. Lin, and C.-J. Hsieh, “Fast and complete: Enabling complete neural network verification with rapid and massively parallel incomplete verifiers,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=nVZtXBI6LNn

work page 2021

[46] [46]

Beta-crown: Efficient bound propagation with per-neuron split constraints for complete and incomplete neural network verification,

S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C.-J. Hsieh, and J. Z. Kolter, “Beta-crown: Efficient bound propagation with per-neuron split constraints for complete and incomplete neural network verification,” Advances in Neural Information Processing Systems, vol. 34, 2021

work page 2021

[47] [47]

Mnist handwritten digit database,

Y . LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” AT&T Labs, Tech. Rep., 2010, available at http://yann.lecun.com/exdb/mnist

work page 2010

[48] [48]

Critically assessing the state of the art in neural network verification,

M. Koeniget al., “Critically assessing the state of the art in neural network verification,”Journal of Machine Learning Research, vol. 25, no. 12, pp. 1–53, 2024

work page 2024

[49] [49]

Suds challenge datasheet for ds v5.pdf - jpl open repository,

W. et al., “Suds challenge datasheet for ds v5.pdf - jpl open repository,” 6 2023

work page 2023

[50] [50]

Holistic mapping of the present-day martian seasonal co2 frost. i. frost detection within global visible, thermal, and spectral data sets,

S. Diniega, G. Doran, S. Lu, M. Wronkiewicz, J. M. Widmer, U. Reb- bapragada, and R. Agrawal, “Holistic mapping of the present-day martian seasonal co2 frost. i. frost detection within global visible, thermal, and spectral data sets,”The Planetary Science Journal, vol. 6, no. 9, p. 209, 2025

work page 2025

[51] [51]

Nvidia tesla v100

NVIDIA, “Nvidia tesla v100.” [Online]. Available: https://www.nvidia.com/en-gb/data-center/tesla-v100/

work page

[52] [52]

H100 gpu

Nvidia, “H100 gpu.” [Online]. Available: https://www.nvidia.com/en- us/data-center/h100/

work page

[53] [53]

Deep reinforcement learning that matters,

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

work page 2018

[54] [54]

Accounting for variance in machine learning benchmarks,

X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V . V oletiet al., “Accounting for variance in machine learning benchmarks,”Proceedings of machine learning and systems, vol. 3, pp. 747–769, 2021

work page 2021

[55] [55]

Underspecification presents challenges for credibility in modern machine learning,

D. et al., “Underspecification presents challenges for credibility in modern machine learning,”Journal of Machine Learning Research, vol. 23, no. 226, pp. 1–61, 2022

work page 2022

[56] [56]

Shortcut learning in deep neural networks,

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,”Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020

work page 2020

[57] [57]

Adversarial examples are not bugs, they are features,

A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,”Advances in neural information processing systems, vol. 32, 2019

work page 2019