pith. sign in

arxiv: 2512.20865 · v2 · submitted 2025-12-24 · 💻 cs.LG · cs.SY· eess.SY

Robustness Certificates for Neural Networks against Adversarial Attacks

Pith reviewed 2026-05-16 20:29 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY
keywords neural network robustnessdata poisoning attacksbarrier certificatesPAC boundsadversarial robustnessdynamical systemssafety verification
0
0 comments X

The pith

Barrier certificates certify a safe radius for neural networks under data poisoning attacks during and after training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a formal framework for certifying the robustness of neural networks to adversarial poisoning attacks on training data. It models gradient descent training as a discrete-time dynamical system and adapts barrier certificates from control theory to ensure the final model satisfies safety properties despite bounded perturbations. Sufficient conditions are provided for a robust radius, and probably approximately correct bounds are computed via scenario convex programming on sampled poisoned trajectories to generalize beyond the samples. The method unifies certification for both training-time poisoning and test-time adversarial attacks without requiring knowledge of the specific attack or contamination level.

Core claim

By modeling gradient-based training as a discrete-time dynamical system, barrier certificates parameterized by neural networks can be trained on finite poisoned trajectories to certify a robustness radius guaranteeing that the terminal model remains safe under worst-case l_p-norm poisoning, with PAC bounds providing generalization guarantees, and this extends directly to test-time attacks.

What carries the argument

Neural-network barrier certificates for the discrete-time dynamical system model of training, verified via scenario convex programming for PAC bounds.

If this is right

  • Certified radii apply to unseen data via PAC bounds.
  • The framework handles both training and inference attacks in one approach.
  • No prior assumptions on model class or attack details are needed.
  • Experiments confirm non-trivial radii on standard image datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might extend to other training algorithms beyond gradient descent if modeled similarly.
  • Integrating with existing verification tools could tighten the bounds in practice.
  • It opens possibilities for runtime monitoring using the same certificate concept.

Load-bearing premise

Gradient-based training trajectories can be sufficiently sampled to train a barrier certificate that validly bounds the worst-case poisoning effects.

What would settle it

Finding a poisoning attack within the certified l_p radius that causes the model to violate the safety property on a new dataset would falsify the certification.

Figures

Figures reproduced from arXiv: 2512.20865 by Debarghya Ghoshdastidar, Mahalakshmi Sabanayagam, Majid Zamani, Sara Taheri.

Figure 1
Figure 1. Figure 1: Overview of our proposed framework against train-time attacks. (Left) Data Generation: The model hθ is trained on multiple poisoned datasets with varying perturbation levels to generate a set of parameter trajectories. The terminal parameters are labeled as safe or unsafe based on the test accuracy degradation. (Right) Certification: A Neural Network￾based Barrier Certificate (NNBC) B is learned from these… view at source ↗
Figure 2
Figure 2. Figure 2: Certified accuracy (g ∗ p ) versus perturbation magnitude (δ) on different settings and poisoning scenarios. Each figure reports the terminal test accuracy g(θ(t∞)), the em￾pirical robust radius δemp, and the certified robust radius δ ∗ cert obtained using our proposed framework. The confidence level is fixed at 1 − β, β = 10−4 , across all settings, with the corresponding violation probabilities being (a)… view at source ↗
Figure 3
Figure 3. Figure 3: Comparing the result of our framework and RAB on SVHN under test-time BDA with ℓ∞ attack. Our results consistently yield higher certified robustness than RAB. radius δcert consistently close to the empirical robust radius δemp. While δcert ≤ δemp holds by construction, tightness is largely dictated by how regularly the model degrades under poisoning: when the test accuracy g(θ) decreases smoothly rather th… view at source ↗
Figure 4
Figure 4. Figure 4: Our proposed framework. The left panel illustrates the data generation process under both train-time poisoning attacks or test-time evasion attacks. For each perturbation level, the model hθ is trained on perturbed datasets to produce two disjoint sets of parameter vectors: θ and ˆθ. A safety criterion function is then applied to each parameter vector to label it as safe or unsafe. The set θ is used to tra… view at source ↗
Figure 5
Figure 5. Figure 5: Certified accuracy versus perturbation magnitude δ under different poisoning scenarios and datasets. Each subplot shows the test accuracy g, empirical robust radius δemp, and certified robust radius δ ∗ cert under the proposed framework. The confidence level is fixed at 99.99%. Violation probabilities are: ϵ = (a) 0.005, (b) 0.011, (c) 0.045, (d) 0.003, (e) 0.006, (f) 0.006, (g) 0.006, (h) 0.004, (i) 0.011… view at source ↗
Figure 6
Figure 6. Figure 6: Violation rate ϵ vs the number of scenarios Nˆ used for solving the SCP at different confidence level. F.5.4. Effect of sampling density on robust-radius curves. In some configurations, the empirical test-accuracy curve appears to fall below the certified robust-radius curves, which is theoretically inconsistent. This artifact is caused by an insufficient sampling of α thresholds when computing δcert, lead… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of α-sampling density on empirical and certified robust-radius curves for MNIST, MLP, PGD, train-time poisoning. F.5.5. On seemingly extreme poisoning ratios. In several experiments we deliberately consider very large cor￾ruption ratios (e.g., 0.5–1), even though such levels are rare in practice. This is intentional and reflects what our framework actually certifies: a bound on the per-sample pertur… view at source ↗
read the original abstract

The increasing use of machine learning in safety-critical domains amplifies the risk of adversarial threats, especially data poisoning attacks that corrupt training data to degrade performance or induce unsafe behavior. Most existing defenses lack formal guarantees or rely on restrictive assumptions about the model class, attack type, extent of poisoning, or point-wise certification, limiting their practical reliability. This paper introduces a principled formal robustness certification framework that models gradient-based training as a discrete-time dynamical system (dt-DS) and formulates poisoning robustness as a formal safety verification problem. By adapting the concept of barrier certificates (BCs) from control theory, we introduce sufficient conditions to certify a robust radius ensuring that the terminal model remains safe under worst-case ${\ell}_p$-norm based poisoning. To make this practical, we parameterize BCs as neural networks trained on finite sets of poisoned trajectories. We further derive probably approximately correct (PAC) bounds by solving a scenario convex program (SCP), which yields a confidence lower bound on the certified robustness radius generalizing beyond the training set. Importantly, our framework also extends to certification against test-time attacks, making it the first unified framework to provide formal guarantees in both training and test-time attack settings. Experiments on MNIST, SVHN, and CIFAR-10 show that our approach certifies non-trivial perturbation budgets while being model-agnostic and requiring no prior knowledge of the attack or contamination level.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a unified formal robustness certification framework for neural networks against data poisoning attacks. It models gradient-based training as a discrete-time dynamical system, adapts barrier certificates from control theory to provide sufficient conditions for a certified robust radius under worst-case l_p-norm poisoning, parameterizes the certificates as neural networks trained on finite sampled poisoned trajectories, and derives PAC bounds via scenario convex programming (SCP) that generalize beyond the training set. The framework is claimed to extend to test-time attacks and is evaluated on MNIST, SVHN, and CIFAR-10, yielding non-trivial certified perturbation budgets while remaining model-agnostic.

Significance. If the central claims hold, the work would be significant as the first unified framework providing formal sufficient conditions and PAC guarantees for both training-time poisoning and test-time attacks using barrier certificates. It bridges control theory with ML robustness certification in a model-agnostic way, with potential impact on safety-critical applications. The use of scenario optimization for PAC bounds on the radius is a strength if the sampling-to-worst-case gap can be closed.

major comments (2)
  1. [Abstract and barrier certificate construction] The sufficient conditions for certifying the robust radius (Abstract; derivation of barrier inequalities for the dt-DS) require the learned neural-network barrier to satisfy the discrete-time barrier conditions for every poisoning inside the l_p ball. Training on randomly sampled trajectories within the radius yields only probabilistic coverage over the sampling distribution; nothing forces validity on the actual worst-case poisoning that maximizes violation, so the claimed deterministic certification against worst-case l_p poisoning does not follow.
  2. [PAC bounds via SCP] The PAC lower bound on the certified radius is obtained by solving the scenario convex program on the finite set of sampled trajectories (SCP derivation). Because the bound is with respect to the sampling distribution rather than the true adversarial poisoning, the generalization claim beyond the training set holds only probabilistically over samples and does not deterministically bound the worst-case instance advertised in the abstract.
minor comments (2)
  1. [Experimental setup] Clarify the precise procedure for generating and sampling poisoned trajectories, including how the l_p ball is discretized and the number of samples relative to the SCP confidence parameters.
  2. [Modeling section] The manuscript should include a short discussion of how the discrete-time dynamical system abstraction exactly preserves the gradient updates of standard optimizers (e.g., SGD with momentum).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify a distinction between deterministic sufficient conditions and the probabilistic guarantees obtained via sampling and scenario optimization. We respond to each major comment below and will make revisions to clarify the nature of our certificates.

read point-by-point responses
  1. Referee: [Abstract and barrier certificate construction] The sufficient conditions for certifying the robust radius (Abstract; derivation of barrier inequalities for the dt-DS) require the learned neural-network barrier to satisfy the discrete-time barrier conditions for every poisoning inside the l_p ball. Training on randomly sampled trajectories within the radius yields only probabilistic coverage over the sampling distribution; nothing forces validity on the actual worst-case poisoning that maximizes violation, so the claimed deterministic certification against worst-case l_p poisoning does not follow.

    Authors: We agree that the barrier-certificate theorem yields deterministic sufficient conditions only when the inequalities hold for every poisoning inside the l_p ball. Our construction learns a neural-network barrier from finite sampled trajectories and applies scenario convex programming to obtain PAC bounds on the violation probability under the sampling measure. The resulting certificate therefore holds with high probability over the choice of samples rather than deterministically for the worst-case poisoning. We will revise the abstract and the statement of the main theorem to replace language suggesting deterministic worst-case certification with explicit reference to the PAC guarantee, making the probabilistic character of the result clear. revision: yes

  2. Referee: [PAC bounds via SCP] The PAC lower bound on the certified radius is obtained by solving the scenario convex program on the finite set of sampled trajectories (SCP derivation). Because the bound is with respect to the sampling distribution rather than the true adversarial poisoning, the generalization claim beyond the training set holds only probabilistically over samples and does not deterministically bound the worst-case instance advertised in the abstract.

    Authors: The referee is correct: the PAC bound produced by the scenario program is with respect to the distribution from which trajectories are drawn and therefore controls the measure of violating poisonings rather than the worst-case instance. This is the standard guarantee delivered by scenario optimization; it does not close the gap to a deterministic bound on the supremum violation. We will add a short discussion section clarifying this distinction, the relationship to the sampling distribution, and the practical interpretation of the certified radius as a high-confidence lower bound rather than a deterministic worst-case guarantee. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external control theory and scenario optimization

full rationale

The paper models training as a discrete-time dynamical system and adapts barrier certificates from control theory to certify a robust radius via neural-network parameterization and scenario convex programming on sampled trajectories. The PAC bounds are derived from the SCP solution on finite samples rather than being tautological with the fitted barrier or any self-citation chain. No load-bearing self-citations, self-definitional reductions, or fitted inputs renamed as predictions appear in the central claims. The framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on the modeling of gradient descent as an exact discrete-time dynamical system and on the existence of a neural-network barrier certificate that can be trained to separate safe and unsafe trajectories; no new physical entities are postulated.

free parameters (2)
  • barrier network architecture and training hyperparameters
    The neural network used to represent the barrier certificate is trained on finite poisoned trajectories; its weights and architecture are fitted parameters.
  • scenario convex program sample size and confidence parameters
    The PAC bound depends on the number of scenarios drawn and the desired confidence level chosen for the convex program.
axioms (2)
  • domain assumption Gradient-based training dynamics can be represented exactly as a discrete-time dynamical system whose state transition is independent of the specific loss landscape details beyond the gradient step.
    Invoked when the paper states that training is modeled as dt-DS to apply barrier-certificate theory.
  • domain assumption A neural-network parameterization of the barrier certificate is sufficiently expressive to certify the desired safety property for the sampled trajectories.
    Required for the practical training step described in the abstract.

pith-pipeline@v0.9.0 · 5568 in / 1598 out tokens · 24703 ms · 2026-05-16T20:29:22.186723+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    models gradient-based training as a discrete-time dynamical system (dt-DS) and formulates poisoning robustness as a formal safety verification problem. By adapting the concept of barrier certificates (BCs) from control theory, we introduce sufficient conditions to certify a robust radius

  • IndisputableMonolith/Foundation/Atomicity.lean atomic_tick unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We further derive probably approximately correct (PAC) bounds by solving a scenario convex program (SCP), which yields a confidence lower bound on the certified robustness radius

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    arXiv preprint arXiv:2502.05510 , year=

    [RAM25] Luke Rickard, Alessandro Abate, and Kostas Margellos. Data-driven neural certificate synthesis. arXiv preprint arXiv:2502.05510,

  2. [2]

    Cer- tified robustness to data poisoning in gradient-based training.arXiv preprint arXiv:2406.05670,

    [SMB+24] Philip Sosnin, Mark Niklas M¨ uller, Maximilian Baader, Calvin Tsay, and Matthew Wicker. Cer- tified robustness to data poisoning in gradient-based training.arXiv preprint arXiv:2406.05670,

  3. [3]

    On certifying robustness against backdoor attacks via randomized smoothing.ArXiv, abs/2002.11750, 2020

    [WCJG20] Binghui Wang, Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. On certifying robustness against backdoor attacks via randomized smoothing.arXiv preprint arXiv:2002.11750,

  4. [4]

    F.5.7.Generality and future work.The proposed framework models gradient-based training as a discrete-time stochastic dynamical system, operating entirely in parameter space

    tends to improve certificate quality at the expense of additional training time. F.5.7.Generality and future work.The proposed framework models gradient-based training as a discrete-time stochastic dynamical system, operating entirely in parameter space. It assumes no white-box access to the attack (e.g., strategy, trigger, or poisoning ratio), model arch...