arxiv: 2604.21854 · v1 · submitted 2026-04-23 · 💻 cs.AI

Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation

Natan Levy , Gadi Perl This is my paper

Pith reviewed 2026-05-09 21:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI risk regulationstatistical certificationblack-box verificationfailure rate boundingRoMAgRoMAconformity assessmentEU AI Act

0 comments

The pith

Regulators can certify opaque AI systems by statistically bounding their failure rates without internal access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-stage certification framework to address the absence of quantitative methods for verifying AI safety under regulations like the EU AI Act. Stage One requires authorities to formally set an acceptable failure probability threshold and define the operational input domain. Stage Two applies the RoMA and gRoMA statistical tools to sample the system's behavior and compute an auditable upper bound on its true failure rate. This bound holds using only external queries to the model, independent of its internal structure or complexity. The result converts regulatory safety demands into a practical, engineering-style certificate that developers can produce and auditors can review.

Core claim

By having a competent authority fix an acceptable failure probability δ and an operational domain ε, then applying RoMA or gRoMA to draw black-box samples from ε, one obtains a rigorous statistical upper bound on the probability that the deployed AI system fails on any input from that domain.

What carries the argument

The RoMA and gRoMA statistical verification tools that produce auditable upper bounds on failure probability from black-box sampling alone.

If this is right

Developers must generate and submit statistical certificates to meet conformity assessment obligations.
Accountability for meeting quantitative risk thresholds moves upstream to the model providers.
The same certificate format can be reused across different model architectures without modification.
Regulators gain a concrete, numerical way to enforce safety thresholds that existing legal frameworks already recognize.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regulators could apply different δ thresholds to different risk categories while using the same verification procedure.
Collecting samples during live operation rather than pre-deployment testing could tighten the bounds over time.
The framework might extend to bounding other quantities such as fairness violations or privacy leakage if analogous sampling tools are defined.

Load-bearing premise

The statistical procedures can produce a valid upper bound on failure probability for arbitrary high-dimensional input domains and complex models without additional assumptions on the data-generating process or the form of the failure region.

What would settle it

An empirical case in which the actual observed failure rate on the operational domain exceeds the upper bound computed by RoMA or gRoMA for the chosen sample size and confidence level.

read the original abstract

Artificial intelligence now decides who receives a loan, who is flagged for criminal investigation, and whether an autonomous vehicle brakes in time. Governments have responded: the EU AI Act, the NIST Risk Management Framework, and the Council of Europe Convention all demand that high-risk systems demonstrate safety before deployment. Yet beneath this regulatory consensus lies a critical vacuum: none specifies what ``acceptable risk'' means in quantitative terms, and none provides a technical method for verifying that a deployed system actually meets such a threshold. The regulatory architecture is in place; the verification instrument is not. This gap is not theoretical. As the EU AI Act moves into full enforcement, developers face mandatory conformity assessments without established methodologies for producing quantitative safety evidence - and the systems most in need of oversight are opaque statistical inference engines that resist white-box scrutiny. This paper provides the missing instrument. Drawing on the aviation certification paradigm, we propose a two-stage framework that transforms AI risk regulation into engineering practice. In Stage One, a competent authority formally fixes an acceptable failure probability $\delta$ and an operational input domain $\varepsilon$ - a normative act with direct civil liability implications. In Stage Two, the RoMA and gRoMA statistical verification tools compute a definitive, auditable upper bound on the system's true failure rate, requiring no access to model internals and scaling to arbitrary architectures. We demonstrate how this certificate satisfies existing regulatory obligations, shifts accountability upstream to developers, and integrates with the legal frameworks that exist today.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper packages standard black-box statistical bounds into a two-stage regulatory workflow for AI failure rates, but the abstract gives no derivation or validation for the named RoMA and gRoMA tools.

read the letter

The main takeaway is that this work tries to turn vague regulatory demands for AI safety into a concrete process: regulators pick the acceptable failure probability and the operating domain, then developers use sampling-based checks to produce an auditable upper bound without needing model internals. That separation is a useful way to handle the political versus technical split. The paper correctly identifies the enforcement problem with the EU AI Act and similar rules, where conformity assessments are required but no method exists for opaque systems. Framing it around aviation-style certification makes the idea concrete and shows how existing distribution-free bounds could plug into stage two. The stress-test note is right that simple i.i.d. concentration inequalities already deliver valid upper bounds on failure probability for any failure region, so the central claim does not collapse on its own terms. What is missing is any visible math. The abstract asserts that RoMA and gRoMA deliver definitive bounds but shows neither the procedure nor a proof sketch, nor any discussion of sample sizes needed when failures are rare or inputs are high-dimensional. If these are just binomial or Hoeffding bounds under new names, the contribution is mostly the regulatory wrapper; if they add something, that needs to be shown. The assumption that operational data can be sampled i.i.d. from the right distribution is standard but worth stating explicitly, since real deployments often have distribution shift. This paper is for regulators and governance researchers who want a checklist-style approach rather than for theorists looking for new concentration results. A reader focused on implementation would find the framing helpful even if the statistics are familiar. It deserves peer review because the regulatory gap is real and the two-stage structure is coherent; referees can check whether the statistical tools add anything beyond existing methods and whether the liability implications hold up under scrutiny.

Referee Report

2 major / 3 minor

Summary. The paper proposes a two-stage certification framework for high-risk AI systems to meet regulatory requirements such as the EU AI Act. Stage One requires a competent authority to fix an acceptable failure probability δ and an operational input domain ε. Stage Two introduces RoMA and gRoMA as black-box statistical verification procedures that, from i.i.d. samples drawn from ε, compute an auditable upper bound on the system's true failure probability; the bound is claimed to be definitive, architecture-agnostic, and sufficient to discharge existing regulatory obligations.

Significance. If the statistical procedures are correctly derived and the sampling assumptions hold, the framework supplies a concrete, regulator-friendly mechanism for converting qualitative safety mandates into quantitative, auditable certificates. Its reliance on distribution-free concentration inequalities (rather than model-specific analysis) is a genuine strength, as it applies to arbitrary architectures and does not require white-box access.

major comments (2)

§3 (RoMA derivation): the manuscript must explicitly state the precise concentration inequality (Hoeffding, binomial, or other) used to obtain the upper bound and the exact definition of the failure indicator function; without this, it is impossible to verify that the claimed 'definitive' bound is not merely a restatement of a standard result under unstated i.i.d. sampling from ε.
§4.2 (gRoMA generalization): the extension to gRoMA is presented as handling 'arbitrary' failure regions, yet the text provides no proof that the bound remains valid when the failure set has measure zero or when sampling from ε is only approximate; this assumption is load-bearing for the claim that the method scales to complex, high-dimensional domains.

minor comments (3)

Abstract and §2: the terms RoMA and gRoMA are introduced without an immediate expansion or reference to their full definitions; move the acronym expansions to first use.
Figure 2: the caption does not indicate whether the plotted bounds are theoretical or empirical; add a note clarifying the source of the curves.
§5 (regulatory mapping): the discussion of liability implications is brief; a short table mapping each regulatory requirement (EU AI Act Art. X, NIST RMF, etc.) to the corresponding output of the certificate would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review, which we found helpful in strengthening the clarity of the statistical derivations. We appreciate the positive assessment of the framework's regulatory utility and address each major comment below with targeted revisions.

read point-by-point responses

Referee: §3 (RoMA derivation): the manuscript must explicitly state the precise concentration inequality (Hoeffding, binomial, or other) used to obtain the upper bound and the exact definition of the failure indicator function; without this, it is impossible to verify that the claimed 'definitive' bound is not merely a restatement of a standard result under unstated i.i.d. sampling from ε.

Authors: We agree that explicit identification of the inequality and indicator improves verifiability. The revised §3 will state that RoMA applies Hoeffding's inequality to the empirical failure rate under i.i.d. sampling from ε. The failure indicator is defined as I(x) = 1 if the black-box system violates the safety specification on input x ∈ ε and I(x) = 0 otherwise. The upper bound on the true failure probability p is the smallest value satisfying the Hoeffding tail bound P(empirical rate ≥ observed + t) ≤ exp(−2nt²) ≤ δ, yielding a distribution-free, architecture-agnostic certificate. This is not a restatement but a direct application tailored to the regulatory δ and ε fixed in Stage One. revision: yes
Referee: §4.2 (gRoMA generalization): the extension to gRoMA is presented as handling 'arbitrary' failure regions, yet the text provides no proof that the bound remains valid when the failure set has measure zero or when sampling from ε is only approximate; this assumption is load-bearing for the claim that the method scales to complex, high-dimensional domains.

Authors: We accept that a short proof sketch is warranted for rigor. In the revision we will add to §4.2 that for any measurable failure set F ⊆ ε (including measure-zero sets), the probability mass is zero by definition, so the empirical rate is zero and the bound collapses to zero without affecting validity. For approximate sampling, the framework presupposes exact i.i.d. draws from the regulator-specified ε; we will include a remark noting that any approximation error can be absorbed into a slightly inflated δ via a union bound, preserving the concentration property. This keeps the method scalable while clarifying the assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's abstract and described framework set normative parameters (δ, ε) in Stage One as external inputs, then apply RoMA/gRoMA in Stage Two to produce an upper bound via black-box sampling. No equations, fitted parameters, or derivations are exhibited that reduce the claimed bound to a quantity defined by the authors' own choices or prior self-citations. The approach is consistent with standard distribution-free concentration inequalities (e.g., binomial or Hoeffding) applied to i.i.d. samples, which are independent of the paper's internal definitions. No load-bearing self-citation chains or ansatz smuggling are present in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The framework rests on the existence of statistical procedures that can produce finite-sample upper bounds on failure probability for arbitrary models and input domains; these procedures are introduced in the paper but not derived here.

free parameters (2)

delta
Acceptable failure probability chosen by the competent authority; treated as an exogenous normative input rather than fitted.
epsilon
Operational input domain also fixed by the authority.

axioms (1)

domain assumption The failure indicator is a measurable function of the input-output pair and the statistical test can be run on i.i.d. samples from the operational distribution.
Implicit in the claim that black-box testing suffices for any architecture.

invented entities (2)

RoMA no independent evidence
purpose: Statistical verification tool that returns an upper bound on failure rate
Newly named procedure whose correctness is asserted but not shown in the abstract.
gRoMA no independent evidence
purpose: Generalized version of RoMA for broader settings
Newly named procedure whose correctness is asserted but not shown in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1398 out tokens · 29851 ms · 2026-05-09T21:29:29.626708+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Right-to-Act: A Pre-Execution Non-Compensatory Decision Protocol for AI Systems
cs.AI 2026-04 unverdicted novelty 4.0

Introduces Right-to-Act as a non-compensatory pre-execution protocol that blocks AI decision realization unless all structural conditions are met, shifting focus from decision optimization to admissibility governance.

Reference graph

Works this paper leans on

35 extracted references · 3 canonical work pages · cited by 1 Pith paper

[1]

European Parliament and Council of the European Union,Regulation (EU) 2024/1689 laying down har- monised rules on artificial intelligence (Artificial Intel- ligence Act), Official Journal of the European Union, OJ L 2024/1689, 12 July 2024, 2024

2024
[2]

Artificial Intelligence Risk Management Framework (AI RMF 1.0),

National Institute of Standards and Technology, “Artifi- cial intelligence risk management framework (AI RMF 1.0),” U.S. Department of Commerce, National Institute of Standards and Technology, Tech. Rep. NIST AI 100- 1, 2023.DOI: 10.6028/NIST.AI.100-1

work page doi:10.6028/nist.ai.100-1 2023
[3]

Navigating the Complexities of AI Regulation in China,

B. Li and A. Zhou, “Navigating the Complexities of AI Regulation in China,”Reed Smith Perspectives, Aug. 2024, Reed Smith In-depth, 2024-166. Accessed: Nov. 20, 2024. [Online]. Available: https : / / www . reedsmith . com / en / perspectives / 2024 / 08 / navigating - the-complexities-of-ai-regulation-in-china

2024
[4]

Beyond Benchmarks: On the False Promise of AI Regulation,

G. Stanovsky, R. Keydar, G. Perl, and T. Habba, “Beyond Benchmarks: On the False Promise of AI Regulation,”AI Regulation, vol. 6, 2025

2025
[5]

Truly risk-based regulation of AI,

M. Ebers, “Truly risk-based regulation of AI,”European Journal of Risk Regulation, 2024, Advance publication

2024
[6]

Binary Governance: Lessons from the EU AI Act’s Risk Classification of AI Systems,

M. E. Kaminski, “Binary Governance: Lessons from the EU AI Act’s Risk Classification of AI Systems,”Boston University Law Review, vol. 103, no. 4, pp. 1529–1579, 2023

2023
[7]

RoMA: a Method for Neural Network Robustness Measurement and Assessment,

N. Levy and G. Katz, “RoMA: a Method for Neural Network Robustness Measurement and Assessment,” in Proc. 29th Int. Conf. on Neural Information Processing (ICONIP), 2021

2021
[8]

gRoMA: a Tool for Measuring Deep Neural Networks Global Robust- ness,

N. Levy, R. Yerushalmi, and G. Katz, “gRoMA: a Tool for Measuring Deep Neural Networks Global Robust- ness,”arXiv preprint arXiv:2301.02288, 2023

work page arXiv 2023
[9]

Statistical Run- time Verification for LLMs via Robustness Estimation,

N. Levy, A. Ashrov, and G. Katz, “Statistical Run- time Verification for LLMs via Robustness Estimation,” inInternational Conference on Runtime Verification, Springer, 2025, pp. 457–476

2025
[10]

European Commission,Guidelines on the application of Article 55 of Regulation (EU) 2024/1689 to general- purpose AI models, European Commission Staff Docu- ment, 2025

2024
[11]

The EU AI act: A Political not a risk- based approach,

S. Wachter, “The EU AI act: A Political not a risk- based approach,”Yale Journal of Law and Technology, vol. 26, 2024

2024
[12]

Trustworthy ar- tificial intelligence and the European Union AI act: On the conflation of trustworthiness and conformity with ethical principles,

J. Laux, S. Wachter, and B. Mittelstadt, “Trustworthy ar- tificial intelligence and the European Union AI act: On the conflation of trustworthiness and conformity with ethical principles,”Regulation & Governance, vol. 18, no. 1, 2024

2024
[13]

9 of the Cyberspace Administration of China, Effective 1 March 2022, 2022

Cyberspace Administration of China,Provisions on the Management of Algorithmic Recommendations in Inter- net Information Services, Order No. 9 of the Cyberspace Administration of China, Effective 1 March 2022, 2022

2022
[14]

Cyberspace Administration of China,Provisions on the Management of Deep Synthesis Internet Information Services, Cyberspace Administration of China, Effective 10 January 2023, 2023

2023
[15]

Cyberspace Administration of China,Interim measures for the management of generative artificial intelligence services, Cyberspace Administration of China, Effective 15 August 2023, 2023

2023
[16]

Arti- ficial Intelligence Risk Management Framework: Gen- erative Artificial Intelligence Profile,

National Institute of Standards and Technology, “Arti- ficial Intelligence Risk Management Framework: Gen- erative Artificial Intelligence Profile,” U.S. Department of Commerce, National Institute of Standards and Tech- nology, Tech. Rep. NIST AI 600-1, 2024.DOI: 10.6028/ NIST.AI.600-1

2024
[17]

Executive Office of the President,Executive Order 14179: Removing Barriers to American Leadership in Artificial Intelligence, Federal Register, Signed 23 January 2025, 90 Fed. Reg. 8741, 2025

2025
[18]

225 (CETS No

Council of Europe,Framework Convention on Artificial Intelligence and Human Rights, Democracy and the Rule of Law, Council of Europe Treaty Series No. 225 (CETS No. 225), Opened for signature 5 September 2024; entered into force 1 November 2025, 2024

2024
[19]

European Commission,Proposal for a Regulation of the European Parliament and of the Council Amending Regulations on Digital Legislation (Digital Omnibus), COM(2025) 836 final, 2025

2025
[20]

ARP4754A/ED-79A- Guidelines for Development of Civil Aircraft and Systems-Enhancements, Novelties and Key Topics,

A. Landi and M. Nicholson, “ARP4754A/ED-79A- Guidelines for Development of Civil Aircraft and Systems-Enhancements, Novelties and Key Topics,” SAE Int. Journal of Aerospace, vol. 4, pp. 871–879, 2011

2011
[21]

New Challenges in Certification for Air- craft Software,

J. Rushby, “New Challenges in Certification for Air- craft Software,” inProceedings of the ninth ACM in- ternational conference on Embedded software, 2011, pp. 211–218

2011
[22]

Federal Aviation Administration,RTCA, Inc., Docu- ment RTCA/DO-178B, https : / / nla . gov . au / nla . cat - vn4510326, 1993

1993
[23]

Toward the Certification of Safety- Related Systems Using ML Techniques: the ACAS-XU Experience,

C. Gabreau, A. Gauffriau, F. De Grancey, J.-B. Ginestet, and C. Pagetti, “Toward the Certification of Safety- Related Systems Using ML Techniques: the ACAS-XU Experience,” in11th European Congress on Embedded Real Time Software and Systems (ERTS 2022), 2022

2022
[24]

Reluplex: a Calculus for Reasoning about Deep Neural Networks,

G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: a Calculus for Reasoning about Deep Neural Networks,”Formal Methods in Sys- tem Design (FMSD), 2021

2021
[25]

Statistical Certification of Acceptable Robustness for Neural Net- works,

C. Huang, Z. Hu, X. Huang, and K. Pei, “Statistical Certification of Acceptable Robustness for Neural Net- works,” inICANN, 2021, pp. 79–90

2021
[26]

Certified Adver- sarial Robustness via Randomized Smoothing,

J. Cohen, E. Rosenfeld, and Z. Kolter, “Certified Adver- sarial Robustness via Randomized Smoothing,” inProc. 36th Int. Conf. on Machine Learning (ICML), 2019

2019
[27]

S. Webb, T. Rainforth, Y . Teh, and M. Pawan Kumar, A Statistical Approach to Assessing Neural Network Robustness, http://arxiv.org/abs/1811.07209, 2018

work page arXiv 2018
[28]

DEM: a Method for Certifying Deep Neural Network Classi- fier Outputs in Aerospace,

G. Katz, N. Levy, I. Refaeli, and R. Yerushalmi, “DEM: a Method for Certifying Deep Neural Network Classi- fier Outputs in Aerospace,” in2024 AIAA DATC/IEEE 43rd Digital Avionics Systems Conference (DASC), IEEE, 2024, pp. 1–8

2024
[29]

Anderson-Darling Tests of Goodness- of-Fit,

T. W. Anderson, “Anderson-Darling Tests of Goodness- of-Fit,”Int. Encyclopedia of Statistical Science, vol. 1, pp. 52–54, 2011

2011
[30]

An Analysis of Transformations Revisited, Rebutted,

G. E. Box and D. Cox, “An Analysis of Transformations Revisited, Rebutted,”Journal of the American Statisti- cal Association, vol. 77, no. 377, pp. 209–210, 1982

1982
[31]

Probability Inequalities for Sums of Bounded Random Variables,

W. Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables,”Journal of the American statistical association, vol. 58, no. 301, pp. 13–30, 1963

1963
[32]

Standard IEC 62304-Medical Device Soft- ware Lifecycle Processes,

P. Jordan, “Standard IEC 62304-Medical Device Soft- ware Lifecycle Processes,” in2006 IET Seminar on Software for Medical devices, IET, 2006, pp. 41–47

2006
[33]

Brat and G

G. Brat and G. Pai,Runtime Assurance of Aeronautical Products: Preliminary Recommendations, 2023

2023
[34]

A Systematic Review of Autonomous Emergency Braking System: Impact Factor, Technol- ogy, and Performance Evaluation,

L. Yang et al., “A Systematic Review of Autonomous Emergency Braking System: Impact Factor, Technol- ogy, and Performance Evaluation,”Journal of advanced transportation, vol. 2022, no. 1, p. 1 188 089, 2022

2022
[35]

A Decision-Making Strategy for Vehicle Au- tonomous Braking in Emergency via Deep Reinforce- ment Learning,

Y . Fu, C. Li, F. R. Yu, T. H. Luan, and Y . Zhang, “A Decision-Making Strategy for Vehicle Au- tonomous Braking in Emergency via Deep Reinforce- ment Learning,”IEEE transactions on vehicular tech- nology, vol. 69, no. 6, pp. 5876–5888, 2020

2020