pith. sign in

arxiv: 2605.22964 · v1 · pith:6QGMGU6Dnew · submitted 2026-05-21 · 💻 cs.LG

Certification from Examples is Hard for Circuits and Transformers under Minimal Overparametrization

Pith reviewed 2026-05-25 05:59 UTC · model grok-4.3

classification 💻 cs.LG
keywords exact certificationoverparametrizationthreshold circuitstransformerscertificate sizeapproximate certificationneural network verification
0
0 comments X

The pith

Adding one extra gate forces exponential certificate sizes for exact certification of threshold circuits and transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that exact certification, which requires proving a learned hypothesis equals the target function, can demand exponentially many labeled examples even under minimal overparametrization. For threshold circuits of depth at least 2, a single additional gate creates this exponential barrier in the input dimension. An analogous result applies to log-precision transformers when only constant architectural overhead is permitted. This matters because deployed networks on reasoning tasks need exactness guarantees to avoid hidden inconsistencies that average accuracy misses, yet slight extra capacity can render the certification process intractable.

Core claim

Exact certification asks for the smallest set of labeled examples sufficient to prove a hypothesis is identical to the target function. The paper proves that threshold circuits of depth greater than or equal to 2 with one extra gate require certificate sizes exponential in the input dimension. It establishes the same hardness for log-precision transformers under only constant architectural overhead. It further shows that approximate certification allowing polynomially many mistakes still needs exponential certificates, while constant relative-error bounds can conceal exponentially many mistakes.

What carries the argument

Certificate size, the minimal number of labeled examples needed to certify that a hypothesis equals the target function exactly.

If this is right

  • Threshold circuits of depth at least 2 with one extra gate require certificate sizes exponential in input dimension for exact certification.
  • Log-precision transformers with constant architectural overhead require exponential certificate sizes.
  • Approximate certification that allows polynomially many mistakes still demands exponentially large certificates.
  • Constant relative-error guarantees can conceal exponentially many mistakes in the hypothesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Certification procedures may need to measure overparametrization more carefully to remain tractable in practice.
  • Training methods that constrain models closer to minimal capacity could yield smaller certificates.
  • Alternative notions of agreement short of exact identity might make certification feasible for overparametrized models.

Load-bearing premise

The hardness depends on defining minimal overparametrization as precisely one extra gate or constant overhead, and on requiring proof of exact identity rather than approximate agreement.

What would settle it

A threshold circuit of depth 2 with one extra gate for which a polynomial number of examples suffices to certify exact identity to the target function.

Figures

Figures reproduced from arXiv: 2605.22964 by Artur Back de Luca, Kimon Fountoulakis.

Figure 1
Figure 1. Figure 1: Same-depth deceiver construction in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Surviving deceivers under uniformly sampled certificate candidates. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Remaining constructed threshold-circuit deceivers under optimally chosen certificate candidates. [PITH_FULL_IMAGE:figures/full_fig_p033_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of validation and test errors across accepted models. Each point is an accepted addition [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Easy-versus-hard certification in the two finite-class experiments. Left: the restricted threshold [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗
read the original abstract

As state-of-the-art neural networks are deployed on reasoning and algorithmic tasks, exactness guarantees become increasingly important. However, high average-case accuracy can still mask inconsistent behaviors. This motivates exact certification, which asks for the smallest set of labeled examples needed to certify that a learned hypothesis equals the target. We show that while some hypotheses are easy to certify, even minimal overparametrization can make certification exponentially hard across several hypothesis classes. For threshold circuits of depth $\ge 2$, adding a single extra gate can force certificate sizes exponential in the input dimension. We show an analogous hardness result for log-precision Transformers with only constant architectural overhead. We also characterize approximate certification, showing that allowing only polynomially many mistakes still requires exponentially large certificates, whereas constant relative-error guarantees can hide exponentially many mistakes. Empirically, we study certification for constructed circuits and trained Transformers for recognizing binary addition. While the constructed circuits instantiate the exponential barrier for certification, the trained Transformer analysis shows that imperfect models can evade detection by large uniformly sampled certificate candidates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proves that exact certification from examples is hard for threshold circuits of depth ≥2 and for log-precision Transformers under minimal overparametrization: adding one extra gate (or constant overhead) forces certificate sizes exponential in the input dimension for exact identity to the target function. It further shows that allowing polynomially many mistakes still requires exponential certificates, while constant relative-error guarantees can conceal exponentially many mistakes. Empirical results on constructed circuits and trained Transformers for binary addition illustrate the barrier and show that imperfect models can evade large uniform certificate candidates.

Significance. If the lower bounds hold, the results are significant because they establish concrete computational barriers to exact certification even under minimal overparametrization, with direct implications for verification of neural networks on reasoning tasks. The separation between exact, polynomially approximate, and constant-relative-error regimes is a useful contribution. The empirical instantiation on binary addition provides concrete examples of both the hardness and practical evasion.

minor comments (3)
  1. [Abstract, §2] Abstract and §2: the precise definition of 'minimal overparametrization' (one extra gate or constant overhead) and the model of exact certification (identity vs. approximate agreement) should be stated with a forward reference to the formal definitions used in the hardness proofs.
  2. [§5] §5 (empirical section): clarify whether the certificate candidates for the trained Transformer are drawn uniformly at random and how the observed evasion relates to the theoretical exponential lower bound (e.g., does it only apply to perfect models?).
  3. Notation for certificate size (e.g., the function mapping hypothesis to minimal certificate length) should be introduced once and used consistently; a short table summarizing the exact vs. approximate regimes would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the significance of the lower bounds, and recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper establishes theoretical hardness lower bounds showing that minimal overparametrization forces exponential certificate sizes for exact certification in depth-2 threshold circuits and log-precision transformers. These results rely on explicit constructions and reductions rather than any fitted parameters, self-referential definitions, or load-bearing self-citations that collapse the claimed results to the inputs by construction. The derivation chain is self-contained as standard complexity-theoretic arguments with no reduction of predictions to fitted quantities or ansatzes smuggled via prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no free parameters, axioms, or invented entities are extractable from the provided text.

pith-pipeline@v0.9.0 · 5712 in / 1027 out tokens · 36896 ms · 2026-05-25T05:59:24.736648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    Learning regular sets from queries and counterexamples.Information and computation, 75(2):87–106, 1987

    Dana Angluin. Learning regular sets from queries and counterexamples.Information and computation, 75(2):87–106, 1987

  2. [2]

    Queries and concept learning.Machine learning, 2(4):319–342, 1988

    Dana Angluin. Queries and concept learning.Machine learning, 2(4):319–342, 1988

  3. [3]

    Negative results for equivalence queries.Machine Learning, 5(2):121–150, 1990

    Dana Angluin. Negative results for equivalence queries.Machine Learning, 5(2):121–150, 1990

  4. [4]

    On exact specification by examples

    Martin Anthony, Graham Brightwell, Dave Cohen, and John Shawe-Taylor. On exact specification by examples. InFifth annual workshop on computational learning theory, pages 311–318, 1992

  5. [5]

    Polynomial threshold functions, hyperplane arrangements, and random tensors.SIAM Journal on Mathematics of Data Science, 1(4):699–729, 2019

    Pierre Baldi and Roman Vershynin. Polynomial threshold functions, hyperplane arrangements, and random tensors.SIAM Journal on Mathematics of Data Science, 1(4):699–729, 2019

  6. [6]

    Bshouty, Richard Cleve, Ricard Gavaldà, Sampath Kannan, and Christino Tamon

    Nader H. Bshouty, Richard Cleve, Ricard Gavaldà, Sampath Kannan, and Christino Tamon. Oracles and queries that are sufficient for exact learning.Journal of Computer and System Sciences, 52(3):421–433, 1996

  7. [7]

    Bounds in query learning

    Hunter Chase and James Freitag. Bounds in query learning. InProceedings of Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 1142–1160. PMLR, 2020

  8. [8]

    On the recursive teaching dimension of VC classes

    Xi Chen, Yu Cheng, and Bo Tang. On the recursive teaching dimension of VC classes. InAdvances in Neural Information Processing Systems, volume 29, 2016

  9. [9]

    Transformers in uniform TC0.Transactions on Machine Learning Research, 2025

    David Chiang. Transformers in uniform TC0.Transactions on Machine Learning Research, 2025. ISSN 2835–8856

  10. [10]

    Learning to add, multiply, and execute algorithmic instructions exactly with neural networks

    Artur Back de Luca, George Giapitzakis, and Kimon Fountoulakis. Learning to add, multiply, and execute algorithmic instructions exactly with neural networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  11. [11]

    Recursive teaching dimension, VC-dimension and sample compression.Journal of Machine Learning Research, 15(89):3107–3131, 2014

    Thorsten Doliwa, Gaojian Fan, Hans Ulrich Simon, and Sandra Zilles. Recursive teaching dimension, VC-dimension and sample compression.Journal of Machine Learning Research, 15(89):3107–3131, 2014

  12. [12]

    Preference-based teaching

    Ziyuan Gao, Christoph Ries, Hans Simon, and Sandra Zilles. Preference-based teaching. In29th Annual Conference on Learning Theory, volume 49 ofProceedings of Machine Learning Research, pages 971–997. PMLR, 2016

  13. [13]

    Goldman and M.J

    S.A. Goldman and M.J. Kearns. On the complexity of teaching.Journal of Computer and System Sciences, 50(1):20–31, 1995. ISSN 0022-0000

  14. [14]

    Beyond statistical learning: Exact learning is essential for general intelligence.arXiv preprint arXiv:2506.23908, 2025

    András György, Tor Lattimore, Nevena Lazić, and Csaba Szepesvári. Beyond statistical learning: Exact learning is essential for general intelligence.arXiv preprint arXiv:2506.23908, 2025

  15. [15]

    Formal language recognition by hard attention transform- ers: Perspectives from circuit complexity.Transactions of the Association for Computational Linguistics, 10:800–810, 2022

    Yiding Hao, Dana Angluin, and Robert Frank. Formal language recognition by hard attention transform- ers: Perspectives from circuit complexity.Transactions of the Association for Computational Linguistics, 10:800–810, 2022

  16. [16]

    Generalized teaching dimensions and the query complexity of learning

    Tibor Hegedüs. Generalized teaching dimensions and the query complexity of learning. InProceedings of the Eighth Annual Conference on Computational Learning Theory, pages 108––117, 1995. doi: 10.1145/225298.225311

  17. [17]

    Exact learning of DNF formulas using DNF hypotheses.Journal of Computer and System Sciences, 70(4):435–470, 2005

    Lisa Hellerstein and Vijay Raghavan. Exact learning of DNF formulas using DNF hypotheses.Journal of Computer and System Sciences, 70(4):435–470, 2005. 11

  18. [18]

    How many queries are needed to learn?Journal of the ACM (JACM), 43(5):840–862, 1996

    Lisa Hellerstein, Krishnan Pillaipakkamnatt, Vijay Raghavan, and Dawn Wilkins. How many queries are needed to learn?Journal of the ACM (JACM), 43(5):840–862, 1996

  19. [19]

    Quadratic upper bound for recursive teaching dimension of finite vc classes

    Lunjia Hu, Ruihan Wu, Tianhong Li, and Liwei Wang. Quadratic upper bound for recursive teaching dimension of finite vc classes. InProceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1147–1156. PMLR, 2017

  20. [20]

    Frontier LLMs still struggle with simple reasoning tasks.arXiv preprint arXiv:2507.07313, 2025

    Alan Malek, Jiawei Ge, Nevena Lazic, Chi Jin, András György, and Csaba Szepesvári. Frontier LLMs still struggle with simple reasoning tasks.arXiv preprint arXiv:2507.07313, 2025

  21. [21]

    A logic for expressing log-precision transformers

    William Merrill and Ashish Sabharwal. A logic for expressing log-precision transformers. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  22. [22]

    The parallelism tradeoff: Limitations of log-precision transformers

    William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023

  23. [23]

    The expressive power of transformers with chain of thought

    William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. In The Twelfth International Conference on Learning Representations, 2024

  24. [24]

    Saturated transformers are constant-depth threshold circuits.Transactions of the Association for Computational Linguistics, 10:843–856, 2022

    William Merrill, Ashish Sabharwal, and Noah A Smith. Saturated transformers are constant-depth threshold circuits.Transactions of the Association for Computational Linguistics, 10:843–856, 2022

  25. [25]

    Alice in wonderland: Simple tasks reveal severe generalization and basic reasoning deficits in state-of-the-art large language models

    Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks reveal severe generalization and basic reasoning deficits in state-of-the-art large language models. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024

  26. [26]

    Exact learning of arithmetic with differentiable agents

    Hristo Papazov, Francesco D’Angelo, and Nicolas Flammarion. Exact learning of arithmetic with differentiable agents. InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025, 2025

  27. [27]

    On the limits of proper learnability of subclasses of DNF formulas

    Krishnan Pillaipakkamnatt and Vijay Raghavan. On the limits of proper learnability of subclasses of DNF formulas. InProceedings of the seventh annual conference on Computational learning theory, pages 118–129, 1994

  28. [28]

    On the limits of efficient teachability.Information Processing Letters, 79(6):267–272, 2001

    Rocco A Servedio. On the limits of efficient teachability.Information Processing Letters, 79(6):267–272, 2001

  29. [29]

    Teachabilityincomputationallearning.New Generation Computing, 8:337–347, 1991

    AyumiShinoharaandSatoruMiyano. Teachabilityincomputationallearning.New Generation Computing, 8:337–347, 1991. doi: 10.1007/BF03037091

  30. [30]

    Models of cooperative teaching and learning.Journal of Machine Learning Research, 12:349–384, 2011

    Sandra Zilles, Steffen Lange, Robert Holte, and Martin Zinkevich. Models of cooperative teaching and learning.Journal of Machine Learning Research, 12:349–384, 2011. 12 A Generic certification tools A.1 Disjoint-block principle Proposition A.1(Restatement of Proposition 4.1).Let X be a finite domain, letH⊆{h:X→{0, 1}}, and letf⋆∈H. LetT be a finite index ...

  31. [31]

    scan inputs in increasing binary order until finding the first disagreement point among the surviving hypotheses

  32. [32]

    keep the smaller label side

  33. [33]

    36 The process stops when one hypothesis remains

    if the split is tied, keep label0. 36 The process stops when one hypothesis remains. Thus, for eachn, the procedure returns a target in the base class and a certificate for that target inside the base class. In the worst case, halving uses logarithmically many rounds in the size of the semantic class. Table 1 reports the exact semantic class sizes and the...