Neural Networks With Dense Weights Are Not Universal Approximators

Levi Rauchwerger; Ron Levie; Stefanie Jegelka

arxiv: 2602.07618 · v6 · pith:OWPK4BFYnew · submitted 2026-02-07 · 💻 cs.LG · stat.ML

Neural Networks With Dense Weights Are Not Universal Approximators

Levi Rauchwerger , Stefanie Jegelka , Ron Levie This is my paper

Pith reviewed 2026-05-21 13:28 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords neural networksuniversal approximationdense connectivityReLU networksLipschitz functionsgraph neural networksregularity lemmamodel compression

0 comments

The pith

ReLU networks with dense weight constraints cannot approximate all Lipschitz continuous functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ReLU neural networks under natural constraints on weights and input-output dimensions, which capture dense connectivity, lack the universal approximation property for continuous functions. Classic results show unrestricted networks can approximate any continuous function, but here the restrictions prevent approximation of certain Lipschitz functions. The proof interprets the network as a message-passing graph neural network and applies the weak regularity lemma from graph theory to demonstrate that dense structures force a form of model compression that erases necessary information. If correct, this means achieving full expressive power requires sparsity rather than relying on dense layers alone.

Core claim

We demonstrate the existence of Lipschitz continuous functions that cannot be approximated by ReLU neural networks subject to natural constraints on weights and input and output dimensions modeling dense connectivity. This follows from a model compression argument that combines the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks.

What carries the argument

Interpretation of feedforward neural networks as message-passing graph neural networks, combined with the weak regularity lemma to derive a compression that reveals approximation limits.

If this is right

Dense layers impose intrinsic limits on the class of functions a neural network can approximate.
Sparse connectivity becomes necessary to recover the full universal approximation capability.
Model compression via graph regularity lemmas can expose hidden limitations in standard dense architectures.
Network design must treat sparsity as a core ingredient rather than an optional efficiency trick.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid architectures mixing dense and sparse layers could be tested to see if they restore universality while retaining some dense benefits.
The result may help explain why attention-based or graph-structured models succeed on tasks where plain dense feedforward nets struggle.
Empirical checks could involve trying to approximate functions with sharp local variations using only dense constrained ReLU nets of increasing size.

Load-bearing premise

The chosen natural constraints on weights and dimensions correctly model dense connectivity and allow the feedforward network to be treated as a message-passing graph neural network for the weak regularity lemma.

What would settle it

Constructing or exhibiting one specific Lipschitz continuous function that a sequence of increasingly large dense ReLU networks under the stated weight and dimension constraints can approximate to arbitrary accuracy would falsify the main claim.

Figures

Figures reproduced from arXiv: 2602.07618 by Levi Rauchwerger, Ron Levie, Stefanie Jegelka.

**Figure 1.** Figure 1: A computational kernel-signal with hidden dimension 𝑑 = 8. intervals. The partition R (0) 𝑑 is the refinement of the coarser interval equipartition C in 𝑑0 of the interval 𝑈 (0) into 𝑑0 intervals of length 1/(𝐿 + 2)𝑑0 called the input partition. The partition R (𝐿) 𝑑 is a refinement of the coarser interval equipartition C out 𝑑𝐿 of the interval 𝑈 (𝐿) into 𝑑𝐿 intervals of length 1/(𝐿 + 2)𝑑𝐿 called the outpu… view at source ↗

read the original abstract

We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that ReLU neural networks subject to natural constraints on weights and input/output dimensions (modeling dense connectivity) are not universal approximators for Lipschitz continuous functions. The argument proceeds via a model-compression approach that interprets such feedforward networks as message-passing graph neural networks and invokes the weak regularity lemma to obtain a regularity obstruction.

Significance. If the central modeling step is valid, the result is significant: it supplies a concrete theoretical limitation on dense layers and supplies a motivation for sparsity as a necessary ingredient for universality. The combination of the weak regularity lemma with a GNN recasting of dense layers is a distinctive technical device; when the equivalence is made fully explicit it would constitute a reusable technique for proving non-universality results.

major comments (1)

[Section defining the GNN interpretation and constraints] The load-bearing step is the claim that the chosen natural constraints on weights plus input/output dimensions induce a complete-graph message-passing structure whose updates exactly reproduce the matrix multiplications of a standard dense ReLU layer. Without an explicit verification (e.g., an equation or diagram in the section defining the GNN interpretation) showing that node features, message aggregation, and weight reuse match the dense-layer computation under the stated dimension bounds, the weak regularity lemma cannot be applied to ordinary dense ReLU networks. This equivalence must be established before the compression obstruction can be transferred to the target Lipschitz functions.

minor comments (1)

[Abstract] The abstract refers to “natural constraints” without a forward pointer to their precise statement; a one-sentence definition or equation reference should appear in the abstract or immediately after the first use of the term.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We appreciate the recognition of the potential significance of the result and the distinctive technical approach. We address the major comment below and will incorporate the requested clarification in the revision.

read point-by-point responses

Referee: [Section defining the GNN interpretation and constraints] The load-bearing step is the claim that the chosen natural constraints on weights plus input/output dimensions induce a complete-graph message-passing structure whose updates exactly reproduce the matrix multiplications of a standard dense ReLU layer. Without an explicit verification (e.g., an equation or diagram in the section defining the GNN interpretation) showing that node features, message aggregation, and weight reuse match the dense-layer computation under the stated dimension bounds, the weak regularity lemma cannot be applied to ordinary dense ReLU networks. This equivalence must be established before the compression obstruction can be transferred to the target Lipschitz functions.

Authors: We agree that an explicit verification of the equivalence is necessary to rigorously transfer the obstruction to standard dense ReLU networks. In the revised manuscript we will expand the section on the GNN interpretation to include detailed equations showing the correspondence between node features and layer activations, the precise form of message aggregation that reproduces the matrix-vector product, and the reuse of weights under the stated input/output dimension bounds. A diagram illustrating the complete-graph message-passing structure for a single dense layer will also be added. These additions will make the application of the weak regularity lemma fully transparent without changing the main theorems or proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external lemma and modeling interpretation

full rationale

The paper derives non-universality for Lipschitz functions by applying the weak regularity lemma to an interpretation of constrained dense ReLU feedforward networks as message-passing GNNs under natural weight and dimension constraints that model dense connectivity. This chain does not reduce any prediction or central claim to a self-definitional equation, a fitted parameter renamed as a prediction, or a load-bearing self-citation whose content is unverified. The weak regularity lemma is an external tool, and the GNN recasting is presented as a modeling choice rather than an ansatz smuggled from prior author work or a renaming of a known result. The argument remains self-contained against external benchmarks and does not force the target non-universality result by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard mathematical background without introducing new free parameters or postulated entities; the constraints are presented as natural modeling choices.

axioms (2)

standard math ReLU activation functions are continuous and piecewise linear; target functions are Lipschitz continuous.
These are background facts from real analysis and neural network theory invoked to set up the approximation question.
domain assumption The weak regularity lemma applies to the graph representation of the network.
Invoked as part of the model compression argument in the abstract.

pith-pipeline@v0.9.0 · 5652 in / 1284 out tokens · 43763 ms · 2026-05-21T13:28:53.607707+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks... N N dense(B, L, d0, dL) is not a universal approximator of Lip(d0, dL) (Theorem 9).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension bounds for piecewise linear neural networks. In Satyen Kale and Ohad Shamir, editors,Proceedings of the 2017 Conference on Learning Theory, volume 65 ofProceedings of Machine Learning Research, pages 1064–1068. PMLR, 07–10 Jul

work page 2017
[2]

Daniel Herbst and Stefanie Jegelka

doi: 10.48550/arXiv.2206.08684. Daniel Herbst and Stefanie Jegelka. Higher-Order Graphon Neural Networks: Approximation and Cut Distance.Int. Conf. on Learning Representations (ICLR),

work page doi:10.48550/arxiv.2206.08684
[3]

Higher-order graphon neural networks: Approximation and cut distance.arXiv preprint arXiv:2503.14338, 2025

Spotlight Paper. arXiv:2503.14338. Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning and Representation Learning Workshop,

work page arXiv
[4]

Deep learning,

doi: 10.1038/nature14539. Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Snip: Single-shot Network Pruning based on Connection Sensitivity. InInt. Conf. on Learning Representations (ICLR). OpenReview.net,

work page doi:10.1038/nature14539
[5]

doi: https://doi.org/10.1016/S0893-6080(05)80131-5

ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(05)80131-5. Ron Levie. A Graphon-Signal Analysis of Graph Neural Networks.Advances in Neural Information Processing Systems (NIPS),

work page doi:10.1016/s0893-6080(05)80131-5
[6]

Cl´audio M

doi: 10.1007/s00039-007-0599-6. Cl´audio M. S. Medeiros and Guilherme A. Barreto. An efficient method for pruning the multilayer perceptron based on the correlation of errors. In Joaquim Marques de S ´a, Lu´ıs A. Alexandre, W lodzis law Duch, and Danilo Mandic, editors,Artificial Neural Networks – ICANN 2007, pages 219–228, Berlin, Heidelberg,

work page doi:10.1007/s00039-007-0599-6 2007
[7]

doi: 10.1145/3578938

ISSN 0360-0300. doi: 10.1145/3578938. C. Merkwirth and T. Lengauer. Automatic generation of complementary descriptors with molecular graph networks. Journal of Chemical Information and Modeling, 45(5):1159–1168,

work page doi:10.1145/3578938
[8]

C.; Mocanu, E.; Stone, P.; Nguyen, P

doi: 10.1038/s41467-018-04316-3. Elena-Alexandra Peste.Efficiency and Generalization of Sparse Neural Networks. Phd thesis, Institute of Science and Technology Austria,

work page doi:10.1038/s41467-018-04316-3
[9]

Approximation theory of the MLP model in neural net works

doi: 10.1017/S0962492900002919. Levi Rauchwerger and Ron Levie. A Note on Graphon-Signal Analysis of Graph Neural Networks.arXiv preprint,

work page doi:10.1017/s0962492900002919
[10]

Ugo Tanielian, Maxime Sangnier, and G´erard Biau

doi: 10.48550/arXiv.2305.10498. Ugo Tanielian, Maxime Sangnier, and G´erard Biau. Approximating lipschitz continuous functions with groupsort neural networks.International Conference on Artificial Intelligence and Statistics, abs/2006.05254,

work page doi:10.48550/arxiv.2305.10498 2006
[11]

and Manso, G.F., 2020

Neil C. Thompson, Kristjan H. Greenewald, Keeheon Lee, and Gabriel F. Manso. The Computational Limits of Deep Learning.Ninth Workshop on Computing within Limits, abs/2007.05558,

work page arXiv 2007
[12]

Then, for any input x∈R 𝑑0 and output channel𝑖∈ [𝑑 𝐿], we have Θ(W,b) (x)𝑖 = Φ 𝐵,𝐿 𝐾 (W,b) , 𝑓 x 𝑣 , whenever𝑣is in the𝑖th interval ofC out 𝑑𝐿

Lemma 4.Let (W,b) be the dense parameters of Θ(W,b) ∈ N N dense(𝐵, 𝐿, 𝑑 0, 𝑑𝐿). Then, for any input x∈R 𝑑0 and output channel𝑖∈ [𝑑 𝐿], we have Θ(W,b) (x)𝑖 = Φ 𝐵,𝐿 𝐾 (W,b) , 𝑓 x 𝑣 , whenever𝑣is in the𝑖th interval ofC out 𝑑𝐿 . Proof. We now show that forℓ=0, . . . , 𝐿 the activation of the ℓth hidden layer h(ℓ) 𝑖 in channel 𝑖∈ [𝑑 ℓ] of the network Θ(W,b) is...

work page 2023
[13]

We first recall some useful lemmas

to our setting in order to obtain an explicit bound on the Lipschitz constant. We first recall some useful lemmas. Lemma 11(Levie [2023], Equation (9)).Let𝑓:[0,1] →Rbe measurable. Then 1 2 ∥𝑓∥ 1 ≤ ∥𝑓∥ □ ≤ ∥𝑓∥

work page 2023
[14]

Lemma 12(Levie [2023], Lemma F.5/Lov ´asz [2012], Lemma 8.10).Let𝑄:[0,1] 2 →Rbe measurable

Denote by𝐿 + [0,1]the space of measurable functions𝑓:[0,1] → [0,1]. Lemma 12(Levie [2023], Lemma F.5/Lov ´asz [2012], Lemma 8.10).Let𝑄:[0,1] 2 →Rbe measurable. Then ∥𝑄∥ □ =sup 𝑓 ,𝑔∈𝐿 + [0,1] ∫ [0,1] 2 𝑓(𝑥)𝑄(𝑥, 𝑦)𝑔(𝑦)d𝑥d𝑦 , where the supremum is attained for some𝑓 , 𝑔∈𝐿 + [0,1]with values in{0,1}. Lemma 13(Cut Norm Lipschitz Continuity).Let 𝐵∈R and Φ𝐵,𝐿 be...

work page 2023
[15]

analyst’s version

Theorem 5(Computational Cut Distance Lipschitz Continuity).Let 𝐵 >0 and let Φ𝐵,𝐿 be the 𝐿-layer 𝐵-IR-MPNN. Let(𝐾, 𝑓),(𝐽, 𝑔) ∈ CK (𝐵, 𝐿, 𝑑 0, 𝑑𝐿) × CS (𝐿, 𝑑 0, 𝑑𝐿)be two computational kernel-signals. Let Out𝐾 , 𝑓 :=1 𝑈 (𝐿) Φ𝐵,𝐿 (𝐾, 𝑓),Out 𝐽 ,𝑔 :=1 𝑈 (𝐿) Φ𝐵,𝐿 (𝐽, 𝑔). Then ∥Out𝐾 , 𝑓 −Out 𝐽 ,𝑔 ∥□ ≤2 𝐿 𝐵𝐿 𝛿comp □ (𝐾, 𝑓),(𝐽, 𝑔) . Proof. Consider an arbitrary me...

work page 2025
[16]

Lemma 17.LetP 𝑛 ={𝑃 1,

as Lemma B.10. Lemma 17.LetP 𝑛 ={𝑃 1, . . . , 𝑃𝑛}be a partition of[0,1], and Let𝑉, 𝑅∈ S 2 P𝑛 ∩ W1. Then, the supremum of sup 𝑆,𝑇⊂ [0,1] ∫ 𝑆 ∫ 𝑇 𝑉(𝑥, 𝑦) −𝑅(𝑥, 𝑦) 𝑑𝑥𝑑𝑦 is attained for𝑆, 𝑇of the form 𝑆= Ø 𝑖∈𝑠 𝑃𝑖 , 𝑇= Ø 𝑗∈𝑡 𝑃 𝑗 , where𝑡, 𝑠⊂ [𝑛]. As part of the proof of Theorem 6, we first show that any computational kernel can be approximated by a coarse step...

work page 2023
[17]

Hence, Θ(x) 𝑗 −Θ ′ (x) 𝑗 < 𝜖

(𝐿+2)𝑑 𝐿 ∥Out𝐾 , 𝑓x −Out 𝐾 ′ , 𝑓x ∥□ ≤ (𝐿+2)𝑑 𝐿 (2𝐵) 𝐿 𝛿comp □ (𝐾, 𝐾 ′)< 𝜖 . Hence, Θ(x) 𝑗 −Θ ′ (x) 𝑗 < 𝜖 . 22 C Implications for Approximation In this section, we prove our main results, providing explicit conditions under which dense deep ReLU networks fail to be universal. In other words, for sufficiently large input dimension𝑑0,any dense ReLU network ...

work page 2017
[18]

Then, we use these two results to show that for a large enough input dimension, no ReLU architecture can approximate all functions with a small𝜖 error

provides lower bounds on the number of computational units a network architecture must have in order to be able to approximate any function in Lip(𝑑 0, 𝑑𝐿). Then, we use these two results to show that for a large enough input dimension, no ReLU architecture can approximate all functions with a small𝜖 error. This highlights an inherent limitation of dense ...

work page 2017
[19]

Theorem 8 gives a result close to the first part of Yarotsky [2017], Theorem 4, but with explicit constants restricted to Lipschitz continuous functions. Recall that for a hypothesis class H of Boolean functions on [0,1] 𝑑, theVC-dimensionis the size of the largest subset 𝑆⊂ [0,1] 𝑑 that can be shattered byH, i.e., on whichHcan realize all dichotomies (se...

work page 2017
[20]

27 D Experiments We empirically test the theoretical prediction that increasing width in dense networks leads to early saturation in performance

This contradiction completes the proof. 27 D Experiments We empirically test the theoretical prediction that increasing width in dense networks leads to early saturation in performance. We use fully connected one-hidden-layer ReLU networks in PyTorch, trained on MNIST with the Adam optimizer (learning rate 10−3, batch size 128, 20 epochs). We consider two...

work page 2048
[21]

the𝑗th cell of the partition

and Rauchwerger and Levie [2025], connecting graph-based neural computations to kernel-based neural computations. Graph Neural Networksand specificallymessage passing networks (MPNNs), form a class of neural networks designed to process graph-structured data [Xu et al., 2019, Rossi et al., 2023] by iteratively updating node embeddings through the exchange...

work page 2025
[22]

Levie [2023], Rauchwerger and Levie

work page 2023
[23]

Appendix E in Rauchwerger and Levie [2025])

show that applying an MPNN to an attributed graph and then inducing an attributed kernel yields the same representation as first inducing the attributed kernel and then applying the MPNN (see e.g. Appendix E in Rauchwerger and Levie [2025]). A direct result of this theorem in our setting is stated as follows. Lemma 22.Let 𝐿∈N and 𝐵∈R . For any [−𝐵, 𝐵] -we...

work page 2025

[1] [1]

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension bounds for piecewise linear neural networks. In Satyen Kale and Ohad Shamir, editors,Proceedings of the 2017 Conference on Learning Theory, volume 65 ofProceedings of Machine Learning Research, pages 1064–1068. PMLR, 07–10 Jul

work page 2017

[2] [2]

Daniel Herbst and Stefanie Jegelka

doi: 10.48550/arXiv.2206.08684. Daniel Herbst and Stefanie Jegelka. Higher-Order Graphon Neural Networks: Approximation and Cut Distance.Int. Conf. on Learning Representations (ICLR),

work page doi:10.48550/arxiv.2206.08684

[3] [3]

Higher-order graphon neural networks: Approximation and cut distance.arXiv preprint arXiv:2503.14338, 2025

Spotlight Paper. arXiv:2503.14338. Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning and Representation Learning Workshop,

work page arXiv

[4] [4]

Deep learning,

doi: 10.1038/nature14539. Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Snip: Single-shot Network Pruning based on Connection Sensitivity. InInt. Conf. on Learning Representations (ICLR). OpenReview.net,

work page doi:10.1038/nature14539

[5] [5]

doi: https://doi.org/10.1016/S0893-6080(05)80131-5

ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(05)80131-5. Ron Levie. A Graphon-Signal Analysis of Graph Neural Networks.Advances in Neural Information Processing Systems (NIPS),

work page doi:10.1016/s0893-6080(05)80131-5

[6] [6]

Cl´audio M

doi: 10.1007/s00039-007-0599-6. Cl´audio M. S. Medeiros and Guilherme A. Barreto. An efficient method for pruning the multilayer perceptron based on the correlation of errors. In Joaquim Marques de S ´a, Lu´ıs A. Alexandre, W lodzis law Duch, and Danilo Mandic, editors,Artificial Neural Networks – ICANN 2007, pages 219–228, Berlin, Heidelberg,

work page doi:10.1007/s00039-007-0599-6 2007

[7] [7]

doi: 10.1145/3578938

ISSN 0360-0300. doi: 10.1145/3578938. C. Merkwirth and T. Lengauer. Automatic generation of complementary descriptors with molecular graph networks. Journal of Chemical Information and Modeling, 45(5):1159–1168,

work page doi:10.1145/3578938

[8] [8]

C.; Mocanu, E.; Stone, P.; Nguyen, P

doi: 10.1038/s41467-018-04316-3. Elena-Alexandra Peste.Efficiency and Generalization of Sparse Neural Networks. Phd thesis, Institute of Science and Technology Austria,

work page doi:10.1038/s41467-018-04316-3

[9] [9]

Approximation theory of the MLP model in neural net works

doi: 10.1017/S0962492900002919. Levi Rauchwerger and Ron Levie. A Note on Graphon-Signal Analysis of Graph Neural Networks.arXiv preprint,

work page doi:10.1017/s0962492900002919

[10] [10]

Ugo Tanielian, Maxime Sangnier, and G´erard Biau

doi: 10.48550/arXiv.2305.10498. Ugo Tanielian, Maxime Sangnier, and G´erard Biau. Approximating lipschitz continuous functions with groupsort neural networks.International Conference on Artificial Intelligence and Statistics, abs/2006.05254,

work page doi:10.48550/arxiv.2305.10498 2006

[11] [11]

and Manso, G.F., 2020

Neil C. Thompson, Kristjan H. Greenewald, Keeheon Lee, and Gabriel F. Manso. The Computational Limits of Deep Learning.Ninth Workshop on Computing within Limits, abs/2007.05558,

work page arXiv 2007

[12] [12]

Then, for any input x∈R 𝑑0 and output channel𝑖∈ [𝑑 𝐿], we have Θ(W,b) (x)𝑖 = Φ 𝐵,𝐿 𝐾 (W,b) , 𝑓 x 𝑣 , whenever𝑣is in the𝑖th interval ofC out 𝑑𝐿

Lemma 4.Let (W,b) be the dense parameters of Θ(W,b) ∈ N N dense(𝐵, 𝐿, 𝑑 0, 𝑑𝐿). Then, for any input x∈R 𝑑0 and output channel𝑖∈ [𝑑 𝐿], we have Θ(W,b) (x)𝑖 = Φ 𝐵,𝐿 𝐾 (W,b) , 𝑓 x 𝑣 , whenever𝑣is in the𝑖th interval ofC out 𝑑𝐿 . Proof. We now show that forℓ=0, . . . , 𝐿 the activation of the ℓth hidden layer h(ℓ) 𝑖 in channel 𝑖∈ [𝑑 ℓ] of the network Θ(W,b) is...

work page 2023

[13] [13]

We first recall some useful lemmas

to our setting in order to obtain an explicit bound on the Lipschitz constant. We first recall some useful lemmas. Lemma 11(Levie [2023], Equation (9)).Let𝑓:[0,1] →Rbe measurable. Then 1 2 ∥𝑓∥ 1 ≤ ∥𝑓∥ □ ≤ ∥𝑓∥

work page 2023

[14] [14]

Lemma 12(Levie [2023], Lemma F.5/Lov ´asz [2012], Lemma 8.10).Let𝑄:[0,1] 2 →Rbe measurable

Denote by𝐿 + [0,1]the space of measurable functions𝑓:[0,1] → [0,1]. Lemma 12(Levie [2023], Lemma F.5/Lov ´asz [2012], Lemma 8.10).Let𝑄:[0,1] 2 →Rbe measurable. Then ∥𝑄∥ □ =sup 𝑓 ,𝑔∈𝐿 + [0,1] ∫ [0,1] 2 𝑓(𝑥)𝑄(𝑥, 𝑦)𝑔(𝑦)d𝑥d𝑦 , where the supremum is attained for some𝑓 , 𝑔∈𝐿 + [0,1]with values in{0,1}. Lemma 13(Cut Norm Lipschitz Continuity).Let 𝐵∈R and Φ𝐵,𝐿 be...

work page 2023

[15] [15]

analyst’s version

Theorem 5(Computational Cut Distance Lipschitz Continuity).Let 𝐵 >0 and let Φ𝐵,𝐿 be the 𝐿-layer 𝐵-IR-MPNN. Let(𝐾, 𝑓),(𝐽, 𝑔) ∈ CK (𝐵, 𝐿, 𝑑 0, 𝑑𝐿) × CS (𝐿, 𝑑 0, 𝑑𝐿)be two computational kernel-signals. Let Out𝐾 , 𝑓 :=1 𝑈 (𝐿) Φ𝐵,𝐿 (𝐾, 𝑓),Out 𝐽 ,𝑔 :=1 𝑈 (𝐿) Φ𝐵,𝐿 (𝐽, 𝑔). Then ∥Out𝐾 , 𝑓 −Out 𝐽 ,𝑔 ∥□ ≤2 𝐿 𝐵𝐿 𝛿comp □ (𝐾, 𝑓),(𝐽, 𝑔) . Proof. Consider an arbitrary me...

work page 2025

[16] [16]

Lemma 17.LetP 𝑛 ={𝑃 1,

as Lemma B.10. Lemma 17.LetP 𝑛 ={𝑃 1, . . . , 𝑃𝑛}be a partition of[0,1], and Let𝑉, 𝑅∈ S 2 P𝑛 ∩ W1. Then, the supremum of sup 𝑆,𝑇⊂ [0,1] ∫ 𝑆 ∫ 𝑇 𝑉(𝑥, 𝑦) −𝑅(𝑥, 𝑦) 𝑑𝑥𝑑𝑦 is attained for𝑆, 𝑇of the form 𝑆= Ø 𝑖∈𝑠 𝑃𝑖 , 𝑇= Ø 𝑗∈𝑡 𝑃 𝑗 , where𝑡, 𝑠⊂ [𝑛]. As part of the proof of Theorem 6, we first show that any computational kernel can be approximated by a coarse step...

work page 2023

[17] [17]

Hence, Θ(x) 𝑗 −Θ ′ (x) 𝑗 < 𝜖

(𝐿+2)𝑑 𝐿 ∥Out𝐾 , 𝑓x −Out 𝐾 ′ , 𝑓x ∥□ ≤ (𝐿+2)𝑑 𝐿 (2𝐵) 𝐿 𝛿comp □ (𝐾, 𝐾 ′)< 𝜖 . Hence, Θ(x) 𝑗 −Θ ′ (x) 𝑗 < 𝜖 . 22 C Implications for Approximation In this section, we prove our main results, providing explicit conditions under which dense deep ReLU networks fail to be universal. In other words, for sufficiently large input dimension𝑑0,any dense ReLU network ...

work page 2017

[18] [18]

Then, we use these two results to show that for a large enough input dimension, no ReLU architecture can approximate all functions with a small𝜖 error

provides lower bounds on the number of computational units a network architecture must have in order to be able to approximate any function in Lip(𝑑 0, 𝑑𝐿). Then, we use these two results to show that for a large enough input dimension, no ReLU architecture can approximate all functions with a small𝜖 error. This highlights an inherent limitation of dense ...

work page 2017

[19] [19]

Theorem 8 gives a result close to the first part of Yarotsky [2017], Theorem 4, but with explicit constants restricted to Lipschitz continuous functions. Recall that for a hypothesis class H of Boolean functions on [0,1] 𝑑, theVC-dimensionis the size of the largest subset 𝑆⊂ [0,1] 𝑑 that can be shattered byH, i.e., on whichHcan realize all dichotomies (se...

work page 2017

[20] [20]

27 D Experiments We empirically test the theoretical prediction that increasing width in dense networks leads to early saturation in performance

This contradiction completes the proof. 27 D Experiments We empirically test the theoretical prediction that increasing width in dense networks leads to early saturation in performance. We use fully connected one-hidden-layer ReLU networks in PyTorch, trained on MNIST with the Adam optimizer (learning rate 10−3, batch size 128, 20 epochs). We consider two...

work page 2048

[21] [21]

the𝑗th cell of the partition

and Rauchwerger and Levie [2025], connecting graph-based neural computations to kernel-based neural computations. Graph Neural Networksand specificallymessage passing networks (MPNNs), form a class of neural networks designed to process graph-structured data [Xu et al., 2019, Rossi et al., 2023] by iteratively updating node embeddings through the exchange...

work page 2025

[22] [22]

Levie [2023], Rauchwerger and Levie

work page 2023

[23] [23]

Appendix E in Rauchwerger and Levie [2025])

show that applying an MPNN to an attributed graph and then inducing an attributed kernel yields the same representation as first inducing the attributed kernel and then applying the MPNN (see e.g. Appendix E in Rauchwerger and Levie [2025]). A direct result of this theorem in our setting is stated as follows. Lemma 22.Let 𝐿∈N and 𝐵∈R . For any [−𝐵, 𝐵] -we...

work page 2025