pith. sign in

arxiv: 2602.07618 · v6 · pith:OWPK4BFYnew · submitted 2026-02-07 · 💻 cs.LG · stat.ML

Neural Networks With Dense Weights Are Not Universal Approximators

Pith reviewed 2026-05-21 13:28 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords neural networksuniversal approximationdense connectivityReLU networksLipschitz functionsgraph neural networksregularity lemmamodel compression
0
0 comments X

The pith

ReLU networks with dense weight constraints cannot approximate all Lipschitz continuous functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ReLU neural networks under natural constraints on weights and input-output dimensions, which capture dense connectivity, lack the universal approximation property for continuous functions. Classic results show unrestricted networks can approximate any continuous function, but here the restrictions prevent approximation of certain Lipschitz functions. The proof interprets the network as a message-passing graph neural network and applies the weak regularity lemma from graph theory to demonstrate that dense structures force a form of model compression that erases necessary information. If correct, this means achieving full expressive power requires sparsity rather than relying on dense layers alone.

Core claim

We demonstrate the existence of Lipschitz continuous functions that cannot be approximated by ReLU neural networks subject to natural constraints on weights and input and output dimensions modeling dense connectivity. This follows from a model compression argument that combines the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks.

What carries the argument

Interpretation of feedforward neural networks as message-passing graph neural networks, combined with the weak regularity lemma to derive a compression that reveals approximation limits.

If this is right

  • Dense layers impose intrinsic limits on the class of functions a neural network can approximate.
  • Sparse connectivity becomes necessary to recover the full universal approximation capability.
  • Model compression via graph regularity lemmas can expose hidden limitations in standard dense architectures.
  • Network design must treat sparsity as a core ingredient rather than an optional efficiency trick.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid architectures mixing dense and sparse layers could be tested to see if they restore universality while retaining some dense benefits.
  • The result may help explain why attention-based or graph-structured models succeed on tasks where plain dense feedforward nets struggle.
  • Empirical checks could involve trying to approximate functions with sharp local variations using only dense constrained ReLU nets of increasing size.

Load-bearing premise

The chosen natural constraints on weights and dimensions correctly model dense connectivity and allow the feedforward network to be treated as a message-passing graph neural network for the weak regularity lemma.

What would settle it

Constructing or exhibiting one specific Lipschitz continuous function that a sequence of increasingly large dense ReLU networks under the stated weight and dimension constraints can approximate to arbitrary accuracy would falsify the main claim.

Figures

Figures reproduced from arXiv: 2602.07618 by Levi Rauchwerger, Ron Levie, Stefanie Jegelka.

Figure 1
Figure 1. Figure 1: A computational kernel-signal with hidden dimension 𝑑 = 8. intervals. The partition R (0) 𝑑 is the refinement of the coarser interval equipartition C in 𝑑0 of the interval 𝑈 (0) into 𝑑0 intervals of length 1/(𝐿 + 2)𝑑0 called the input partition. The partition R (𝐿) 𝑑 is a refinement of the coarser interval equipartition C out 𝑑𝐿 of the interval 𝑈 (𝐿) into 𝑑𝐿 intervals of length 1/(𝐿 + 2)𝑑𝐿 called the outpu… view at source ↗
read the original abstract

We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that ReLU neural networks subject to natural constraints on weights and input/output dimensions (modeling dense connectivity) are not universal approximators for Lipschitz continuous functions. The argument proceeds via a model-compression approach that interprets such feedforward networks as message-passing graph neural networks and invokes the weak regularity lemma to obtain a regularity obstruction.

Significance. If the central modeling step is valid, the result is significant: it supplies a concrete theoretical limitation on dense layers and supplies a motivation for sparsity as a necessary ingredient for universality. The combination of the weak regularity lemma with a GNN recasting of dense layers is a distinctive technical device; when the equivalence is made fully explicit it would constitute a reusable technique for proving non-universality results.

major comments (1)
  1. [Section defining the GNN interpretation and constraints] The load-bearing step is the claim that the chosen natural constraints on weights plus input/output dimensions induce a complete-graph message-passing structure whose updates exactly reproduce the matrix multiplications of a standard dense ReLU layer. Without an explicit verification (e.g., an equation or diagram in the section defining the GNN interpretation) showing that node features, message aggregation, and weight reuse match the dense-layer computation under the stated dimension bounds, the weak regularity lemma cannot be applied to ordinary dense ReLU networks. This equivalence must be established before the compression obstruction can be transferred to the target Lipschitz functions.
minor comments (1)
  1. [Abstract] The abstract refers to “natural constraints” without a forward pointer to their precise statement; a one-sentence definition or equation reference should appear in the abstract or immediately after the first use of the term.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We appreciate the recognition of the potential significance of the result and the distinctive technical approach. We address the major comment below and will incorporate the requested clarification in the revision.

read point-by-point responses
  1. Referee: [Section defining the GNN interpretation and constraints] The load-bearing step is the claim that the chosen natural constraints on weights plus input/output dimensions induce a complete-graph message-passing structure whose updates exactly reproduce the matrix multiplications of a standard dense ReLU layer. Without an explicit verification (e.g., an equation or diagram in the section defining the GNN interpretation) showing that node features, message aggregation, and weight reuse match the dense-layer computation under the stated dimension bounds, the weak regularity lemma cannot be applied to ordinary dense ReLU networks. This equivalence must be established before the compression obstruction can be transferred to the target Lipschitz functions.

    Authors: We agree that an explicit verification of the equivalence is necessary to rigorously transfer the obstruction to standard dense ReLU networks. In the revised manuscript we will expand the section on the GNN interpretation to include detailed equations showing the correspondence between node features and layer activations, the precise form of message aggregation that reproduces the matrix-vector product, and the reuse of weights under the stated input/output dimension bounds. A diagram illustrating the complete-graph message-passing structure for a single dense layer will also be added. These additions will make the application of the weak regularity lemma fully transparent without changing the main theorems or proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external lemma and modeling interpretation

full rationale

The paper derives non-universality for Lipschitz functions by applying the weak regularity lemma to an interpretation of constrained dense ReLU feedforward networks as message-passing GNNs under natural weight and dimension constraints that model dense connectivity. This chain does not reduce any prediction or central claim to a self-definitional equation, a fitted parameter renamed as a prediction, or a load-bearing self-citation whose content is unverified. The weak regularity lemma is an external tool, and the GNN recasting is presented as a modeling choice rather than an ansatz smuggled from prior author work or a renaming of a known result. The argument remains self-contained against external benchmarks and does not force the target non-universality result by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard mathematical background without introducing new free parameters or postulated entities; the constraints are presented as natural modeling choices.

axioms (2)
  • standard math ReLU activation functions are continuous and piecewise linear; target functions are Lipschitz continuous.
    These are background facts from real analysis and neural network theory invoked to set up the approximation question.
  • domain assumption The weak regularity lemma applies to the graph representation of the network.
    Invoked as part of the model compression argument in the abstract.

pith-pipeline@v0.9.0 · 5652 in / 1284 out tokens · 43763 ms · 2026-05-21T13:28:53.607707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Nearly-tight VC-dimension bounds for piecewise linear neural networks

    Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension bounds for piecewise linear neural networks. In Satyen Kale and Ohad Shamir, editors,Proceedings of the 2017 Conference on Learning Theory, volume 65 ofProceedings of Machine Learning Research, pages 1064–1068. PMLR, 07–10 Jul

  2. [2]

    Daniel Herbst and Stefanie Jegelka

    doi: 10.48550/arXiv.2206.08684. Daniel Herbst and Stefanie Jegelka. Higher-Order Graphon Neural Networks: Approximation and Cut Distance.Int. Conf. on Learning Representations (ICLR),

  3. [3]

    Higher-order graphon neural networks: Approximation and cut distance.arXiv preprint arXiv:2503.14338, 2025

    Spotlight Paper. arXiv:2503.14338. Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning and Representation Learning Workshop,

  4. [4]

    Deep learning,

    doi: 10.1038/nature14539. Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Snip: Single-shot Network Pruning based on Connection Sensitivity. InInt. Conf. on Learning Representations (ICLR). OpenReview.net,

  5. [5]

    doi: https://doi.org/10.1016/S0893-6080(05)80131-5

    ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(05)80131-5. Ron Levie. A Graphon-Signal Analysis of Graph Neural Networks.Advances in Neural Information Processing Systems (NIPS),

  6. [6]

    Cl´audio M

    doi: 10.1007/s00039-007-0599-6. Cl´audio M. S. Medeiros and Guilherme A. Barreto. An efficient method for pruning the multilayer perceptron based on the correlation of errors. In Joaquim Marques de S ´a, Lu´ıs A. Alexandre, W lodzis law Duch, and Danilo Mandic, editors,Artificial Neural Networks – ICANN 2007, pages 219–228, Berlin, Heidelberg,

  7. [7]

    doi: 10.1145/3578938

    ISSN 0360-0300. doi: 10.1145/3578938. C. Merkwirth and T. Lengauer. Automatic generation of complementary descriptors with molecular graph networks. Journal of Chemical Information and Modeling, 45(5):1159–1168,

  8. [8]

    C.; Mocanu, E.; Stone, P.; Nguyen, P

    doi: 10.1038/s41467-018-04316-3. Elena-Alexandra Peste.Efficiency and Generalization of Sparse Neural Networks. Phd thesis, Institute of Science and Technology Austria,

  9. [9]

    Approximation theory of the MLP model in neural net works

    doi: 10.1017/S0962492900002919. Levi Rauchwerger and Ron Levie. A Note on Graphon-Signal Analysis of Graph Neural Networks.arXiv preprint,

  10. [10]

    Ugo Tanielian, Maxime Sangnier, and G´erard Biau

    doi: 10.48550/arXiv.2305.10498. Ugo Tanielian, Maxime Sangnier, and G´erard Biau. Approximating lipschitz continuous functions with groupsort neural networks.International Conference on Artificial Intelligence and Statistics, abs/2006.05254,

  11. [11]

    and Manso, G.F., 2020

    Neil C. Thompson, Kristjan H. Greenewald, Keeheon Lee, and Gabriel F. Manso. The Computational Limits of Deep Learning.Ninth Workshop on Computing within Limits, abs/2007.05558,

  12. [12]

    Then, for any input x∈R 𝑑0 and output channel𝑖∈ [𝑑 𝐿], we have Θ(W,b) (x)𝑖 = Φ 𝐵,𝐿 𝐾 (W,b) , 𝑓 x 𝑣 , whenever𝑣is in the𝑖th interval ofC out 𝑑𝐿

    Lemma 4.Let (W,b) be the dense parameters of Θ(W,b) ∈ N N dense(𝐵, 𝐿, 𝑑 0, 𝑑𝐿). Then, for any input x∈R 𝑑0 and output channel𝑖∈ [𝑑 𝐿], we have Θ(W,b) (x)𝑖 = Φ 𝐵,𝐿 𝐾 (W,b) , 𝑓 x 𝑣 , whenever𝑣is in the𝑖th interval ofC out 𝑑𝐿 . Proof. We now show that forℓ=0, . . . , 𝐿 the activation of the ℓth hidden layer h(ℓ) 𝑖 in channel 𝑖∈ [𝑑 ℓ] of the network Θ(W,b) is...

  13. [13]

    We first recall some useful lemmas

    to our setting in order to obtain an explicit bound on the Lipschitz constant. We first recall some useful lemmas. Lemma 11(Levie [2023], Equation (9)).Let𝑓:[0,1] →Rbe measurable. Then 1 2 ∥𝑓∥ 1 ≤ ∥𝑓∥ □ ≤ ∥𝑓∥

  14. [14]

    Lemma 12(Levie [2023], Lemma F.5/Lov ´asz [2012], Lemma 8.10).Let𝑄:[0,1] 2 →Rbe measurable

    Denote by𝐿 + [0,1]the space of measurable functions𝑓:[0,1] → [0,1]. Lemma 12(Levie [2023], Lemma F.5/Lov ´asz [2012], Lemma 8.10).Let𝑄:[0,1] 2 →Rbe measurable. Then ∥𝑄∥ □ =sup 𝑓 ,𝑔∈𝐿 + [0,1] ∫ [0,1] 2 𝑓(𝑥)𝑄(𝑥, 𝑦)𝑔(𝑦)d𝑥d𝑦 , where the supremum is attained for some𝑓 , 𝑔∈𝐿 + [0,1]with values in{0,1}. Lemma 13(Cut Norm Lipschitz Continuity).Let 𝐵∈R and Φ𝐵,𝐿 be...

  15. [15]

    analyst’s version

    Theorem 5(Computational Cut Distance Lipschitz Continuity).Let 𝐵 >0 and let Φ𝐵,𝐿 be the 𝐿-layer 𝐵-IR-MPNN. Let(𝐾, 𝑓),(𝐽, 𝑔) ∈ CK (𝐵, 𝐿, 𝑑 0, 𝑑𝐿) × CS (𝐿, 𝑑 0, 𝑑𝐿)be two computational kernel-signals. Let Out𝐾 , 𝑓 :=1 𝑈 (𝐿) Φ𝐵,𝐿 (𝐾, 𝑓),Out 𝐽 ,𝑔 :=1 𝑈 (𝐿) Φ𝐵,𝐿 (𝐽, 𝑔). Then ∥Out𝐾 , 𝑓 −Out 𝐽 ,𝑔 ∥□ ≤2 𝐿 𝐵𝐿 𝛿comp □ (𝐾, 𝑓),(𝐽, 𝑔) . Proof. Consider an arbitrary me...

  16. [16]

    Lemma 17.LetP 𝑛 ={𝑃 1,

    as Lemma B.10. Lemma 17.LetP 𝑛 ={𝑃 1, . . . , 𝑃𝑛}be a partition of[0,1], and Let𝑉, 𝑅∈ S 2 P𝑛 ∩ W1. Then, the supremum of sup 𝑆,𝑇⊂ [0,1] ∫ 𝑆 ∫ 𝑇 𝑉(𝑥, 𝑦) −𝑅(𝑥, 𝑦) 𝑑𝑥𝑑𝑦 is attained for𝑆, 𝑇of the form 𝑆= Ø 𝑖∈𝑠 𝑃𝑖 , 𝑇= Ø 𝑗∈𝑡 𝑃 𝑗 , where𝑡, 𝑠⊂ [𝑛]. As part of the proof of Theorem 6, we first show that any computational kernel can be approximated by a coarse step...

  17. [17]

    Hence, Θ(x) 𝑗 −Θ ′ (x) 𝑗 < 𝜖

    (𝐿+2)𝑑 𝐿 ∥Out𝐾 , 𝑓x −Out 𝐾 ′ , 𝑓x ∥□ ≤ (𝐿+2)𝑑 𝐿 (2𝐵) 𝐿 𝛿comp □ (𝐾, 𝐾 ′)< 𝜖 . Hence, Θ(x) 𝑗 −Θ ′ (x) 𝑗 < 𝜖 . 22 C Implications for Approximation In this section, we prove our main results, providing explicit conditions under which dense deep ReLU networks fail to be universal. In other words, for sufficiently large input dimension𝑑0,any dense ReLU network ...

  18. [18]

    Then, we use these two results to show that for a large enough input dimension, no ReLU architecture can approximate all functions with a small𝜖 error

    provides lower bounds on the number of computational units a network architecture must have in order to be able to approximate any function in Lip(𝑑 0, 𝑑𝐿). Then, we use these two results to show that for a large enough input dimension, no ReLU architecture can approximate all functions with a small𝜖 error. This highlights an inherent limitation of dense ...

  19. [19]

    Theorem 8 gives a result close to the first part of Yarotsky [2017], Theorem 4, but with explicit constants restricted to Lipschitz continuous functions. Recall that for a hypothesis class H of Boolean functions on [0,1] 𝑑, theVC-dimensionis the size of the largest subset 𝑆⊂ [0,1] 𝑑 that can be shattered byH, i.e., on whichHcan realize all dichotomies (se...

  20. [20]

    27 D Experiments We empirically test the theoretical prediction that increasing width in dense networks leads to early saturation in performance

    This contradiction completes the proof. 27 D Experiments We empirically test the theoretical prediction that increasing width in dense networks leads to early saturation in performance. We use fully connected one-hidden-layer ReLU networks in PyTorch, trained on MNIST with the Adam optimizer (learning rate 10−3, batch size 128, 20 epochs). We consider two...

  21. [21]

    the𝑗th cell of the partition

    and Rauchwerger and Levie [2025], connecting graph-based neural computations to kernel-based neural computations. Graph Neural Networksand specificallymessage passing networks (MPNNs), form a class of neural networks designed to process graph-structured data [Xu et al., 2019, Rossi et al., 2023] by iteratively updating node embeddings through the exchange...

  22. [22]

    Levie [2023], Rauchwerger and Levie

  23. [23]

    Appendix E in Rauchwerger and Levie [2025])

    show that applying an MPNN to an attributed graph and then inducing an attributed kernel yields the same representation as first inducing the attributed kernel and then applying the MPNN (see e.g. Appendix E in Rauchwerger and Levie [2025]). A direct result of this theorem in our setting is stated as follows. Lemma 22.Let 𝐿∈N and 𝐵∈R . For any [−𝐵, 𝐵] -we...