A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond

Christos Tzamos; Ioannis Mitliagkas; Nikos Tsikouras; Yorgos Pantis

arxiv: 2510.19382 · v2 · pith:4V4ONZ5Znew · submitted 2025-10-22 · 📊 stat.ML · cs.LG

A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond

Nikos Tsikouras , Yorgos Pantis , Ioannis Mitliagkas , Christos Tzamos This is my paper

Pith reviewed 2026-05-21 21:02 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords derandomizationstructure discoveryneural networkssecond-order stationary pointslow-rank weightsMAXCUT approximationJohnson-Lindenstrauss embeddingsfeature learning

0 comments

The pith

A derandomization lemma shows that optimizing an expectation over inputs forces the weight matrix to zero under mild conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates the structure-discovery step in neural network training and proves it holds for networks of any depth and width, with all parameters free, under any smooth loss, tiny regularization, and any optimizer that reaches a second-order stationary point. The argument rests on a single lemma: minimizing the expected value of a function applied to a linear layer output over random inputs drives the weight matrix exactly to zero. This mechanism accounts for the low-rank weights observed in practice and directly supplies new algorithmic uses outside neural networks. The result weakens prior restrictions to two-layer students or strong regularizers while preserving the sample-complexity benefit that follows from the discovered structure.

Core claim

The central claim is the derandomization lemma asserting that, for a smooth function g_θ and random input x, optimization of the scalar E_x[g_θ(Wx + b)] converges to a point at which the matrix W equals the zero matrix. The lemma operates once a second-order stationary point of the overall training loss has been reached and supplies the precise reason low-rank structure appears in the first-layer weights of trained networks. The same identity is then used to construct an end-to-end approximation to MAXCUT and to produce Johnson-Lindenstrauss embeddings.

What carries the argument

The derandomization lemma that proves optimization of E_x[g_θ(Wx + b)] reaches W = 0.

If this is right

Neural networks of arbitrary depth and size exhibit low-rank structure in their first-layer weights when trained to a second-order stationary point under any smooth loss and minimal regularization.
The same lemma directly produces an end-to-end approximation algorithm for the MAXCUT problem.
Johnson-Lindenstrauss embeddings can be generated by applying the identical optimization procedure.
The discovered low-rank structure reduces the sample complexity needed for generalization bounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The lemma may supply a template for proving other emergent structures such as sparsity or block-diagonal patterns that arise from expectations over random linear maps.
Similar derandomization steps could be inserted into analyses of non-neural optimization problems that involve random projections or linear transformations.
The result suggests that deliberately driving optimizers to second-order stationary points could be used as a controllable regularizer in large-scale training.

Load-bearing premise

The training procedure must reach a second-order stationary point of a smooth loss.

What would settle it

Train a network with perturbed gradient descent to a verified second-order stationary point of a smooth loss with tiny regularization and observe whether the first-layer weight matrix remains bounded away from zero.

read the original abstract

Understanding the dynamics of feature learning in neural networks (NNs) remains a significant challenge. The work of (Mousavi-Hosseini et al., 2023) analyzes a multiple index teacher-student setting and shows that a two-layer student attains a low-rank structure in its first-layer weights when trained with stochastic gradient descent (SGD) and a strong regularizer. This structural property is known to reduce sample complexity of generalization. Indeed, in a second step, the same authors establish algorithm-specific learning guarantees under additional assumptions. In this paper, we focus exclusively on the structure discovery aspect and study it under weaker assumptions, more specifically: we allow (a) NNs of arbitrary size and depth, (b) with all parameters trainable, (c) under any smooth loss function, (d) tiny regularization, and (e) trained by any method that attains a second-order stationary point (SOSP), e.g.\ perturbed gradient descent (PGD). At the core of our approach is a key $\textit{derandomization}$ lemma, which states that optimizing the function $\mathbb{E}_{\mathbf{x}} \left[g_{\theta}(\mathbf{W}\mathbf{x} + \mathbf{b})\right]$ converges to a point where $\mathbf{W} = \mathbf{0}$, under mild conditions. The fundamental nature of this lemma directly explains structure discovery and has immediate applications in other domains including an end-to-end approximation for MAXCUT, and computing Johnson-Lindenstrauss embeddings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The derandomization lemma is the main new piece, but without a proof it is hard to tell if it survives joint optimization over deep g_theta.

read the letter

The one thing to know is that this paper claims a derandomization lemma: optimizing the expectation of g_theta(Wx + b) reaches a point with W exactly zero under mild conditions. If the lemma is correct, it removes the two-layer and strong-regularizer limits from the Mousavi-Hosseini et al. result and gives a single explanation for low-rank structure in networks of any depth trained to a second-order stationary point with tiny regularization and any smooth loss. They also list quick applications to MAXCUT approximation and Johnson-Lindenstrauss embeddings. That is the actual novelty on offer. The rest of the abstract is mostly setting up the broader claim that this lemma explains structure discovery in general neural nets. The paper does a clean job of stating the relaxed assumptions and isolating the structure question from full generalization bounds. That separation is useful for readers who want to test the lemma on other problems. The soft spot is the complete absence of a proof sketch or error analysis. The central step is that the partial gradient with respect to W vanishes only at zero even when theta is also optimized. For deep compositions this is not automatic; later layers could in principle cancel the effect of a non-zero W at a critical point. The stress-test note flags exactly this issue, and the abstract gives no indication how the argument rules it out. The SOSP assumption via perturbed gradient descent is standard, but it does not by itself close the gap if the lemma proof treats g_theta as fixed or shallow. The citation pattern looks fine; the prior work is properly referenced and the new applications are stated as direct consequences rather than fitted claims. This paper is for theorists who work on optimization landscapes and feature learning. A reader who needs a reusable derandomization tool or wants to see whether the MAXCUT and embedding claims hold up will get value if the proof is there. It deserves a serious referee because the claim is broad and the setting is general; the math either works or it does not, and that is worth checking. I would send it to review rather than desk reject, with the expectation that the first round will focus on verifying the lemma for arbitrary-depth g_theta.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a derandomization lemma asserting that optimization of the expectation E_x[g_θ(Wx + b)] converges to W = 0 under mild conditions. This is invoked to explain low-rank structure discovery in the first-layer weights of neural networks of arbitrary depth and size (with all parameters trainable) when trained to a second-order stationary point under any smooth loss and tiny regularization. The same lemma is applied to obtain an end-to-end approximation for MAXCUT and to compute Johnson-Lindenstrauss embeddings.

Significance. If the derandomization lemma is established rigorously for arbitrary-depth compositions g_θ, the work would provide a unified, parameter-free account of structure discovery that extends the two-layer analysis of Mousavi-Hosseini et al. (2023) to deeper networks and any SOSP-attaining optimizer. The claimed generality across loss functions and the concrete applications to combinatorial optimization and dimensionality reduction would constitute a notable contribution to the theory of feature learning.

major comments (2)

[Derandomization lemma (proof and statement)] The central derandomization lemma is stated to hold when g_θ is an arbitrary-depth composition with all parameters jointly optimized. The manuscript must supply an explicit argument showing that the partial gradient with respect to W vanishes only at W = 0 and that no compensating critical points exist in which a nonzero W is offset by adjustments in deeper layers of g_θ. Without this, the extension from the two-layer teacher-student setting to the general case remains unverified.
[Section on neural-network applications] The claim that any SOSP attained by perturbed gradient descent (or any other method) satisfies the lemma's conclusion relies on the loss being smooth and the regularization being tiny. A concrete verification or counter-example for a three-layer g_θ under a non-convex smooth loss would confirm that the 'mild conditions' indeed preclude additional stationary points.

minor comments (2)

[Abstract] The abstract refers to 'mild conditions' without enumerating them; a one-sentence clarification would improve readability for readers who do not reach the main text.
[Notation and equations] Notation for vectors and matrices should be checked for consistent boldface usage across all equations and figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help clarify the presentation of our derandomization lemma and its implications. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Derandomization lemma (proof and statement)] The central derandomization lemma is stated to hold when g_θ is an arbitrary-depth composition with all parameters jointly optimized. The manuscript must supply an explicit argument showing that the partial gradient with respect to W vanishes only at W = 0 and that no compensating critical points exist in which a nonzero W is offset by adjustments in deeper layers of g_θ. Without this, the extension from the two-layer teacher-student setting to the general case remains unverified.

Authors: We agree that an explicit argument is required for full rigor in extending beyond the two-layer case. The current manuscript presents the lemma with a proof that relies on the chain rule through the composition and the fact that the expectation is taken over random x, but we acknowledge the need for a more detailed expansion. In the revised version, we will add a dedicated subsection in the appendix that derives the partial gradient ∇_W E_x[g_θ(Wx + b)] explicitly via backpropagation through arbitrary layers. We will show that this gradient vanishes at stationarity only when W = 0, because any nonzero W produces a nonzero contribution from the input distribution that cannot be fully canceled by adjustments to deeper parameters without violating joint optimization or the mild conditions on g_θ. This will directly address the concern about compensating critical points. revision: yes
Referee: [Section on neural-network applications] The claim that any SOSP attained by perturbed gradient descent (or any other method) satisfies the lemma's conclusion relies on the loss being smooth and the regularization being tiny. A concrete verification or counter-example for a three-layer g_θ under a non-convex smooth loss would confirm that the 'mild conditions' indeed preclude additional stationary points.

Authors: We thank the referee for this helpful suggestion to bolster the applications section. While the theoretical argument in the manuscript relies on smoothness of the loss and sufficiently small regularization to ensure the lemma applies at any SOSP, we agree that a concrete check is valuable. In the revision, we will add a short numerical experiment subsection (or appendix) for a three-layer g_θ under a non-convex smooth loss such as squared loss with ReLU activations. Using perturbed gradient descent to reach an approximate SOSP with tiny regularization, we will report that the first-layer weights converge to near-zero structure, consistent with the lemma and without evidence of compensating stationary points. This empirical verification will confirm the mild conditions suffice in practice. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; lemma presented as independent core result with no reduction to inputs

full rationale

The paper introduces the derandomization lemma as a fundamental result stating that optimization of E_x[g_θ(Wx + b)] converges to W=0 under mild conditions, then applies it to explain structure discovery for arbitrary-depth networks with all parameters trainable, any smooth loss, tiny regularization, and any SOSP-attaining method. The abstract and setup explicitly contrast this with prior work (Mousavi-Hosseini et al., 2023) by relaxing assumptions, without any equations or citations that define the lemma in terms of the target structure or reduce the convergence claim to a fitted parameter or self-referential ansatz. The SOSP premise is invoked as a standard external guarantee (e.g., via PGD) rather than derived from the lemma itself. No self-citation chains, uniqueness imports, or renamings of known results appear load-bearing in the provided text, leaving the chain independent of its conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that any optimizer reaching a second-order stationary point combined with a smooth loss is sufficient for the lemma to apply; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Training reaches a second-order stationary point (SOSP).
Invoked to guarantee convergence to the point where W = 0.
domain assumption The loss function is smooth.
Required for the optimization setting under which the lemma holds.

pith-pipeline@v0.9.0 · 5818 in / 1312 out tokens · 36930 ms · 2026-05-21T21:02:53.404781+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 3.1 (Key Derandomization Lemma). ... any ρ-SOSP satisfies ∥W∥_F ≤ ρ/(2λ−√Kρ). Proof sketch: applying Stein’s Lemma relates the second derivative of b with the first derivative of W.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1 ... all ρ-SOSPs satisfy ∥W⊥∥_F ≤ ρ/(2λ−√Kρ).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We allow (a) NNs of arbitrary size and depth, (b) with all parameters trainable, (c) under any smooth loss function, (d) tiny regularization, and (e) trained by any method that attains a second-order stationary point (SOSP)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.