Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration

Alexander Tyurin

arxiv: 2512.11587 · v2 · pith:AC4FRKG4new · submitted 2025-12-12 · 💻 cs.LG · cs.NA· math.NA· math.OC

Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration

Alexander Tyurin This is my paper

Pith reviewed 2026-05-22 12:06 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAmath.OC

keywords gradient descentperceptron algorithmneural network trainingimplicit accelerationlogistic losstwo-layer modeliteration complexityoptimization dynamics

0 comments

The pith

Gradient descent on nonlinear models with logistic loss reduces to generalized perceptron updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to clarify why gradient descent exhibits specific behaviors like implicit acceleration when training nonlinear models under the logistic loss. It establishes that each step of gradient descent corresponds directly to an update from a generalized perceptron algorithm. This mapping turns complex neural network updates into simpler linear operations that can be studied with standard linear algebra. On a basic two-layer example the nonlinearity then produces a provably better iteration complexity of order square root of dimension rather than linear in dimension. The result supplies a concrete explanation for observed training dynamics in neural networks.

Core claim

We analyze nonlinear models with the logistic loss and show that the steps of GD reduce to those of generalized perceptron algorithms, providing a new perspective on the dynamics. This reduction yields significantly simpler algorithmic steps, which we analyze using classical linear algebra tools. Using these tools, we demonstrate on a minimalistic example that the nonlinearity in a two-layer model can provably yield a faster iteration complexity Õ(√d) compared to Ω(d) achieved by linear models, where d is the number of features. This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks.

What carries the argument

The exact reduction that rewrites each gradient descent step as a generalized perceptron update on the two-layer nonlinear model with logistic loss.

If this is right

Dynamics of convergence, trajectories, and oscillations become analyzable with elementary linear algebra once the perceptron equivalence is used.
Nonlinearity in the two-layer case produces iteration complexity scaling as Õ(√d) while linear models require Ω(d) steps.
Implicit acceleration observed during neural network training can be traced to the perceptron-like behavior induced by the nonlinear layers.
Oscillations in function values during training follow directly from known properties of generalized perceptron algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reduction technique could be attempted on deeper networks or with different nonlinear activations to see whether similar acceleration appears.
New optimization procedures might be constructed by directly simulating the perceptron updates instead of computing gradients.
Empirical checks on high-dimensional datasets could reveal whether the square-root scaling persists beyond the minimal example.
The perspective links modern neural optimization to classical linear classifiers and may illuminate other implicit-bias effects.

Load-bearing premise

The claimed reduction of every gradient descent step to a perceptron update and the resulting complexity comparison are derived only for a minimal two-layer nonlinear model.

What would settle it

Running gradient descent on the two-layer logistic model and checking whether the exact sequence of parameter changes matches the perceptron update rule would directly test the reduction.

Figures

Figures reproduced from arXiv: 2512.11587 by Alexander Tyurin.

**Figure 2.** Figure 2: Perceptron Algorithm and Two-Sample Quadratic Perceptron Algorithm on dataset (5). Discussion. Comparing Theorem 5.1 and Theorem 5.2, we see that the quadratic perceptron approach achieves a √ d-times better complexity. These theorems help to explain the gap between mlin and mcv observed in Section 1.1 and [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracies for nonlinear model mcv trained on CIFAR-10 with two classes 0 and 1, # of samples n = 5000, and different kernel sizes. Ai ∈ R (d+k)×(d+k) for all i ∈ [n] and wt := c ⊤ t v ⊤ t ⊤ ∈ R k+d . For this matrix, we can prove the following result. Proposition 6.1. Consider matrix Ai from (8). Then, the norm of Ai can be bounded as √ 1 k [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Linear model mlin with classes 0 and 1 and n = 5000 samples on CIFAR-10 0 50000 100000 150000 200000 250000 300000 350000 400000 Iteration 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2)+Linear model; step: 0.0625 Conv(2)+Linear model; step: 0.03125 Conv(2)+Linear model; st… view at source ↗

**Figure 5.** Figure 5: Non-linear model mcv with kernel size k = 2, classes 0 and 1, and n = 5000 samples on CIFAR-10 57 [PITH_FULL_IMAGE:figures/full_fig_p057_5.png] view at source ↗

**Figure 6.** Figure 6: Linear model mlin with classes 3 and 7 and n = 5000 samples on CIFAR-10 0 50000 100000 150000 200000 250000 300000 350000 400000 Iteration 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2)+Linear model; step: 0.0625 Conv(2)+Linear model; step: 0.03125 Conv(2)+Linear model; st… view at source ↗

**Figure 7.** Figure 7: Non-linear model mcv with kernel size k = 2, classes 3 and 7, and n = 5000 samples on CIFAR-10 58 [PITH_FULL_IMAGE:figures/full_fig_p058_7.png] view at source ↗

**Figure 8.** Figure 8: Linear model mlin with classes 0 and 1 and all samples on CIFAR-10 0 50000 100000 150000 200000 250000 300000 350000 400000 Iteration 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2)+Linear model; step: 0.0625 Conv(2)+Linear model; step: 0.03125 Conv(2)+Linear model; step: 0… view at source ↗

**Figure 9.** Figure 9: Non-linear model mcv with kernel size k = 2, classes 0 and 1, and all samples on CIFAR-10 59 [PITH_FULL_IMAGE:figures/full_fig_p059_9.png] view at source ↗

**Figure 10.** Figure 10: Linear model mlin with classes 3 and 7 and n = 5000 samples on EuroSAT 0 20000 40000 60000 80000 100000 Iteration 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2)+Linear model; step: 0.0625 Conv(2)+Linear model; step: 0.03125 0 20000 40000 60000 80000 100000 Itera… view at source ↗

**Figure 11.** Figure 11: Non-linear model mcv with kernel size k = 2, classes 3 and 7, and n = 5000 samples on EuroSAT 60 [PITH_FULL_IMAGE:figures/full_fig_p060_11.png] view at source ↗

**Figure 12.** Figure 12: Linear model mlin with classes 0 and 1 and n = 5000 samples on EuroSAT 0 20000 40000 60000 80000 100000 Iteration 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2)+Linear model; step: 0.0625 Conv(2)+Linear model; step: 0.03125 0 20000 40000 60000 80000 100000 Itera… view at source ↗

**Figure 13.** Figure 13: Non-linear model mcv with kernel size k = 2, classes 0 and 1, and n = 5000 samples on EuroSAT 61 [PITH_FULL_IMAGE:figures/full_fig_p061_13.png] view at source ↗

**Figure 14.** Figure 14: Linear model mlin with classes 5 and 7 and n = 5000 samples on FashionMNIST 0 100000 200000 300000 400000 500000 Iteration 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Accuracy Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2)+Linear model; step: 0.0625 0 100000 200000 300000 400000 500000 Iteration 10 4 10 3… view at source ↗

**Figure 15.** Figure 15: Non-linear model mcv with kernel size k = 2, classes 5 and 7, and n = 5000 samples on FashionMNIST 62 [PITH_FULL_IMAGE:figures/full_fig_p062_15.png] view at source ↗

**Figure 16.** Figure 16: Linear model mlin with classes 7 and 9 and n = 5000 samples on FashionMNIST 0 100000 200000 300000 400000 500000 Iteration 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Accuracy Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2)+Linear model; step: 0.0625 0 100000 200000 300000 400000 500000 Iteration 10 6 10 5… view at source ↗

**Figure 17.** Figure 17: Non-linear model mcv with kernel size k = 2, classes 7 and 9, and n = 5000 samples on FashionMNIST 63 [PITH_FULL_IMAGE:figures/full_fig_p063_17.png] view at source ↗

**Figure 18.** Figure 18: Linear model mlin with classes 0 and 1 and n = 10000 samples on Food101 0 5000 10000 15000 20000 25000 30000 35000 40000 Iteration 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Conv(2)+Linear model; step: 8.0 Conv(2)+Linear model; step: 4.0 Conv(2)+Linear model; step: 2.0 Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2… view at source ↗

**Figure 19.** Figure 19: Non-linear model mcv with kernel size k = 2, classes 0 and 1, and n = 10000 samples on Food101 64 [PITH_FULL_IMAGE:figures/full_fig_p064_19.png] view at source ↗

**Figure 20.** Figure 20: Linear model mlin with classes 3 and 7 and n = 10000 samples on Food101 0 5000 10000 15000 20000 25000 30000 35000 40000 Iteration 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Conv(2)+Linear model; step: 8.0 Conv(2)+Linear model; step: 4.0 Conv(2)+Linear model; step: 2.0 Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2… view at source ↗

**Figure 21.** Figure 21: Non-linear model mcv with kernel size k = 2, classes 3 and 7, and n = 10000 samples on Food101 65 [PITH_FULL_IMAGE:figures/full_fig_p065_21.png] view at source ↗

**Figure 22.** Figure 22: Linear model mlin with classes 0 and 1 and all samples on MNIST 0 10000 20000 30000 40000 50000 Iteration 0.90 0.92 0.94 0.96 0.98 1.00 Accuracy Conv(2)+Linear model; step: 8.0 Conv(2)+Linear model; step: 4.0 Conv(2)+Linear model; step: 2.0 Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2)+Linear model; step: 0.0625 C… view at source ↗

**Figure 23.** Figure 23: Non-linear model mcv with kernel size k = 2, classes 0 and 1, and all samples on MNIST 66 [PITH_FULL_IMAGE:figures/full_fig_p066_23.png] view at source ↗

**Figure 24.** Figure 24: Linear model mlin with classes 7 and 8 and all samples on MNIST 0 10000 20000 30000 40000 50000 Iteration 0.90 0.92 0.94 0.96 0.98 1.00 Accuracy Conv(2)+Linear model; step: 8.0 Conv(2)+Linear model; step: 4.0 Conv(2)+Linear model; step: 2.0 Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2)+Linear model; step: 0.0625 C… view at source ↗

**Figure 25.** Figure 25: Non-linear model mcv with kernel size k = 2, classes 7 and 8, and all samples on MNIST 67 [PITH_FULL_IMAGE:figures/full_fig_p067_25.png] view at source ↗

**Figure 26.** Figure 26: Linear model mlin with all samples on Gisette 0 2000 4000 6000 8000 10000 Iteration 0.95 0.96 0.97 0.98 0.99 1.00 1.01 Accuracy Conv(2)+Linear model; step: 1.0 Conv(2)+Linear model; step: 0.5 Conv(2)+Linear model; step: 0.25 Conv(2)+Linear model; step: 0.125 Conv(2)+Linear model; step: 0.0625 Conv(2)+Linear model; step: 0.03125 Conv(2)+Linear model; step: 0.015625 0 2000 4000 6000 8000 10000 Iteration 10 … view at source ↗

**Figure 27.** Figure 27: Non-linear model mcv with kernel size k = 2, and all samples on Gisette 68 [PITH_FULL_IMAGE:figures/full_fig_p068_27.png] view at source ↗

**Figure 28.** Figure 28: Non-linear model mcv with kernel size k = 2, classes 0 and 1, and n = 5000 samples on CIFAR-10 In this section, we present the same experiment as in Figure 1b and [PITH_FULL_IMAGE:figures/full_fig_p069_28.png] view at source ↗

**Figure 29.** Figure 29: Training ResNet18 on CIFAR-10 with the logistic loss and the vanilla GD method. [PITH_FULL_IMAGE:figures/full_fig_p069_29.png] view at source ↗

read the original abstract

Even for the gradient descent (GD) method applied to neural network training, understanding its optimization dynamics, including convergence rate, iterate trajectories, function value oscillations, and especially its implicit acceleration, remains a challenging problem. We analyze nonlinear models with the logistic loss and show that the steps of GD reduce to those of generalized perceptron algorithms (Rosenblatt, 1958), providing a new perspective on the dynamics. This reduction yields significantly simpler algorithmic steps, which we analyze using classical linear algebra tools. Using these tools, we demonstrate on a minimalistic example that the nonlinearity in a two-layer model can provably yield a faster iteration complexity $\tilde{O}(\sqrt{d})$ compared to $\Omega(d)$ achieved by linear models, where $d$ is the number of features. This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks. The theoretical results are supported by extensive numerical experiments. We believe that this alternative view will further advance research on the optimization of neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper rewrites GD steps on a minimal two-layer logistic model as generalized perceptron updates and proves an Õ(√d) vs Ω(d) gap, but stops short of showing the same reduction works for deeper or more general networks.

read the letter

The paper's main contribution is showing that gradient descent on a simple two-layer nonlinear model with logistic loss can be rewritten exactly as a generalized perceptron update. Once that algebraic step is done, classical linear algebra tools give a clean Õ(√d) iteration bound on their toy example, while the linear version needs Ω(d) steps. That separation is the concrete new piece they deliver, and the numerical runs back it up on the small case they study.

Referee Report

3 major / 3 minor

Summary. The paper argues that gradient descent applied to nonlinear models with the logistic loss can be exactly reduced to generalized perceptron algorithm steps. This reduction simplifies the analysis of optimization dynamics using classical linear algebra. On a minimal two-layer nonlinear model, they show a provable iteration complexity of ~O(sqrt(d)) versus Omega(d) for linear models, which they suggest explains implicit acceleration in neural networks. The theoretical findings are corroborated by numerical experiments.

Significance. Should the reduction hold and extend to more general settings, this work would provide a fresh perspective on why gradient descent exhibits certain behaviors in neural network training, particularly the role of nonlinearity in achieving faster convergence. The complexity separation in the toy model is a concrete demonstration of this advantage, though its relevance to practical deep networks hinges on broader applicability.

major comments (3)

[Abstract and §3] Abstract and §3: The assertion that GD steps reduce to generalized perceptron updates for 'nonlinear models' is supported by derivation only in the minimalistic two-layer logistic loss setting; the algebraic cancellation shown does not immediately generalize to deeper architectures or other activations, which is critical for the explanation of implicit acceleration in general neural networks.
[§4] §4, complexity bound derivation: The ~O(sqrt(d)) complexity is proven for the two-layer example, but the manuscript does not provide a corresponding analysis or bound for linear models within the same perceptron reduction framework to rigorously establish the separation; the Omega(d) is cited from classical results without integration.
[§5] §5, Experiments: The numerical experiments validate the theory on the minimal example, but lack ablations on slightly deeper models to test if the reduction and acceleration persist, leaving the extrapolation to general NNs untested.

minor comments (3)

[Introduction] The citation to Rosenblatt (1958) should be expanded with full bibliographic details for completeness.
[Figures] The plot labels and legends in the figures could be made larger for better readability in the final version.
[Notation] Some symbols like the generalized perceptron update rule are introduced without sufficient preliminary definition before use in the main theorems.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments, which help clarify the scope of our results. We address each major comment point by point below, with revisions where appropriate to improve precision without altering the core contributions.

read point-by-point responses

Referee: [Abstract and §3] The assertion that GD steps reduce to generalized perceptron updates for 'nonlinear models' is supported by derivation only in the minimalistic two-layer logistic loss setting; the algebraic cancellation shown does not immediately generalize to deeper architectures or other activations.

Authors: We agree that the exact reduction via algebraic cancellation is derived for the two-layer model. The manuscript presents this as a minimalistic example to illustrate the perspective on implicit acceleration. We will revise the abstract and Section 3 to explicitly qualify the claims as applying to this setting and to state that extensions to deeper architectures or other activations are left as future work. revision: yes
Referee: [§4] The ~O(sqrt(d)) complexity is proven for the two-layer example, but the manuscript does not provide a corresponding analysis or bound for linear models within the same perceptron reduction framework to rigorously establish the separation; the Omega(d) is cited from classical results without integration.

Authors: The linear case reduces precisely to the standard perceptron algorithm under the same framework, so the classical Ω(d) bound applies directly. To integrate the comparison rigorously, we will add a short subsection in Section 4 that recalls the classical perceptron bound using the linear algebra tools developed for the nonlinear analysis, thereby establishing the separation within our unified view. revision: yes
Referee: [§5] The numerical experiments validate the theory on the minimal example, but lack ablations on slightly deeper models to test if the reduction and acceleration persist, leaving the extrapolation to general NNs untested.

Authors: Our experiments are tailored to validate the theory on the minimal two-layer model. Adding ablations on deeper models would require new theoretical approximations for the reduction, which exceeds the current scope. We will expand the discussion in Section 5 to explicitly note this limitation and identify deeper-model testing as future work; this constitutes a partial revision. revision: partial

standing simulated objections not resolved

Generalization of the perceptron reduction and complexity separation to arbitrary deep architectures or activations.

Circularity Check

0 steps flagged

No significant circularity detected; derivation relies on algebraic reduction and external linear-algebra analysis

full rationale

The paper derives an equivalence between GD steps and generalized perceptron updates directly from the logistic loss on a two-layer nonlinear model, then applies classical linear-algebra tools to obtain the O(sqrt(d)) vs Omega(d) complexity separation. This chain is presented as explicit algebraic manipulation rather than a parameter fit, self-referential definition, or load-bearing self-citation. The minimalistic example is used to demonstrate the nonlinearity benefit without assuming the target implicit-acceleration phenomenon as an input. No steps reduce by construction to the paper's own fitted values or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the logistic-loss gradient steps on the chosen nonlinear models exactly match generalized perceptron updates; no free parameters, new entities, or additional axioms are introduced in the abstract.

axioms (1)

domain assumption Gradient-descent iterates on the logistic loss for the nonlinear models under study are exactly equivalent to updates of a generalized perceptron algorithm.
The abstract states that the steps of GD reduce to those of generalized perceptron algorithms for the nonlinear models considered.

pith-pipeline@v0.9.0 · 5701 in / 1341 out tokens · 63737 ms · 2026-05-22T12:06:59.559228+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Risk and parameter convergence of logistic regression

PMLR. Ji, Z. and Telgarsky, M. (2018). Risk and parameter convergence of logistic regression.arXiv preprint arXiv:1803.07300. Kreisler, I., Nacson, M. S., Soudry, D., and Carmon, Y . (2023). Gradient descent monotonically decreases the sharpness of gradient flow solutions in scalar networks and beyond. InInternational Conference on Machine Learning, pages...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Ifa= 0,thenAis the zero matrix withd+ 2zero eigenvalues

work page
[3]

If a̸= 0 and a=Pa, then A has two non-zero eigenvalues √ 2∥a∥ and − √ 2∥a∥ with the corresponding eigenvectorsv 1 = ∥a∥ ∥a∥ √ 2a⊤ ⊤ andv 2 = − ∥a∥ − ∥a∥ √ 2a⊤ ⊤

work page
[4]

If a̸= 0 and a=−Pa, then A has two non-zero eigenvalues √ 2∥a∥ and − √ 2∥a∥ with the corresponding eigenvectorsv 1 = ∥a∥ − ∥a∥ √ 2a⊤ ⊤ andv 2 = − ∥a∥ ∥a∥ √ 2a⊤ ⊤

work page
[5]

5.∥a∥ ≤ ∥A∥ ≤ √ 2∥a∥

If a̸= 0, a̸=Pa, and a̸=−Pa, then A has four non-zero eigenvalues q ∥a∥2 +a ⊤Pa, − q ∥a∥2 +a ⊤Pa, q ∥a∥2 −a ⊤Pa,− q ∥a∥2 −a ⊤Pawith the corresponding eigenvectors v1 = ¯a+,P ¯a+,P (a+Pa) ⊤ ⊤ , v2 = −¯a+,P −¯a+,P (a+Pa) ⊤ ⊤ , v3 = ¯a−,P −¯a−,P (a−Pa) ⊤ ⊤ ,andv 4 = −¯a−,P ¯a−,P (a−Pa) ⊤ ⊤ , where¯a+,P := q ∥a∥2 +a ⊤Paand¯a−,P := q ∥a∥2 −a ⊤Pa. 5.∥a∥ ≤ ∥A∥ ≤...

work page
[6]

A1 has two non-zero eigenvalues λ1 and −λ1 with the corresponding eigenvectors v1,+ and v1,−, whereλ 1 := √ 2d

work page
[7]

ping-pong

A2 has four non-zero eigenvalues λ2,−λ 2, µ,−µ with the corresponding eigenvectors v2,+, v2,−, vµ,+,andv µ,−,whereλ 2 := p (2 +µ) 2 + 2(d−2). 3.v µ,+, vµ,− ∈ker(A 1). Proof. The first and second result are simple corollaries of Proposition 4.1. The third result follows from thatv µ,+ andv µ,− are propositional to [−µ µ µ−µ0. . .0 ]⊤ , [µ−µ µ−µ0. . .0 ]⊤ ....

work page 2027
[9]

Substituting to (42), ∥θt+1∥2 ≤(1 +s) " (1 +γ 2λ2 2)(⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2) + 2γλ2 − 2γλ2 1 +γ 2λ2 2 (⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2)− µ(1 +γµ) 2 λ2(1 +γ 2λ2

⟨θt, vµ,−⟩2 + 4∥ξ t∥ ∥θt∥+∥ξ t∥2 1 +γ 2λ2 2 , 30 where we have arranged terms. Substituting to (42), ∥θt+1∥2 ≤(1 +s) " (1 +γ 2λ2 2)(⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2) + 2γλ2 − 2γλ2 1 +γ 2λ2 2 (⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2)− µ(1 +γµ) 2 λ2(1 +γ 2λ2

work page
[10]

⟨θt, vµ,+⟩2 + µ(1−γµ) 2 λ2(1 +γ 2λ2

work page
[11]

⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2 + (1 +γµ) 2 (1−γµ)⟨θ t, vµ,+⟩2 + (1−γµ) 2 (1 + 2γµ)⟨θ t, vµ,−⟩2 + d−2X j=1 ⟨θt, v2,j⟩2 # + (1 +s −1)∥ξ t∥2 + (1 +s)2γλ 2 4∥ξ t∥ ∥θt∥+∥ξ t∥2 ≤(1 +s)

⟨θt, vµ,−⟩2 + 2γλ2 4∥ξ t∥ ∥θt∥+∥ξ t∥2 1 +γ 2λ2 2 ! + (1 +γµ) 2 ⟨θt, vµ,+⟩2 + (1−γµ) 2 ⟨θt, vµ,−⟩2 + d−2X j=1 ⟨θt, v2,j⟩2 # + (1 +s −1)∥ξ t∥2 . Since0≤γ≤ 1 λ2 and0≤γ≤ 1 µ , ∥θt+1∥2 ≤(1 +s) " ⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2 + (1 +γµ) 2 (1−γµ)⟨θ t, vµ,+⟩2 + (1−γµ) 2 (1 + 2γµ)⟨θ t, vµ,−⟩2 + d−2X j=1 ⟨θt, v2,j⟩2 # + (1 +s −1)∥ξ t∥2 + (1 +s)2γλ 2 4∥ξ t∥ ∥θt∥+∥ξ t∥2 ≤...

work page 2017

[1] [1]

Risk and parameter convergence of logistic regression

PMLR. Ji, Z. and Telgarsky, M. (2018). Risk and parameter convergence of logistic regression.arXiv preprint arXiv:1803.07300. Kreisler, I., Nacson, M. S., Soudry, D., and Carmon, Y . (2023). Gradient descent monotonically decreases the sharpness of gradient flow solutions in scalar networks and beyond. InInternational Conference on Machine Learning, pages...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Ifa= 0,thenAis the zero matrix withd+ 2zero eigenvalues

work page

[3] [3]

If a̸= 0 and a=Pa, then A has two non-zero eigenvalues √ 2∥a∥ and − √ 2∥a∥ with the corresponding eigenvectorsv 1 = ∥a∥ ∥a∥ √ 2a⊤ ⊤ andv 2 = − ∥a∥ − ∥a∥ √ 2a⊤ ⊤

work page

[4] [4]

If a̸= 0 and a=−Pa, then A has two non-zero eigenvalues √ 2∥a∥ and − √ 2∥a∥ with the corresponding eigenvectorsv 1 = ∥a∥ − ∥a∥ √ 2a⊤ ⊤ andv 2 = − ∥a∥ ∥a∥ √ 2a⊤ ⊤

work page

[5] [5]

5.∥a∥ ≤ ∥A∥ ≤ √ 2∥a∥

If a̸= 0, a̸=Pa, and a̸=−Pa, then A has four non-zero eigenvalues q ∥a∥2 +a ⊤Pa, − q ∥a∥2 +a ⊤Pa, q ∥a∥2 −a ⊤Pa,− q ∥a∥2 −a ⊤Pawith the corresponding eigenvectors v1 = ¯a+,P ¯a+,P (a+Pa) ⊤ ⊤ , v2 = −¯a+,P −¯a+,P (a+Pa) ⊤ ⊤ , v3 = ¯a−,P −¯a−,P (a−Pa) ⊤ ⊤ ,andv 4 = −¯a−,P ¯a−,P (a−Pa) ⊤ ⊤ , where¯a+,P := q ∥a∥2 +a ⊤Paand¯a−,P := q ∥a∥2 −a ⊤Pa. 5.∥a∥ ≤ ∥A∥ ≤...

work page

[6] [6]

A1 has two non-zero eigenvalues λ1 and −λ1 with the corresponding eigenvectors v1,+ and v1,−, whereλ 1 := √ 2d

work page

[7] [7]

ping-pong

A2 has four non-zero eigenvalues λ2,−λ 2, µ,−µ with the corresponding eigenvectors v2,+, v2,−, vµ,+,andv µ,−,whereλ 2 := p (2 +µ) 2 + 2(d−2). 3.v µ,+, vµ,− ∈ker(A 1). Proof. The first and second result are simple corollaries of Proposition 4.1. The third result follows from thatv µ,+ andv µ,− are propositional to [−µ µ µ−µ0. . .0 ]⊤ , [µ−µ µ−µ0. . .0 ]⊤ ....

work page 2027

[8] [9]

Substituting to (42), ∥θt+1∥2 ≤(1 +s) " (1 +γ 2λ2 2)(⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2) + 2γλ2 − 2γλ2 1 +γ 2λ2 2 (⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2)− µ(1 +γµ) 2 λ2(1 +γ 2λ2

⟨θt, vµ,−⟩2 + 4∥ξ t∥ ∥θt∥+∥ξ t∥2 1 +γ 2λ2 2 , 30 where we have arranged terms. Substituting to (42), ∥θt+1∥2 ≤(1 +s) " (1 +γ 2λ2 2)(⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2) + 2γλ2 − 2γλ2 1 +γ 2λ2 2 (⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2)− µ(1 +γµ) 2 λ2(1 +γ 2λ2

work page

[9] [10]

⟨θt, vµ,+⟩2 + µ(1−γµ) 2 λ2(1 +γ 2λ2

work page

[10] [11]

⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2 + (1 +γµ) 2 (1−γµ)⟨θ t, vµ,+⟩2 + (1−γµ) 2 (1 + 2γµ)⟨θ t, vµ,−⟩2 + d−2X j=1 ⟨θt, v2,j⟩2 # + (1 +s −1)∥ξ t∥2 + (1 +s)2γλ 2 4∥ξ t∥ ∥θt∥+∥ξ t∥2 ≤(1 +s)

⟨θt, vµ,−⟩2 + 2γλ2 4∥ξ t∥ ∥θt∥+∥ξ t∥2 1 +γ 2λ2 2 ! + (1 +γµ) 2 ⟨θt, vµ,+⟩2 + (1−γµ) 2 ⟨θt, vµ,−⟩2 + d−2X j=1 ⟨θt, v2,j⟩2 # + (1 +s −1)∥ξ t∥2 . Since0≤γ≤ 1 λ2 and0≤γ≤ 1 µ , ∥θt+1∥2 ≤(1 +s) " ⟨θt, v2,+⟩2 +⟨θ t, v2,−⟩2 + (1 +γµ) 2 (1−γµ)⟨θ t, vµ,+⟩2 + (1−γµ) 2 (1 + 2γµ)⟨θ t, vµ,−⟩2 + d−2X j=1 ⟨θt, v2,j⟩2 # + (1 +s −1)∥ξ t∥2 + (1 +s)2γλ 2 4∥ξ t∥ ∥θt∥+∥ξ t∥2 ≤...

work page 2017