Learning Rate Scheduling with Matrix Factorization for Private Training
Pith reviewed 2026-05-17 06:08 UTC · model grok-4.3
The pith
Learning-rate-aware matrix factorizations reduce error in private SGD training with scheduled learning rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For a broad class of learning rate schedules in single- and multi-epoch settings, general upper and lower bounds on the error of matrix factorizations can be derived; a schedule-aware factorization then improves upon prefix-sum constructions in both MaxSE and MeanSE metrics while remaining memory-efficient.
What carries the argument
Learning-rate-aware matrix factorization, which incorporates the sequence of learning rates into the workload matrix to produce correlated noise whose variance matches the schedule.
If this is right
- The new factorizations yield strictly lower MaxSE and MeanSE than prefix-sum methods for any schedule in the analyzed class.
- Memory-efficient constructions allow the approach to be deployed in practical single- and multi-epoch private training.
- Empirical accuracy gains appear on both image (CIFAR-10) and text (IMDB) tasks without increasing the privacy budget.
- The bounds hold for a wide family of schedules, not just constant rates.
Where Pith is reading between the lines
- The same schedule-aware construction could be applied to other adaptive optimizers whose effective step sizes vary over time.
- If the schedule changes dynamically, an online update rule for the factorization would be needed to retain the gains.
- Lower error under MaxSE may be especially useful in settings where the worst-case gradient step dominates final model quality.
Load-bearing premise
The learning rate schedule must be known ahead of time so the factorization and noise correlations can be precomputed without extra privacy loss or runtime cost.
What would settle it
On CIFAR-10 or IMDB with a standard decay schedule, run private training using the proposed factorization versus the prefix-sum baseline and measure whether final test accuracy fails to improve under identical privacy parameters and noise scale.
Figures
read the original abstract
We study differentially private model training with stochastic gradient descent under learning rate scheduling and correlated noise. Although correlated noise, in particular via matrix factorizations, has been shown to improve accuracy, prior theoretical work focused primarily on the prefix-sum workload. That workload assumes a constant learning rate, whereas in practice learning rate schedules are widely used to accelerate training and improve convergence. We close this gap by deriving general upper and lower bounds for a broad class of learning rate schedules in both single- and multi-epoch settings. Building on these results, we propose a learning-rate-aware factorization that achieves improvements over prefix-sum factorizations under both MaxSE and MeanSE error metrics. Our theoretical analysis yields memory-efficient constructions suitable for practical deployment, and experiments on CIFAR-10 and IMDB datasets confirm that schedule-aware factorizations improve accuracy in private training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to close the gap between theoretical matrix-factorization noise mechanisms (previously limited to constant-learning-rate prefix-sum workloads) and practical DP-SGD by deriving general upper and lower bounds on MaxSE and MeanSE for arbitrary learning-rate schedules in both single- and multi-epoch regimes. It constructs schedule-aware, memory-efficient factorizations that provably improve on prefix-sum baselines, performs privacy accounting directly on the public schedule and resulting workload matrix, and reports accuracy gains on CIFAR-10 and IMDB under fixed privacy budgets.
Significance. If the bounds and empirical improvements hold, the work meaningfully extends the applicability of correlated-noise techniques to the learning-rate schedules that are standard in modern training, without introducing hidden privacy costs or prohibitive memory overhead. The explicit memory-efficient constructions and the use of standard datasets with reproducible accuracy numbers are strengths that support practical adoption.
minor comments (3)
- [§3] §3 (Workload Matrix Definition): clarify whether the sensitivity scaling induced by a time-varying learning rate is folded into the workload matrix W or handled separately in the privacy accountant; a short remark or equation would remove ambiguity for readers implementing the method.
- [Experiments] Figure 4 (CIFAR-10 accuracy curves): the legend should explicitly state the privacy budget ε and the number of epochs for each curve so that the reported gains can be directly compared to the theoretical error metrics.
- [Table 1] Table 1 (MaxSE/MeanSE comparisons): add a column or footnote indicating the rank or structural constraint used for the proposed factorization to allow fair reproduction of the memory-efficiency claim.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our manuscript and for recommending minor revision. The referee correctly identifies the core contribution: extending matrix-factorization noise mechanisms from constant-learning-rate prefix-sum workloads to general learning-rate schedules while preserving memory efficiency and providing explicit privacy accounting.
Circularity Check
No significant circularity; derivation self-contained from workload matrix
full rationale
The paper derives general upper and lower bounds for learning-rate-aware factorizations directly from the workload matrix under known schedules, extending prefix-sum analysis without reducing any prediction or bound to a fitted parameter or self-defined quantity by construction. No self-citation is load-bearing for the central claims, no ansatz is smuggled, and the memory-efficient constructions plus MaxSE/MeanSE improvements are obtained from explicit matrix factorizations rather than renaming or reparameterizing prior results. The argument remains independent of the target accuracy gains, which are validated separately on CIFAR-10 and IMDB.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The learning rate schedule is fixed and known before training begins.
- standard math Correlated noise can be realized via matrix factorization without extra privacy leakage.
Reference graph
Works this paper leans on
-
[1]
Antti Koskela, Joonas J ¨alk¨o, Lukas Prediger, and Antti Honkela
arXiv preprint arXiv:2505.12128. Antti Koskela, Joonas J ¨alk¨o, Lukas Prediger, and Antti Honkela. Tight differential privacy for discrete-valued mechanisms and for the subsampled Gaussian mechanism using FFT. InConfer- ence on Uncertainty in Artificial Intelligence (AISTATS), 2021. Alexey Kurakin, Shuang Song, Steve Chien, Roxana Geambasu, Andreas Terzi...
-
[2]
m−jY k=1 1−α 2k−1 1−α 2k # · α−1 αl+1(1−α 2(j−l)−1)
We evaluate DP-SGD, BISR (w/o LRS), and BISR (w/ LRS) under four learning rate decay strategies: exponential, polynomial, linear, and cosine. All experiments use clipping normζ= 1and batch size128, for BISR we use bandwidthp= 64. Learning ratesηare tuned on a validation set for each decay setting. Dataset Method ζBSpLearning rateηby scheduler Exponential ...
work page 2024
-
[3]
=O vuut k n logp+ p b nX m=1 " χ2m log (min{m, p}) + 1 p m−1X t=p χ2 t # .(23) Proof.As shown in the proof of Theorem 1, the condition onχ t is sufficient to enforceQ= o(logn), and so 1√n ∥Bp χ∥F = Θ vuut 1 n nX m=1 " χ2m log (min{m, p}) + 1 p m−1X t=p χ2 t # from invoking Lemma 21. For the sensitivity sens(C p 1), we use the following bound...
work page 2025
-
[4]
Inserting the two bounds intoE(B p χ, Cp
=O r klogp+ kp b ! . Inserting the two bounds intoE(B p χ, Cp
-
[5]
Corollary 3.Letχ t =β t−1 n−1 withβ∈(0,1/e)
= 1√n ∥Bp χ∥F ·sens(C p 1)gives the statement. Corollary 3.Letχ t =β t−1 n−1 withβ∈(0,1/e). Then, in multi-participation withb-min-separation and at mostk=⌈ n b ⌉participations, we have forp ∗ ∼blogbthe following optimized upper bound: E(B p χ, Cp
-
[6]
=O √ klogn+k√ log(1/β) .(24) Proof.Asχ t satisfies the condition of Theorem 3, we have that E(B p χ, Cp
-
[7]
We will evaluate each of the two terms in the outer sum
=O vuut k n logp+ p b nX m=1 " α2(m−1) log (min{m, p}) + 1 p m−1X t=p α2(t−1) # , whereα=β 1 n−1 . We will evaluate each of the two terms in the outer sum. First off, nX m=1 α2(m−1) log (min{m, p})≤logp nX m=1 α2(m−1) = Θ nlogp log(1/β) , 35 where the last step follows from the proof of Lemma 12. Proceeding with the second term: 1 p nX m=1 m−1X t=p...
-
[8]
=O s k log(1/β) logp+ p b logp+ n p ! . As this exactly matches the error given in (Kalinin et al., 2025, Theorem 2), up to the1/ p log(1/β) factor, the upper bound is minimized for the choice ofp ∗ ∼blogbachieving error E(B p χ, Cp
work page 2025
-
[9]
=O √ klogn+kp log(1/β) ! , completing the proof. H MULTI-PARTICIPATION: LOWER BOUNDS Theorem 4(Lower bound for multi-participation).LetA χ =A 1Dχ, whereD χ = diag(χ1, . . . , χn)with positiveχ t >0. Assume any factorizationA χ =B×C. Then, in multi- participation withb-min-separation and at mostk=⌈ n b ⌉participations, we have E(B, C)≥max max t≤n √ k ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.