Learning Rate Scheduling with Matrix Factorization for Private Training

Joel Daniel Andersson; Nikita P. Kalinin

arxiv: 2511.17994 · v2 · submitted 2025-11-22 · 💻 cs.LG · stat.ML

Learning Rate Scheduling with Matrix Factorization for Private Training

Nikita P. Kalinin , Joel Daniel Andersson This is my paper

Pith reviewed 2026-05-17 06:08 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords differential privacymatrix factorizationlearning rate schedulingstochastic gradient descentcorrelated noiseprivate trainingerror bounds

0 comments

The pith

Learning-rate-aware matrix factorizations reduce error in private SGD training with scheduled learning rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt matrix factorizations for correlated noise to account for changing learning rates during differentially private training. Prior approaches assumed a fixed learning rate and used prefix-sum factorizations, but real training often employs schedules like decay or warmup to reach better convergence. By deriving bounds for a range of schedules and constructing factorizations that match them, the method lowers both maximum and average squared error in the noise. Experiments confirm higher final accuracy on standard image and text classification tasks under the same privacy budget. This matters because noise addition is the main accuracy cost in private training, and better correlation can preserve more signal without weakening the privacy guarantee.

Core claim

For a broad class of learning rate schedules in single- and multi-epoch settings, general upper and lower bounds on the error of matrix factorizations can be derived; a schedule-aware factorization then improves upon prefix-sum constructions in both MaxSE and MeanSE metrics while remaining memory-efficient.

What carries the argument

Learning-rate-aware matrix factorization, which incorporates the sequence of learning rates into the workload matrix to produce correlated noise whose variance matches the schedule.

If this is right

The new factorizations yield strictly lower MaxSE and MeanSE than prefix-sum methods for any schedule in the analyzed class.
Memory-efficient constructions allow the approach to be deployed in practical single- and multi-epoch private training.
Empirical accuracy gains appear on both image (CIFAR-10) and text (IMDB) tasks without increasing the privacy budget.
The bounds hold for a wide family of schedules, not just constant rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same schedule-aware construction could be applied to other adaptive optimizers whose effective step sizes vary over time.
If the schedule changes dynamically, an online update rule for the factorization would be needed to retain the gains.
Lower error under MaxSE may be especially useful in settings where the worst-case gradient step dominates final model quality.

Load-bearing premise

The learning rate schedule must be known ahead of time so the factorization and noise correlations can be precomputed without extra privacy loss or runtime cost.

What would settle it

On CIFAR-10 or IMDB with a standard decay schedule, run private training using the proposed factorization versus the prefix-sum baseline and measure whether final test accuracy fails to improve under identical privacy parameters and noise scale.

Figures

Figures reproduced from arXiv: 2511.17994 by Joel Daniel Andersson, Nikita P. Kalinin.

**Figure 2.** Figure 2: Multi-participation MeanSE error with matrix size [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: CIFAR-10 results under (9, 10−5 )-differential privacy. (a) Validation accuracy with exponential learning rate scheduling for different learning rates in DP-SGD. We report the points corresponding to the lowest learning rate; for example, a learning rate of 1/2 for β = 1/4 indicates that training starts with a learning rate of 2 and decays to 1/2. (b) Test accuracy across different matrix factorizations w… view at source ↗

**Figure 4.** Figure 4: Test accuracy of different learning rate schedulers for (a) BERT-base on IMDB and (b) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of different LR schedulers (n = 2048) in single participation. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-participation MeanSE error under different learning-rate schedulers (Polynomial [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

We study differentially private model training with stochastic gradient descent under learning rate scheduling and correlated noise. Although correlated noise, in particular via matrix factorizations, has been shown to improve accuracy, prior theoretical work focused primarily on the prefix-sum workload. That workload assumes a constant learning rate, whereas in practice learning rate schedules are widely used to accelerate training and improve convergence. We close this gap by deriving general upper and lower bounds for a broad class of learning rate schedules in both single- and multi-epoch settings. Building on these results, we propose a learning-rate-aware factorization that achieves improvements over prefix-sum factorizations under both MaxSE and MeanSE error metrics. Our theoretical analysis yields memory-efficient constructions suitable for practical deployment, and experiments on CIFAR-10 and IMDB datasets confirm that schedule-aware factorizations improve accuracy in private training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper extends matrix factorization noise to non-constant learning rates in private SGD, with general bounds and memory-efficient constructions that beat prefix-sum baselines on error metrics and show accuracy gains in experiments.

read the letter

The main point is that they close the gap between theory, which often assumes constant learning rates, and practice, which uses schedules. By making the factorization aware of the varying learning rate, they get lower MaxSE and MeanSE than the standard prefix-sum approach for both single- and multi-epoch training. The constructions stay memory-efficient, which is necessary for long runs, and the experiments on CIFAR-10 and IMDB report accuracy improvements at fixed privacy budgets. That combination of general bounds and concrete gains is the useful part.

Referee Report

0 major / 3 minor

Summary. The manuscript claims to close the gap between theoretical matrix-factorization noise mechanisms (previously limited to constant-learning-rate prefix-sum workloads) and practical DP-SGD by deriving general upper and lower bounds on MaxSE and MeanSE for arbitrary learning-rate schedules in both single- and multi-epoch regimes. It constructs schedule-aware, memory-efficient factorizations that provably improve on prefix-sum baselines, performs privacy accounting directly on the public schedule and resulting workload matrix, and reports accuracy gains on CIFAR-10 and IMDB under fixed privacy budgets.

Significance. If the bounds and empirical improvements hold, the work meaningfully extends the applicability of correlated-noise techniques to the learning-rate schedules that are standard in modern training, without introducing hidden privacy costs or prohibitive memory overhead. The explicit memory-efficient constructions and the use of standard datasets with reproducible accuracy numbers are strengths that support practical adoption.

minor comments (3)

[§3] §3 (Workload Matrix Definition): clarify whether the sensitivity scaling induced by a time-varying learning rate is folded into the workload matrix W or handled separately in the privacy accountant; a short remark or equation would remove ambiguity for readers implementing the method.
[Experiments] Figure 4 (CIFAR-10 accuracy curves): the legend should explicitly state the privacy budget ε and the number of epochs for each curve so that the reported gains can be directly compared to the theoretical error metrics.
[Table 1] Table 1 (MaxSE/MeanSE comparisons): add a column or footnote indicating the rank or structural constraint used for the proposed factorization to allow fair reproduction of the memory-efficiency claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our manuscript and for recommending minor revision. The referee correctly identifies the core contribution: extending matrix-factorization noise mechanisms from constant-learning-rate prefix-sum workloads to general learning-rate schedules while preserving memory efficiency and providing explicit privacy accounting.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from workload matrix

full rationale

The paper derives general upper and lower bounds for learning-rate-aware factorizations directly from the workload matrix under known schedules, extending prefix-sum analysis without reducing any prediction or bound to a fitted parameter or self-defined quantity by construction. No self-citation is load-bearing for the central claims, no ansatz is smuggled, and the memory-efficient constructions plus MaxSE/MeanSE improvements are obtained from explicit matrix factorizations rather than renaming or reparameterizing prior results. The argument remains independent of the target accuracy gains, which are validated separately on CIFAR-10 and IMDB.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about differential privacy composition and noise addition mechanisms, plus the existence of matrix factorizations that can be precomputed for any fixed schedule.

axioms (2)

domain assumption The learning rate schedule is fixed and known before training begins.
Invoked to allow precomputation of the factorization matrix.
standard math Correlated noise can be realized via matrix factorization without extra privacy leakage.
Standard in the DP-SGD literature on correlated noise.

pith-pipeline@v0.9.0 · 5434 in / 1306 out tokens · 22706 ms · 2026-05-17T06:08:22.525912+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

Antti Koskela, Joonas J ¨alk¨o, Lukas Prediger, and Antti Honkela

arXiv preprint arXiv:2505.12128. Antti Koskela, Joonas J ¨alk¨o, Lukas Prediger, and Antti Honkela. Tight differential privacy for discrete-valued mechanisms and for the subsampled Gaussian mechanism using FFT. InConfer- ence on Uncertainty in Artificial Intelligence (AISTATS), 2021. Alexey Kurakin, Shuang Song, Steve Chien, Roxana Geambasu, Andreas Terzi...

work page arXiv 2021
[2]

m−jY k=1 1−α 2k−1 1−α 2k # · α−1 αl+1(1−α 2(j−l)−1)

We evaluate DP-SGD, BISR (w/o LRS), and BISR (w/ LRS) under four learning rate decay strategies: exponential, polynomial, linear, and cosine. All experiments use clipping normζ= 1and batch size128, for BISR we use bandwidthp= 64. Learning ratesηare tuned on a validation set for each decay setting. Dataset Method ζBSpLearning rateηby scheduler Exponential ...

work page 2024
[3]

For the sensitivity sens(C p 1), we use the following bound from (Kalinin et al., 2025, Theorem 2 proof) sens(C p

=O   vuut k n logp+ p b nX m=1 " χ2m log (min{m, p}) + 1 p m−1X t=p χ2 t #  .(23) Proof.As shown in the proof of Theorem 1, the condition onχ t is sufficient to enforceQ= o(logn), and so 1√n ∥Bp χ∥F = Θ   vuut 1 n nX m=1 " χ2m log (min{m, p}) + 1 p m−1X t=p χ2 t #  from invoking Lemma 21. For the sensitivity sens(C p 1), we use the following bound...

work page 2025
[4]

Inserting the two bounds intoE(B p χ, Cp

=O r klogp+ kp b ! . Inserting the two bounds intoE(B p χ, Cp

work page
[5]

Corollary 3.Letχ t =β t−1 n−1 withβ∈(0,1/e)

= 1√n ∥Bp χ∥F ·sens(C p 1)gives the statement. Corollary 3.Letχ t =β t−1 n−1 withβ∈(0,1/e). Then, in multi-participation withb-min-separation and at mostk=⌈ n b ⌉participations, we have forp ∗ ∼blogbthe following optimized upper bound: E(B p χ, Cp

work page
[6]

=O √ klogn+k√ log(1/β) .(24) Proof.Asχ t satisfies the condition of Theorem 3, we have that E(B p χ, Cp

work page
[7]

We will evaluate each of the two terms in the outer sum

=O   vuut k n logp+ p b nX m=1 " α2(m−1) log (min{m, p}) + 1 p m−1X t=p α2(t−1) #  , whereα=β 1 n−1 . We will evaluate each of the two terms in the outer sum. First off, nX m=1 α2(m−1) log (min{m, p})≤logp nX m=1 α2(m−1) = Θ nlogp log(1/β) , 35 where the last step follows from the proof of Lemma 12. Proceeding with the second term: 1 p nX m=1 m−1X t=p...

work page
[8]

=O s k log(1/β) logp+ p b logp+ n p ! . As this exactly matches the error given in (Kalinin et al., 2025, Theorem 2), up to the1/ p log(1/β) factor, the upper bound is minimized for the choice ofp ∗ ∼blogbachieving error E(B p χ, Cp

work page 2025
[9]

H MULTI-PARTICIPATION: LOWER BOUNDS Theorem 4(Lower bound for multi-participation).LetA χ =A 1Dχ, whereD χ = diag(χ1,

=O √ klogn+kp log(1/β) ! , completing the proof. H MULTI-PARTICIPATION: LOWER BOUNDS Theorem 4(Lower bound for multi-participation).LetA χ =A 1Dχ, whereD χ = diag(χ1, . . . , χn)with positiveχ t >0. Assume any factorizationA χ =B×C. Then, in multi- participation withb-min-separation and at mostk=⌈ n b ⌉participations, we have E(B, C)≥max   max t≤n √ k ...

work page 2025

[1] [1]

Antti Koskela, Joonas J ¨alk¨o, Lukas Prediger, and Antti Honkela

arXiv preprint arXiv:2505.12128. Antti Koskela, Joonas J ¨alk¨o, Lukas Prediger, and Antti Honkela. Tight differential privacy for discrete-valued mechanisms and for the subsampled Gaussian mechanism using FFT. InConfer- ence on Uncertainty in Artificial Intelligence (AISTATS), 2021. Alexey Kurakin, Shuang Song, Steve Chien, Roxana Geambasu, Andreas Terzi...

work page arXiv 2021

[2] [2]

m−jY k=1 1−α 2k−1 1−α 2k # · α−1 αl+1(1−α 2(j−l)−1)

We evaluate DP-SGD, BISR (w/o LRS), and BISR (w/ LRS) under four learning rate decay strategies: exponential, polynomial, linear, and cosine. All experiments use clipping normζ= 1and batch size128, for BISR we use bandwidthp= 64. Learning ratesηare tuned on a validation set for each decay setting. Dataset Method ζBSpLearning rateηby scheduler Exponential ...

work page 2024

[3] [3]

For the sensitivity sens(C p 1), we use the following bound from (Kalinin et al., 2025, Theorem 2 proof) sens(C p

=O   vuut k n logp+ p b nX m=1 " χ2m log (min{m, p}) + 1 p m−1X t=p χ2 t #  .(23) Proof.As shown in the proof of Theorem 1, the condition onχ t is sufficient to enforceQ= o(logn), and so 1√n ∥Bp χ∥F = Θ   vuut 1 n nX m=1 " χ2m log (min{m, p}) + 1 p m−1X t=p χ2 t #  from invoking Lemma 21. For the sensitivity sens(C p 1), we use the following bound...

work page 2025

[4] [4]

Inserting the two bounds intoE(B p χ, Cp

=O r klogp+ kp b ! . Inserting the two bounds intoE(B p χ, Cp

work page

[5] [5]

Corollary 3.Letχ t =β t−1 n−1 withβ∈(0,1/e)

= 1√n ∥Bp χ∥F ·sens(C p 1)gives the statement. Corollary 3.Letχ t =β t−1 n−1 withβ∈(0,1/e). Then, in multi-participation withb-min-separation and at mostk=⌈ n b ⌉participations, we have forp ∗ ∼blogbthe following optimized upper bound: E(B p χ, Cp

work page

[6] [6]

=O √ klogn+k√ log(1/β) .(24) Proof.Asχ t satisfies the condition of Theorem 3, we have that E(B p χ, Cp

work page

[7] [7]

We will evaluate each of the two terms in the outer sum

=O   vuut k n logp+ p b nX m=1 " α2(m−1) log (min{m, p}) + 1 p m−1X t=p α2(t−1) #  , whereα=β 1 n−1 . We will evaluate each of the two terms in the outer sum. First off, nX m=1 α2(m−1) log (min{m, p})≤logp nX m=1 α2(m−1) = Θ nlogp log(1/β) , 35 where the last step follows from the proof of Lemma 12. Proceeding with the second term: 1 p nX m=1 m−1X t=p...

work page

[8] [8]

=O s k log(1/β) logp+ p b logp+ n p ! . As this exactly matches the error given in (Kalinin et al., 2025, Theorem 2), up to the1/ p log(1/β) factor, the upper bound is minimized for the choice ofp ∗ ∼blogbachieving error E(B p χ, Cp

work page 2025

[9] [9]

H MULTI-PARTICIPATION: LOWER BOUNDS Theorem 4(Lower bound for multi-participation).LetA χ =A 1Dχ, whereD χ = diag(χ1,

=O √ klogn+kp log(1/β) ! , completing the proof. H MULTI-PARTICIPATION: LOWER BOUNDS Theorem 4(Lower bound for multi-participation).LetA χ =A 1Dχ, whereD χ = diag(χ1, . . . , χn)with positiveχ t >0. Assume any factorizationA χ =B×C. Then, in multi- participation withb-min-separation and at mostk=⌈ n b ⌉participations, we have E(B, C)≥max   max t≤n √ k ...

work page 2025