Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

Eli Chien; Pan Li; Wei-Ning Chen

arxiv: 2506.00158 · v3 · submitted 2025-05-30 · 💻 cs.LG

Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

Eli Chien , Wei-Ning Chen , Pan Li This is my paper

Pith reviewed 2026-05-19 11:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords differential privacyzeroth-order optimizationprivacy amplificationhidden statesmachine learningcoupling analysis

0 comments

The pith

Hybrid noise and coupling analysis deliver the first convergent privacy bounds for zeroth-order optimization with hidden states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to close the gap between first-order and zeroth-order methods in differentially private optimization by deriving privacy bounds that tighten over iterations. Zeroth-order approaches appeal for memory-constrained fine-tuning of large models because they avoid storing gradients, yet their directional noise creates anisotropic updates that break the standard tools for tracking privacy amplification. The authors introduce a hybrid noise mechanism and construct a coupled auxiliary process that sidesteps the failed global Lipschitz condition, producing the first convergent hidden-state DP bounds for these methods. If correct, this directly enables tighter privacy accounting and new algorithm designs that were previously unavailable.

Core claim

We provide the first convergent hidden-state DP bound for zeroth-order optimization by proposing a hybrid noise mechanism and a novel coupling analysis. We bypass the purely shifted-divergence approach by constructing a coupled auxiliary process, which circumvents the global Lipschitz barrier and yields a convergent privacy bound. Furthermore, our results induce better DP zeroth-order algorithmic designs that are previously unknown to the literature.

What carries the argument

Coupled auxiliary process built from a hybrid noise mechanism that tracks privacy amplification for anisotropic directional updates without requiring the global Lipschitz property.

If this is right

Convergent differential privacy bounds are now available for hidden-state zeroth-order optimization.
The analysis directly supports new algorithmic designs for DP zeroth-order methods that improve on prior approaches.
Privacy amplification by iteration extends to memory-efficient optimization settings that inject scalar noise along random directions.
Tighter noise calibration becomes possible while still guaranteeing formal privacy that improves over training iterations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These bounds could let practitioners run private zeroth-order fine-tuning of large models with less noise than conservative analyses require.
The coupling construction may generalize to other directional or anisotropic noise settings in private optimization.
Empirical checks on actual model training runs would test how loose or tight the new privacy accounting proves in practice.

Load-bearing premise

A coupled auxiliary process can be maintained that accurately follows the privacy loss even though anisotropic updates destroy the global Lipschitz condition required by earlier frameworks.

What would settle it

A numerical simulation or proof that the privacy loss remains bounded away from zero or diverges with more iterations under the hybrid mechanism and coupling would refute the convergent bound.

Figures

Figures reproduced from arXiv: 2506.00158 by Eli Chien, Pan Li, Wei-Ning Chen.

read the original abstract

Zeroth-order optimization has emerged as a promising approach for fine-tuning large language models under differential privacy (DP) and memory constraints. While privacy amplification by iteration (PABI) provides convergent DP bounds for first-order methods, establishing similar guarantees for zeroth-order methods remains an open problem. First-order PABI analysis relies on the fact that gradients are perturbed with isotropic noise, allowing privacy bounds to be iteratively tracked via shifted R\'enyi divergence. In contrast, DP zeroth-order methods inject scalar noise along random update directions to maintain utility. This anisotropic update fails standard shifted divergence frameworks, as the global Lipschitz property no longer holds almost surely. We provide the first convergent hidden-state DP bound for zeroth-order optimization by proposing a hybrid noise mechanism and a novel coupling analysis. We bypass the purely shifted-divergence approach by constructing a coupled auxiliary process, which circumvents the global Lipschitz barrier and yields a convergent privacy bound. Furthermore, our results induce better DP zeroth-order algorithmic designs that are previously unknown to the literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a hybrid noise mechanism and coupling analysis for the first convergent hidden-state DP bound in zeroth-order optimization, but the contraction under anisotropy still needs explicit verification.

read the letter

The main point is that this work gives a convergent privacy bound for zeroth-order methods by combining a hybrid noise mechanism with a coupling analysis that sidesteps the global Lipschitz failure in standard shifted Rényi divergence tracking. This directly targets the open gap between first-order PABI results and the anisotropic, directional noise used in memory-efficient DP fine-tuning for large models. They show how the auxiliary process can be constructed to match the required behavior for iterative privacy tracking, and this construction apparently yields previously unknown algorithmic tweaks that improve the setup. That part is a clear technical step forward and earns credit for engaging the specific obstacle in the literature. The soft spot sits in the contraction step itself. The hybrid noise must still produce a divergence contraction factor strictly less than one at each iteration even though updates remain random and directional. The derivations attempt to control this through the coupled process, but the resulting constants appear somewhat loose and depend on the mixture weights in ways that could limit practical gains. If the conditional properties do not hold as assumed, the iterated bound may not improve much over basic mechanisms. This paper is aimed at specialists working on private optimization and memory-constrained training. Readers who already know PABI and want to extend it to zeroth-order settings will extract useful ideas from the construction. It deserves a serious referee because the problem is concrete, the approach is new, and the details are checkable even if revisions are likely needed to tighten the bounds.

Referee Report

2 major / 2 minor

Summary. The paper claims to resolve an open problem in differentially private zeroth-order optimization by deriving the first convergent hidden-state DP bound. It introduces a hybrid noise mechanism (combining isotropic and directional components) and a novel coupling analysis that constructs an auxiliary process to bypass the failure of global Lipschitz continuity under anisotropic updates, thereby enabling iterative tracking of privacy via a modified Rényi divergence that contracts over iterations.

Significance. If the coupling analysis is correct and produces a strictly contractive bound under the hybrid mechanism, the result would be significant for DP optimization of large models under memory constraints. It provides the first convergent PABI-style guarantee for zeroth-order methods and suggests improved algorithmic designs. The work is technically novel in its handling of anisotropy but its impact depends on whether the auxiliary process rigorously controls the divergence contraction without hidden assumptions.

major comments (2)

[§4.3, Theorem 3] §4.3, Theorem 3 (Convergence of the coupled process): The proof sketch shows that the auxiliary process matches the marginal distribution of the original mechanism, but it is unclear whether the conditional Lipschitz constant remains bounded almost surely under the random directional noise component. Without an explicit bound on the contraction factor (e.g., showing it is strictly less than 1 uniformly over directions), the iterated Rényi divergence may fail to converge, undermining the central claim of a convergent hidden-state bound.
[§3.2] §3.2, Definition of the hybrid noise mechanism: The mixture parameter α between isotropic and directional noise is introduced without a clear derivation showing that it preserves the necessary martingale properties for the coupling argument. If α is chosen post-hoc to force contraction, this would weaken the parameter-free nature of the resulting DP bound.

minor comments (2)

[§2.1] The notation for the shifted Rényi divergence in §2.1 is introduced without explicitly contrasting it to the standard definition used in prior PABI works; adding a short comparison table would improve clarity.
[Figure 2] Figure 2 (empirical privacy-utility curves) lacks error bars or multiple random seeds; reporting variance across runs would strengthen the experimental validation of the theoretical bounds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments, which help strengthen the presentation of our coupling analysis and hybrid mechanism. We address each major comment below with clarifications and will incorporate explicit derivations and bounds into the revised manuscript.

read point-by-point responses

Referee: [§4.3, Theorem 3] §4.3, Theorem 3 (Convergence of the coupled process): The proof sketch shows that the auxiliary process matches the marginal distribution of the original mechanism, but it is unclear whether the conditional Lipschitz constant remains bounded almost surely under the random directional noise component. Without an explicit bound on the contraction factor (e.g., showing it is strictly less than 1 uniformly over directions), the iterated Rényi divergence may fail to converge, undermining the central claim of a convergent hidden-state bound.

Authors: We appreciate the referee's focus on the almost-sure boundedness of the conditional Lipschitz constant. The hybrid noise construction in our analysis ensures that the isotropic component dominates the directional perturbation sufficiently to yield a uniform bound on the effective Lipschitz constant of the coupled updates, independent of the realized direction. In the revised proof of Theorem 3, we will explicitly derive the contraction factor of the modified Rényi divergence as at most 1 - β, where β > 0 depends on the noise variances and the mixture weight but is strictly positive and uniform over all directions. This establishes iterative contraction without additional assumptions. revision: yes
Referee: [§3.2] §3.2, Definition of the hybrid noise mechanism: The mixture parameter α between isotropic and directional noise is introduced without a clear derivation showing that it preserves the necessary martingale properties for the coupling argument. If α is chosen post-hoc to force contraction, this would weaken the parameter-free nature of the resulting DP bound.

Authors: The mixture parameter α is derived directly from the requirement that the hybrid noise process remains a martingale with respect to the natural filtration of the optimization trajectory, ensuring unbiased updates and controlled variance that enable the auxiliary process to match marginals while contracting the divergence. We will expand §3.2 with the explicit derivation of α from these martingale conditions, confirming that the choice is not post-hoc and that the final DP bound remains parameter-free in the sense that α can be expressed in closed form from the problem parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on novel coupling construction.

full rationale

The paper's central contribution is a hybrid noise mechanism paired with a new coupled auxiliary process that circumvents the global Lipschitz failure of anisotropic zeroth-order updates under shifted Rényi divergence. This is presented as an independent technical development that yields the first convergent hidden-state DP bound, without any indication in the abstract or description that the bound reduces to a fitted parameter, self-citation chain, or re-derivation of prior results by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the validity of the proposed hybrid noise mechanism and coupled auxiliary process, which are introduced to bypass the Lipschitz barrier; no free parameters or explicit axioms are detailed in the abstract.

axioms (1)

domain assumption Standard differential privacy assumptions including bounded sensitivity of the loss and appropriate noise distributions for privacy.
Invoked implicitly to set up the DP zeroth-order framework before introducing the new mechanism.

invented entities (2)

Hybrid noise mechanism no independent evidence
purpose: Inject scalar noise along random directions while enabling privacy amplification tracking
New construction proposed to handle the anisotropic nature of zeroth-order updates.
Coupled auxiliary process no independent evidence
purpose: Track privacy evolution via coupling instead of shifted Rényi divergence
Invented technical device to circumvent the global Lipschitz property failure.

pith-pipeline@v0.9.0 · 5709 in / 1240 out tokens · 32698 ms · 2026-05-19T11:53:15.917877+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 3 internal anchors

[1]

In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security

Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318. Jason Altschuler and Kunal Talwar

work page 2016
[2]

Advances in Neural Information Processing Systems 35 (2022), 3788–3800

Privacy of noisy stochastic gradient descent: More iterations without more privacy loss. Advances in Neural Information Processing Systems 35 (2022), 3788–3800. Jason Altschuler and Kunal Talwar

work page 2022
[3]

On the Privacy of Noisy Stochastic Gradient Descent for Convex Optimization. SIAM J. Comput. 53, 4 (2024), 969–1001. Meenatchi Sundaram Muthu Selva Annamalai

work page 2024
[4]

arXiv preprint arXiv:2407.06496 (2024)

It’s Our Loss: No Privacy Amplification for Hidden State DP-SGD With Non-Convex Loss. arXiv preprint arXiv:2407.06496 (2024). Shahab Asoodeh and Mario Diaz

work page arXiv 2024
[5]

arXiv preprint arXiv:2305.09903 (2023)

Privacy Loss of Noisy Stochastic Gradient Descent Might Converge Even for Non-Convex Losses. arXiv preprint arXiv:2305.09903 (2023). Borja Balle, Gilles Barthe, and Marco Gaboardi

work page arXiv 2023
[6]

Advances in neural information processing systems 31 (2018)

Privacy amplification by subsampling: Tight analyses via couplings and divergences. Advances in neural information processing systems 31 (2018). Tudor Cebere, Aurélien Bellet, and Nicolas Papernot

work page 2018
[7]

arXiv preprint arXiv:2405.14457 (2024)

Tighter privacy auditing of dp-sgd in the hidden state threat model. arXiv preprint arXiv:2405.14457 (2024). Eli Chien and Pan Li

work page arXiv 2024
[8]

Advances in Neural Information Processing Systems 34 (2021), 14771–14781

Differential privacy dynamics of langevin diffusion and noisy gradient descent. Advances in Neural Information Processing Systems 34 (2021), 14771–14781. John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono

work page 2021
[9]

IEEE Transactions on Information Theory 61, 5 (2015), 2788–2806

Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory 61, 5 (2015), 2788–2806. 10 Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith

work page 2015
[10]

In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7,

Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7,

work page 2006
[11]

In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS)

Privacy amplification by iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 521–532. Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan

work page 2018
[12]

Probability and Mathematical Statistics 30, 2 (2010), 339–351

Inequalities for quantiles of the chi-square distribution. Probability and Mathematical Statistics 30, 2 (2010), 339–351. Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan

work page 2010
[13]

A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm

A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736 (2019). Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu

work page internal anchor Pith review Pith/arXiv arXiv 1902
[14]

Scaling Laws for Neural Language Models

Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020). Weiwei Kong and Mónica Ribero

work page internal anchor Pith review Pith/arXiv arXiv 2001
[15]

arXiv preprint arXiv:2407.05237 (2024)

Privacy of the last iterate in cyclically-sampled DP-SGD on nonconvex composite losses. arXiv preprint arXiv:2407.05237 (2024). Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto

work page arXiv 2024
[16]

In International Conference on Learning Represen- tations

Large Language Models Can Be Strong Differentially Private Learners. In International Conference on Learning Represen- tations. https://openreview.net/forum?id=bVuP3ltATMz Xinyue Liu, Hualin Zhang, Bin Gu, and Hong Chen. 2024b. General Stability Analysis for Zeroth- Order Optimization Algorithms. In The Twelfth International Conference on Learning Represe...

work page arXiv 2024
[17]

On the sub-Gaussianity of the Beta and Dirichlet distributions. (2017). Ilya Mironov

work page 2017
[18]

In 2017 IEEE 30th computer security foundations symposium (CSF)

Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF). IEEE, 263–275. Ilya Mironov, Kunal Talwar, and Li Zhang

work page 2017
[19]

arXiv preprint arXiv:1908.10530 (2019)

R\’enyi differential privacy of the sampled gaussian mechanism. arXiv preprint arXiv:1908.10530 (2019). Yurii Nesterov and Vladimir Spokoiny

work page arXiv 1908
[20]

Foundations of Computational Mathematics 17, 2 (2017), 527–566

Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17, 2 (2017), 527–566. Maciej Skorski

work page 2017
[21]

Modern Stochastics: Theory and Applications 10, 2 (2023), 211–228

Bernstein-type bounds for beta distribution. Modern Stochastics: Theory and Applications 10, 2 (2023), 211–228. Thomas Steinke, Milad Nasr, Arun Ganesh, Borja Balle, Christopher A Choquette-Choo, Matthew Jagielski, Jamie Hayes, Abhradeep Guha Thakurta, Adam Smith, and Andreas Terzis

work page 2023
[22]

arXiv preprint arXiv:2410.06186 (2024)

The last iterate advantage: Empirical auditing and principled heuristic analysis of differentially private sgd. arXiv preprint arXiv:2410.06186 (2024). 11 Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, and Prateek Mittal

work page arXiv 2024
[23]

arXiv preprint arXiv:2401.04343 (2024)

Private fine- tuning of large language models with zeroth-order optimization. arXiv preprint arXiv:2401.04343 (2024). Roman Vershynin

work page arXiv 2024
[24]

Introduction to the non-asymptotic analysis of random matrices

Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010). Rongzhe Wei, Eli Chien, and Pan Li

work page internal anchor Pith review Pith/arXiv arXiv 2010
[25]

Advances in Neural Information Processing Systems 35 (2022), 703–715

Differentially private learning needs hidden state (or much faster convergence). Advances in Neural Information Processing Systems 35 (2022), 703–715. Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang

work page 2022
[26]

In International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023

DPZero: Dimension- Independent and Differentially Private Zeroth-Order Optimization. In International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023 . https://openreview.net/forum?id=s7hquGszME Qinzi Zhang, Hoang Tran, and Ashok Cutkosky

work page 2023
[27]

In The Twelfth International Conference on Learning Representations

Private Zeroth-Order Nonsmooth Nonconvex Optimization. In The Twelfth International Conference on Learning Representations . https: //openreview.net/forum?id=IzqZbNMZ0M A Appendix A.1 Limitations While we have established the first convergent DP guarantee for the ZOGD method, it is still far from ready to be directly applied to LLM (or any other general n...

work page 2022
[28]

We set the radius of the projected set to be R = 1, smooth constant M = 1, strongly convex constant m = 0.9, and the clipped norm ∆ = 1

We report the privacy loss of Noisy-ZOGD (3) under different privacy analysis. We set the radius of the projected set to be R = 1, smooth constant M = 1, strongly convex constant m = 0.9, and the clipped norm ∆ = 1 . For a fair comparison, we first compute the RDP budget for each analysis and then convert it to DP by the standard conversion (Mironov, 2017...

work page 2017
[29]

12 A.4 Proof of Theorem 3.1 The proof is quite standard in the DP literature (i.e., based on the analysis of Mironov (2017))

For our approach, we adopted the same simplification as in the proof of Corollary 3.3 that applies to Theorem 3.2 for strongly convex problems and kept all constants to be explicit. 12 A.4 Proof of Theorem 3.1 The proof is quite standard in the DP literature (i.e., based on the analysis of Mironov (2017)). We state it here for completeness. Proof. Case βt...

work page 2017
[30]

In the meanwhile, the Lipschitz constant c of the first order gradient update map ϕ is as follows: If ℓi are M-smooth and m-strongly convex, then if η K ≤ 1 M we have c = 1 − ηm K

This is important since it will directly impact the error probability 1 − P(G) in the exponent. In the meanwhile, the Lipschitz constant c of the first order gradient update map ϕ is as follows: If ℓi are M-smooth and m-strongly convex, then if η K ≤ 1 M we have c = 1 − ηm K . With these choices and setup, the bound in Theorem 3.2 for any τ, βt, at become...

work page 2025
[31]

Additionally, for simplicity we choose βt = 1/2

We can choose such a solution of a′ t and plug them in our bound. Additionally, for simplicity we choose βt = 1/2. Together we have Dα(WT |TT −1 t=τ Gt ||W ′ T |TT −1 t=τ Gt ) ≤ (T − τ) α σ2 (2∆ n )2 + 2αc2 2d η2σ2 + 4α min(2R, 2Cτ√ K ) 2 d (T − τ)η2σ2 . (25) Recall that c2 = ηM ξ. When we choose ξ ≤ 2∆ nηM √ 2d, we have α σ2 ( 2∆ n )2 + 2αc2 2d η2σ2 ≤ 2α...

work page 2017
[32]

Together we complete the proof. A.9 Minibatch extensions Now, we discuss the minibatch generalizations of Noisy-ZOGD (Theorem 3.2), which we called the corresponding algorithm Noisy-ZOSGD (under subsampling without replacement). 18 Subsampling without replacement. For this strategy, the Noisy-ZOSGD update is defined as follows. wt+1 = ΠBR " wt − η K KX k=...

work page 2019
[33]

(65) Note that Sα can be computed in practice with a numerically stable procedure for precise computa- tion Mironov (2017)

and noise parameter σ > 0, define Sα(q, σ) = Dα(N(0, σ2)||(1 − q)N(0, σ2) + qN (1, σ2)). (65) Note that Sα can be computed in practice with a numerically stable procedure for precise computa- tion Mironov (2017). Now we are ready to state the result for subsampling without replacement. Theorem A.3 (RDP guarantee of Noisy-ZOSGD). Assume ℓ(·, x) are M-smoot...

work page 2017
[34]

If ℓ is also m-strongly convex and choose η ≤ K/M, we have c = 1 − ηm K . The proof mainly combines the proof of shifted Rényi divergence analysis of the first-order counter- part Chien and Li (2025); Altschuler and Talwar (2022) and the proof of our Theorem 3.2. Before we state our proof, let us first introduce a technical lemma before we introduce our p...

work page 2025
[35]

In practice, Sα(q, σ) is computed via numerical integral for the tightest possible privacy accounting

and the noise parameter σ > 0, dimension d ∈ N and radius R > 0, sup µ∈P(BR) Dα(N(0, σ2Id)||(1 − q)N(0, σ2Id) + q(N(0, σ2Id) ∗ µ)) = Sα(q, σ/R), (66) where P(BR) denotes set of all Borel probability distributions over the ℓ2 ball of radius R in Rd. In practice, Sα(q, σ) is computed via numerical integral for the tightest possible privacy accounting. Never...

work page 2022
[36]

The averaged loss function is twice differentiable with −H ⪯ ∇2L(w; D) ⪯ H for any w ∈ Rd, and its minimum is finite

The loss function is L-Lipschitz and M-smooth. The averaged loss function is twice differentiable with −H ⪯ ∇2L(w; D) ⪯ H for any w ∈ Rd, and its minimum is finite. Here, 0 ⪯ H is a real-valued d by d matrix such that ∥H∥2 ≤ M and T r(H) ≤ r∥H∥2 for for some r as the effective rank or the 20 intrinsic dimension of the problem. These assumptions are exactl...

work page 2023
[37]

(88) where (a) is due to M-smoothness and the elementary inequality (a + b)2 ≤ 2a2 + 2b2

E≤t[∥∇¯ℓ(wt)∥2] + 1 2d M3ξ2r (87) = 2(2 + r)M d(d + 2)P(Qt) E≤t[∥∇¯ℓ(wt)∥2] + M3ξ2r 2dP(Qt) . (88) where (a) is due to M-smoothness and the elementary inequality (a + b)2 ≤ 2a2 + 2b2. (b) and (c) are due to Lemma C.1 in Zhang et al. (2023). Similarly we have KX k=1 E≤t[uT t,kHu t,k|Qt] ≤ rM K dP(Qt) . (89) For the inner-product term, let us denote u′ = √ ...

work page 2023
[38]

(101) If we further assume that |¯ℓ(w)| ≤ B for any w

Then we have E≤t[∥∇¯ℓ(wt)∥2] ≤ 4dP(Qt) η E≤t[¯ℓ(wt) − ¯ℓ(wt+1)|Qt] + M2ξ2d3 + 4L2dP( ¯Qt) (100) + 2drM η σ2 1,t K d + σ2 2,tP(Qt) . (101) If we further assume that |¯ℓ(w)| ≤ B for any w. Then following the same analysis of Zhang et al . (2023) we have E≤t[¯ℓ(wt) − ¯ℓ(wt+1)|Qt]P(Qt) ≤ E≤t[¯ℓ(wt) − ¯ℓ(wt+1)|Q]P(Q) + 2BP( ¯Q). (102) As a result we have E≤t[∥...

work page 2023
[39]

Lemma B.3

by following the analysis of Lemma 5.5 of Vershynin (2010) in order to facilitate our non-asymptotic bounds. Lemma B.3. Let Bk i.i.d. ∼ Beta 1 2 , d−1 2 . Then it holds that P ( KX k=1 Bk ≥ 0.15 − log(1/δ) K ) + · K 20 d + 1 + √ d − 2 ) ≥ 1 − δ, where (·)+ denotes max(·, 0). Proof. Let B ∼ Beta 1 2 , d−1 2 . Then B (d) = W1 W1+W2 where W1 ∼ χ2 (1) and W2 ...

work page 2010
[40]

Since W1 (d) = Z2 for Z ∼ N (0, 1), by the lower bound of the Q-function, it holds that P {W1 ≥ κ1} = P {|Z| ≥ √κ1} = 2 · Q (√κ1) ≥ √κ1 1 + κ1 e−κ1/2 √ 2π

are independent. Since W1 (d) = Z2 for Z ∼ N (0, 1), by the lower bound of the Q-function, it holds that P {W1 ≥ κ1} = P {|Z| ≥ √κ1} = 2 · Q (√κ1) ≥ √κ1 1 + κ1 e−κ1/2 √ 2π . On the other hand, according to the quantile upper bound on the χ2 random variables (see, for instance, (Inglot, 2010, Theorem A)), we have P n W2 ≤ (d −

work page 2010
[41]

Then w⊥ 2 2 = R2 z2 1 Pd j=2 z2 j Pd j=1 z2 j 2 ≤ R2 z2 1 Pd j=1 z2 j Pd j=1 z2 j 2 = r2 z2 1Pd j=1 z2 j ≜ B, where B ∼ Beta 1 2 , d−1 2

for j ∈ [d]. Then w⊥ 2 2 = R2 z2 1 Pd j=2 z2 j Pd j=1 z2 j 2 ≤ R2 z2 1 Pd j=1 z2 j Pd j=1 z2 j 2 = r2 z2 1Pd j=1 z2 j ≜ B, where B ∼ Beta 1 2 , d−1 2 . As a result, by Marchal and Arbel (2017, Theorem 1), w⊥ is 1 4(1/2+d/2+1) = 1 2d+6 norm-sub-Gaussian, so applying Lemma B.2 yields KX k=1 w⊥ k 2 2 ≤ ∥v∥2 K 2d + 6 log 2d δb , with probability at least 1 − ...

work page 2017

[1] [1]

In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security

Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318. Jason Altschuler and Kunal Talwar

work page 2016

[2] [2]

Advances in Neural Information Processing Systems 35 (2022), 3788–3800

Privacy of noisy stochastic gradient descent: More iterations without more privacy loss. Advances in Neural Information Processing Systems 35 (2022), 3788–3800. Jason Altschuler and Kunal Talwar

work page 2022

[3] [3]

On the Privacy of Noisy Stochastic Gradient Descent for Convex Optimization. SIAM J. Comput. 53, 4 (2024), 969–1001. Meenatchi Sundaram Muthu Selva Annamalai

work page 2024

[4] [4]

arXiv preprint arXiv:2407.06496 (2024)

It’s Our Loss: No Privacy Amplification for Hidden State DP-SGD With Non-Convex Loss. arXiv preprint arXiv:2407.06496 (2024). Shahab Asoodeh and Mario Diaz

work page arXiv 2024

[5] [5]

arXiv preprint arXiv:2305.09903 (2023)

Privacy Loss of Noisy Stochastic Gradient Descent Might Converge Even for Non-Convex Losses. arXiv preprint arXiv:2305.09903 (2023). Borja Balle, Gilles Barthe, and Marco Gaboardi

work page arXiv 2023

[6] [6]

Advances in neural information processing systems 31 (2018)

Privacy amplification by subsampling: Tight analyses via couplings and divergences. Advances in neural information processing systems 31 (2018). Tudor Cebere, Aurélien Bellet, and Nicolas Papernot

work page 2018

[7] [7]

arXiv preprint arXiv:2405.14457 (2024)

Tighter privacy auditing of dp-sgd in the hidden state threat model. arXiv preprint arXiv:2405.14457 (2024). Eli Chien and Pan Li

work page arXiv 2024

[8] [8]

Advances in Neural Information Processing Systems 34 (2021), 14771–14781

Differential privacy dynamics of langevin diffusion and noisy gradient descent. Advances in Neural Information Processing Systems 34 (2021), 14771–14781. John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono

work page 2021

[9] [9]

IEEE Transactions on Information Theory 61, 5 (2015), 2788–2806

Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory 61, 5 (2015), 2788–2806. 10 Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith

work page 2015

[10] [10]

In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7,

Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7,

work page 2006

[11] [11]

In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS)

Privacy amplification by iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 521–532. Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan

work page 2018

[12] [12]

Probability and Mathematical Statistics 30, 2 (2010), 339–351

Inequalities for quantiles of the chi-square distribution. Probability and Mathematical Statistics 30, 2 (2010), 339–351. Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan

work page 2010

[13] [13]

A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm

A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736 (2019). Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu

work page internal anchor Pith review Pith/arXiv arXiv 1902

[14] [14]

Scaling Laws for Neural Language Models

Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020). Weiwei Kong and Mónica Ribero

work page internal anchor Pith review Pith/arXiv arXiv 2001

[15] [15]

arXiv preprint arXiv:2407.05237 (2024)

Privacy of the last iterate in cyclically-sampled DP-SGD on nonconvex composite losses. arXiv preprint arXiv:2407.05237 (2024). Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto

work page arXiv 2024

[16] [16]

In International Conference on Learning Represen- tations

Large Language Models Can Be Strong Differentially Private Learners. In International Conference on Learning Represen- tations. https://openreview.net/forum?id=bVuP3ltATMz Xinyue Liu, Hualin Zhang, Bin Gu, and Hong Chen. 2024b. General Stability Analysis for Zeroth- Order Optimization Algorithms. In The Twelfth International Conference on Learning Represe...

work page arXiv 2024

[17] [17]

On the sub-Gaussianity of the Beta and Dirichlet distributions. (2017). Ilya Mironov

work page 2017

[18] [18]

In 2017 IEEE 30th computer security foundations symposium (CSF)

Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF). IEEE, 263–275. Ilya Mironov, Kunal Talwar, and Li Zhang

work page 2017

[19] [19]

arXiv preprint arXiv:1908.10530 (2019)

R\’enyi differential privacy of the sampled gaussian mechanism. arXiv preprint arXiv:1908.10530 (2019). Yurii Nesterov and Vladimir Spokoiny

work page arXiv 1908

[20] [20]

Foundations of Computational Mathematics 17, 2 (2017), 527–566

Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17, 2 (2017), 527–566. Maciej Skorski

work page 2017

[21] [21]

Modern Stochastics: Theory and Applications 10, 2 (2023), 211–228

Bernstein-type bounds for beta distribution. Modern Stochastics: Theory and Applications 10, 2 (2023), 211–228. Thomas Steinke, Milad Nasr, Arun Ganesh, Borja Balle, Christopher A Choquette-Choo, Matthew Jagielski, Jamie Hayes, Abhradeep Guha Thakurta, Adam Smith, and Andreas Terzis

work page 2023

[22] [22]

arXiv preprint arXiv:2410.06186 (2024)

The last iterate advantage: Empirical auditing and principled heuristic analysis of differentially private sgd. arXiv preprint arXiv:2410.06186 (2024). 11 Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, and Prateek Mittal

work page arXiv 2024

[23] [23]

arXiv preprint arXiv:2401.04343 (2024)

Private fine- tuning of large language models with zeroth-order optimization. arXiv preprint arXiv:2401.04343 (2024). Roman Vershynin

work page arXiv 2024

[24] [24]

Introduction to the non-asymptotic analysis of random matrices

Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010). Rongzhe Wei, Eli Chien, and Pan Li

work page internal anchor Pith review Pith/arXiv arXiv 2010

[25] [25]

Advances in Neural Information Processing Systems 35 (2022), 703–715

Differentially private learning needs hidden state (or much faster convergence). Advances in Neural Information Processing Systems 35 (2022), 703–715. Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang

work page 2022

[26] [26]

In International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023

DPZero: Dimension- Independent and Differentially Private Zeroth-Order Optimization. In International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023 . https://openreview.net/forum?id=s7hquGszME Qinzi Zhang, Hoang Tran, and Ashok Cutkosky

work page 2023

[27] [27]

In The Twelfth International Conference on Learning Representations

Private Zeroth-Order Nonsmooth Nonconvex Optimization. In The Twelfth International Conference on Learning Representations . https: //openreview.net/forum?id=IzqZbNMZ0M A Appendix A.1 Limitations While we have established the first convergent DP guarantee for the ZOGD method, it is still far from ready to be directly applied to LLM (or any other general n...

work page 2022

[28] [28]

We set the radius of the projected set to be R = 1, smooth constant M = 1, strongly convex constant m = 0.9, and the clipped norm ∆ = 1

We report the privacy loss of Noisy-ZOGD (3) under different privacy analysis. We set the radius of the projected set to be R = 1, smooth constant M = 1, strongly convex constant m = 0.9, and the clipped norm ∆ = 1 . For a fair comparison, we first compute the RDP budget for each analysis and then convert it to DP by the standard conversion (Mironov, 2017...

work page 2017

[29] [29]

12 A.4 Proof of Theorem 3.1 The proof is quite standard in the DP literature (i.e., based on the analysis of Mironov (2017))

For our approach, we adopted the same simplification as in the proof of Corollary 3.3 that applies to Theorem 3.2 for strongly convex problems and kept all constants to be explicit. 12 A.4 Proof of Theorem 3.1 The proof is quite standard in the DP literature (i.e., based on the analysis of Mironov (2017)). We state it here for completeness. Proof. Case βt...

work page 2017

[30] [30]

In the meanwhile, the Lipschitz constant c of the first order gradient update map ϕ is as follows: If ℓi are M-smooth and m-strongly convex, then if η K ≤ 1 M we have c = 1 − ηm K

This is important since it will directly impact the error probability 1 − P(G) in the exponent. In the meanwhile, the Lipschitz constant c of the first order gradient update map ϕ is as follows: If ℓi are M-smooth and m-strongly convex, then if η K ≤ 1 M we have c = 1 − ηm K . With these choices and setup, the bound in Theorem 3.2 for any τ, βt, at become...

work page 2025

[31] [31]

Additionally, for simplicity we choose βt = 1/2

We can choose such a solution of a′ t and plug them in our bound. Additionally, for simplicity we choose βt = 1/2. Together we have Dα(WT |TT −1 t=τ Gt ||W ′ T |TT −1 t=τ Gt ) ≤ (T − τ) α σ2 (2∆ n )2 + 2αc2 2d η2σ2 + 4α min(2R, 2Cτ√ K ) 2 d (T − τ)η2σ2 . (25) Recall that c2 = ηM ξ. When we choose ξ ≤ 2∆ nηM √ 2d, we have α σ2 ( 2∆ n )2 + 2αc2 2d η2σ2 ≤ 2α...

work page 2017

[32] [32]

Together we complete the proof. A.9 Minibatch extensions Now, we discuss the minibatch generalizations of Noisy-ZOGD (Theorem 3.2), which we called the corresponding algorithm Noisy-ZOSGD (under subsampling without replacement). 18 Subsampling without replacement. For this strategy, the Noisy-ZOSGD update is defined as follows. wt+1 = ΠBR " wt − η K KX k=...

work page 2019

[33] [33]

(65) Note that Sα can be computed in practice with a numerically stable procedure for precise computa- tion Mironov (2017)

and noise parameter σ > 0, define Sα(q, σ) = Dα(N(0, σ2)||(1 − q)N(0, σ2) + qN (1, σ2)). (65) Note that Sα can be computed in practice with a numerically stable procedure for precise computa- tion Mironov (2017). Now we are ready to state the result for subsampling without replacement. Theorem A.3 (RDP guarantee of Noisy-ZOSGD). Assume ℓ(·, x) are M-smoot...

work page 2017

[34] [34]

If ℓ is also m-strongly convex and choose η ≤ K/M, we have c = 1 − ηm K . The proof mainly combines the proof of shifted Rényi divergence analysis of the first-order counter- part Chien and Li (2025); Altschuler and Talwar (2022) and the proof of our Theorem 3.2. Before we state our proof, let us first introduce a technical lemma before we introduce our p...

work page 2025

[35] [35]

In practice, Sα(q, σ) is computed via numerical integral for the tightest possible privacy accounting

and the noise parameter σ > 0, dimension d ∈ N and radius R > 0, sup µ∈P(BR) Dα(N(0, σ2Id)||(1 − q)N(0, σ2Id) + q(N(0, σ2Id) ∗ µ)) = Sα(q, σ/R), (66) where P(BR) denotes set of all Borel probability distributions over the ℓ2 ball of radius R in Rd. In practice, Sα(q, σ) is computed via numerical integral for the tightest possible privacy accounting. Never...

work page 2022

[36] [36]

The averaged loss function is twice differentiable with −H ⪯ ∇2L(w; D) ⪯ H for any w ∈ Rd, and its minimum is finite

The loss function is L-Lipschitz and M-smooth. The averaged loss function is twice differentiable with −H ⪯ ∇2L(w; D) ⪯ H for any w ∈ Rd, and its minimum is finite. Here, 0 ⪯ H is a real-valued d by d matrix such that ∥H∥2 ≤ M and T r(H) ≤ r∥H∥2 for for some r as the effective rank or the 20 intrinsic dimension of the problem. These assumptions are exactl...

work page 2023

[37] [37]

(88) where (a) is due to M-smoothness and the elementary inequality (a + b)2 ≤ 2a2 + 2b2

E≤t[∥∇¯ℓ(wt)∥2] + 1 2d M3ξ2r (87) = 2(2 + r)M d(d + 2)P(Qt) E≤t[∥∇¯ℓ(wt)∥2] + M3ξ2r 2dP(Qt) . (88) where (a) is due to M-smoothness and the elementary inequality (a + b)2 ≤ 2a2 + 2b2. (b) and (c) are due to Lemma C.1 in Zhang et al. (2023). Similarly we have KX k=1 E≤t[uT t,kHu t,k|Qt] ≤ rM K dP(Qt) . (89) For the inner-product term, let us denote u′ = √ ...

work page 2023

[38] [38]

(101) If we further assume that |¯ℓ(w)| ≤ B for any w

Then we have E≤t[∥∇¯ℓ(wt)∥2] ≤ 4dP(Qt) η E≤t[¯ℓ(wt) − ¯ℓ(wt+1)|Qt] + M2ξ2d3 + 4L2dP( ¯Qt) (100) + 2drM η σ2 1,t K d + σ2 2,tP(Qt) . (101) If we further assume that |¯ℓ(w)| ≤ B for any w. Then following the same analysis of Zhang et al . (2023) we have E≤t[¯ℓ(wt) − ¯ℓ(wt+1)|Qt]P(Qt) ≤ E≤t[¯ℓ(wt) − ¯ℓ(wt+1)|Q]P(Q) + 2BP( ¯Q). (102) As a result we have E≤t[∥...

work page 2023

[39] [39]

Lemma B.3

by following the analysis of Lemma 5.5 of Vershynin (2010) in order to facilitate our non-asymptotic bounds. Lemma B.3. Let Bk i.i.d. ∼ Beta 1 2 , d−1 2 . Then it holds that P ( KX k=1 Bk ≥ 0.15 − log(1/δ) K ) + · K 20 d + 1 + √ d − 2 ) ≥ 1 − δ, where (·)+ denotes max(·, 0). Proof. Let B ∼ Beta 1 2 , d−1 2 . Then B (d) = W1 W1+W2 where W1 ∼ χ2 (1) and W2 ...

work page 2010

[40] [40]

Since W1 (d) = Z2 for Z ∼ N (0, 1), by the lower bound of the Q-function, it holds that P {W1 ≥ κ1} = P {|Z| ≥ √κ1} = 2 · Q (√κ1) ≥ √κ1 1 + κ1 e−κ1/2 √ 2π

are independent. Since W1 (d) = Z2 for Z ∼ N (0, 1), by the lower bound of the Q-function, it holds that P {W1 ≥ κ1} = P {|Z| ≥ √κ1} = 2 · Q (√κ1) ≥ √κ1 1 + κ1 e−κ1/2 √ 2π . On the other hand, according to the quantile upper bound on the χ2 random variables (see, for instance, (Inglot, 2010, Theorem A)), we have P n W2 ≤ (d −

work page 2010

[41] [41]

Then w⊥ 2 2 = R2 z2 1 Pd j=2 z2 j Pd j=1 z2 j 2 ≤ R2 z2 1 Pd j=1 z2 j Pd j=1 z2 j 2 = r2 z2 1Pd j=1 z2 j ≜ B, where B ∼ Beta 1 2 , d−1 2

for j ∈ [d]. Then w⊥ 2 2 = R2 z2 1 Pd j=2 z2 j Pd j=1 z2 j 2 ≤ R2 z2 1 Pd j=1 z2 j Pd j=1 z2 j 2 = r2 z2 1Pd j=1 z2 j ≜ B, where B ∼ Beta 1 2 , d−1 2 . As a result, by Marchal and Arbel (2017, Theorem 1), w⊥ is 1 4(1/2+d/2+1) = 1 2d+6 norm-sub-Gaussian, so applying Lemma B.2 yields KX k=1 w⊥ k 2 2 ≤ ∥v∥2 K 2d + 6 log 2d δb , with probability at least 1 − ...

work page 2017