Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States
Pith reviewed 2026-05-19 11:53 UTC · model grok-4.3
The pith
Hybrid noise and coupling analysis deliver the first convergent privacy bounds for zeroth-order optimization with hidden states.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We provide the first convergent hidden-state DP bound for zeroth-order optimization by proposing a hybrid noise mechanism and a novel coupling analysis. We bypass the purely shifted-divergence approach by constructing a coupled auxiliary process, which circumvents the global Lipschitz barrier and yields a convergent privacy bound. Furthermore, our results induce better DP zeroth-order algorithmic designs that are previously unknown to the literature.
What carries the argument
Coupled auxiliary process built from a hybrid noise mechanism that tracks privacy amplification for anisotropic directional updates without requiring the global Lipschitz property.
If this is right
- Convergent differential privacy bounds are now available for hidden-state zeroth-order optimization.
- The analysis directly supports new algorithmic designs for DP zeroth-order methods that improve on prior approaches.
- Privacy amplification by iteration extends to memory-efficient optimization settings that inject scalar noise along random directions.
- Tighter noise calibration becomes possible while still guaranteeing formal privacy that improves over training iterations.
Where Pith is reading between the lines
- These bounds could let practitioners run private zeroth-order fine-tuning of large models with less noise than conservative analyses require.
- The coupling construction may generalize to other directional or anisotropic noise settings in private optimization.
- Empirical checks on actual model training runs would test how loose or tight the new privacy accounting proves in practice.
Load-bearing premise
A coupled auxiliary process can be maintained that accurately follows the privacy loss even though anisotropic updates destroy the global Lipschitz condition required by earlier frameworks.
What would settle it
A numerical simulation or proof that the privacy loss remains bounded away from zero or diverges with more iterations under the hybrid mechanism and coupling would refute the convergent bound.
Figures
read the original abstract
Zeroth-order optimization has emerged as a promising approach for fine-tuning large language models under differential privacy (DP) and memory constraints. While privacy amplification by iteration (PABI) provides convergent DP bounds for first-order methods, establishing similar guarantees for zeroth-order methods remains an open problem. First-order PABI analysis relies on the fact that gradients are perturbed with isotropic noise, allowing privacy bounds to be iteratively tracked via shifted R\'enyi divergence. In contrast, DP zeroth-order methods inject scalar noise along random update directions to maintain utility. This anisotropic update fails standard shifted divergence frameworks, as the global Lipschitz property no longer holds almost surely. We provide the first convergent hidden-state DP bound for zeroth-order optimization by proposing a hybrid noise mechanism and a novel coupling analysis. We bypass the purely shifted-divergence approach by constructing a coupled auxiliary process, which circumvents the global Lipschitz barrier and yields a convergent privacy bound. Furthermore, our results induce better DP zeroth-order algorithmic designs that are previously unknown to the literature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to resolve an open problem in differentially private zeroth-order optimization by deriving the first convergent hidden-state DP bound. It introduces a hybrid noise mechanism (combining isotropic and directional components) and a novel coupling analysis that constructs an auxiliary process to bypass the failure of global Lipschitz continuity under anisotropic updates, thereby enabling iterative tracking of privacy via a modified Rényi divergence that contracts over iterations.
Significance. If the coupling analysis is correct and produces a strictly contractive bound under the hybrid mechanism, the result would be significant for DP optimization of large models under memory constraints. It provides the first convergent PABI-style guarantee for zeroth-order methods and suggests improved algorithmic designs. The work is technically novel in its handling of anisotropy but its impact depends on whether the auxiliary process rigorously controls the divergence contraction without hidden assumptions.
major comments (2)
- [§4.3, Theorem 3] §4.3, Theorem 3 (Convergence of the coupled process): The proof sketch shows that the auxiliary process matches the marginal distribution of the original mechanism, but it is unclear whether the conditional Lipschitz constant remains bounded almost surely under the random directional noise component. Without an explicit bound on the contraction factor (e.g., showing it is strictly less than 1 uniformly over directions), the iterated Rényi divergence may fail to converge, undermining the central claim of a convergent hidden-state bound.
- [§3.2] §3.2, Definition of the hybrid noise mechanism: The mixture parameter α between isotropic and directional noise is introduced without a clear derivation showing that it preserves the necessary martingale properties for the coupling argument. If α is chosen post-hoc to force contraction, this would weaken the parameter-free nature of the resulting DP bound.
minor comments (2)
- [§2.1] The notation for the shifted Rényi divergence in §2.1 is introduced without explicitly contrasting it to the standard definition used in prior PABI works; adding a short comparison table would improve clarity.
- [Figure 2] Figure 2 (empirical privacy-utility curves) lacks error bars or multiple random seeds; reporting variance across runs would strengthen the experimental validation of the theoretical bounds.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable comments, which help strengthen the presentation of our coupling analysis and hybrid mechanism. We address each major comment below with clarifications and will incorporate explicit derivations and bounds into the revised manuscript.
read point-by-point responses
-
Referee: [§4.3, Theorem 3] §4.3, Theorem 3 (Convergence of the coupled process): The proof sketch shows that the auxiliary process matches the marginal distribution of the original mechanism, but it is unclear whether the conditional Lipschitz constant remains bounded almost surely under the random directional noise component. Without an explicit bound on the contraction factor (e.g., showing it is strictly less than 1 uniformly over directions), the iterated Rényi divergence may fail to converge, undermining the central claim of a convergent hidden-state bound.
Authors: We appreciate the referee's focus on the almost-sure boundedness of the conditional Lipschitz constant. The hybrid noise construction in our analysis ensures that the isotropic component dominates the directional perturbation sufficiently to yield a uniform bound on the effective Lipschitz constant of the coupled updates, independent of the realized direction. In the revised proof of Theorem 3, we will explicitly derive the contraction factor of the modified Rényi divergence as at most 1 - β, where β > 0 depends on the noise variances and the mixture weight but is strictly positive and uniform over all directions. This establishes iterative contraction without additional assumptions. revision: yes
-
Referee: [§3.2] §3.2, Definition of the hybrid noise mechanism: The mixture parameter α between isotropic and directional noise is introduced without a clear derivation showing that it preserves the necessary martingale properties for the coupling argument. If α is chosen post-hoc to force contraction, this would weaken the parameter-free nature of the resulting DP bound.
Authors: The mixture parameter α is derived directly from the requirement that the hybrid noise process remains a martingale with respect to the natural filtration of the optimization trajectory, ensuring unbiased updates and controlled variance that enable the auxiliary process to match marginals while contracting the divergence. We will expand §3.2 with the explicit derivation of α from these martingale conditions, confirming that the choice is not post-hoc and that the final DP bound remains parameter-free in the sense that α can be expressed in closed form from the problem parameters. revision: yes
Circularity Check
No significant circularity; derivation relies on novel coupling construction.
full rationale
The paper's central contribution is a hybrid noise mechanism paired with a new coupled auxiliary process that circumvents the global Lipschitz failure of anisotropic zeroth-order updates under shifted Rényi divergence. This is presented as an independent technical development that yields the first convergent hidden-state DP bound, without any indication in the abstract or description that the bound reduces to a fitted parameter, self-citation chain, or re-derivation of prior results by construction. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard differential privacy assumptions including bounded sensitivity of the loss and appropriate noise distributions for privacy.
invented entities (2)
-
Hybrid noise mechanism
no independent evidence
-
Coupled auxiliary process
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security
Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318. Jason Altschuler and Kunal Talwar
work page 2016
-
[2]
Advances in Neural Information Processing Systems 35 (2022), 3788–3800
Privacy of noisy stochastic gradient descent: More iterations without more privacy loss. Advances in Neural Information Processing Systems 35 (2022), 3788–3800. Jason Altschuler and Kunal Talwar
work page 2022
-
[3]
On the Privacy of Noisy Stochastic Gradient Descent for Convex Optimization. SIAM J. Comput. 53, 4 (2024), 969–1001. Meenatchi Sundaram Muthu Selva Annamalai
work page 2024
-
[4]
arXiv preprint arXiv:2407.06496 (2024)
It’s Our Loss: No Privacy Amplification for Hidden State DP-SGD With Non-Convex Loss. arXiv preprint arXiv:2407.06496 (2024). Shahab Asoodeh and Mario Diaz
-
[5]
arXiv preprint arXiv:2305.09903 (2023)
Privacy Loss of Noisy Stochastic Gradient Descent Might Converge Even for Non-Convex Losses. arXiv preprint arXiv:2305.09903 (2023). Borja Balle, Gilles Barthe, and Marco Gaboardi
-
[6]
Advances in neural information processing systems 31 (2018)
Privacy amplification by subsampling: Tight analyses via couplings and divergences. Advances in neural information processing systems 31 (2018). Tudor Cebere, Aurélien Bellet, and Nicolas Papernot
work page 2018
-
[7]
arXiv preprint arXiv:2405.14457 (2024)
Tighter privacy auditing of dp-sgd in the hidden state threat model. arXiv preprint arXiv:2405.14457 (2024). Eli Chien and Pan Li
-
[8]
Advances in Neural Information Processing Systems 34 (2021), 14771–14781
Differential privacy dynamics of langevin diffusion and noisy gradient descent. Advances in Neural Information Processing Systems 34 (2021), 14771–14781. John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono
work page 2021
-
[9]
IEEE Transactions on Information Theory 61, 5 (2015), 2788–2806
Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory 61, 5 (2015), 2788–2806. 10 Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith
work page 2015
-
[10]
Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7,
work page 2006
-
[11]
In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS)
Privacy amplification by iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 521–532. Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan
work page 2018
-
[12]
Probability and Mathematical Statistics 30, 2 (2010), 339–351
Inequalities for quantiles of the chi-square distribution. Probability and Mathematical Statistics 30, 2 (2010), 339–351. Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan
work page 2010
-
[13]
A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm
A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736 (2019). Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[14]
Scaling Laws for Neural Language Models
Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020). Weiwei Kong and Mónica Ribero
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[15]
arXiv preprint arXiv:2407.05237 (2024)
Privacy of the last iterate in cyclically-sampled DP-SGD on nonconvex composite losses. arXiv preprint arXiv:2407.05237 (2024). Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto
-
[16]
In International Conference on Learning Represen- tations
Large Language Models Can Be Strong Differentially Private Learners. In International Conference on Learning Represen- tations. https://openreview.net/forum?id=bVuP3ltATMz Xinyue Liu, Hualin Zhang, Bin Gu, and Hong Chen. 2024b. General Stability Analysis for Zeroth- Order Optimization Algorithms. In The Twelfth International Conference on Learning Represe...
-
[17]
On the sub-Gaussianity of the Beta and Dirichlet distributions. (2017). Ilya Mironov
work page 2017
-
[18]
In 2017 IEEE 30th computer security foundations symposium (CSF)
Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF). IEEE, 263–275. Ilya Mironov, Kunal Talwar, and Li Zhang
work page 2017
-
[19]
arXiv preprint arXiv:1908.10530 (2019)
R\’enyi differential privacy of the sampled gaussian mechanism. arXiv preprint arXiv:1908.10530 (2019). Yurii Nesterov and Vladimir Spokoiny
-
[20]
Foundations of Computational Mathematics 17, 2 (2017), 527–566
Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17, 2 (2017), 527–566. Maciej Skorski
work page 2017
-
[21]
Modern Stochastics: Theory and Applications 10, 2 (2023), 211–228
Bernstein-type bounds for beta distribution. Modern Stochastics: Theory and Applications 10, 2 (2023), 211–228. Thomas Steinke, Milad Nasr, Arun Ganesh, Borja Balle, Christopher A Choquette-Choo, Matthew Jagielski, Jamie Hayes, Abhradeep Guha Thakurta, Adam Smith, and Andreas Terzis
work page 2023
-
[22]
arXiv preprint arXiv:2410.06186 (2024)
The last iterate advantage: Empirical auditing and principled heuristic analysis of differentially private sgd. arXiv preprint arXiv:2410.06186 (2024). 11 Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, and Prateek Mittal
-
[23]
arXiv preprint arXiv:2401.04343 (2024)
Private fine- tuning of large language models with zeroth-order optimization. arXiv preprint arXiv:2401.04343 (2024). Roman Vershynin
-
[24]
Introduction to the non-asymptotic analysis of random matrices
Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010). Rongzhe Wei, Eli Chien, and Pan Li
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[25]
Advances in Neural Information Processing Systems 35 (2022), 703–715
Differentially private learning needs hidden state (or much faster convergence). Advances in Neural Information Processing Systems 35 (2022), 703–715. Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang
work page 2022
-
[26]
DPZero: Dimension- Independent and Differentially Private Zeroth-Order Optimization. In International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023 . https://openreview.net/forum?id=s7hquGszME Qinzi Zhang, Hoang Tran, and Ashok Cutkosky
work page 2023
-
[27]
In The Twelfth International Conference on Learning Representations
Private Zeroth-Order Nonsmooth Nonconvex Optimization. In The Twelfth International Conference on Learning Representations . https: //openreview.net/forum?id=IzqZbNMZ0M A Appendix A.1 Limitations While we have established the first convergent DP guarantee for the ZOGD method, it is still far from ready to be directly applied to LLM (or any other general n...
work page 2022
-
[28]
We report the privacy loss of Noisy-ZOGD (3) under different privacy analysis. We set the radius of the projected set to be R = 1, smooth constant M = 1, strongly convex constant m = 0.9, and the clipped norm ∆ = 1 . For a fair comparison, we first compute the RDP budget for each analysis and then convert it to DP by the standard conversion (Mironov, 2017...
work page 2017
-
[29]
For our approach, we adopted the same simplification as in the proof of Corollary 3.3 that applies to Theorem 3.2 for strongly convex problems and kept all constants to be explicit. 12 A.4 Proof of Theorem 3.1 The proof is quite standard in the DP literature (i.e., based on the analysis of Mironov (2017)). We state it here for completeness. Proof. Case βt...
work page 2017
-
[30]
This is important since it will directly impact the error probability 1 − P(G) in the exponent. In the meanwhile, the Lipschitz constant c of the first order gradient update map ϕ is as follows: If ℓi are M-smooth and m-strongly convex, then if η K ≤ 1 M we have c = 1 − ηm K . With these choices and setup, the bound in Theorem 3.2 for any τ, βt, at become...
work page 2025
-
[31]
Additionally, for simplicity we choose βt = 1/2
We can choose such a solution of a′ t and plug them in our bound. Additionally, for simplicity we choose βt = 1/2. Together we have Dα(WT |TT −1 t=τ Gt ||W ′ T |TT −1 t=τ Gt ) ≤ (T − τ) α σ2 (2∆ n )2 + 2αc2 2d η2σ2 + 4α min(2R, 2Cτ√ K ) 2 d (T − τ)η2σ2 . (25) Recall that c2 = ηM ξ. When we choose ξ ≤ 2∆ nηM √ 2d, we have α σ2 ( 2∆ n )2 + 2αc2 2d η2σ2 ≤ 2α...
work page 2017
-
[32]
Together we complete the proof. A.9 Minibatch extensions Now, we discuss the minibatch generalizations of Noisy-ZOGD (Theorem 3.2), which we called the corresponding algorithm Noisy-ZOSGD (under subsampling without replacement). 18 Subsampling without replacement. For this strategy, the Noisy-ZOSGD update is defined as follows. wt+1 = ΠBR " wt − η K KX k=...
work page 2019
-
[33]
and noise parameter σ > 0, define Sα(q, σ) = Dα(N(0, σ2)||(1 − q)N(0, σ2) + qN (1, σ2)). (65) Note that Sα can be computed in practice with a numerically stable procedure for precise computa- tion Mironov (2017). Now we are ready to state the result for subsampling without replacement. Theorem A.3 (RDP guarantee of Noisy-ZOSGD). Assume ℓ(·, x) are M-smoot...
work page 2017
-
[34]
If ℓ is also m-strongly convex and choose η ≤ K/M, we have c = 1 − ηm K . The proof mainly combines the proof of shifted Rényi divergence analysis of the first-order counter- part Chien and Li (2025); Altschuler and Talwar (2022) and the proof of our Theorem 3.2. Before we state our proof, let us first introduce a technical lemma before we introduce our p...
work page 2025
-
[35]
and the noise parameter σ > 0, dimension d ∈ N and radius R > 0, sup µ∈P(BR) Dα(N(0, σ2Id)||(1 − q)N(0, σ2Id) + q(N(0, σ2Id) ∗ µ)) = Sα(q, σ/R), (66) where P(BR) denotes set of all Borel probability distributions over the ℓ2 ball of radius R in Rd. In practice, Sα(q, σ) is computed via numerical integral for the tightest possible privacy accounting. Never...
work page 2022
-
[36]
The loss function is L-Lipschitz and M-smooth. The averaged loss function is twice differentiable with −H ⪯ ∇2L(w; D) ⪯ H for any w ∈ Rd, and its minimum is finite. Here, 0 ⪯ H is a real-valued d by d matrix such that ∥H∥2 ≤ M and T r(H) ≤ r∥H∥2 for for some r as the effective rank or the 20 intrinsic dimension of the problem. These assumptions are exactl...
work page 2023
-
[37]
(88) where (a) is due to M-smoothness and the elementary inequality (a + b)2 ≤ 2a2 + 2b2
E≤t[∥∇¯ℓ(wt)∥2] + 1 2d M3ξ2r (87) = 2(2 + r)M d(d + 2)P(Qt) E≤t[∥∇¯ℓ(wt)∥2] + M3ξ2r 2dP(Qt) . (88) where (a) is due to M-smoothness and the elementary inequality (a + b)2 ≤ 2a2 + 2b2. (b) and (c) are due to Lemma C.1 in Zhang et al. (2023). Similarly we have KX k=1 E≤t[uT t,kHu t,k|Qt] ≤ rM K dP(Qt) . (89) For the inner-product term, let us denote u′ = √ ...
work page 2023
-
[38]
(101) If we further assume that |¯ℓ(w)| ≤ B for any w
Then we have E≤t[∥∇¯ℓ(wt)∥2] ≤ 4dP(Qt) η E≤t[¯ℓ(wt) − ¯ℓ(wt+1)|Qt] + M2ξ2d3 + 4L2dP( ¯Qt) (100) + 2drM η σ2 1,t K d + σ2 2,tP(Qt) . (101) If we further assume that |¯ℓ(w)| ≤ B for any w. Then following the same analysis of Zhang et al . (2023) we have E≤t[¯ℓ(wt) − ¯ℓ(wt+1)|Qt]P(Qt) ≤ E≤t[¯ℓ(wt) − ¯ℓ(wt+1)|Q]P(Q) + 2BP( ¯Q). (102) As a result we have E≤t[∥...
work page 2023
-
[39]
by following the analysis of Lemma 5.5 of Vershynin (2010) in order to facilitate our non-asymptotic bounds. Lemma B.3. Let Bk i.i.d. ∼ Beta 1 2 , d−1 2 . Then it holds that P ( KX k=1 Bk ≥ 0.15 − log(1/δ) K ) + · K 20 d + 1 + √ d − 2 ) ≥ 1 − δ, where (·)+ denotes max(·, 0). Proof. Let B ∼ Beta 1 2 , d−1 2 . Then B (d) = W1 W1+W2 where W1 ∼ χ2 (1) and W2 ...
work page 2010
-
[40]
are independent. Since W1 (d) = Z2 for Z ∼ N (0, 1), by the lower bound of the Q-function, it holds that P {W1 ≥ κ1} = P {|Z| ≥ √κ1} = 2 · Q (√κ1) ≥ √κ1 1 + κ1 e−κ1/2 √ 2π . On the other hand, according to the quantile upper bound on the χ2 random variables (see, for instance, (Inglot, 2010, Theorem A)), we have P n W2 ≤ (d −
work page 2010
-
[41]
for j ∈ [d]. Then w⊥ 2 2 = R2 z2 1 Pd j=2 z2 j Pd j=1 z2 j 2 ≤ R2 z2 1 Pd j=1 z2 j Pd j=1 z2 j 2 = r2 z2 1Pd j=1 z2 j ≜ B, where B ∼ Beta 1 2 , d−1 2 . As a result, by Marchal and Arbel (2017, Theorem 1), w⊥ is 1 4(1/2+d/2+1) = 1 2d+6 norm-sub-Gaussian, so applying Lemma B.2 yields KX k=1 w⊥ k 2 2 ≤ ∥v∥2 K 2d + 6 log 2d δb , with probability at least 1 − ...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.