Efficient DP-SGD for LLMs with Randomized Clipping
Pith reviewed 2026-06-30 12:22 UTC · model grok-4.3
The pith
DP-SGD with randomized clipping via trace estimation reduces memory for private LLM training while matching utility.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DP-SGD-RC is a variant of DP-SGD that uses Hutchinson's estimator and Hutch++ for randomized clipping, reducing the memory complexity of per-sample gradient norm estimation while providing tight privacy bounds and preserving model utility on downstream tasks.
What carries the argument
Stochastic trace estimation with Hutchinson's estimator and Hutch++ to approximate per-sample gradient norms for randomized clipping in DP-SGD.
If this is right
- DP-SGD-RC achieves noise multipliers competitive with deterministic clipping.
- Experiments show it matches baseline utility on classification, question answering, and summarization tasks.
- It significantly reduces memory and compute requirements compared to standard DP-SGD implementations.
- The approach supports fine-tuning on long-context benchmarks without proportional resource increase.
Where Pith is reading between the lines
- Similar randomized estimation techniques might apply to other DP mechanisms that require per-sample statistics.
- The memory savings could permit larger batch sizes or longer contexts in private training setups.
- If the estimation variance is controlled, it may extend to full pre-training of LLMs rather than just fine-tuning.
Load-bearing premise
The stochastic trace estimation produces sufficiently accurate per-sample gradient norm estimates that the subsequent clipping and noise addition preserve both the stated privacy bound and downstream model utility.
What would settle it
A direct comparison experiment showing that DP-SGD-RC either violates the claimed privacy guarantees or yields measurably lower task performance than deterministic clipping on the same Llama model and datasets.
Figures
read the original abstract
Large language models (LLMs) are trained on vast datasets that may contain sensitive information. Differential privacy (DP), the de facto standard for formal privacy guarantees, provides a principled framework for training LLMs with provable privacy protection. However, state-of-the-art DP training implementations rely on fast gradient clipping techniques with memory overhead $O(B \min\{T^2, d^2\})$, where $B$ is the batch size, $T$ is the sequence length, and $d$ is the model width. This becomes prohibitive as both model size and context length grow. We propose DP-SGD-RC, a novel variant of DP-SGD with randomized clipping that reduces memory and compute complexity. DP-SGD-RC leverages stochastic trace estimation methods, specifically Hutchinson's estimator[Hutchinson, 1989] and its improved variant, Hutch++[Meyer et al., 2021], to reduce the memory footprint of per-sample gradient norm estimation. We provide a tight privacy analysis showing that DP-SGD-RC achieves noise multipliers competitive with deterministic clipping. Experiments fine-tuning Llama~3.2-1B on long-context benchmarks spanning classification, question answering, and summarization tasks demonstrate that DP-SGD-RC matches baseline utility while significantly reducing memory and compute requirements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DP-SGD-RC, a DP-SGD variant that replaces exact per-sample gradient clipping with randomized clipping based on stochastic trace estimation (Hutchinson's estimator and Hutch++) of ||g_i||_2^2. It claims this yields memory/compute savings over standard O(B min{T^2,d^2}) clipping while providing a tight privacy analysis whose noise multipliers remain competitive with deterministic clipping, and that fine-tuning Llama 3.2-1B on long-context tasks preserves utility.
Significance. If the privacy analysis correctly accounts for norm-estimation error, the method would materially improve the practicality of DP training for billion-parameter LLMs by lowering the memory barrier that currently limits batch size and context length.
major comments (2)
- [Privacy Analysis] The central privacy claim (that noise multipliers remain competitive) rests on the assumption that the stochastic estimator produces per-sample norms sufficiently accurate that the effective L2 sensitivity stays bounded by the clipping threshold. The abstract asserts a 'tight privacy analysis' but provides no high-probability tail bound on the underestimation probability of Hutchinson/Hutch++ for high-rank LLM gradients (d~10^9). Without an explicit adjustment to the sensitivity or a failure-probability term in the privacy budget, the competitive multiplier does not follow.
- [Experiments] § on experimental setup: the utility-matching claim for Llama 3.2-1B is reported without error bars on the number of Hutchinson samples used per gradient or on the observed norm-estimation error distribution; this leaves open whether the reported utility holds only for estimator configurations that already violate the sensitivity assumption used in the proof.
minor comments (1)
- [Abstract] Notation for the randomized clipping threshold and the subsequent noise scale should be defined once and used consistently; the abstract introduces 'randomized clipping' without an equation relating the estimator output to the final clipped gradient.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our work. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Privacy Analysis] The central privacy claim (that noise multipliers remain competitive) rests on the assumption that the stochastic estimator produces per-sample norms sufficiently accurate that the effective L2 sensitivity stays bounded by the clipping threshold. The abstract asserts a 'tight privacy analysis' but provides no high-probability tail bound on the underestimation probability of Hutchinson/Hutch++ for high-rank LLM gradients (d~10^9). Without an explicit adjustment to the sensitivity or a failure-probability term in the privacy budget, the competitive multiplier does not follow.
Authors: We agree that the presentation of the privacy analysis can be strengthened by making the dependence on estimator concentration explicit. The manuscript invokes the known high-probability error bounds for Hutch++ (Meyer et al., 2021), but does not yet fold the failure probability into the overall (ε,δ) budget or discuss the scaling of these bounds for d ≈ 10^9. We will revise the privacy section to (i) state the tail bounds used, (ii) allocate a small failure probability eta to the estimator and compose it with the DP guarantee via a standard union bound, and (iii) recompute the effective noise multiplier under this adjusted sensitivity. This will make the claim of competitive multipliers fully rigorous. revision: yes
-
Referee: [Experiments] § on experimental setup: the utility-matching claim for Llama 3.2-1B is reported without error bars on the number of Hutchinson samples used per gradient or on the observed norm-estimation error distribution; this leaves open whether the reported utility holds only for estimator configurations that already violate the sensitivity assumption used in the proof.
Authors: We acknowledge the value of reporting variability. The current experiments fix the number of Hutchinson samples but do not display error bars across runs or the empirical distribution of relative norm error. In the revised version we will (i) report mean and standard deviation of downstream metrics over at least three independent runs, (ii) include histograms or quantiles of the observed ||ĝ_i|| / ||g_i|| ratio for the chosen sample count, and (iii) verify that the chosen configuration keeps the underestimation probability below the eta used in the updated privacy analysis. revision: yes
Circularity Check
No circularity; privacy analysis and estimator citations are independent of target claims
full rationale
The provided abstract and reader summary contain no equations, fitted parameters, or self-citations that reduce the claimed privacy multipliers, noise analysis, or utility results to a definition or input by construction. Hutchinson (1989) and Hutch++ (Meyer et al. 2021) are external citations; the privacy analysis is presented as a separate contribution without evidence of self-definitional loops or renaming. The derivation chain therefore remains self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Hutchinson's estimator yields an unbiased estimate of the trace of a matrix
- standard math Hutch++ reduces variance of the trace estimate relative to plain Hutchinson
Reference graph
Works this paper leans on
-
[1]
Deep learning with differential privacy
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pages 308--318, 2016
2016
-
[2]
The us census bureau adopts differential privacy
John M Abowd. The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2867--2867, 2018
2018
-
[3]
Scaling private deep learning with opacus: Advances for large language models
Sai Aparna Aketi, Will Bullock, Iden Kalemaj, Enayat Ullah, and Huanyu Zhang. Scaling private deep learning with opacus: Advances for large language models. In Championing Open-source DEvelopment in ML Workshop@ ICML25, 2025
2025
-
[4]
Differentially private learning with adaptive clipping
Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34: 0 17455--17466, 2021
2021
-
[5]
Learning with privacy at scale
Apple Differential Privacy Team . Learning with privacy at scale. Apple Machine Learning Journal, 1 0 (8), December 2017. URL https://machinelearning.apple.com/research/learning-with-privacy-at-scale
2017
-
[6]
Faster rates of convergence to stationary points in differentially private optimization
Raman Arora, Raef Bassily, Tom \'a s Gonz \'a lez, Crist \'o bal A Guzm \'a n, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. In International Conference on Machine Learning, pages 1060--1092. PMLR, 2023
2023
-
[7]
Private stochastic convex optimization: Optimal rates in _1 geometry
Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in _1 geometry. In International Conference on Machine Learning, pages 393--403. PMLR, 2021
2021
-
[8]
Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix
Haim Avron and Sivan Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM), 58 0 (2): 0 1--34, 2011
2011
-
[9]
Private empirical risk minimization: Efficient algorithms and tight error bounds
Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In IEEE Annual Symposium on Foundations of Computer Science, pages 464--473. IEEE, 2014
2014
-
[10]
Bayesian theory, volume 586
Jos \'e M Bernardo, Adrian FM Smith, and Mark Berliner. Bayesian theory, volume 586. Wiley Online Library, 1994
1994
-
[11]
Tres observaciones sobre el algebra lineal
Garrett Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucuman, Ser. A, 5: 0 147--154, 1946
1946
-
[12]
Fast and memory efficient differentially private-sgd via jl projections
Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Hanwen Shen, and Uthaipon Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. Advances in Neural Information Processing Systems, 34: 0 19680--19691, 2021
2021
-
[13]
Differentially private optimization on large model at small cost
Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Differentially private optimization on large model at small cost. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023
2023
-
[14]
Concentrated differential privacy: Simplifications, extensions, and lower bounds
Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of cryptography conference, pages 635--658. Springer, 2016
2016
-
[15]
Composable and versatile privacy via truncated cdp
Mark Bun, Cynthia Dwork, Guy N Rothblum, and Thomas Steinke. Composable and versatile privacy via truncated cdp. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 74--86, 2018
2018
-
[16]
Differentially private empirical risk minimization
Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12 0 (3), 2011
2011
-
[17]
Multi-epoch matrix factorization mechanisms for private machine learning
Christopher A Choquette-Choo, H Brendan McMahan, Keith Rush, and Abhradeep Thakurta. Multi-epoch matrix factorization mechanisms for private machine learning. arXiv preprint arXiv:2211.06530, 2022
-
[18]
An elementary proof of a theorem of johnson and lindenstrauss
Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22 0 (1): 0 60--65, 2003
2003
-
[19]
Gaussian differential privacy
Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84 0 (1): 0 3--37, 2022
2022
-
[20]
Calibrating noise to sensitivity in private data analysis
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265--284. Springer, 2006
2006
-
[21]
The algorithmic foundations of differential privacy
Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science , 9 0 (3--4): 0 211--407, 2014
2014
-
[22]
Rappor: Randomized aggregatable privacy-preserving ordinal response
\'U lfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054--1067, 2014
2014
-
[23]
Private stochastic convex optimization: optimal rates in linear time
Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, page 439–449, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369794. doi:10.1145/3357713.3384335. URL https://doi.org/10.11...
-
[24]
Efficient Per-Example Gradient Computations
Ian Goodfellow. Efficient per-example gradient computations, 2015. URL https://arxiv.org/abs/1510.01799
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[25]
dp-accounting: Tools for tracking differential privacy budgets
Google Differential Privacy Team . dp-accounting: Tools for tracking differential privacy budgets. https://github.com, 2020
2020
-
[26]
Numerical composition of differential privacy
Sivakanth Gopi, Yin Tat Lee, and Lukas Wutschitz. Numerical composition of differential privacy. Advances in Neural Information Processing Systems, 34: 0 11631--11642, 2021
2021
-
[27]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and ... The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Proceedings of the 23rd International Conference on Machine Learning , series =
Derek Greene and P\' a draig Cunningham. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine Learning, ICML '06, page 377–384, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933832. doi:10.1145/1143844.1143892. URL https://doi.org/10....
-
[29]
A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines
Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18 0 (3): 0 1059--1076, 1989
1989
-
[30]
Extensions of lipschitz mappings into a hilbert space
William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26 0 (189-206): 0 1, 1984
1984
-
[31]
Cs 860 lecture 5: Approximate differential privacy
Gautam Kamath. Cs 860 lecture 5: Approximate differential privacy. Course notes for CS 860: Algorithms for Private Data Analysis, 2020. URL http://www.gautamkamath.com/CS860notes/lec5.pdf
2020
-
[32]
Private convex empirical risk minimization and high-dimensional regression
Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 25.1--25.40, Edinburgh, Scotland, 25--27 J...
2012
-
[33]
B ill S um: A corpus for automatic summarization of US legislation
Anastassia Kornilova and Vladimir Eidelman. B ill S um: A corpus for automatic summarization of US legislation. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 48--56, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653...
-
[34]
Computing tight differential privacy guarantees using fft
Antti Koskela, Joonas J \"a lk \"o , and Antti Honkela. Computing tight differential privacy guarantees using fft. In International Conference on Artificial Intelligence and Statistics, pages 2560--2569. PMLR, 2020
2020
-
[35]
Scaling up differentially private deep learning with fast per-example gradient clipping
Jaewoo Lee and Daniel Kifer. Scaling up differentially private deep learning with fast per-example gradient clipping. Proceedings on Privacy Enhancing Technologies, 2021
2021
-
[36]
arXiv preprint arXiv:2110.05679 , year=
Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679, 2021
-
[37]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Inequalities: theory of majorization and its applications
Albert W Marshall, Ingram Olkin, and Barry C Arnold. Inequalities: theory of majorization and its applications. 1979
1979
-
[39]
Differentially private non-convex optimization under the kl condition with optimal rates
Michael Menart, Enayat Ullah, Raman Arora, Raef Bassily, and Crist \'o bal Guzm \'a n. Differentially private non-convex optimization under the kl condition with optimal rates. In International Conference on Algorithmic Learning Theory, pages 868--906. PMLR, 2024
2024
-
[40]
Hutchinson's estimator is bad at kronecker-trace-estimation
Raphael A Meyer and Haim Avron. Hutchinson's estimator is bad at kronecker-trace-estimation. arXiv preprint arXiv:2309.04952, 2023
-
[41]
Hutch++: Optimal stochastic trace estimation
Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142--155. SIAM, 2021
2021
-
[42]
R \'e nyi differential privacy
Ilya Mironov. R \'e nyi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263--275. IEEE, 2017
2017
-
[43]
Stochastic orders
Moshe Shaked and J George Shanthikumar. Stochastic orders. Springer, 2007
2007
-
[44]
V. Strassen. The Existence of Probability Measures with Given Marginals . The Annals of Mathematical Statistics, 36 0 (2): 0 423 -- 439, 1965. doi:10.1214/aoms/1177700153. URL https://doi.org/10.1214/aoms/1177700153
-
[45]
Extremal probabilities for gaussian quadratic forms
G \'a bor J Sz \'e kely and Nail K Bakirov. Extremal probabilities for gaussian quadratic forms. Probability theory and related fields, 126 0 (2): 0 184--202, 2003
2003
-
[46]
arXiv preprint arXiv:2401.04343 , year=
Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, and Prateek Mittal. Private fine-tuning of large language models with zeroth-order optimization. arXiv preprint arXiv:2401.04343, 2024
-
[47]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. H otpot QA : A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro...
-
[48]
Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021
Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in pytorch. arXiv preprint arXiv:2109.12298, 2021
-
[49]
Differentially private fine-tuning of language models
Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. Differentially private fine-tuning of language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Q42f0dfjECO
2022
-
[50]
Differentially private SGD without clipping bias: An error-feedback approach
Xinwei Zhang, Zhiqi Bu, Steven Wu, and Mingyi Hong. Differentially private SGD without clipping bias: An error-feedback approach. In International Conference on Learning Representations, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.