Efficient DP-SGD for LLMs with Randomized Clipping

Devansh Gupta; Enayat Ullah; Huanyu Zhang; Meisam Razaviyayn; Sai Aparna Aketi

arxiv: 2605.24879 · v1 · pith:ZLQBVEFRnew · submitted 2026-05-24 · 💻 cs.LG · math.OC

Efficient DP-SGD for LLMs with Randomized Clipping

Enayat Ullah , Sai Aparna Aketi , Devansh Gupta , Huanyu Zhang , Meisam Razaviyayn This is my paper

Pith reviewed 2026-06-30 12:22 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords differential privacyDP-SGDrandomized clippingstochastic trace estimationlarge language modelsHutchinson estimatormemory efficiencygradient clipping

0 comments

The pith

DP-SGD with randomized clipping via trace estimation reduces memory for private LLM training while matching utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DP-SGD-RC, which applies randomized clipping using stochastic trace estimation to approximate per-sample gradient norms in differential privacy training. This addresses the high memory overhead of standard fast gradient clipping techniques that scale with batch size, sequence length, and model dimension. The method comes with a privacy analysis that shows noise multipliers competitive with deterministic clipping. Experiments demonstrate that fine-tuning the Llama 3.2-1B model on long-context tasks achieves similar utility to baselines but with lower memory and compute needs. Readers would care if this makes private training feasible for larger models and longer sequences.

Core claim

DP-SGD-RC is a variant of DP-SGD that uses Hutchinson's estimator and Hutch++ for randomized clipping, reducing the memory complexity of per-sample gradient norm estimation while providing tight privacy bounds and preserving model utility on downstream tasks.

What carries the argument

Stochastic trace estimation with Hutchinson's estimator and Hutch++ to approximate per-sample gradient norms for randomized clipping in DP-SGD.

If this is right

DP-SGD-RC achieves noise multipliers competitive with deterministic clipping.
Experiments show it matches baseline utility on classification, question answering, and summarization tasks.
It significantly reduces memory and compute requirements compared to standard DP-SGD implementations.
The approach supports fine-tuning on long-context benchmarks without proportional resource increase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar randomized estimation techniques might apply to other DP mechanisms that require per-sample statistics.
The memory savings could permit larger batch sizes or longer contexts in private training setups.
If the estimation variance is controlled, it may extend to full pre-training of LLMs rather than just fine-tuning.

Load-bearing premise

The stochastic trace estimation produces sufficiently accurate per-sample gradient norm estimates that the subsequent clipping and noise addition preserve both the stated privacy bound and downstream model utility.

What would settle it

A direct comparison experiment showing that DP-SGD-RC either violates the claimed privacy guarantees or yields measurably lower task performance than deterministic clipping on the same Llama model and datasets.

Figures

Figures reproduced from arXiv: 2605.24879 by Devansh Gupta, Enayat Ullah, Huanyu Zhang, Meisam Razaviyayn, Sai Aparna Aketi.

**Figure 1.** Figure 1: Envelope functions for Hutch and Hutch++ for k = 32 and d = 2 (left) and d = 2048 (right). In the left, we see a separation between Hutch and Hutch++, but in the right (practical setting), they are essentially overlapping. 5.1 Privacy Analysis The following is the main result, which gives a description of the trade-off curve for the single-step of Algorithm 1 which use 2 as the norm estimator. Theorem 1. A… view at source ↗

**Figure 2.** Figure 2: Peak memory savings for full fine-tuning settings v/s projection dimension for different linear layers [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Compute savings for full fine-tuning settings v/s projection dimension for different linear layers of [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Latency savings for full fine-tuning as function of projection dimension for different linear layers of [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Behavior of optimal {λi}i ’s for d = 3 as a function of x where s(x) = supλ∈[0,1] P (λX1 + (1 − λ)X2 ≤ x). The authors invalidly assume that the set of candidate quadratic forms for the Supremum is the same as the set derived for the Infimum in Theorem 1. In Theorem 1 (Infimum): The authors used a perturbation argument (Page 193, Equation 4) to show that if a cluster of eigenvalues has multiplicity k > 1, … view at source ↗

**Figure 6.** Figure 6: Simulations showing x− is essentially 1 • Left Panel: Derivative at t = 0 vs ϵ, Shows ∂F ∂t |x=1+ϵ,t=0. Rate of change of the CDF as we move away from the vertex. We see All lines have slope 1 on log-log, meaning derivative ∝ ϵ. This means that he CDF immediately starts increasing when you move away from vertex, and this effect gets stronger as dk increase • Middle Panel: Maximum Excess vs ϵ. Shows maxt[F(… view at source ↗

**Figure 7.** Figure 7: Simulations on x+ across settings of d and k together with the 1 + 2 dk fit [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

**Figure 8.** Figure 8: Simulations of the envelope CDF for k = 2, d = 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x 0.0 0.2 0.4 0.6 0.8 1.0 CDF EXACT CDFs: k=32, d=2 F(x; X_1) ~ (1/32) ²(32) F(x; X(u)) ~ (1/64) ²(64) max_{ [0,0.5]} F(x; X( )) x = 1.011 x = 1.024 Uncertain region (width=0.013) x* = 1.015 (X_1 X(u)) Mean = 1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x 0.0 0.1 0.2 0.3 0.4 0.5 Optimal Optimal (x) Maximizing CDF =0: X_2 ~ (1/32)… view at source ↗

**Figure 9.** Figure 9: Simulations of the envelope CDF for k = 32, d = 2 B.7 Proof of Main Theorem 1 Let D and D′ be neighboring datasets with differing element (A0, G0) and let Q0 := A⊤G. We start with Lemma 7, which shows that for any norm estimation routine, x 7→ R(x), the trade-off function is bounded as, T(A(D)∥A(D′ )) ⪰ T (Z, N (Z, 1)∥Z, N (0, 1)) where Z = ∥Q0∥ R(Q0) . We now instantiate R with Hutch andHutch++ and specia… view at source ↗

**Figure 10.** Figure 10: A high-level structure proof of Proposition [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗

**Figure 11.** Figure 11: (a) Relative error in norm estimation with Hutch and Hutch [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗

**Figure 12.** Figure 12: Noise multiplier (σ) as a function of hidden (non-projected) dimension (d) of the linear layer for a fixed projection dimension (k) of 32 for Hutch, Hutch++ and DP-SGD (Baseline). We used BBC dataset experimental setup for both the plots. D Hyper-parameters Hyperparameter tuning was performed consistently across all experiments. We used a batch size of 64 for the BBC experiments (for both full finetuning … view at source ↗

**Figure 13.** Figure 13: Peak memory savings for full fine-tuning settings without considering inputs (activations and [PITH_FULL_IMAGE:figures/full_fig_p040_13.png] view at source ↗

read the original abstract

Large language models (LLMs) are trained on vast datasets that may contain sensitive information. Differential privacy (DP), the de facto standard for formal privacy guarantees, provides a principled framework for training LLMs with provable privacy protection. However, state-of-the-art DP training implementations rely on fast gradient clipping techniques with memory overhead $O(B \min\{T^2, d^2\})$, where $B$ is the batch size, $T$ is the sequence length, and $d$ is the model width. This becomes prohibitive as both model size and context length grow. We propose DP-SGD-RC, a novel variant of DP-SGD with randomized clipping that reduces memory and compute complexity. DP-SGD-RC leverages stochastic trace estimation methods, specifically Hutchinson's estimator[Hutchinson, 1989] and its improved variant, Hutch++[Meyer et al., 2021], to reduce the memory footprint of per-sample gradient norm estimation. We provide a tight privacy analysis showing that DP-SGD-RC achieves noise multipliers competitive with deterministic clipping. Experiments fine-tuning Llama~3.2-1B on long-context benchmarks spanning classification, question answering, and summarization tasks demonstrate that DP-SGD-RC matches baseline utility while significantly reducing memory and compute requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DP-SGD-RC cuts memory via Hutch++ clipping for LLM fine-tuning but the privacy bound looks incomplete because the estimator can underestimate norms.

read the letter

The paper's core move is replacing exact per-sample gradient norm computation in DP-SGD with Hutchinson and Hutch++ trace estimators so that clipping no longer requires O(B min{T^2, d^2}) memory. That change is the actual novelty; prior DP-SGD work either used full per-sample gradients or cruder approximations that did not target long-context LLMs.

The experiments are the strongest part. Fine-tuning Llama 3.2-1B on classification, QA, and summarization benchmarks shows utility that matches the deterministic clipping baseline while memory and compute drop noticeably. The authors report the method scales to longer contexts without the usual blow-up, which is the practical payoff people care about.

The soft spot is the privacy claim. The abstract states a tight analysis that yields competitive noise multipliers, yet the stress-test concern holds: Hutchinson's estimator is unbiased but its variance scales with effective dimension. For LLM gradients the probability of under-estimation is non-negligible, and under-clipping breaks the L2 sensitivity the proof relies on. Nothing in the provided abstract or stress-test note shows an explicit high-probability error bound or an adjusted sensitivity that absorbs the tail. If the full paper does not supply that step, the competitive noise multiplier does not follow.

The work is aimed at people who actually run DP fine-tuning on models larger than a few hundred million parameters. A reader who needs lower memory footprints will find the empirical numbers useful even if they later have to patch the analysis. The paper is coherent on its own terms and engages the right prior art on trace estimation and DP-SGD, so it clears the bar for serious refereeing. I would send it out for review with the expectation that the privacy section receives the most scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper introduces DP-SGD-RC, a DP-SGD variant that replaces exact per-sample gradient clipping with randomized clipping based on stochastic trace estimation (Hutchinson's estimator and Hutch++) of ||g_i||_2^2. It claims this yields memory/compute savings over standard O(B min{T^2,d^2}) clipping while providing a tight privacy analysis whose noise multipliers remain competitive with deterministic clipping, and that fine-tuning Llama 3.2-1B on long-context tasks preserves utility.

Significance. If the privacy analysis correctly accounts for norm-estimation error, the method would materially improve the practicality of DP training for billion-parameter LLMs by lowering the memory barrier that currently limits batch size and context length.

major comments (2)

[Privacy Analysis] The central privacy claim (that noise multipliers remain competitive) rests on the assumption that the stochastic estimator produces per-sample norms sufficiently accurate that the effective L2 sensitivity stays bounded by the clipping threshold. The abstract asserts a 'tight privacy analysis' but provides no high-probability tail bound on the underestimation probability of Hutchinson/Hutch++ for high-rank LLM gradients (d~10^9). Without an explicit adjustment to the sensitivity or a failure-probability term in the privacy budget, the competitive multiplier does not follow.
[Experiments] § on experimental setup: the utility-matching claim for Llama 3.2-1B is reported without error bars on the number of Hutchinson samples used per gradient or on the observed norm-estimation error distribution; this leaves open whether the reported utility holds only for estimator configurations that already violate the sensitivity assumption used in the proof.

minor comments (1)

[Abstract] Notation for the randomized clipping threshold and the subsequent noise scale should be defined once and used consistently; the abstract introduces 'randomized clipping' without an equation relating the estimator output to the final clipped gradient.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our work. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Privacy Analysis] The central privacy claim (that noise multipliers remain competitive) rests on the assumption that the stochastic estimator produces per-sample norms sufficiently accurate that the effective L2 sensitivity stays bounded by the clipping threshold. The abstract asserts a 'tight privacy analysis' but provides no high-probability tail bound on the underestimation probability of Hutchinson/Hutch++ for high-rank LLM gradients (d~10^9). Without an explicit adjustment to the sensitivity or a failure-probability term in the privacy budget, the competitive multiplier does not follow.

Authors: We agree that the presentation of the privacy analysis can be strengthened by making the dependence on estimator concentration explicit. The manuscript invokes the known high-probability error bounds for Hutch++ (Meyer et al., 2021), but does not yet fold the failure probability into the overall (ε,δ) budget or discuss the scaling of these bounds for d ≈ 10^9. We will revise the privacy section to (i) state the tail bounds used, (ii) allocate a small failure probability eta to the estimator and compose it with the DP guarantee via a standard union bound, and (iii) recompute the effective noise multiplier under this adjusted sensitivity. This will make the claim of competitive multipliers fully rigorous. revision: yes
Referee: [Experiments] § on experimental setup: the utility-matching claim for Llama 3.2-1B is reported without error bars on the number of Hutchinson samples used per gradient or on the observed norm-estimation error distribution; this leaves open whether the reported utility holds only for estimator configurations that already violate the sensitivity assumption used in the proof.

Authors: We acknowledge the value of reporting variability. The current experiments fix the number of Hutchinson samples but do not display error bars across runs or the empirical distribution of relative norm error. In the revised version we will (i) report mean and standard deviation of downstream metrics over at least three independent runs, (ii) include histograms or quantiles of the observed ||ĝ_i|| / ||g_i|| ratio for the chosen sample count, and (iii) verify that the chosen configuration keeps the underestimation probability below the eta used in the updated privacy analysis. revision: yes

Circularity Check

0 steps flagged

No circularity; privacy analysis and estimator citations are independent of target claims

full rationale

The provided abstract and reader summary contain no equations, fitted parameters, or self-citations that reduce the claimed privacy multipliers, noise analysis, or utility results to a definition or input by construction. Hutchinson (1989) and Hutch++ (Meyer et al. 2021) are external citations; the privacy analysis is presented as a separate contribution without evidence of self-definitional loops or renaming. The derivation chain therefore remains self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review is based solely on the abstract; no free parameters, ad-hoc axioms, or invented entities are visible beyond reliance on two cited prior estimators.

axioms (2)

standard math Hutchinson's estimator yields an unbiased estimate of the trace of a matrix
Cited directly from Hutchinson 1989 in the abstract
standard math Hutch++ reduces variance of the trace estimate relative to plain Hutchinson
Cited from Meyer et al. 2021 in the abstract

pith-pipeline@v0.9.1-grok · 5778 in / 1297 out tokens · 28195 ms · 2026-06-30T12:22:01.631853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 13 canonical work pages · 3 internal anchors

[1]

Deep learning with differential privacy

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pages 308--318, 2016

2016
[2]

The us census bureau adopts differential privacy

John M Abowd. The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2867--2867, 2018

2018
[3]

Scaling private deep learning with opacus: Advances for large language models

Sai Aparna Aketi, Will Bullock, Iden Kalemaj, Enayat Ullah, and Huanyu Zhang. Scaling private deep learning with opacus: Advances for large language models. In Championing Open-source DEvelopment in ML Workshop@ ICML25, 2025

2025
[4]

Differentially private learning with adaptive clipping

Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34: 0 17455--17466, 2021

2021
[5]

Learning with privacy at scale

Apple Differential Privacy Team . Learning with privacy at scale. Apple Machine Learning Journal, 1 0 (8), December 2017. URL https://machinelearning.apple.com/research/learning-with-privacy-at-scale

2017
[6]

Faster rates of convergence to stationary points in differentially private optimization

Raman Arora, Raef Bassily, Tom \'a s Gonz \'a lez, Crist \'o bal A Guzm \'a n, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. In International Conference on Machine Learning, pages 1060--1092. PMLR, 2023

2023
[7]

Private stochastic convex optimization: Optimal rates in _1 geometry

Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in _1 geometry. In International Conference on Machine Learning, pages 393--403. PMLR, 2021

2021
[8]

Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix

Haim Avron and Sivan Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM), 58 0 (2): 0 1--34, 2011

2011
[9]

Private empirical risk minimization: Efficient algorithms and tight error bounds

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In IEEE Annual Symposium on Foundations of Computer Science, pages 464--473. IEEE, 2014

2014
[10]

Bayesian theory, volume 586

Jos \'e M Bernardo, Adrian FM Smith, and Mark Berliner. Bayesian theory, volume 586. Wiley Online Library, 1994

1994
[11]

Tres observaciones sobre el algebra lineal

Garrett Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucuman, Ser. A, 5: 0 147--154, 1946

1946
[12]

Fast and memory efficient differentially private-sgd via jl projections

Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Hanwen Shen, and Uthaipon Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. Advances in Neural Information Processing Systems, 34: 0 19680--19691, 2021

2021
[13]

Differentially private optimization on large model at small cost

Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Differentially private optimization on large model at small cost. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

2023
[14]

Concentrated differential privacy: Simplifications, extensions, and lower bounds

Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of cryptography conference, pages 635--658. Springer, 2016

2016
[15]

Composable and versatile privacy via truncated cdp

Mark Bun, Cynthia Dwork, Guy N Rothblum, and Thomas Steinke. Composable and versatile privacy via truncated cdp. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 74--86, 2018

2018
[16]

Differentially private empirical risk minimization

Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12 0 (3), 2011

2011
[17]

Multi-epoch matrix factorization mechanisms for private machine learning

Christopher A Choquette-Choo, H Brendan McMahan, Keith Rush, and Abhradeep Thakurta. Multi-epoch matrix factorization mechanisms for private machine learning. arXiv preprint arXiv:2211.06530, 2022

work page arXiv 2022
[18]

An elementary proof of a theorem of johnson and lindenstrauss

Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22 0 (1): 0 60--65, 2003

2003
[19]

Gaussian differential privacy

Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84 0 (1): 0 3--37, 2022

2022
[20]

Calibrating noise to sensitivity in private data analysis

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265--284. Springer, 2006

2006
[21]

The algorithmic foundations of differential privacy

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science , 9 0 (3--4): 0 211--407, 2014

2014
[22]

Rappor: Randomized aggregatable privacy-preserving ordinal response

\'U lfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054--1067, 2014

2014
[23]

Private stochastic convex optimization: optimal rates in linear time

Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, page 439–449, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369794. doi:10.1145/3357713.3384335. URL https://doi.org/10.11...

work page doi:10.1145/3357713.3384335 2020
[24]

Efficient Per-Example Gradient Computations

Ian Goodfellow. Efficient per-example gradient computations, 2015. URL https://arxiv.org/abs/1510.01799

work page internal anchor Pith review Pith/arXiv arXiv 2015
[25]

dp-accounting: Tools for tracking differential privacy budgets

Google Differential Privacy Team . dp-accounting: Tools for tracking differential privacy budgets. https://github.com, 2020

2020
[26]

Numerical composition of differential privacy

Sivakanth Gopi, Yin Tat Lee, and Lukas Wutschitz. Numerical composition of differential privacy. Advances in Neural Information Processing Systems, 34: 0 11631--11642, 2021

2021
[27]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and ... The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Proceedings of the 23rd International Conference on Machine Learning , series =

Derek Greene and P\' a draig Cunningham. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine Learning, ICML '06, page 377–384, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933832. doi:10.1145/1143844.1143892. URL https://doi.org/10....

work page doi:10.1145/1143844.1143892 2006
[29]

A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines

Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18 0 (3): 0 1059--1076, 1989

1989
[30]

Extensions of lipschitz mappings into a hilbert space

William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26 0 (189-206): 0 1, 1984

1984
[31]

Cs 860 lecture 5: Approximate differential privacy

Gautam Kamath. Cs 860 lecture 5: Approximate differential privacy. Course notes for CS 860: Algorithms for Private Data Analysis, 2020. URL http://www.gautamkamath.com/CS860notes/lec5.pdf

2020
[32]

Private convex empirical risk minimization and high-dimensional regression

Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 25.1--25.40, Edinburgh, Scotland, 25--27 J...

2012
[33]

B ill S um: A corpus for automatic summarization of US legislation

Anastassia Kornilova and Vladimir Eidelman. B ill S um: A corpus for automatic summarization of US legislation. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 48--56, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653...

work page doi:10.18653/v1/d19-5406 2019
[34]

Computing tight differential privacy guarantees using fft

Antti Koskela, Joonas J \"a lk \"o , and Antti Honkela. Computing tight differential privacy guarantees using fft. In International Conference on Artificial Intelligence and Statistics, pages 2560--2569. PMLR, 2020

2020
[35]

Scaling up differentially private deep learning with fast per-example gradient clipping

Jaewoo Lee and Daniel Kifer. Scaling up differentially private deep learning with fast per-example gradient clipping. Proceedings on Privacy Enhancing Technologies, 2021

2021
[36]

arXiv preprint arXiv:2110.05679 , year=

Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679, 2021

work page arXiv 2021
[37]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Inequalities: theory of majorization and its applications

Albert W Marshall, Ingram Olkin, and Barry C Arnold. Inequalities: theory of majorization and its applications. 1979

1979
[39]

Differentially private non-convex optimization under the kl condition with optimal rates

Michael Menart, Enayat Ullah, Raman Arora, Raef Bassily, and Crist \'o bal Guzm \'a n. Differentially private non-convex optimization under the kl condition with optimal rates. In International Conference on Algorithmic Learning Theory, pages 868--906. PMLR, 2024

2024
[40]

Hutchinson's estimator is bad at kronecker-trace-estimation

Raphael A Meyer and Haim Avron. Hutchinson's estimator is bad at kronecker-trace-estimation. arXiv preprint arXiv:2309.04952, 2023

work page arXiv 2023
[41]

Hutch++: Optimal stochastic trace estimation

Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142--155. SIAM, 2021

2021
[42]

R \'e nyi differential privacy

Ilya Mironov. R \'e nyi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263--275. IEEE, 2017

2017
[43]

Stochastic orders

Moshe Shaked and J George Shanthikumar. Stochastic orders. Springer, 2007

2007
[44]

Loftsgaarden and Charles P

V. Strassen. The Existence of Probability Measures with Given Marginals . The Annals of Mathematical Statistics, 36 0 (2): 0 423 -- 439, 1965. doi:10.1214/aoms/1177700153. URL https://doi.org/10.1214/aoms/1177700153

work page doi:10.1214/aoms/1177700153 1965
[45]

Extremal probabilities for gaussian quadratic forms

G \'a bor J Sz \'e kely and Nail K Bakirov. Extremal probabilities for gaussian quadratic forms. Probability theory and related fields, 126 0 (2): 0 184--202, 2003

2003
[46]

arXiv preprint arXiv:2401.04343 , year=

Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, and Prateek Mittal. Private fine-tuning of large language models with zeroth-order optimization. arXiv preprint arXiv:2401.04343, 2024

work page arXiv 2024
[47]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. H otpot QA : A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro...

work page doi:10.18653/v1/d18-1259 2018
[48]

Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021

Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in pytorch. arXiv preprint arXiv:2109.12298, 2021

work page arXiv 2021
[49]

Differentially private fine-tuning of language models

Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. Differentially private fine-tuning of language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Q42f0dfjECO

2022
[50]

Differentially private SGD without clipping bias: An error-feedback approach

Xinwei Zhang, Zhiqi Bu, Steven Wu, and Mingyi Hong. Differentially private SGD without clipping bias: An error-feedback approach. In International Conference on Learning Representations, 2024

2024

[1] [1]

Deep learning with differential privacy

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pages 308--318, 2016

2016

[2] [2]

The us census bureau adopts differential privacy

John M Abowd. The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2867--2867, 2018

2018

[3] [3]

Scaling private deep learning with opacus: Advances for large language models

Sai Aparna Aketi, Will Bullock, Iden Kalemaj, Enayat Ullah, and Huanyu Zhang. Scaling private deep learning with opacus: Advances for large language models. In Championing Open-source DEvelopment in ML Workshop@ ICML25, 2025

2025

[4] [4]

Differentially private learning with adaptive clipping

Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34: 0 17455--17466, 2021

2021

[5] [5]

Learning with privacy at scale

Apple Differential Privacy Team . Learning with privacy at scale. Apple Machine Learning Journal, 1 0 (8), December 2017. URL https://machinelearning.apple.com/research/learning-with-privacy-at-scale

2017

[6] [6]

Faster rates of convergence to stationary points in differentially private optimization

Raman Arora, Raef Bassily, Tom \'a s Gonz \'a lez, Crist \'o bal A Guzm \'a n, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. In International Conference on Machine Learning, pages 1060--1092. PMLR, 2023

2023

[7] [7]

Private stochastic convex optimization: Optimal rates in _1 geometry

Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in _1 geometry. In International Conference on Machine Learning, pages 393--403. PMLR, 2021

2021

[8] [8]

Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix

Haim Avron and Sivan Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM), 58 0 (2): 0 1--34, 2011

2011

[9] [9]

Private empirical risk minimization: Efficient algorithms and tight error bounds

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In IEEE Annual Symposium on Foundations of Computer Science, pages 464--473. IEEE, 2014

2014

[10] [10]

Bayesian theory, volume 586

Jos \'e M Bernardo, Adrian FM Smith, and Mark Berliner. Bayesian theory, volume 586. Wiley Online Library, 1994

1994

[11] [11]

Tres observaciones sobre el algebra lineal

Garrett Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucuman, Ser. A, 5: 0 147--154, 1946

1946

[12] [12]

Fast and memory efficient differentially private-sgd via jl projections

Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Hanwen Shen, and Uthaipon Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. Advances in Neural Information Processing Systems, 34: 0 19680--19691, 2021

2021

[13] [13]

Differentially private optimization on large model at small cost

Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Differentially private optimization on large model at small cost. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

2023

[14] [14]

Concentrated differential privacy: Simplifications, extensions, and lower bounds

Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of cryptography conference, pages 635--658. Springer, 2016

2016

[15] [15]

Composable and versatile privacy via truncated cdp

Mark Bun, Cynthia Dwork, Guy N Rothblum, and Thomas Steinke. Composable and versatile privacy via truncated cdp. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 74--86, 2018

2018

[16] [16]

Differentially private empirical risk minimization

Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12 0 (3), 2011

2011

[17] [17]

Multi-epoch matrix factorization mechanisms for private machine learning

Christopher A Choquette-Choo, H Brendan McMahan, Keith Rush, and Abhradeep Thakurta. Multi-epoch matrix factorization mechanisms for private machine learning. arXiv preprint arXiv:2211.06530, 2022

work page arXiv 2022

[18] [18]

An elementary proof of a theorem of johnson and lindenstrauss

Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22 0 (1): 0 60--65, 2003

2003

[19] [19]

Gaussian differential privacy

Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84 0 (1): 0 3--37, 2022

2022

[20] [20]

Calibrating noise to sensitivity in private data analysis

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265--284. Springer, 2006

2006

[21] [21]

The algorithmic foundations of differential privacy

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science , 9 0 (3--4): 0 211--407, 2014

2014

[22] [22]

Rappor: Randomized aggregatable privacy-preserving ordinal response

\'U lfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054--1067, 2014

2014

[23] [23]

Private stochastic convex optimization: optimal rates in linear time

Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, page 439–449, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369794. doi:10.1145/3357713.3384335. URL https://doi.org/10.11...

work page doi:10.1145/3357713.3384335 2020

[24] [24]

Efficient Per-Example Gradient Computations

Ian Goodfellow. Efficient per-example gradient computations, 2015. URL https://arxiv.org/abs/1510.01799

work page internal anchor Pith review Pith/arXiv arXiv 2015

[25] [25]

dp-accounting: Tools for tracking differential privacy budgets

Google Differential Privacy Team . dp-accounting: Tools for tracking differential privacy budgets. https://github.com, 2020

2020

[26] [26]

Numerical composition of differential privacy

Sivakanth Gopi, Yin Tat Lee, and Lukas Wutschitz. Numerical composition of differential privacy. Advances in Neural Information Processing Systems, 34: 0 11631--11642, 2021

2021

[27] [27]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and ... The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Proceedings of the 23rd International Conference on Machine Learning , series =

Derek Greene and P\' a draig Cunningham. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine Learning, ICML '06, page 377–384, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933832. doi:10.1145/1143844.1143892. URL https://doi.org/10....

work page doi:10.1145/1143844.1143892 2006

[29] [29]

A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines

Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18 0 (3): 0 1059--1076, 1989

1989

[30] [30]

Extensions of lipschitz mappings into a hilbert space

William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26 0 (189-206): 0 1, 1984

1984

[31] [31]

Cs 860 lecture 5: Approximate differential privacy

Gautam Kamath. Cs 860 lecture 5: Approximate differential privacy. Course notes for CS 860: Algorithms for Private Data Analysis, 2020. URL http://www.gautamkamath.com/CS860notes/lec5.pdf

2020

[32] [32]

Private convex empirical risk minimization and high-dimensional regression

Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 25.1--25.40, Edinburgh, Scotland, 25--27 J...

2012

[33] [33]

B ill S um: A corpus for automatic summarization of US legislation

Anastassia Kornilova and Vladimir Eidelman. B ill S um: A corpus for automatic summarization of US legislation. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 48--56, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653...

work page doi:10.18653/v1/d19-5406 2019

[34] [34]

Computing tight differential privacy guarantees using fft

Antti Koskela, Joonas J \"a lk \"o , and Antti Honkela. Computing tight differential privacy guarantees using fft. In International Conference on Artificial Intelligence and Statistics, pages 2560--2569. PMLR, 2020

2020

[35] [35]

Scaling up differentially private deep learning with fast per-example gradient clipping

Jaewoo Lee and Daniel Kifer. Scaling up differentially private deep learning with fast per-example gradient clipping. Proceedings on Privacy Enhancing Technologies, 2021

2021

[36] [36]

arXiv preprint arXiv:2110.05679 , year=

Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679, 2021

work page arXiv 2021

[37] [37]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Inequalities: theory of majorization and its applications

Albert W Marshall, Ingram Olkin, and Barry C Arnold. Inequalities: theory of majorization and its applications. 1979

1979

[39] [39]

Differentially private non-convex optimization under the kl condition with optimal rates

Michael Menart, Enayat Ullah, Raman Arora, Raef Bassily, and Crist \'o bal Guzm \'a n. Differentially private non-convex optimization under the kl condition with optimal rates. In International Conference on Algorithmic Learning Theory, pages 868--906. PMLR, 2024

2024

[40] [40]

Hutchinson's estimator is bad at kronecker-trace-estimation

Raphael A Meyer and Haim Avron. Hutchinson's estimator is bad at kronecker-trace-estimation. arXiv preprint arXiv:2309.04952, 2023

work page arXiv 2023

[41] [41]

Hutch++: Optimal stochastic trace estimation

Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142--155. SIAM, 2021

2021

[42] [42]

R \'e nyi differential privacy

Ilya Mironov. R \'e nyi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263--275. IEEE, 2017

2017

[43] [43]

Stochastic orders

Moshe Shaked and J George Shanthikumar. Stochastic orders. Springer, 2007

2007

[44] [44]

Loftsgaarden and Charles P

V. Strassen. The Existence of Probability Measures with Given Marginals . The Annals of Mathematical Statistics, 36 0 (2): 0 423 -- 439, 1965. doi:10.1214/aoms/1177700153. URL https://doi.org/10.1214/aoms/1177700153

work page doi:10.1214/aoms/1177700153 1965

[45] [45]

Extremal probabilities for gaussian quadratic forms

G \'a bor J Sz \'e kely and Nail K Bakirov. Extremal probabilities for gaussian quadratic forms. Probability theory and related fields, 126 0 (2): 0 184--202, 2003

2003

[46] [46]

arXiv preprint arXiv:2401.04343 , year=

Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, and Prateek Mittal. Private fine-tuning of large language models with zeroth-order optimization. arXiv preprint arXiv:2401.04343, 2024

work page arXiv 2024

[47] [47]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. H otpot QA : A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro...

work page doi:10.18653/v1/d18-1259 2018

[48] [48]

Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021

Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in pytorch. arXiv preprint arXiv:2109.12298, 2021

work page arXiv 2021

[49] [49]

Differentially private fine-tuning of language models

Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. Differentially private fine-tuning of language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Q42f0dfjECO

2022

[50] [50]

Differentially private SGD without clipping bias: An error-feedback approach

Xinwei Zhang, Zhiqi Bu, Steven Wu, and Mingyi Hong. Differentially private SGD without clipping bias: An error-feedback approach. In International Conference on Learning Representations, 2024

2024