pith. sign in

arxiv: 2605.24879 · v1 · pith:ZLQBVEFRnew · submitted 2026-05-24 · 💻 cs.LG · math.OC

Efficient DP-SGD for LLMs with Randomized Clipping

Pith reviewed 2026-06-30 12:22 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords differential privacyDP-SGDrandomized clippingstochastic trace estimationlarge language modelsHutchinson estimatormemory efficiencygradient clipping
0
0 comments X

The pith

DP-SGD with randomized clipping via trace estimation reduces memory for private LLM training while matching utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DP-SGD-RC, which applies randomized clipping using stochastic trace estimation to approximate per-sample gradient norms in differential privacy training. This addresses the high memory overhead of standard fast gradient clipping techniques that scale with batch size, sequence length, and model dimension. The method comes with a privacy analysis that shows noise multipliers competitive with deterministic clipping. Experiments demonstrate that fine-tuning the Llama 3.2-1B model on long-context tasks achieves similar utility to baselines but with lower memory and compute needs. Readers would care if this makes private training feasible for larger models and longer sequences.

Core claim

DP-SGD-RC is a variant of DP-SGD that uses Hutchinson's estimator and Hutch++ for randomized clipping, reducing the memory complexity of per-sample gradient norm estimation while providing tight privacy bounds and preserving model utility on downstream tasks.

What carries the argument

Stochastic trace estimation with Hutchinson's estimator and Hutch++ to approximate per-sample gradient norms for randomized clipping in DP-SGD.

If this is right

  • DP-SGD-RC achieves noise multipliers competitive with deterministic clipping.
  • Experiments show it matches baseline utility on classification, question answering, and summarization tasks.
  • It significantly reduces memory and compute requirements compared to standard DP-SGD implementations.
  • The approach supports fine-tuning on long-context benchmarks without proportional resource increase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar randomized estimation techniques might apply to other DP mechanisms that require per-sample statistics.
  • The memory savings could permit larger batch sizes or longer contexts in private training setups.
  • If the estimation variance is controlled, it may extend to full pre-training of LLMs rather than just fine-tuning.

Load-bearing premise

The stochastic trace estimation produces sufficiently accurate per-sample gradient norm estimates that the subsequent clipping and noise addition preserve both the stated privacy bound and downstream model utility.

What would settle it

A direct comparison experiment showing that DP-SGD-RC either violates the claimed privacy guarantees or yields measurably lower task performance than deterministic clipping on the same Llama model and datasets.

Figures

Figures reproduced from arXiv: 2605.24879 by Devansh Gupta, Enayat Ullah, Huanyu Zhang, Meisam Razaviyayn, Sai Aparna Aketi.

Figure 1
Figure 1. Figure 1: Envelope functions for Hutch and Hutch++ for k = 32 and d = 2 (left) and d = 2048 (right). In the left, we see a separation between Hutch and Hutch++, but in the right (practical setting), they are essentially overlapping. 5.1 Privacy Analysis The following is the main result, which gives a description of the trade-off curve for the single-step of Algorithm 1 which use 2 as the norm estimator. Theorem 1. A… view at source ↗
Figure 2
Figure 2. Figure 2: Peak memory savings for full fine-tuning settings v/s projection dimension for different linear layers [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Compute savings for full fine-tuning settings v/s projection dimension for different linear layers of [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Latency savings for full fine-tuning as function of projection dimension for different linear layers of [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Behavior of optimal {λi}i ’s for d = 3 as a function of x where s(x) = supλ∈[0,1] P (λX1 + (1 − λ)X2 ≤ x). The authors invalidly assume that the set of candidate quadratic forms for the Supremum is the same as the set derived for the Infimum in Theorem 1. In Theorem 1 (Infimum): The authors used a perturbation argument (Page 193, Equation 4) to show that if a cluster of eigenvalues has multiplicity k > 1, … view at source ↗
Figure 6
Figure 6. Figure 6: Simulations showing x− is essentially 1 • Left Panel: Derivative at t = 0 vs ϵ, Shows ∂F ∂t |x=1+ϵ,t=0. Rate of change of the CDF as we move away from the vertex. We see All lines have slope 1 on log-log, meaning derivative ∝ ϵ. This means that he CDF immediately starts increasing when you move away from vertex, and this effect gets stronger as dk increase • Middle Panel: Maximum Excess vs ϵ. Shows maxt[F(… view at source ↗
Figure 7
Figure 7. Figure 7: Simulations on x+ across settings of d and k together with the 1 + 2 dk fit [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Simulations of the envelope CDF for k = 2, d = 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x 0.0 0.2 0.4 0.6 0.8 1.0 CDF EXACT CDFs: k=32, d=2 F(x; X_1) ~ (1/32) ²(32) F(x; X(u)) ~ (1/64) ²(64) max_{ [0,0.5]} F(x; X( )) x = 1.011 x = 1.024 Uncertain region (width=0.013) x* = 1.015 (X_1 X(u)) Mean = 1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x 0.0 0.1 0.2 0.3 0.4 0.5 Optimal Optimal (x) Maximizing CDF =0: X_2 ~ (1/32)… view at source ↗
Figure 9
Figure 9. Figure 9: Simulations of the envelope CDF for k = 32, d = 2 B.7 Proof of Main Theorem 1 Let D and D′ be neighboring datasets with differing element (A0, G0) and let Q0 := A⊤G. We start with Lemma 7, which shows that for any norm estimation routine, x 7→ R(x), the trade-off function is bounded as, T(A(D)∥A(D′ )) ⪰ T (Z, N (Z, 1)∥Z, N (0, 1)) where Z = ∥Q0∥ R(Q0) . We now instantiate R with Hutch andHutch++ and specia… view at source ↗
Figure 10
Figure 10. Figure 10: A high-level structure proof of Proposition [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (a) Relative error in norm estimation with Hutch and Hutch [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Noise multiplier (σ) as a function of hidden (non-projected) dimension (d) of the linear layer for a fixed projection dimension (k) of 32 for Hutch, Hutch++ and DP-SGD (Baseline). We used BBC dataset experimental setup for both the plots. D Hyper-parameters Hyperparameter tuning was performed consistently across all experiments. We used a batch size of 64 for the BBC experiments (for both full finetuning … view at source ↗
Figure 13
Figure 13. Figure 13: Peak memory savings for full fine-tuning settings without considering inputs (activations and [PITH_FULL_IMAGE:figures/full_fig_p040_13.png] view at source ↗
read the original abstract

Large language models (LLMs) are trained on vast datasets that may contain sensitive information. Differential privacy (DP), the de facto standard for formal privacy guarantees, provides a principled framework for training LLMs with provable privacy protection. However, state-of-the-art DP training implementations rely on fast gradient clipping techniques with memory overhead $O(B \min\{T^2, d^2\})$, where $B$ is the batch size, $T$ is the sequence length, and $d$ is the model width. This becomes prohibitive as both model size and context length grow. We propose DP-SGD-RC, a novel variant of DP-SGD with randomized clipping that reduces memory and compute complexity. DP-SGD-RC leverages stochastic trace estimation methods, specifically Hutchinson's estimator[Hutchinson, 1989] and its improved variant, Hutch++[Meyer et al., 2021], to reduce the memory footprint of per-sample gradient norm estimation. We provide a tight privacy analysis showing that DP-SGD-RC achieves noise multipliers competitive with deterministic clipping. Experiments fine-tuning Llama~3.2-1B on long-context benchmarks spanning classification, question answering, and summarization tasks demonstrate that DP-SGD-RC matches baseline utility while significantly reducing memory and compute requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DP-SGD-RC, a DP-SGD variant that replaces exact per-sample gradient clipping with randomized clipping based on stochastic trace estimation (Hutchinson's estimator and Hutch++) of ||g_i||_2^2. It claims this yields memory/compute savings over standard O(B min{T^2,d^2}) clipping while providing a tight privacy analysis whose noise multipliers remain competitive with deterministic clipping, and that fine-tuning Llama 3.2-1B on long-context tasks preserves utility.

Significance. If the privacy analysis correctly accounts for norm-estimation error, the method would materially improve the practicality of DP training for billion-parameter LLMs by lowering the memory barrier that currently limits batch size and context length.

major comments (2)
  1. [Privacy Analysis] The central privacy claim (that noise multipliers remain competitive) rests on the assumption that the stochastic estimator produces per-sample norms sufficiently accurate that the effective L2 sensitivity stays bounded by the clipping threshold. The abstract asserts a 'tight privacy analysis' but provides no high-probability tail bound on the underestimation probability of Hutchinson/Hutch++ for high-rank LLM gradients (d~10^9). Without an explicit adjustment to the sensitivity or a failure-probability term in the privacy budget, the competitive multiplier does not follow.
  2. [Experiments] § on experimental setup: the utility-matching claim for Llama 3.2-1B is reported without error bars on the number of Hutchinson samples used per gradient or on the observed norm-estimation error distribution; this leaves open whether the reported utility holds only for estimator configurations that already violate the sensitivity assumption used in the proof.
minor comments (1)
  1. [Abstract] Notation for the randomized clipping threshold and the subsequent noise scale should be defined once and used consistently; the abstract introduces 'randomized clipping' without an equation relating the estimator output to the final clipped gradient.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our work. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Privacy Analysis] The central privacy claim (that noise multipliers remain competitive) rests on the assumption that the stochastic estimator produces per-sample norms sufficiently accurate that the effective L2 sensitivity stays bounded by the clipping threshold. The abstract asserts a 'tight privacy analysis' but provides no high-probability tail bound on the underestimation probability of Hutchinson/Hutch++ for high-rank LLM gradients (d~10^9). Without an explicit adjustment to the sensitivity or a failure-probability term in the privacy budget, the competitive multiplier does not follow.

    Authors: We agree that the presentation of the privacy analysis can be strengthened by making the dependence on estimator concentration explicit. The manuscript invokes the known high-probability error bounds for Hutch++ (Meyer et al., 2021), but does not yet fold the failure probability into the overall (ε,δ) budget or discuss the scaling of these bounds for d ≈ 10^9. We will revise the privacy section to (i) state the tail bounds used, (ii) allocate a small failure probability eta to the estimator and compose it with the DP guarantee via a standard union bound, and (iii) recompute the effective noise multiplier under this adjusted sensitivity. This will make the claim of competitive multipliers fully rigorous. revision: yes

  2. Referee: [Experiments] § on experimental setup: the utility-matching claim for Llama 3.2-1B is reported without error bars on the number of Hutchinson samples used per gradient or on the observed norm-estimation error distribution; this leaves open whether the reported utility holds only for estimator configurations that already violate the sensitivity assumption used in the proof.

    Authors: We acknowledge the value of reporting variability. The current experiments fix the number of Hutchinson samples but do not display error bars across runs or the empirical distribution of relative norm error. In the revised version we will (i) report mean and standard deviation of downstream metrics over at least three independent runs, (ii) include histograms or quantiles of the observed ||ĝ_i|| / ||g_i|| ratio for the chosen sample count, and (iii) verify that the chosen configuration keeps the underestimation probability below the eta used in the updated privacy analysis. revision: yes

Circularity Check

0 steps flagged

No circularity; privacy analysis and estimator citations are independent of target claims

full rationale

The provided abstract and reader summary contain no equations, fitted parameters, or self-citations that reduce the claimed privacy multipliers, noise analysis, or utility results to a definition or input by construction. Hutchinson (1989) and Hutch++ (Meyer et al. 2021) are external citations; the privacy analysis is presented as a separate contribution without evidence of self-definitional loops or renaming. The derivation chain therefore remains self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review is based solely on the abstract; no free parameters, ad-hoc axioms, or invented entities are visible beyond reliance on two cited prior estimators.

axioms (2)
  • standard math Hutchinson's estimator yields an unbiased estimate of the trace of a matrix
    Cited directly from Hutchinson 1989 in the abstract
  • standard math Hutch++ reduces variance of the trace estimate relative to plain Hutchinson
    Cited from Meyer et al. 2021 in the abstract

pith-pipeline@v0.9.1-grok · 5778 in / 1297 out tokens · 28195 ms · 2026-06-30T12:22:01.631853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    Deep learning with differential privacy

    Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pages 308--318, 2016

  2. [2]

    The us census bureau adopts differential privacy

    John M Abowd. The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2867--2867, 2018

  3. [3]

    Scaling private deep learning with opacus: Advances for large language models

    Sai Aparna Aketi, Will Bullock, Iden Kalemaj, Enayat Ullah, and Huanyu Zhang. Scaling private deep learning with opacus: Advances for large language models. In Championing Open-source DEvelopment in ML Workshop@ ICML25, 2025

  4. [4]

    Differentially private learning with adaptive clipping

    Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34: 0 17455--17466, 2021

  5. [5]

    Learning with privacy at scale

    Apple Differential Privacy Team . Learning with privacy at scale. Apple Machine Learning Journal, 1 0 (8), December 2017. URL https://machinelearning.apple.com/research/learning-with-privacy-at-scale

  6. [6]

    Faster rates of convergence to stationary points in differentially private optimization

    Raman Arora, Raef Bassily, Tom \'a s Gonz \'a lez, Crist \'o bal A Guzm \'a n, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. In International Conference on Machine Learning, pages 1060--1092. PMLR, 2023

  7. [7]

    Private stochastic convex optimization: Optimal rates in _1 geometry

    Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in _1 geometry. In International Conference on Machine Learning, pages 393--403. PMLR, 2021

  8. [8]

    Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix

    Haim Avron and Sivan Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM), 58 0 (2): 0 1--34, 2011

  9. [9]

    Private empirical risk minimization: Efficient algorithms and tight error bounds

    Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In IEEE Annual Symposium on Foundations of Computer Science, pages 464--473. IEEE, 2014

  10. [10]

    Bayesian theory, volume 586

    Jos \'e M Bernardo, Adrian FM Smith, and Mark Berliner. Bayesian theory, volume 586. Wiley Online Library, 1994

  11. [11]

    Tres observaciones sobre el algebra lineal

    Garrett Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucuman, Ser. A, 5: 0 147--154, 1946

  12. [12]

    Fast and memory efficient differentially private-sgd via jl projections

    Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Hanwen Shen, and Uthaipon Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. Advances in Neural Information Processing Systems, 34: 0 19680--19691, 2021

  13. [13]

    Differentially private optimization on large model at small cost

    Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Differentially private optimization on large model at small cost. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

  14. [14]

    Concentrated differential privacy: Simplifications, extensions, and lower bounds

    Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of cryptography conference, pages 635--658. Springer, 2016

  15. [15]

    Composable and versatile privacy via truncated cdp

    Mark Bun, Cynthia Dwork, Guy N Rothblum, and Thomas Steinke. Composable and versatile privacy via truncated cdp. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 74--86, 2018

  16. [16]

    Differentially private empirical risk minimization

    Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12 0 (3), 2011

  17. [17]

    Multi-epoch matrix factorization mechanisms for private machine learning

    Christopher A Choquette-Choo, H Brendan McMahan, Keith Rush, and Abhradeep Thakurta. Multi-epoch matrix factorization mechanisms for private machine learning. arXiv preprint arXiv:2211.06530, 2022

  18. [18]

    An elementary proof of a theorem of johnson and lindenstrauss

    Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22 0 (1): 0 60--65, 2003

  19. [19]

    Gaussian differential privacy

    Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84 0 (1): 0 3--37, 2022

  20. [20]

    Calibrating noise to sensitivity in private data analysis

    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265--284. Springer, 2006

  21. [21]

    The algorithmic foundations of differential privacy

    Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science , 9 0 (3--4): 0 211--407, 2014

  22. [22]

    Rappor: Randomized aggregatable privacy-preserving ordinal response

    \'U lfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054--1067, 2014

  23. [23]

    Private stochastic convex optimization: optimal rates in linear time

    Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, page 439–449, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369794. doi:10.1145/3357713.3384335. URL https://doi.org/10.11...

  24. [24]

    Efficient Per-Example Gradient Computations

    Ian Goodfellow. Efficient per-example gradient computations, 2015. URL https://arxiv.org/abs/1510.01799

  25. [25]

    dp-accounting: Tools for tracking differential privacy budgets

    Google Differential Privacy Team . dp-accounting: Tools for tracking differential privacy budgets. https://github.com, 2020

  26. [26]

    Numerical composition of differential privacy

    Sivakanth Gopi, Yin Tat Lee, and Lukas Wutschitz. Numerical composition of differential privacy. Advances in Neural Information Processing Systems, 34: 0 11631--11642, 2021

  27. [27]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and ... The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  28. [28]

    Proceedings of the 23rd International Conference on Machine Learning , series =

    Derek Greene and P\' a draig Cunningham. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine Learning, ICML '06, page 377–384, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933832. doi:10.1145/1143844.1143892. URL https://doi.org/10....

  29. [29]

    A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines

    Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18 0 (3): 0 1059--1076, 1989

  30. [30]

    Extensions of lipschitz mappings into a hilbert space

    William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26 0 (189-206): 0 1, 1984

  31. [31]

    Cs 860 lecture 5: Approximate differential privacy

    Gautam Kamath. Cs 860 lecture 5: Approximate differential privacy. Course notes for CS 860: Algorithms for Private Data Analysis, 2020. URL http://www.gautamkamath.com/CS860notes/lec5.pdf

  32. [32]

    Private convex empirical risk minimization and high-dimensional regression

    Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 25.1--25.40, Edinburgh, Scotland, 25--27 J...

  33. [33]

    B ill S um: A corpus for automatic summarization of US legislation

    Anastassia Kornilova and Vladimir Eidelman. B ill S um: A corpus for automatic summarization of US legislation. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 48--56, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653...

  34. [34]

    Computing tight differential privacy guarantees using fft

    Antti Koskela, Joonas J \"a lk \"o , and Antti Honkela. Computing tight differential privacy guarantees using fft. In International Conference on Artificial Intelligence and Statistics, pages 2560--2569. PMLR, 2020

  35. [35]

    Scaling up differentially private deep learning with fast per-example gradient clipping

    Jaewoo Lee and Daniel Kifer. Scaling up differentially private deep learning with fast per-example gradient clipping. Proceedings on Privacy Enhancing Technologies, 2021

  36. [36]

    arXiv preprint arXiv:2110.05679 , year=

    Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679, 2021

  37. [37]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  38. [38]

    Inequalities: theory of majorization and its applications

    Albert W Marshall, Ingram Olkin, and Barry C Arnold. Inequalities: theory of majorization and its applications. 1979

  39. [39]

    Differentially private non-convex optimization under the kl condition with optimal rates

    Michael Menart, Enayat Ullah, Raman Arora, Raef Bassily, and Crist \'o bal Guzm \'a n. Differentially private non-convex optimization under the kl condition with optimal rates. In International Conference on Algorithmic Learning Theory, pages 868--906. PMLR, 2024

  40. [40]

    Hutchinson's estimator is bad at kronecker-trace-estimation

    Raphael A Meyer and Haim Avron. Hutchinson's estimator is bad at kronecker-trace-estimation. arXiv preprint arXiv:2309.04952, 2023

  41. [41]

    Hutch++: Optimal stochastic trace estimation

    Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142--155. SIAM, 2021

  42. [42]

    R \'e nyi differential privacy

    Ilya Mironov. R \'e nyi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263--275. IEEE, 2017

  43. [43]

    Stochastic orders

    Moshe Shaked and J George Shanthikumar. Stochastic orders. Springer, 2007

  44. [44]

    Loftsgaarden and Charles P

    V. Strassen. The Existence of Probability Measures with Given Marginals . The Annals of Mathematical Statistics, 36 0 (2): 0 423 -- 439, 1965. doi:10.1214/aoms/1177700153. URL https://doi.org/10.1214/aoms/1177700153

  45. [45]

    Extremal probabilities for gaussian quadratic forms

    G \'a bor J Sz \'e kely and Nail K Bakirov. Extremal probabilities for gaussian quadratic forms. Probability theory and related fields, 126 0 (2): 0 184--202, 2003

  46. [46]

    arXiv preprint arXiv:2401.04343 , year=

    Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, and Prateek Mittal. Private fine-tuning of large language models with zeroth-order optimization. arXiv preprint arXiv:2401.04343, 2024

  47. [47]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. H otpot QA : A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro...

  48. [48]

    Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021

    Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in pytorch. arXiv preprint arXiv:2109.12298, 2021

  49. [49]

    Differentially private fine-tuning of language models

    Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. Differentially private fine-tuning of language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Q42f0dfjECO

  50. [50]

    Differentially private SGD without clipping bias: An error-feedback approach

    Xinwei Zhang, Zhiqi Bu, Steven Wu, and Mingyi Hong. Differentially private SGD without clipping bias: An error-feedback approach. In International Conference on Learning Representations, 2024