Memory-Efficient Differentially Private Training with Gradient Random Projection

Alex Mulrooney; Devansh Gupta; Huanyu Zhang; James Flemings; Meisam Razaviyayn; Murali Annavaram; Xinwei Zhang

arxiv: 2506.15588 · v2 · pith:H4STVMCDnew · submitted 2025-06-18 · 💻 cs.LG

Memory-Efficient Differentially Private Training with Gradient Random Projection

Alex Mulrooney , Devansh Gupta , James Flemings , Huanyu Zhang , Murali Annavaram , Meisam Razaviyayn , Xinwei Zhang This is my paper

Pith reviewed 2026-05-21 23:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords differential privacygradient projectionmemory efficiencyneural network trainingDP-SGDrandom projectionlarge language modelsvision transformers

0 comments

The pith

DP-GRAPE projects gradients with random Gaussian matrices to cut memory in differentially private training while matching DP-SGD utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that random low-dimensional projections can replace SVD-based subspace methods for differentially private gradient updates, because noise for privacy already flattens the gradient spectrum enough to make fancy selection unnecessary. If this holds, practitioners could train large private models on hardware that currently cannot fit per-sample clipped gradients or full SVD steps. The method projects gradients first using random Gaussian matrices, adds privacy noise in the reduced space, and folds the projection into the backward pass to avoid storing full gradients. Experiments report memory cuts above 63 percent on Vision Transformer pre-training and above 70 percent on RoBERTa-Large fine-tuning, plus the ability to handle OPT models with 6.7 billion parameters where standard DP-Adam runs out of memory.

Core claim

DP-GRAPE performs differentially private training by replacing SVD subspace selection with random Gaussian projection matrices, privatizing the projected gradients, and applying the projection during backpropagation. The central finding is that adding differential privacy noise flattens the singular value spectrum of gradients, so random projections suffice. Theoretical analysis shows the resulting privacy-utility tradeoff stays comparable to DP-SGD despite the lower-dimensional operation. Empirically the approach yields large memory reductions on vision and language models while preserving accuracy and training speed.

What carries the argument

Random Gaussian projection matrices applied to gradients before privatization, which reduce dimensionality during backpropagation and eliminate SVD costs.

If this is right

Memory footprint drops by more than 63 percent during Vision Transformer pre-training relative to DP-Adam.
Memory footprint drops by more than 70 percent during RoBERTa-Large fine-tuning relative to DP-Adam.
Fine-tuning becomes feasible for OPT-scale models up to 6.7 billion parameters that exceed DP-Adam memory limits.
Privacy-utility curves remain comparable to full-dimensional DP-SGD even though training occurs in lower-dimensional subspaces.
No extra SVD computation or accuracy loss occurs relative to standard first-order DP methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same random-projection idea might combine with other memory tricks such as gradient checkpointing to push private training to still larger scales.
If the spectrum-flattening effect is general, similar low-rank random methods could be tested on non-Adam optimizers used in private settings.
Further reduction of the projection dimension below the levels tested here could be explored to see the exact point where utility begins to degrade.
The approach might generalize to federated or distributed DP training where communication of full gradients is also costly.

Load-bearing premise

Privatization flattens the gradient singular value spectrum enough that random Gaussian matrices work as well as SVD-selected subspaces.

What would settle it

A controlled comparison where DP-GRAPE at the same projected dimension shows clearly lower final accuracy or higher privacy loss than an SVD-based baseline on the same task and model size.

Figures

Figures reproduced from arXiv: 2506.15588 by Alex Mulrooney, Devansh Gupta, Huanyu Zhang, James Flemings, Meisam Razaviyayn, Murali Annavaram, Xinwei Zhang.

**Figure 1.** Figure 1: Left: singular values si of layer gradient matrices with different clipping parameter C and noise levels σ, averaged across all layers of ViT-Base during training on CIFAR-10. Right: singular values of gradient matrices for OPT 1.3B during fine-tuning on SST-2. C = ∞ indicates no clipping. See Appendix C.1 for details. Algorithm 1 DP-GRAPE Require: Dataset X = {ξ1, . . . , ξn}, initial weights {W0 ℓ } L ℓ=… view at source ↗

**Figure 2.** Figure 2: Vision Transformer pre-training results for MNIST, CIFAR-10, and CIFAR100 at different ε privacy levels, and memory usage for different methods during training with varying batch size, with non-private Adam for comparison. See Appendix C.1 for detailed results in table form and experiment setup. Through careful analysis, we demonstrate that sampling vectors with unbounded worst-case norms does not adversel… view at source ↗

**Figure 3.** Figure 3: Maximum memory usage for fine-tuning RoBERTa-Large on SST-2 and OPT models on SQuAD using Adam, DP-Adam, DPZero, and DP-GRAPE with varying batch size. See Appendix C.2 and Appendix C.3 for details [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Maximum memory usage for fine-tuning RoBERTa-Large with DP-GRAPE using different subspace dimensions r, with comparisons to Adam, DP-Adam, and DPZero [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Convergence (as measured by development set accuracy) when fine-tuning RoBERTa-Large on SST-2 for DP-GRAPE and DPZero, with runs for three different random seeds used to generate few-shot datasets shown [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. DP-GRAPE is motivated by our finding that privatization flattens the gradient singular value spectrum, making SVD-based projections (as in GaLore (Zhao et al., 2024)) unnecessary. Consequently, DP-GRAPE employs three key components: (1) random Gaussian matrices replace SVD-based subspaces, (2) gradients are privatized after projection, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to improved utility. Despite operating in lower-dimensional subspaces, our theoretical analysis shows that DP-GRAPE achieves a privacy-utility tradeoff comparable to DP-SGD. Our extensive empirical experiments show that DP-GRAPE can significantly reduce the memory footprint of DP training without sacrificing accuracy or training time. In particular, DP-GRAPE reduces memory usage by over 63% when pre-training Vision Transformers and over 70% when fine-tuning RoBERTa-Large as compared to DP-Adam, while achieving similar performance. We further demonstrate that DP-GRAPE scales to fine-tuning large models such as OPT with up to 6.7 billion parameters, a scale at which DP-Adam fails due to memory constraints. Our code is available at https://github.com/alexmul1114/DP_GRAPE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DP-GRAPE, a differentially private training method that projects per-sample gradients to a lower-dimensional space using random Gaussian matrices, performs clipping and noise addition in the projected space, and integrates the projection into the backward pass. The approach is motivated by the empirical observation that DP noise addition flattens the singular-value spectrum of gradients, making SVD-based subspace selection (as in GaLore) unnecessary. The authors claim that, despite the dimension reduction, a theoretical analysis establishes a privacy-utility tradeoff comparable to standard DP-SGD, while experiments report memory reductions exceeding 63% for Vision Transformer pre-training and 70% for RoBERTa-Large fine-tuning, with successful scaling to OPT models of 6.7 billion parameters where DP-Adam runs out of memory.

Significance. If the privacy-utility equivalence and the sufficiency of random projections hold, the work would meaningfully expand the feasible scale of DP training for large models by addressing the memory overhead of per-sample operations. The demonstration on 6.7B-parameter models and the release of code constitute concrete strengths that support reproducibility and practical impact.

major comments (2)

[Motivation and method] Motivation section (and abstract): the central design choice of random Gaussian matrices over SVD-based projections rests on the observation that 'privatization flattens the gradient singular value spectrum.' No quantitative characterization is provided—e.g., effective rank, decay exponent, dependence on ε, layer depth, or training phase—nor is there a demonstration that the residual spectrum is sufficiently isotropic for random projections to preserve signal at the same rate as structured methods. This assumption is load-bearing for both the utility claim and the decision to forgo GaLore-style subspaces.
[Theoretical analysis] Theoretical analysis section: the claim that DP-GRAPE achieves a privacy-utility tradeoff comparable to DP-SGD 'despite operating in lower-dimensional subspaces' is stated without the explicit derivation or bound relating projection dimension, noise scale, and utility loss. The analysis must clarify how the reduced dimension affects the sensitivity calculation and the resulting (ε,δ) guarantee relative to full-dimensional DP-SGD.

minor comments (2)

[Experiments] The experimental section would benefit from explicit ablation tables showing utility as a function of projection dimension for fixed privacy budgets, to quantify the sensitivity of the reported accuracy to the free parameter.
[Figures] Figure captions and axis labels should explicitly state the privacy budget (ε) and model size for each memory-accuracy curve to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our contributions. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Motivation and method] Motivation section (and abstract): the central design choice of random Gaussian matrices over SVD-based projections rests on the observation that 'privatization flattens the gradient singular value spectrum.' No quantitative characterization is provided—e.g., effective rank, decay exponent, dependence on ε, layer depth, or training phase—nor is there a demonstration that the residual spectrum is sufficiently isotropic for random projections to preserve signal at the same rate as structured methods. This assumption is load-bearing for both the utility claim and the decision to forgo GaLore-style subspaces.

Authors: We agree that the current manuscript presents the flattening observation primarily through qualitative description and selected visualizations. To strengthen the motivation, the revised version will add quantitative characterizations, including plots and tables of singular-value spectra, effective rank, and decay exponents computed across layers, training phases, and multiple values of ε. We will also include a comparison of isotropy measures (e.g., condition numbers of the residual spectrum) to show that random projections preserve signal at rates comparable to structured methods under the observed flattening. revision: yes
Referee: [Theoretical analysis] Theoretical analysis section: the claim that DP-GRAPE achieves a privacy-utility tradeoff comparable to DP-SGD 'despite operating in lower-dimensional subspaces' is stated without the explicit derivation or bound relating projection dimension, noise scale, and utility loss. The analysis must clarify how the reduced dimension affects the sensitivity calculation and the resulting (ε,δ) guarantee relative to full-dimensional DP-SGD.

Authors: The manuscript's theoretical section invokes norm-preservation properties of random Gaussian projections to argue that sensitivity scales appropriately and that the privacy-utility tradeoff remains comparable. We acknowledge that an explicit step-by-step derivation relating projection dimension, adjusted noise scale, and utility bounds is not fully expanded. In the revision we will add this derivation: we first bound the projected gradient sensitivity using the Johnson-Lindenstrauss lemma, then show the corresponding noise multiplier needed to achieve the target (ε,δ), and finally relate the excess risk or convergence rate to the reduced dimension via standard DP-SGD analysis, demonstrating that the degradation is controlled by a logarithmic factor in the projection dimension. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard random projection and external DP mechanisms

full rationale

The paper motivates DP-GRAPE by an empirical observation that DP noise flattens gradient singular-value spectra, thereby justifying random Gaussian projections over SVD. This observation is presented as a supporting finding rather than a load-bearing derivation step that reduces to itself. The method itself applies standard random projection followed by DP noise addition after projection, with a claimed theoretical privacy-utility equivalence to DP-SGD that does not invoke self-citation chains or fitted inputs renamed as predictions. Empirical memory savings are demonstrated directly on models up to 6.7B parameters. No quoted equation or claim reduces by construction to the paper's own inputs; the approach remains self-contained against external benchmarks and code.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method depends on the domain assumption of spectrum flattening from privatization and treats projection dimension as a tunable hyperparameter without independent derivation.

free parameters (1)

projection dimension
Rank of the random projection matrix selected to balance memory reduction against utility; value is experiment-dependent rather than derived.

axioms (1)

domain assumption Privatization flattens the gradient singular value spectrum making SVD unnecessary
Stated as the key motivation for replacing SVD with random Gaussian matrices.

pith-pipeline@v0.9.0 · 5857 in / 1278 out tokens · 74488 ms · 2026-05-21T23:42:41.205624+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning
cs.AI 2025-09 unverdicted novelty 6.0

PeCL applies token-level dynamic differential privacy and privacy-guided memory sculpting to achieve superior privacy-utility balance in continual learning.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Abadi, A

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318,

work page 2016
[2]

Bassily, A

R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS ’14, page 464–473, USA,

work page 2014
[3]

ISBN 9781479965175

IEEE Computer Society. ISBN 9781479965175. doi: 10.1109/FOCS.2014.56. URL https://doi.org/10.1109/FOCS.2014

work page doi:10.1109/focs.2014.56 2014
[4]

URL https://proceedings.neurips.cc/paper_files/paper/2019/ file/3bd8fdb090f1f5eb66a00c84dbc5ad51-Paper.pdf. Z. Bu, S. Gopi, J. Kulkarni, Y . T. Lee, H. Shen, and U. Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. Advances in Neural Information Processing Systems, 34:19680–19691,

work page 2019
[5]

Z. Bu, X. Zhang, M. Hong, S. Zha, and G. Karypis. Pre-training differentially private models with limited public data. arXiv preprint arXiv:2402.18752,

work page arXiv
[6]

S. De, L. Berrada, J. Hayes, S. L. Smith, and B. Balle. Unlocking high-accuracy differentially private image classification through scale. arXiv preprint arXiv:2204.13650,

work page arXiv
[7]

Dwork, K

C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology-EUROCRYPT 2006: 24th Annual Interna- tional Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28-June 1,

work page 2006
[8]

Springer, 2006a

Proceedings 25, pages 486–503. Springer, 2006a. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7,

work page 2006
[9]

URL https://arxiv.org/abs/2407.21783. W. H. Greene. Econometric analysis. Pretence Hall,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

X. Gu, G. Kamath, and Z. S. Wu. Choosing public datasets for private machine learning via gradient subspace distance. arXiv preprint arXiv:2303.01256,

work page arXiv
[11]

Y . Hao, Y . Cao, and L. Mou. Flora: Low-rank adapters are secretly gradient compressors. arXiv preprint arXiv:2402.03293,

work page arXiv
[12]

Y . He, P. Li, Y . Hu, C. Chen, and K. Yuan. Subspace optimization for large language models with convergence guarantees. arXiv preprint arXiv:2410.11289,

work page arXiv
[13]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

X. Li, F. Tramer, P. Liang, and T. Hashimoto. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679,

work page arXiv
[15]

Y . Liu. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 364,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[16]

Mireshghallah, A

F. Mireshghallah, A. Uniyal, T. Wang, D. Evans, and T. Berg-Kirkpatrick. Memorization in nlp fine-tuning methods. arXiv preprint arXiv:2205.12506,

work page arXiv
[17]

X. Tang, A. Panda, M. Nasr, S. Mahloujifar, and P. Mittal. Private fine-tuning of large language models with zeroth-order optimization. arXiv preprint arXiv:2401.04343,

work page arXiv
[18]

11 D. Yu, S. Naik, A. Backurs, S. Gopi, H. A. Inan, G. Kamath, J. Kulkarni, Y . T. Lee, A. Ma- noel, L. Wutschitz, et al. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500, 2021a. D. Yu, H. Zhang, W. Chen, and T.-Y . Liu. Do not let privacy overbill utility: Gradient embedding perturbation for private learning. arXiv p...

work page arXiv
[19]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y . Tian. Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Y . Zhou, Z. S. Wu, and A. Banerjee. Bypassing the ambient dimension: Private sgd with gradient subspace identification. arXiv preprint arXiv:2007.03813,

work page arXiv 2007
[22]

Algorithm 3 DP-Adam Require: Dataset X = {ξ1,

12 A DP-Adam Here, we detail the standard DP-Adam algorithm (with flat clipping) using our notation. Algorithm 3 DP-Adam Require: Dataset X = {ξ1, . . . , ξn}, model parameters {W 0 ℓ }L ℓ=1, learning rate η, decay rates β1, β2, batch size B, total iterations T 1: for t = 1, 2, . . . , T do 2: for ℓ = L, L − 1, . . . ,1 do 3: {Gt ℓ,i}B i=1 ← ∇ W t ℓ f({W ...

work page 2024
[23]

(2024) and Malladi et al

98.2 88.1 C.2 RoBERTa Fine-Tuning We follow the same experimental setup and build off of the same codebase as used by Zhang et al. (2024) and Malladi et al. (2023) to fine-tune RoBERTa-Large (Liu,

work page 2024
[24]

We train with a batch size of64 for all experiments, which may be achieved using gradient accumulation

privacy for each, and record the average final test accuracy over the 3 seeds for each dataset and privacy level. We train with a batch size of64 for all experiments, which may be achieved using gradient accumulation. The results for AdamW (non-private), DP-Adam, LoRA (non-private), DP-LoRA, MeZO (non-private), and DPZero come from Zhang et al. (2024). Fo...

work page 2024
[25]

For all methods, we fine-tune for 50 steps

For DP-Adam, a batch size of 64 does not fit, so we use a physical batch size of 32 and gradient accumulation. For all methods, we fine-tune for 50 steps. The total train time for each method is inferred from the time for 50 steps, assuming 1000 total steps for DP-Adam and DP-GRAPE and 10000 total steps for DPZero. The memory and timing experiments were c...

work page 2024
[26]

and 10000 total steps for DPZero, (the number of steps reported to generate the final results in Zhang et al. (2024). Method Throughput (Samples/s) Total Train Time (hours) DP-Adam 71.7 0.6 DPZero 268.1 1.7 DP-GRAPE 75.9 0.6 C.3 OPT Fine-Tuning For the OPT experiments, we also follow the same experimental setup and build off the same codebase as used by Z...

work page 2024
[27]

For all sizes, we use a gradient accumulation step for the first-order methods so that accumulated gradients are included in the memory accounting

using the torch.cuda.max_memory_reserved() function. For all sizes, we use a gradient accumulation step for the first-order methods so that accumulated gradients are included in the memory accounting. The memory and timing experiments were conducted on a single H100 GPU. For the timing experiment, we use the same experimental setup for each method as we u...

work page 2000
[28]

Total training time is based on 2000 total steps for DP-GRAPE, 3000 total steps for DP-Adam (both are the same total number of steps we use to generate the results in Table

76.7 ± 0.7 25.0 ± 1.1 79.2 ± 0.7 24.1 ± 0.3 77.6 ± 0.4 27 .8 ± 0.6 Zero-Shot 26.8 11 .1 29 .8 9 .7 36 .5 17 .8 18 Table 13: Throughput and total training time for fine-tuning on SQuAD with OPT models with a total batch size of 8 on an H100 GPU. Total training time is based on 2000 total steps for DP-GRAPE, 3000 total steps for DP-Adam (both are the same t...

work page 2000
[29]

and 20000 total steps for DPZero, (the number of steps reported to generate the final results in Zhang et al. (2024). OOM indicates out of memory on an 80GB GPU with a batch size of 1 and gradient accumulation. Model OPT-1.3B OPT-2.7B OPT-6.7B Throughput (Samples/s) Total Train Time (hours) Throughput (Samples/s) Total Train Time (hours) Throughput (Sampl...

work page 2024
[30]

Before presenting the complete proof of our theorem, we would like to define the following notations for the ease of stating our proofs

It is important to note that these assumptions are standard in the analysis of private non-convex optimization (Lowy et al., 2024; Zhang et al., 2024). Before presenting the complete proof of our theorem, we would like to define the following notations for the ease of stating our proofs. Notations and Lemmas: For a set A ⊆ X, we define f(w; A) = 1 |A| P ξ...

work page 2024
[31]

We now state some lemmas which would be useful throughout the proof

∥·∥ represents the ℓ2 norm while ∥·∥F represents the Frobenius norm. We now state some lemmas which would be useful throughout the proof. Lemma D.4. Consider any random variable X ≥ 0 and an event Q, then we have that E[X|Q] ≤ E[X] P(Q) Proof. This directly follows from the law of total probability and the non-negativity of X E[X] = E[X|Q]P(Q) + E[X|Qc]P(...

work page 2003
[32]

Then, E 1 m X l∈S al 2 = n − m (n − 1)m 1 n nX l=1 ∥al∥2 ≤ 1 {m<n} m n nX l=1 ∥al∥2

Further, let S be a uniformly random subset of [n] of size m. Then, E 1 m X l∈S al 2 = n − m (n − 1)m 1 n nX l=1 ∥al∥2 ≤ 1 {m<n} m n nX l=1 ∥al∥2. 21 Lemma D.8 (Zhang et al. (2024)). Let u, v be uniformly sampled from the standard d-dimensional Gaussian, let a ∈ Rd be some fixed vector independent of u, and H ∈ Rd×d be some fixed matrix independent of u. ...

work page 2024

[1] [1]

Abadi, A

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318,

work page 2016

[2] [2]

Bassily, A

R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS ’14, page 464–473, USA,

work page 2014

[3] [3]

ISBN 9781479965175

IEEE Computer Society. ISBN 9781479965175. doi: 10.1109/FOCS.2014.56. URL https://doi.org/10.1109/FOCS.2014

work page doi:10.1109/focs.2014.56 2014

[4] [4]

URL https://proceedings.neurips.cc/paper_files/paper/2019/ file/3bd8fdb090f1f5eb66a00c84dbc5ad51-Paper.pdf. Z. Bu, S. Gopi, J. Kulkarni, Y . T. Lee, H. Shen, and U. Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. Advances in Neural Information Processing Systems, 34:19680–19691,

work page 2019

[5] [5]

Z. Bu, X. Zhang, M. Hong, S. Zha, and G. Karypis. Pre-training differentially private models with limited public data. arXiv preprint arXiv:2402.18752,

work page arXiv

[6] [6]

S. De, L. Berrada, J. Hayes, S. L. Smith, and B. Balle. Unlocking high-accuracy differentially private image classification through scale. arXiv preprint arXiv:2204.13650,

work page arXiv

[7] [7]

Dwork, K

C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology-EUROCRYPT 2006: 24th Annual Interna- tional Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28-June 1,

work page 2006

[8] [8]

Springer, 2006a

Proceedings 25, pages 486–503. Springer, 2006a. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7,

work page 2006

[9] [9]

URL https://arxiv.org/abs/2407.21783. W. H. Greene. Econometric analysis. Pretence Hall,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

X. Gu, G. Kamath, and Z. S. Wu. Choosing public datasets for private machine learning via gradient subspace distance. arXiv preprint arXiv:2303.01256,

work page arXiv

[11] [11]

Y . Hao, Y . Cao, and L. Mou. Flora: Low-rank adapters are secretly gradient compressors. arXiv preprint arXiv:2402.03293,

work page arXiv

[12] [12]

Y . He, P. Li, Y . Hu, C. Chen, and K. Yuan. Subspace optimization for large language models with convergence guarantees. arXiv preprint arXiv:2410.11289,

work page arXiv

[13] [13]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

X. Li, F. Tramer, P. Liang, and T. Hashimoto. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679,

work page arXiv

[15] [15]

Y . Liu. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 364,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[16] [16]

Mireshghallah, A

F. Mireshghallah, A. Uniyal, T. Wang, D. Evans, and T. Berg-Kirkpatrick. Memorization in nlp fine-tuning methods. arXiv preprint arXiv:2205.12506,

work page arXiv

[17] [17]

X. Tang, A. Panda, M. Nasr, S. Mahloujifar, and P. Mittal. Private fine-tuning of large language models with zeroth-order optimization. arXiv preprint arXiv:2401.04343,

work page arXiv

[18] [18]

11 D. Yu, S. Naik, A. Backurs, S. Gopi, H. A. Inan, G. Kamath, J. Kulkarni, Y . T. Lee, A. Ma- noel, L. Wutschitz, et al. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500, 2021a. D. Yu, H. Zhang, W. Chen, and T.-Y . Liu. Do not let privacy overbill utility: Gradient embedding perturbation for private learning. arXiv p...

work page arXiv

[19] [19]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y . Tian. Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Y . Zhou, Z. S. Wu, and A. Banerjee. Bypassing the ambient dimension: Private sgd with gradient subspace identification. arXiv preprint arXiv:2007.03813,

work page arXiv 2007

[22] [22]

Algorithm 3 DP-Adam Require: Dataset X = {ξ1,

12 A DP-Adam Here, we detail the standard DP-Adam algorithm (with flat clipping) using our notation. Algorithm 3 DP-Adam Require: Dataset X = {ξ1, . . . , ξn}, model parameters {W 0 ℓ }L ℓ=1, learning rate η, decay rates β1, β2, batch size B, total iterations T 1: for t = 1, 2, . . . , T do 2: for ℓ = L, L − 1, . . . ,1 do 3: {Gt ℓ,i}B i=1 ← ∇ W t ℓ f({W ...

work page 2024

[23] [23]

(2024) and Malladi et al

98.2 88.1 C.2 RoBERTa Fine-Tuning We follow the same experimental setup and build off of the same codebase as used by Zhang et al. (2024) and Malladi et al. (2023) to fine-tune RoBERTa-Large (Liu,

work page 2024

[24] [24]

We train with a batch size of64 for all experiments, which may be achieved using gradient accumulation

privacy for each, and record the average final test accuracy over the 3 seeds for each dataset and privacy level. We train with a batch size of64 for all experiments, which may be achieved using gradient accumulation. The results for AdamW (non-private), DP-Adam, LoRA (non-private), DP-LoRA, MeZO (non-private), and DPZero come from Zhang et al. (2024). Fo...

work page 2024

[25] [25]

For all methods, we fine-tune for 50 steps

For DP-Adam, a batch size of 64 does not fit, so we use a physical batch size of 32 and gradient accumulation. For all methods, we fine-tune for 50 steps. The total train time for each method is inferred from the time for 50 steps, assuming 1000 total steps for DP-Adam and DP-GRAPE and 10000 total steps for DPZero. The memory and timing experiments were c...

work page 2024

[26] [26]

and 10000 total steps for DPZero, (the number of steps reported to generate the final results in Zhang et al. (2024). Method Throughput (Samples/s) Total Train Time (hours) DP-Adam 71.7 0.6 DPZero 268.1 1.7 DP-GRAPE 75.9 0.6 C.3 OPT Fine-Tuning For the OPT experiments, we also follow the same experimental setup and build off the same codebase as used by Z...

work page 2024

[27] [27]

For all sizes, we use a gradient accumulation step for the first-order methods so that accumulated gradients are included in the memory accounting

using the torch.cuda.max_memory_reserved() function. For all sizes, we use a gradient accumulation step for the first-order methods so that accumulated gradients are included in the memory accounting. The memory and timing experiments were conducted on a single H100 GPU. For the timing experiment, we use the same experimental setup for each method as we u...

work page 2000

[28] [28]

Total training time is based on 2000 total steps for DP-GRAPE, 3000 total steps for DP-Adam (both are the same total number of steps we use to generate the results in Table

76.7 ± 0.7 25.0 ± 1.1 79.2 ± 0.7 24.1 ± 0.3 77.6 ± 0.4 27 .8 ± 0.6 Zero-Shot 26.8 11 .1 29 .8 9 .7 36 .5 17 .8 18 Table 13: Throughput and total training time for fine-tuning on SQuAD with OPT models with a total batch size of 8 on an H100 GPU. Total training time is based on 2000 total steps for DP-GRAPE, 3000 total steps for DP-Adam (both are the same t...

work page 2000

[29] [29]

and 20000 total steps for DPZero, (the number of steps reported to generate the final results in Zhang et al. (2024). OOM indicates out of memory on an 80GB GPU with a batch size of 1 and gradient accumulation. Model OPT-1.3B OPT-2.7B OPT-6.7B Throughput (Samples/s) Total Train Time (hours) Throughput (Samples/s) Total Train Time (hours) Throughput (Sampl...

work page 2024

[30] [30]

Before presenting the complete proof of our theorem, we would like to define the following notations for the ease of stating our proofs

It is important to note that these assumptions are standard in the analysis of private non-convex optimization (Lowy et al., 2024; Zhang et al., 2024). Before presenting the complete proof of our theorem, we would like to define the following notations for the ease of stating our proofs. Notations and Lemmas: For a set A ⊆ X, we define f(w; A) = 1 |A| P ξ...

work page 2024

[31] [31]

We now state some lemmas which would be useful throughout the proof

∥·∥ represents the ℓ2 norm while ∥·∥F represents the Frobenius norm. We now state some lemmas which would be useful throughout the proof. Lemma D.4. Consider any random variable X ≥ 0 and an event Q, then we have that E[X|Q] ≤ E[X] P(Q) Proof. This directly follows from the law of total probability and the non-negativity of X E[X] = E[X|Q]P(Q) + E[X|Qc]P(...

work page 2003

[32] [32]

Then, E 1 m X l∈S al 2 = n − m (n − 1)m 1 n nX l=1 ∥al∥2 ≤ 1 {m<n} m n nX l=1 ∥al∥2

Further, let S be a uniformly random subset of [n] of size m. Then, E 1 m X l∈S al 2 = n − m (n − 1)m 1 n nX l=1 ∥al∥2 ≤ 1 {m<n} m n nX l=1 ∥al∥2. 21 Lemma D.8 (Zhang et al. (2024)). Let u, v be uniformly sampled from the standard d-dimensional Gaussian, let a ∈ Rd be some fixed vector independent of u, and H ∈ Rd×d be some fixed matrix independent of u. ...

work page 2024