Memory-Efficient Differentially Private Training with Gradient Random Projection
Pith reviewed 2026-05-21 23:42 UTC · model grok-4.3
The pith
DP-GRAPE projects gradients with random Gaussian matrices to cut memory in differentially private training while matching DP-SGD utility.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DP-GRAPE performs differentially private training by replacing SVD subspace selection with random Gaussian projection matrices, privatizing the projected gradients, and applying the projection during backpropagation. The central finding is that adding differential privacy noise flattens the singular value spectrum of gradients, so random projections suffice. Theoretical analysis shows the resulting privacy-utility tradeoff stays comparable to DP-SGD despite the lower-dimensional operation. Empirically the approach yields large memory reductions on vision and language models while preserving accuracy and training speed.
What carries the argument
Random Gaussian projection matrices applied to gradients before privatization, which reduce dimensionality during backpropagation and eliminate SVD costs.
If this is right
- Memory footprint drops by more than 63 percent during Vision Transformer pre-training relative to DP-Adam.
- Memory footprint drops by more than 70 percent during RoBERTa-Large fine-tuning relative to DP-Adam.
- Fine-tuning becomes feasible for OPT-scale models up to 6.7 billion parameters that exceed DP-Adam memory limits.
- Privacy-utility curves remain comparable to full-dimensional DP-SGD even though training occurs in lower-dimensional subspaces.
- No extra SVD computation or accuracy loss occurs relative to standard first-order DP methods.
Where Pith is reading between the lines
- The same random-projection idea might combine with other memory tricks such as gradient checkpointing to push private training to still larger scales.
- If the spectrum-flattening effect is general, similar low-rank random methods could be tested on non-Adam optimizers used in private settings.
- Further reduction of the projection dimension below the levels tested here could be explored to see the exact point where utility begins to degrade.
- The approach might generalize to federated or distributed DP training where communication of full gradients is also costly.
Load-bearing premise
Privatization flattens the gradient singular value spectrum enough that random Gaussian matrices work as well as SVD-selected subspaces.
What would settle it
A controlled comparison where DP-GRAPE at the same projected dimension shows clearly lower final accuracy or higher privacy loss than an SVD-based baseline on the same task and model size.
Figures
read the original abstract
Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. DP-GRAPE is motivated by our finding that privatization flattens the gradient singular value spectrum, making SVD-based projections (as in GaLore (Zhao et al., 2024)) unnecessary. Consequently, DP-GRAPE employs three key components: (1) random Gaussian matrices replace SVD-based subspaces, (2) gradients are privatized after projection, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to improved utility. Despite operating in lower-dimensional subspaces, our theoretical analysis shows that DP-GRAPE achieves a privacy-utility tradeoff comparable to DP-SGD. Our extensive empirical experiments show that DP-GRAPE can significantly reduce the memory footprint of DP training without sacrificing accuracy or training time. In particular, DP-GRAPE reduces memory usage by over 63% when pre-training Vision Transformers and over 70% when fine-tuning RoBERTa-Large as compared to DP-Adam, while achieving similar performance. We further demonstrate that DP-GRAPE scales to fine-tuning large models such as OPT with up to 6.7 billion parameters, a scale at which DP-Adam fails due to memory constraints. Our code is available at https://github.com/alexmul1114/DP_GRAPE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DP-GRAPE, a differentially private training method that projects per-sample gradients to a lower-dimensional space using random Gaussian matrices, performs clipping and noise addition in the projected space, and integrates the projection into the backward pass. The approach is motivated by the empirical observation that DP noise addition flattens the singular-value spectrum of gradients, making SVD-based subspace selection (as in GaLore) unnecessary. The authors claim that, despite the dimension reduction, a theoretical analysis establishes a privacy-utility tradeoff comparable to standard DP-SGD, while experiments report memory reductions exceeding 63% for Vision Transformer pre-training and 70% for RoBERTa-Large fine-tuning, with successful scaling to OPT models of 6.7 billion parameters where DP-Adam runs out of memory.
Significance. If the privacy-utility equivalence and the sufficiency of random projections hold, the work would meaningfully expand the feasible scale of DP training for large models by addressing the memory overhead of per-sample operations. The demonstration on 6.7B-parameter models and the release of code constitute concrete strengths that support reproducibility and practical impact.
major comments (2)
- [Motivation and method] Motivation section (and abstract): the central design choice of random Gaussian matrices over SVD-based projections rests on the observation that 'privatization flattens the gradient singular value spectrum.' No quantitative characterization is provided—e.g., effective rank, decay exponent, dependence on ε, layer depth, or training phase—nor is there a demonstration that the residual spectrum is sufficiently isotropic for random projections to preserve signal at the same rate as structured methods. This assumption is load-bearing for both the utility claim and the decision to forgo GaLore-style subspaces.
- [Theoretical analysis] Theoretical analysis section: the claim that DP-GRAPE achieves a privacy-utility tradeoff comparable to DP-SGD 'despite operating in lower-dimensional subspaces' is stated without the explicit derivation or bound relating projection dimension, noise scale, and utility loss. The analysis must clarify how the reduced dimension affects the sensitivity calculation and the resulting (ε,δ) guarantee relative to full-dimensional DP-SGD.
minor comments (2)
- [Experiments] The experimental section would benefit from explicit ablation tables showing utility as a function of projection dimension for fixed privacy budgets, to quantify the sensitivity of the reported accuracy to the free parameter.
- [Figures] Figure captions and axis labels should explicitly state the privacy budget (ε) and model size for each memory-accuracy curve to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our contributions. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Motivation and method] Motivation section (and abstract): the central design choice of random Gaussian matrices over SVD-based projections rests on the observation that 'privatization flattens the gradient singular value spectrum.' No quantitative characterization is provided—e.g., effective rank, decay exponent, dependence on ε, layer depth, or training phase—nor is there a demonstration that the residual spectrum is sufficiently isotropic for random projections to preserve signal at the same rate as structured methods. This assumption is load-bearing for both the utility claim and the decision to forgo GaLore-style subspaces.
Authors: We agree that the current manuscript presents the flattening observation primarily through qualitative description and selected visualizations. To strengthen the motivation, the revised version will add quantitative characterizations, including plots and tables of singular-value spectra, effective rank, and decay exponents computed across layers, training phases, and multiple values of ε. We will also include a comparison of isotropy measures (e.g., condition numbers of the residual spectrum) to show that random projections preserve signal at rates comparable to structured methods under the observed flattening. revision: yes
-
Referee: [Theoretical analysis] Theoretical analysis section: the claim that DP-GRAPE achieves a privacy-utility tradeoff comparable to DP-SGD 'despite operating in lower-dimensional subspaces' is stated without the explicit derivation or bound relating projection dimension, noise scale, and utility loss. The analysis must clarify how the reduced dimension affects the sensitivity calculation and the resulting (ε,δ) guarantee relative to full-dimensional DP-SGD.
Authors: The manuscript's theoretical section invokes norm-preservation properties of random Gaussian projections to argue that sensitivity scales appropriately and that the privacy-utility tradeoff remains comparable. We acknowledge that an explicit step-by-step derivation relating projection dimension, adjusted noise scale, and utility bounds is not fully expanded. In the revision we will add this derivation: we first bound the projected gradient sensitivity using the Johnson-Lindenstrauss lemma, then show the corresponding noise multiplier needed to achieve the target (ε,δ), and finally relate the excess risk or convergence rate to the reduced dimension via standard DP-SGD analysis, demonstrating that the degradation is controlled by a logarithmic factor in the projection dimension. revision: yes
Circularity Check
No significant circularity; derivation relies on standard random projection and external DP mechanisms
full rationale
The paper motivates DP-GRAPE by an empirical observation that DP noise flattens gradient singular-value spectra, thereby justifying random Gaussian projections over SVD. This observation is presented as a supporting finding rather than a load-bearing derivation step that reduces to itself. The method itself applies standard random projection followed by DP noise addition after projection, with a claimed theoretical privacy-utility equivalence to DP-SGD that does not invoke self-citation chains or fitted inputs renamed as predictions. Empirical memory savings are demonstrated directly on models up to 6.7B parameters. No quoted equation or claim reduces by construction to the paper's own inputs; the approach remains self-contained against external benchmarks and code.
Axiom & Free-Parameter Ledger
free parameters (1)
- projection dimension
axioms (1)
- domain assumption Privatization flattens the gradient singular value spectrum making SVD unnecessary
Forward citations
Cited by 1 Pith paper
-
Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning
PeCL applies token-level dynamic differential privacy and privacy-guided memory sculpting to achieve superior privacy-utility balance in continual learning.
Reference graph
Works this paper leans on
- [1]
-
[2]
R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS ’14, page 464–473, USA,
work page 2014
-
[3]
IEEE Computer Society. ISBN 9781479965175. doi: 10.1109/FOCS.2014.56. URL https://doi.org/10.1109/FOCS.2014
-
[4]
URL https://proceedings.neurips.cc/paper_files/paper/2019/ file/3bd8fdb090f1f5eb66a00c84dbc5ad51-Paper.pdf. Z. Bu, S. Gopi, J. Kulkarni, Y . T. Lee, H. Shen, and U. Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. Advances in Neural Information Processing Systems, 34:19680–19691,
work page 2019
- [5]
- [6]
-
[7]
C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology-EUROCRYPT 2006: 24th Annual Interna- tional Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28-June 1,
work page 2006
-
[8]
Proceedings 25, pages 486–503. Springer, 2006a. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7,
work page 2006
-
[9]
URL https://arxiv.org/abs/2407.21783. W. H. Greene. Econometric analysis. Pretence Hall,
work page internal anchor Pith review Pith/arXiv arXiv
- [10]
- [11]
- [12]
-
[13]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
- [14]
-
[15]
Y . Liu. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 364,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[16]
F. Mireshghallah, A. Uniyal, T. Wang, D. Evans, and T. Berg-Kirkpatrick. Memorization in nlp fine-tuning methods. arXiv preprint arXiv:2205.12506,
- [17]
-
[18]
11 D. Yu, S. Naik, A. Backurs, S. Gopi, H. A. Inan, G. Kamath, J. Kulkarni, Y . T. Lee, A. Ma- noel, L. Wutschitz, et al. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500, 2021a. D. Yu, H. Zhang, W. Chen, and T.-Y . Liu. Do not let privacy overbill utility: Gradient embedding perturbation for private learning. arXiv p...
-
[19]
OPT: Open Pre-trained Transformer Language Models
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y . Tian. Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507,
work page internal anchor Pith review Pith/arXiv arXiv
- [21]
-
[22]
Algorithm 3 DP-Adam Require: Dataset X = {ξ1,
12 A DP-Adam Here, we detail the standard DP-Adam algorithm (with flat clipping) using our notation. Algorithm 3 DP-Adam Require: Dataset X = {ξ1, . . . , ξn}, model parameters {W 0 ℓ }L ℓ=1, learning rate η, decay rates β1, β2, batch size B, total iterations T 1: for t = 1, 2, . . . , T do 2: for ℓ = L, L − 1, . . . ,1 do 3: {Gt ℓ,i}B i=1 ← ∇ W t ℓ f({W ...
work page 2024
-
[23]
98.2 88.1 C.2 RoBERTa Fine-Tuning We follow the same experimental setup and build off of the same codebase as used by Zhang et al. (2024) and Malladi et al. (2023) to fine-tune RoBERTa-Large (Liu,
work page 2024
-
[24]
privacy for each, and record the average final test accuracy over the 3 seeds for each dataset and privacy level. We train with a batch size of64 for all experiments, which may be achieved using gradient accumulation. The results for AdamW (non-private), DP-Adam, LoRA (non-private), DP-LoRA, MeZO (non-private), and DPZero come from Zhang et al. (2024). Fo...
work page 2024
-
[25]
For all methods, we fine-tune for 50 steps
For DP-Adam, a batch size of 64 does not fit, so we use a physical batch size of 32 and gradient accumulation. For all methods, we fine-tune for 50 steps. The total train time for each method is inferred from the time for 50 steps, assuming 1000 total steps for DP-Adam and DP-GRAPE and 10000 total steps for DPZero. The memory and timing experiments were c...
work page 2024
-
[26]
and 10000 total steps for DPZero, (the number of steps reported to generate the final results in Zhang et al. (2024). Method Throughput (Samples/s) Total Train Time (hours) DP-Adam 71.7 0.6 DPZero 268.1 1.7 DP-GRAPE 75.9 0.6 C.3 OPT Fine-Tuning For the OPT experiments, we also follow the same experimental setup and build off the same codebase as used by Z...
work page 2024
-
[27]
using the torch.cuda.max_memory_reserved() function. For all sizes, we use a gradient accumulation step for the first-order methods so that accumulated gradients are included in the memory accounting. The memory and timing experiments were conducted on a single H100 GPU. For the timing experiment, we use the same experimental setup for each method as we u...
work page 2000
-
[28]
76.7 ± 0.7 25.0 ± 1.1 79.2 ± 0.7 24.1 ± 0.3 77.6 ± 0.4 27 .8 ± 0.6 Zero-Shot 26.8 11 .1 29 .8 9 .7 36 .5 17 .8 18 Table 13: Throughput and total training time for fine-tuning on SQuAD with OPT models with a total batch size of 8 on an H100 GPU. Total training time is based on 2000 total steps for DP-GRAPE, 3000 total steps for DP-Adam (both are the same t...
work page 2000
-
[29]
and 20000 total steps for DPZero, (the number of steps reported to generate the final results in Zhang et al. (2024). OOM indicates out of memory on an 80GB GPU with a batch size of 1 and gradient accumulation. Model OPT-1.3B OPT-2.7B OPT-6.7B Throughput (Samples/s) Total Train Time (hours) Throughput (Samples/s) Total Train Time (hours) Throughput (Sampl...
work page 2024
-
[30]
It is important to note that these assumptions are standard in the analysis of private non-convex optimization (Lowy et al., 2024; Zhang et al., 2024). Before presenting the complete proof of our theorem, we would like to define the following notations for the ease of stating our proofs. Notations and Lemmas: For a set A ⊆ X, we define f(w; A) = 1 |A| P ξ...
work page 2024
-
[31]
We now state some lemmas which would be useful throughout the proof
∥·∥ represents the ℓ2 norm while ∥·∥F represents the Frobenius norm. We now state some lemmas which would be useful throughout the proof. Lemma D.4. Consider any random variable X ≥ 0 and an event Q, then we have that E[X|Q] ≤ E[X] P(Q) Proof. This directly follows from the law of total probability and the non-negativity of X E[X] = E[X|Q]P(Q) + E[X|Qc]P(...
work page 2003
-
[32]
Then, E 1 m X l∈S al 2 = n − m (n − 1)m 1 n nX l=1 ∥al∥2 ≤ 1 {m<n} m n nX l=1 ∥al∥2
Further, let S be a uniformly random subset of [n] of size m. Then, E 1 m X l∈S al 2 = n − m (n − 1)m 1 n nX l=1 ∥al∥2 ≤ 1 {m<n} m n nX l=1 ∥al∥2. 21 Lemma D.8 (Zhang et al. (2024)). Let u, v be uniformly sampled from the standard d-dimensional Gaussian, let a ∈ Rd be some fixed vector independent of u, and H ∈ Rd×d be some fixed matrix independent of u. ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.