When Do Fewer Coordinates Suffice in DP-SGD?

Fang Xie; Huiqi Zhang

arxiv: 2606.04375 · v1 · pith:VEFYNJ2Dnew · submitted 2026-06-03 · 💻 cs.LG · stat.ML

When Do Fewer Coordinates Suffice in DP-SGD?

Huiqi Zhang , Fang Xie This is my paper

Pith reviewed 2026-06-28 07:01 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords DP-SGDcoordinate restrictiondifferential privacyTopK selectionnonconvex stationaritywarm-up phasegradient energy

0 comments

The pith

When a criterion holds, DP-SGD can restrict updates to k coordinates so noise scales with k rather than full dimension d.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DP-SGD adds noise scaled to the full parameter dimension d on every step. The paper identifies a condition under which restricting updates to a smaller active set of k coordinates remains useful for optimization. It introduces TP-TopK, a two-phase procedure that runs a private warm-up phase to rank and select the support used in the main training phase. Under the condition a nonconvex stationarity bound shows the noise term depends on k. Experiments on MNIST, FMNIST and CIFAR-10 confirm that supports chosen from warm-up scores retain more gradient energy than random subsets of equal size, with larger gains when k is small.

Core claim

We give a criterion characterizing when coordinate restriction can be beneficial, show via a nonconvex stationarity bound that under this condition the relevant noise term scales with the active dimension k rather than the full parameter dimension d, and provide a lower bound on the reliability of warm-up-based coordinate ranking.

What carries the argument

TP-TopK, the two-phase method that runs a private warm-up phase to identify a coordinate support for the main training phase.

If this is right

Under the stated criterion the noise term in the stationarity bound scales with k rather than d.
Learned coordinate supports from the warm-up retain more gradient energy than size-matched random supports.
The largest gains appear when the active dimension is small and the warm-up scores are informative.
A lower bound is given on the reliability of warm-up-based coordinate ranking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the criterion can be verified cheaply it would let private training automatically drop uninformative coordinates to lower total noise.
The same warm-up idea could be tested in other per-coordinate noise mechanisms beyond DP-SGD.
Checking whether the criterion holds on large language models would show whether the k-scaling benefit survives beyond the image-classification regimes examined.

Load-bearing premise

The private warm-up phase must produce coordinate scores informative enough that the selected support retains sufficient gradient energy for the main optimization phase to succeed.

What would settle it

An experiment in which the nonconvex stationarity bound fails to improve when restricting to the warm-up-selected k coordinates, or in which the warm-up ranking reliability falls below the stated lower bound on real models.

Figures

Figures reproduced from arXiv: 2606.04375 by Fang Xie, Huiqi Zhang.

**Figure 2.** Figure 2: Two-phase training pipeline of TP-TopK and its RDP accounting structure. Phase 1 runs full-parameter DPSGD with noise multiplier 𝜎1, accumulating coordinate scores 𝑠𝑝 ; the denoised score 𝑎𝑝 = 𝑠𝑝 /𝑇1 − (𝜎1𝐶1/𝐵) 2 is post-processed to select 𝐴 = TopK(𝑎, 𝑘) at no additional privacy cost. Phase 2 restricts DP-SGD to the 𝑘 active coordinates in 𝐴 with noise multiplier 𝜎2, leaving frozen coordinates 𝐴¯ unchang… view at source ↗

read the original abstract

Differentially private stochastic gradient descent (DP-SGD) injects noise into every updated coordinate, making the injected noise energy scale with the ambient parameter dimension \(d\). We ask when private training can update fewer coordinates without losing the signal needed for optimization. We propose \textsc{TP-TopK} (Two-Phase TopK DP-SGD), a two-phase method for coordinate-sparse private training without public data, in which a private warm-up phase identifies a coordinate support used to guide the main training phase. We give a criterion characterizing when coordinate restriction can be beneficial, show via a nonconvex stationarity bound that under this condition the relevant noise term scales with the active dimension \(k\) rather than the full parameter dimension \(d\), and provide a lower bound on the reliability of warm-up-based coordinate ranking. Experiments on MNIST, FMNIST, and CIFAR-10 show that learned coordinate supports can retain more gradient energy than size-matched random supports, with the largest gains when the active dimension is small and warm-up scores are informative.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TP-TopK gives a concrete two-phase warm-up for picking k coordinates in DP-SGD plus a stationarity bound that isolates the k-scaling noise term, but the reliability lower bound looks too loose to guarantee the claimed benefit holds in the regimes where it would matter most.

read the letter

The paper introduces TP-TopK, which runs a short private warm-up to rank coordinates and then restricts the main DP-SGD phase to the top k. It states a criterion for when this restriction helps, derives a non-convex stationarity bound under that criterion so the noise scales with k rather than d, and supplies a lower bound on how reliably the warm-up ranks the coordinates.

The new pieces are the explicit criterion, the k-vs-d noise isolation in the bound, and the two-phase algorithm itself. The experiments on MNIST, FMNIST, and CIFAR-10 are straightforward and show that the learned supports keep more gradient energy than random supports of the same size, with bigger gaps when k is small. That is useful to see.

The soft spot is exactly the one the stress-test flags. The central claim needs the selected support to retain enough gradient energy for the k-scaling improvement to be real. The paper offers only a lower bound on ranking reliability as support for that premise. If the bound is loose (constant factor or vanishing fraction with non-negligible probability), the restricted phase can lose too much signal and the criterion becomes inapplicable precisely when coordinate restriction would be most attractive. The abstract gives no numbers on how tight the bound is or on the actual energy retention, so the practical payoff stays unclear.

This is for people working on utility improvements inside the existing DP-SGD framework rather than new privacy definitions. It has enough of a new technique and a clear question to deserve referee time, even though the current evidence is preliminary and the key assumption needs tighter quantification.

Referee Report

2 major / 1 minor

Summary. The paper introduces TP-TopK, a two-phase DP-SGD algorithm consisting of a private warm-up phase that ranks and selects a coordinate support of size k, followed by a main training phase restricted to those coordinates. It states a criterion under which coordinate restriction is beneficial, derives a non-convex stationarity bound showing that the noise term then scales with k rather than the full dimension d, supplies a lower bound on the reliability of the warm-up ranking, and reports experiments on MNIST, FMNIST, and CIFAR-10 in which learned supports retain more gradient energy than size-matched random supports, with larger gains at small k when warm-up scores are informative.

Significance. If the stationarity bound holds under the stated criterion and the warm-up lower bound is shown to guarantee sufficient retained gradient energy, the work would offer a concrete, public-data-free route to reducing the noise penalty in high-dimensional DP-SGD. The explicit criterion, the non-convex analysis, and the direct comparison of learned versus random supports are positive features; the absence of public data makes the approach relevant to settings where auxiliary data are unavailable.

major comments (2)

[stationarity bound and reliability lower bound] The non-convex stationarity bound (abstract and the derivation referenced in the main text): the bound is stated to deliver a k-versus-d noise improvement only under the paper's criterion, yet the criterion itself presupposes that the selected support retains enough gradient energy. The supplied lower bound on warm-up ranking reliability is described only as 'partial support' for this premise; it is not shown whether the bound guarantees a (1-ε) fraction of total gradient energy with high probability, which is required for the k-scaling claim to remain load-bearing when selection error is present.
[experiments on MNIST, FMNIST, CIFAR-10] Experimental section: results are reported without error bars, without the exact values of k tested, and without an explicit statement of the exclusion rule or energy metric used to declare that 'learned supports retain more gradient energy.' These omissions prevent assessment of whether the observed gains are statistically reliable or whether they occur in the regime where the k-versus-d benefit is theoretically predicted.

minor comments (1)

Notation for the active dimension k and the reliability probability should be introduced once and used consistently; the abstract uses both without an explicit definition of the energy-retention threshold.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and positive assessment of the paper's potential contribution. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [stationarity bound and reliability lower bound] The non-convex stationarity bound (abstract and the derivation referenced in the main text): the bound is stated to deliver a k-versus-d noise improvement only under the paper's criterion, yet the criterion itself presupposes that the selected support retains enough gradient energy. The supplied lower bound on warm-up ranking reliability is described only as 'partial support' for this premise; it is not shown whether the bound guarantees a (1-ε) fraction of total gradient energy with high probability, which is required for the k-scaling claim to remain load-bearing when selection error is present.

Authors: The stationarity bound is derived under an explicit criterion that the selected support must retain sufficient gradient energy for the k-scaling to hold; this is presented as a sufficient condition rather than an unconditional claim. The lower bound on warm-up reliability is offered as partial analytical support for the criterion being plausible, showing that the ranking procedure has non-trivial reliability properties, but it does not claim or prove a high-probability (1-ε) energy retention guarantee. We will revise the manuscript to more explicitly separate the criterion (under which the bound applies) from the reliability analysis and to discuss the effect of possible selection error on the noise scaling. Strengthening the reliability result to a full (1-ε) guarantee would require additional assumptions on gradient distributions that are outside the current scope. revision: partial
Referee: [experiments on MNIST, FMNIST, CIFAR-10] Experimental section: results are reported without error bars, without the exact values of k tested, and without an explicit statement of the exclusion rule or energy metric used to declare that 'learned supports retain more gradient energy.' These omissions prevent assessment of whether the observed gains are statistically reliable or whether they occur in the regime where the k-versus-d benefit is theoretically predicted.

Authors: We agree that the experimental reporting lacks these details. In the revised version we will add error bars computed over multiple independent runs, list the precise k values used for each dataset, and provide an explicit definition of the gradient energy metric together with any coordinate exclusion rules applied when computing retained energy. These additions will make it possible to evaluate statistical reliability and alignment with the theoretical regime. revision: yes

standing simulated objections not resolved

The lower bound on warm-up ranking reliability does not currently guarantee retention of a (1-ε) fraction of total gradient energy with high probability; providing such a guarantee would require further assumptions or analysis not present in the manuscript.

Circularity Check

0 steps flagged

No significant circularity; derivations follow from stated criterion and standard analysis

full rationale

The paper states a criterion for when coordinate restriction benefits DP-SGD, then derives a nonconvex stationarity bound under that criterion (showing noise scaling with active dimension k) and a lower bound on warm-up ranking reliability. These follow from the criterion plus standard non-convex analysis without any reduction by the paper's own equations to fitted quantities, self-citations, or definitional tautologies. No load-bearing step renames a fit as a prediction or imports uniqueness via author self-citation; the central claims remain independent of the paper's inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is inferred at a high level from the described method and theory; no explicit free parameters, axioms, or invented entities are stated.

free parameters (1)

active dimension k
The number of coordinates retained after warm-up; its value is chosen according to the criterion and appears tuned per dataset and experiment.

axioms (1)

domain assumption There exists a small coordinate support that captures sufficient gradient energy for optimization progress
This premise is required for the warm-up phase to be useful and for the stationarity bound to show benefit from restricting to k coordinates.

pith-pipeline@v0.9.1-grok · 5706 in / 1474 out tokens · 31014 ms · 2026-06-28T07:01:55.393556+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 1 canonical work pages

[1]

Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 308–318

2016
[2]

Kamil Adamczewski, Yingchen He, and Mijung Park. 2023. Pre-Pruning and Gradient-Dropping Improve Differentially Private Image Classification.arXiv preprint arXiv:2306.11754(2023)

arXiv 2023
[3]

Brendan McMahan, and Swaroop Ramaswamy

Galen Andrew, Om Thakkar, H. Brendan McMahan, and Swaroop Ramaswamy
[4]

InAdvances in Neural Information Processing Systems, Vol

Differentially Private Learning with Adaptive Clipping. InAdvances in Neural Information Processing Systems, Vol. 34. 17455–17466
[5]

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. 2019. Private Stochastic Convex Optimization with Optimal Rates. InAdvances in Neural Information Processing Systems, Vol. 32. 11282–11291

2019
[6]

Raef Bassily, Adam Smith, and Abhradeep Thakurta. 2014. Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds. InProceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 464–473

2014
[7]

Philipp Benz et al. 2023. Equivariant Differentially Private Deep Learning: Why DP-SGD Needs Sparser Models.arXiv preprint arXiv:2301.13104(2023)

arXiv 2023
[8]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems33 (2020), 1877–1901

2020
[9]

Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. 2023. Au- tomatic Clipping: Differentially Private Deep Learning Made Easier and Stronger. InAdvances in Neural Information Processing Systems, Vol. 36. 41727–41764. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 8249b30d877c91611fd8c7aa6ac2b5fe-Abstract-Conference.html

2023
[10]

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramèr. 2022. Membership Inference Attacks From First Principles. In 2022 IEEE Symposium on Security and Privacy (SP). 1897–1914

2022
[11]

Lin Chen, Xiaofeng Ding, Mengqi Li, and Hai Jin. 2023. Differentially private fed- erated learning with importance client sampling.IEEE Transactions on Consumer Electronics70, 1 (2023), 3635–3649. When Do Fewer Coordinates Suffice in DP-SGD?

2023
[12]

Mia Xu Chen, Benjamin N Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M Dai, Zhifeng Chen, et al. 2019. Gmail smart compose: Real-time assisted writing. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2287– 2295

2019
[13]

Christopher Choquette-Choo, Arun Ganesh, Saminul Haque, Thomas Steinke, and Abhradeep Guha Thakurta. 2025. Near-Exact Privacy Amplification for Matrix Mechanisms. InInternational Conference on Learning Representations. 98772–98802

2025
[14]

Choquette-Choo, Arun Ganesh, Ryan McKenna, H

Christopher A. Choquette-Choo, Arun Ganesh, Ryan McKenna, H. Brendan McMahan, John Rush, Abhradeep Guha Thakurta, and Zheng Xu. 2023. (Ampli- fied) Banded Matrix Factorization: A Unified Approach to Private Training. In Advances in Neural Information Processing Systems, Vol. 36. 74856–74889

2023
[15]

Smith, and Borja Balle

Soham De, Leonard Berrada, Jamie Hayes, Samuel L. Smith, and Borja Balle. 2022. Unlocking High-Accuracy Differentially Private Image Classification through Scale. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 162. PMLR, 4815–4827

2022
[16]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.International Conference on Learning Representations (ICLR)(2021)

2021
[17]

Cynthia Dwork. 2006. Differential Privacy. InAutomata, Languages and Pro- gramming (Lecture Notes in Computer Science), Vol. 4052. Springer, 1–12

2006
[18]

Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differ- ential Privacy.Foundations and Trends in Theoretical Computer Science9, 3–4 (2014), 211–407

2014
[19]

Jie Fu, Zhili Chen, and XinPeng Ling. 2022. SA-DPSGD: Differentially Private Stochastic Gradient Descent based on Simulated Annealing.arXiv preprint arXiv:2211.07218(2022). arXiv:2211.07218 https://arxiv.org/abs/2211.07218

arXiv 2022
[20]

Saeed Ghadimi and Guanghui Lan. 2013. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming.SIAM Journal on Optimization 23, 4 (2013), 2341–2368

2013
[21]

Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Chiyuan Zhang. 2023. Sparsity-Preserving Differentially Pri- vate Training of Large Embedding Models. InAdvances in Neural Information Processing Systems, Vol. 36. 10951–10971

2023
[22]

Sivakanth Gopi, Yin Tat Lee, and Lukas Wutschitz. 2021. Numerical Composition of Differential Privacy. InAdvances in Neural Information Processing Systems, Vol. 34. 11631–11642

2021
[23]

Shlomi Hod, Lucas Rosenblatt, and Julia Stoyanovich. 2025. Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data. arXiv preprint arXiv:2504.14368(2025)

arXiv 2025
[24]

Yu, and Xuyun Zhang

Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, and Xuyun Zhang. 2022. Membership Inference Attacks on Machine Learning: A Survey. Comput. Surveys54, 11s (2022), 1–37

2022
[25]

Kaggle / EyePACS. 2015. EyePACS Diabetic Retinopathy Detection Dataset. https://www.kaggle.com/c/diabetic-retinopathy-detection

2015
[26]

Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. 2021. Practical and Private (Deep) Learning with- out Sampling or Shuffling. InInternational Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 5213–5225

2021
[27]

Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U. Stich. 2023. Revisit- ing Gradient Clipping: Stochastic Bias and Tight Convergence Guarantees. In International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 202. PMLR, 17343–17363

2023
[28]

Yixuan Liu, Li Xiong, Yuhan Liu, Yujie Gu, Ruixuan Liu, and Hong Chen. 2024. DPDR: Gradient Decomposition and Reconstruction for Differentially Private Deep Learning.arXiv preprint arXiv:2406.02744(2024). arXiv:2406.02744 https: //arxiv.org/abs/2406.02744

arXiv 2024
[29]

Linyuan Lü, Matúš Medo, Chi Ho Yeung, Yi-Cheng Zhang, Zi-Ke Zhang, and Tao Zhou. 2012. Recommender systems.Physics reports519, 1 (2012), 1–49

2012
[30]

Alexander Selvikvåg Lundervold and Arvid Lundervold. 2019. An overview of deep learning in medical imaging focusing on MRI.Zeitschrift fuer medizinische Physik29, 2 (2019), 102–127

2019
[31]

Mehdi Makni, Kayhan Behdin, Gabriel Afriat, Zheng Xu, Sergei Vassilvitskii, Natalia Ponomareva, Rahul Mazumder, and Hussein Hazimeh. 2025. SPARTA: An Optimization Framework for Differentially Private Sparse Fine-Tuning. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2090–2101

2025
[32]

Ilya Mironov. 2017. Rényi Differential Privacy. InProceedings of the 2017 IEEE 30th Computer Security Foundations Symposium. IEEE Computer Society, 263–275

2017
[33]

Marziyeh Mohammadi, Mohsen Vejdanihemmat, Mahshad Lotfinia, Mirabela Rusu, Daniel Truhn, Andreas Maier, and Soroosh Tayebi Arasteh. 2026. Differ- ential privacy for medical deep learning: methods, tradeoffs, and deployment implications.npj Digital Medicine9 (2026), 93

2026
[34]

Sabrina Mokhtari, Sara Kodeiri, Shubhankar Mohapatra, Florian Tramèr, and Gautam Kamath. 2026. Rethinking Benchmarks for Differentially Private Image Classification.arXiv preprint arXiv:2601.17189(2026)

arXiv 2026
[35]

Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan, Munawar Hayat, Fahad Shahbaz Khan, and Huazhu Fu. 2023. Transformers in medical imaging: A survey.Medical Image Analysis88 (2023), 102802

2023
[36]

Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. 1310–1321

2015
[37]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In2017 IEEE symposium on security and privacy. 3–18

2017
[38]

Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. 2013. Stochastic Gra- dient Descent with Differentially Private Updates. In2013 IEEE Global Conference on Signal and Information Processing. IEEE, 245–248

2013
[39]

Florian Tramèr, Gautam Kamath, and Nicholas Carlini. 2024. Position: Consider- ations for Differentially Private Learning with Large-Scale Public Pretraining. InProceedings of the 41st International Conference on Machine Learning. PMLR, 48453–48467

2024
[40]

Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. 2019. Sub- sampled Rényi Differential Privacy and Analytical Moments Accountant. In Proceedings of the Twenty-Second International Conference on Artificial Intelli- gence and Statistics (Proceedings of Machine Learning Research), Vol. 89. PMLR, 1226–1235. https://proceedings.mlr.press/v89/wang19b.html

2019
[41]

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. 2024. A Survey on Large Language Models for Recommendation.World Wide Web (2024)

2024
[42]

Alvarez, Jan Kautz, and Pavlo Molchanov

Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M. Alvarez, Jan Kautz, and Pavlo Molchanov. 2021. See Through Gradients: Image Batch Recovery via GradInver- sion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16337–16346

2021
[43]

Da Yu, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. 2021. Do Not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning. InInter- national Conference on Learning Representations. https://openreview.net/forum? id=7aogOj_VYO0

2021
[44]

Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. 2021. Large Scale Private Learning via Low-Rank Reparametrization. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 12208–12218. https://proceedings.mlr.press/v139/ yu21f.html

2021
[45]

Xinwei Zhang, Zhiqi Bu, Borja Balle, Mingyi Hong, Meisam Razaviyayn, and Vahab Mirrokni. 2025. DiSK: Differentially Private Optimizer with Simplified Kalman Filter for Noise Reduction. InInternational Conference on Learning Rep- resentations. https://openreview.net/forum?id=Lfy9q7Icp9

2025
[46]

Xinwei Zhang, Zhiqi Bu, Mingyi Hong, and Meisam Razaviyayn. 2024. DOPPLER: Differentially Private Optimizers with Low-Pass Filter for Privacy Noise Reduction. InAdvances in Neural Information Processing Systems, Vol. 37. 41826–41851. https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 49c466ccc038f39b08b1980a2b06673c-Abstract-Conference.html

2024
[47]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2026. A Survey of Large Language Models.Frontiers of Computer Science20, 12 (2026), 2012627. https://doi.org/10.1007/s11704-026-60308-3

work page doi:10.1007/s11704-026-60308-3 2026
[48]

S Kevin Zhou, Hayit Greenspan, Christos Davatzikos, James S Duncan, Bram Van Ginneken, Anant Madabhushi, Jerry L Prince, Daniel Rueckert, and Ronald M Summers. 2021. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proc. IEEE109, 5 (2021), 820–838

2021
[49]

Yingxue Zhou, Zhiwei Steven Wu, and Arindam Banerjee. 2021. Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification. In International Conference on Learning Representations. https://openreview.net/ forum?id=7dpmlkBuJFC

2021
[50]

Blaschko

Junyi Zhu and Matthew B. Blaschko. 2023. Improving Differentially Private SGD via Randomly Sparsified Gradients.Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=sY35BAiIf4

2023
[51]

Ligeng Zhu, Zhijian Liu, and Song Han. 2019. Deep leakage from gradients. Advances in neural information processing systems32 (2019)

2019

[1] [1]

Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 308–318

2016

[2] [2]

Kamil Adamczewski, Yingchen He, and Mijung Park. 2023. Pre-Pruning and Gradient-Dropping Improve Differentially Private Image Classification.arXiv preprint arXiv:2306.11754(2023)

arXiv 2023

[3] [3]

Brendan McMahan, and Swaroop Ramaswamy

Galen Andrew, Om Thakkar, H. Brendan McMahan, and Swaroop Ramaswamy

[4] [4]

InAdvances in Neural Information Processing Systems, Vol

Differentially Private Learning with Adaptive Clipping. InAdvances in Neural Information Processing Systems, Vol. 34. 17455–17466

[5] [5]

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. 2019. Private Stochastic Convex Optimization with Optimal Rates. InAdvances in Neural Information Processing Systems, Vol. 32. 11282–11291

2019

[6] [6]

Raef Bassily, Adam Smith, and Abhradeep Thakurta. 2014. Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds. InProceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 464–473

2014

[7] [7]

Philipp Benz et al. 2023. Equivariant Differentially Private Deep Learning: Why DP-SGD Needs Sparser Models.arXiv preprint arXiv:2301.13104(2023)

arXiv 2023

[8] [8]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems33 (2020), 1877–1901

2020

[9] [9]

Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. 2023. Au- tomatic Clipping: Differentially Private Deep Learning Made Easier and Stronger. InAdvances in Neural Information Processing Systems, Vol. 36. 41727–41764. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 8249b30d877c91611fd8c7aa6ac2b5fe-Abstract-Conference.html

2023

[10] [10]

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramèr. 2022. Membership Inference Attacks From First Principles. In 2022 IEEE Symposium on Security and Privacy (SP). 1897–1914

2022

[11] [11]

Lin Chen, Xiaofeng Ding, Mengqi Li, and Hai Jin. 2023. Differentially private fed- erated learning with importance client sampling.IEEE Transactions on Consumer Electronics70, 1 (2023), 3635–3649. When Do Fewer Coordinates Suffice in DP-SGD?

2023

[12] [12]

Mia Xu Chen, Benjamin N Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M Dai, Zhifeng Chen, et al. 2019. Gmail smart compose: Real-time assisted writing. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2287– 2295

2019

[13] [13]

Christopher Choquette-Choo, Arun Ganesh, Saminul Haque, Thomas Steinke, and Abhradeep Guha Thakurta. 2025. Near-Exact Privacy Amplification for Matrix Mechanisms. InInternational Conference on Learning Representations. 98772–98802

2025

[14] [14]

Choquette-Choo, Arun Ganesh, Ryan McKenna, H

Christopher A. Choquette-Choo, Arun Ganesh, Ryan McKenna, H. Brendan McMahan, John Rush, Abhradeep Guha Thakurta, and Zheng Xu. 2023. (Ampli- fied) Banded Matrix Factorization: A Unified Approach to Private Training. In Advances in Neural Information Processing Systems, Vol. 36. 74856–74889

2023

[15] [15]

Smith, and Borja Balle

Soham De, Leonard Berrada, Jamie Hayes, Samuel L. Smith, and Borja Balle. 2022. Unlocking High-Accuracy Differentially Private Image Classification through Scale. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 162. PMLR, 4815–4827

2022

[16] [16]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.International Conference on Learning Representations (ICLR)(2021)

2021

[17] [17]

Cynthia Dwork. 2006. Differential Privacy. InAutomata, Languages and Pro- gramming (Lecture Notes in Computer Science), Vol. 4052. Springer, 1–12

2006

[18] [18]

Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differ- ential Privacy.Foundations and Trends in Theoretical Computer Science9, 3–4 (2014), 211–407

2014

[19] [19]

Jie Fu, Zhili Chen, and XinPeng Ling. 2022. SA-DPSGD: Differentially Private Stochastic Gradient Descent based on Simulated Annealing.arXiv preprint arXiv:2211.07218(2022). arXiv:2211.07218 https://arxiv.org/abs/2211.07218

arXiv 2022

[20] [20]

Saeed Ghadimi and Guanghui Lan. 2013. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming.SIAM Journal on Optimization 23, 4 (2013), 2341–2368

2013

[21] [21]

Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Chiyuan Zhang. 2023. Sparsity-Preserving Differentially Pri- vate Training of Large Embedding Models. InAdvances in Neural Information Processing Systems, Vol. 36. 10951–10971

2023

[22] [22]

Sivakanth Gopi, Yin Tat Lee, and Lukas Wutschitz. 2021. Numerical Composition of Differential Privacy. InAdvances in Neural Information Processing Systems, Vol. 34. 11631–11642

2021

[23] [23]

Shlomi Hod, Lucas Rosenblatt, and Julia Stoyanovich. 2025. Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data. arXiv preprint arXiv:2504.14368(2025)

arXiv 2025

[24] [24]

Yu, and Xuyun Zhang

Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, and Xuyun Zhang. 2022. Membership Inference Attacks on Machine Learning: A Survey. Comput. Surveys54, 11s (2022), 1–37

2022

[25] [25]

Kaggle / EyePACS. 2015. EyePACS Diabetic Retinopathy Detection Dataset. https://www.kaggle.com/c/diabetic-retinopathy-detection

2015

[26] [26]

Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. 2021. Practical and Private (Deep) Learning with- out Sampling or Shuffling. InInternational Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 5213–5225

2021

[27] [27]

Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U. Stich. 2023. Revisit- ing Gradient Clipping: Stochastic Bias and Tight Convergence Guarantees. In International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 202. PMLR, 17343–17363

2023

[28] [28]

Yixuan Liu, Li Xiong, Yuhan Liu, Yujie Gu, Ruixuan Liu, and Hong Chen. 2024. DPDR: Gradient Decomposition and Reconstruction for Differentially Private Deep Learning.arXiv preprint arXiv:2406.02744(2024). arXiv:2406.02744 https: //arxiv.org/abs/2406.02744

arXiv 2024

[29] [29]

Linyuan Lü, Matúš Medo, Chi Ho Yeung, Yi-Cheng Zhang, Zi-Ke Zhang, and Tao Zhou. 2012. Recommender systems.Physics reports519, 1 (2012), 1–49

2012

[30] [30]

Alexander Selvikvåg Lundervold and Arvid Lundervold. 2019. An overview of deep learning in medical imaging focusing on MRI.Zeitschrift fuer medizinische Physik29, 2 (2019), 102–127

2019

[31] [31]

Mehdi Makni, Kayhan Behdin, Gabriel Afriat, Zheng Xu, Sergei Vassilvitskii, Natalia Ponomareva, Rahul Mazumder, and Hussein Hazimeh. 2025. SPARTA: An Optimization Framework for Differentially Private Sparse Fine-Tuning. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2090–2101

2025

[32] [32]

Ilya Mironov. 2017. Rényi Differential Privacy. InProceedings of the 2017 IEEE 30th Computer Security Foundations Symposium. IEEE Computer Society, 263–275

2017

[33] [33]

Marziyeh Mohammadi, Mohsen Vejdanihemmat, Mahshad Lotfinia, Mirabela Rusu, Daniel Truhn, Andreas Maier, and Soroosh Tayebi Arasteh. 2026. Differ- ential privacy for medical deep learning: methods, tradeoffs, and deployment implications.npj Digital Medicine9 (2026), 93

2026

[34] [34]

Sabrina Mokhtari, Sara Kodeiri, Shubhankar Mohapatra, Florian Tramèr, and Gautam Kamath. 2026. Rethinking Benchmarks for Differentially Private Image Classification.arXiv preprint arXiv:2601.17189(2026)

arXiv 2026

[35] [35]

Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan, Munawar Hayat, Fahad Shahbaz Khan, and Huazhu Fu. 2023. Transformers in medical imaging: A survey.Medical Image Analysis88 (2023), 102802

2023

[36] [36]

Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. 1310–1321

2015

[37] [37]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In2017 IEEE symposium on security and privacy. 3–18

2017

[38] [38]

Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. 2013. Stochastic Gra- dient Descent with Differentially Private Updates. In2013 IEEE Global Conference on Signal and Information Processing. IEEE, 245–248

2013

[39] [39]

Florian Tramèr, Gautam Kamath, and Nicholas Carlini. 2024. Position: Consider- ations for Differentially Private Learning with Large-Scale Public Pretraining. InProceedings of the 41st International Conference on Machine Learning. PMLR, 48453–48467

2024

[40] [40]

Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. 2019. Sub- sampled Rényi Differential Privacy and Analytical Moments Accountant. In Proceedings of the Twenty-Second International Conference on Artificial Intelli- gence and Statistics (Proceedings of Machine Learning Research), Vol. 89. PMLR, 1226–1235. https://proceedings.mlr.press/v89/wang19b.html

2019

[41] [41]

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. 2024. A Survey on Large Language Models for Recommendation.World Wide Web (2024)

2024

[42] [42]

Alvarez, Jan Kautz, and Pavlo Molchanov

Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M. Alvarez, Jan Kautz, and Pavlo Molchanov. 2021. See Through Gradients: Image Batch Recovery via GradInver- sion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16337–16346

2021

[43] [43]

Da Yu, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. 2021. Do Not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning. InInter- national Conference on Learning Representations. https://openreview.net/forum? id=7aogOj_VYO0

2021

[44] [44]

Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. 2021. Large Scale Private Learning via Low-Rank Reparametrization. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 12208–12218. https://proceedings.mlr.press/v139/ yu21f.html

2021

[45] [45]

Xinwei Zhang, Zhiqi Bu, Borja Balle, Mingyi Hong, Meisam Razaviyayn, and Vahab Mirrokni. 2025. DiSK: Differentially Private Optimizer with Simplified Kalman Filter for Noise Reduction. InInternational Conference on Learning Rep- resentations. https://openreview.net/forum?id=Lfy9q7Icp9

2025

[46] [46]

Xinwei Zhang, Zhiqi Bu, Mingyi Hong, and Meisam Razaviyayn. 2024. DOPPLER: Differentially Private Optimizers with Low-Pass Filter for Privacy Noise Reduction. InAdvances in Neural Information Processing Systems, Vol. 37. 41826–41851. https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 49c466ccc038f39b08b1980a2b06673c-Abstract-Conference.html

2024

[47] [47]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2026. A Survey of Large Language Models.Frontiers of Computer Science20, 12 (2026), 2012627. https://doi.org/10.1007/s11704-026-60308-3

work page doi:10.1007/s11704-026-60308-3 2026

[48] [48]

S Kevin Zhou, Hayit Greenspan, Christos Davatzikos, James S Duncan, Bram Van Ginneken, Anant Madabhushi, Jerry L Prince, Daniel Rueckert, and Ronald M Summers. 2021. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proc. IEEE109, 5 (2021), 820–838

2021

[49] [49]

Yingxue Zhou, Zhiwei Steven Wu, and Arindam Banerjee. 2021. Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification. In International Conference on Learning Representations. https://openreview.net/ forum?id=7dpmlkBuJFC

2021

[50] [50]

Blaschko

Junyi Zhu and Matthew B. Blaschko. 2023. Improving Differentially Private SGD via Randomly Sparsified Gradients.Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=sY35BAiIf4

2023

[51] [51]

Ligeng Zhu, Zhijian Liu, and Song Han. 2019. Deep leakage from gradients. Advances in neural information processing systems32 (2019)

2019