pith. sign in

arxiv: 2606.04375 · v1 · pith:VEFYNJ2Dnew · submitted 2026-06-03 · 💻 cs.LG · stat.ML

When Do Fewer Coordinates Suffice in DP-SGD?

Pith reviewed 2026-06-28 07:01 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords DP-SGDcoordinate restrictiondifferential privacyTopK selectionnonconvex stationaritywarm-up phasegradient energy
0
0 comments X

The pith

When a criterion holds, DP-SGD can restrict updates to k coordinates so noise scales with k rather than full dimension d.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DP-SGD adds noise scaled to the full parameter dimension d on every step. The paper identifies a condition under which restricting updates to a smaller active set of k coordinates remains useful for optimization. It introduces TP-TopK, a two-phase procedure that runs a private warm-up phase to rank and select the support used in the main training phase. Under the condition a nonconvex stationarity bound shows the noise term depends on k. Experiments on MNIST, FMNIST and CIFAR-10 confirm that supports chosen from warm-up scores retain more gradient energy than random subsets of equal size, with larger gains when k is small.

Core claim

We give a criterion characterizing when coordinate restriction can be beneficial, show via a nonconvex stationarity bound that under this condition the relevant noise term scales with the active dimension k rather than the full parameter dimension d, and provide a lower bound on the reliability of warm-up-based coordinate ranking.

What carries the argument

TP-TopK, the two-phase method that runs a private warm-up phase to identify a coordinate support for the main training phase.

If this is right

  • Under the stated criterion the noise term in the stationarity bound scales with k rather than d.
  • Learned coordinate supports from the warm-up retain more gradient energy than size-matched random supports.
  • The largest gains appear when the active dimension is small and the warm-up scores are informative.
  • A lower bound is given on the reliability of warm-up-based coordinate ranking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the criterion can be verified cheaply it would let private training automatically drop uninformative coordinates to lower total noise.
  • The same warm-up idea could be tested in other per-coordinate noise mechanisms beyond DP-SGD.
  • Checking whether the criterion holds on large language models would show whether the k-scaling benefit survives beyond the image-classification regimes examined.

Load-bearing premise

The private warm-up phase must produce coordinate scores informative enough that the selected support retains sufficient gradient energy for the main optimization phase to succeed.

What would settle it

An experiment in which the nonconvex stationarity bound fails to improve when restricting to the warm-up-selected k coordinates, or in which the warm-up ranking reliability falls below the stated lower bound on real models.

Figures

Figures reproduced from arXiv: 2606.04375 by Fang Xie, Huiqi Zhang.

Figure 2
Figure 2. Figure 2: Two-phase training pipeline of TP-TopK and its RDP accounting structure. Phase 1 runs full-parameter DP￾SGD with noise multiplier 𝜎1, accumulating coordinate scores 𝑠𝑝 ; the denoised score 𝑎𝑝 = 𝑠𝑝 /𝑇1 − (𝜎1𝐶1/𝐵) 2 is post-processed to select 𝐴 = TopK(𝑎, 𝑘) at no additional privacy cost. Phase 2 restricts DP-SGD to the 𝑘 active coordinates in 𝐴 with noise multiplier 𝜎2, leaving frozen coordinates 𝐴¯ unchang… view at source ↗
read the original abstract

Differentially private stochastic gradient descent (DP-SGD) injects noise into every updated coordinate, making the injected noise energy scale with the ambient parameter dimension \(d\). We ask when private training can update fewer coordinates without losing the signal needed for optimization. We propose \textsc{TP-TopK} (Two-Phase TopK DP-SGD), a two-phase method for coordinate-sparse private training without public data, in which a private warm-up phase identifies a coordinate support used to guide the main training phase. We give a criterion characterizing when coordinate restriction can be beneficial, show via a nonconvex stationarity bound that under this condition the relevant noise term scales with the active dimension \(k\) rather than the full parameter dimension \(d\), and provide a lower bound on the reliability of warm-up-based coordinate ranking. Experiments on MNIST, FMNIST, and CIFAR-10 show that learned coordinate supports can retain more gradient energy than size-matched random supports, with the largest gains when the active dimension is small and warm-up scores are informative.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TP-TopK, a two-phase DP-SGD algorithm consisting of a private warm-up phase that ranks and selects a coordinate support of size k, followed by a main training phase restricted to those coordinates. It states a criterion under which coordinate restriction is beneficial, derives a non-convex stationarity bound showing that the noise term then scales with k rather than the full dimension d, supplies a lower bound on the reliability of the warm-up ranking, and reports experiments on MNIST, FMNIST, and CIFAR-10 in which learned supports retain more gradient energy than size-matched random supports, with larger gains at small k when warm-up scores are informative.

Significance. If the stationarity bound holds under the stated criterion and the warm-up lower bound is shown to guarantee sufficient retained gradient energy, the work would offer a concrete, public-data-free route to reducing the noise penalty in high-dimensional DP-SGD. The explicit criterion, the non-convex analysis, and the direct comparison of learned versus random supports are positive features; the absence of public data makes the approach relevant to settings where auxiliary data are unavailable.

major comments (2)
  1. [stationarity bound and reliability lower bound] The non-convex stationarity bound (abstract and the derivation referenced in the main text): the bound is stated to deliver a k-versus-d noise improvement only under the paper's criterion, yet the criterion itself presupposes that the selected support retains enough gradient energy. The supplied lower bound on warm-up ranking reliability is described only as 'partial support' for this premise; it is not shown whether the bound guarantees a (1-ε) fraction of total gradient energy with high probability, which is required for the k-scaling claim to remain load-bearing when selection error is present.
  2. [experiments on MNIST, FMNIST, CIFAR-10] Experimental section: results are reported without error bars, without the exact values of k tested, and without an explicit statement of the exclusion rule or energy metric used to declare that 'learned supports retain more gradient energy.' These omissions prevent assessment of whether the observed gains are statistically reliable or whether they occur in the regime where the k-versus-d benefit is theoretically predicted.
minor comments (1)
  1. Notation for the active dimension k and the reliability probability should be introduced once and used consistently; the abstract uses both without an explicit definition of the energy-retention threshold.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and positive assessment of the paper's potential contribution. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [stationarity bound and reliability lower bound] The non-convex stationarity bound (abstract and the derivation referenced in the main text): the bound is stated to deliver a k-versus-d noise improvement only under the paper's criterion, yet the criterion itself presupposes that the selected support retains enough gradient energy. The supplied lower bound on warm-up ranking reliability is described only as 'partial support' for this premise; it is not shown whether the bound guarantees a (1-ε) fraction of total gradient energy with high probability, which is required for the k-scaling claim to remain load-bearing when selection error is present.

    Authors: The stationarity bound is derived under an explicit criterion that the selected support must retain sufficient gradient energy for the k-scaling to hold; this is presented as a sufficient condition rather than an unconditional claim. The lower bound on warm-up reliability is offered as partial analytical support for the criterion being plausible, showing that the ranking procedure has non-trivial reliability properties, but it does not claim or prove a high-probability (1-ε) energy retention guarantee. We will revise the manuscript to more explicitly separate the criterion (under which the bound applies) from the reliability analysis and to discuss the effect of possible selection error on the noise scaling. Strengthening the reliability result to a full (1-ε) guarantee would require additional assumptions on gradient distributions that are outside the current scope. revision: partial

  2. Referee: [experiments on MNIST, FMNIST, CIFAR-10] Experimental section: results are reported without error bars, without the exact values of k tested, and without an explicit statement of the exclusion rule or energy metric used to declare that 'learned supports retain more gradient energy.' These omissions prevent assessment of whether the observed gains are statistically reliable or whether they occur in the regime where the k-versus-d benefit is theoretically predicted.

    Authors: We agree that the experimental reporting lacks these details. In the revised version we will add error bars computed over multiple independent runs, list the precise k values used for each dataset, and provide an explicit definition of the gradient energy metric together with any coordinate exclusion rules applied when computing retained energy. These additions will make it possible to evaluate statistical reliability and alignment with the theoretical regime. revision: yes

standing simulated objections not resolved
  • The lower bound on warm-up ranking reliability does not currently guarantee retention of a (1-ε) fraction of total gradient energy with high probability; providing such a guarantee would require further assumptions or analysis not present in the manuscript.

Circularity Check

0 steps flagged

No significant circularity; derivations follow from stated criterion and standard analysis

full rationale

The paper states a criterion for when coordinate restriction benefits DP-SGD, then derives a nonconvex stationarity bound under that criterion (showing noise scaling with active dimension k) and a lower bound on warm-up ranking reliability. These follow from the criterion plus standard non-convex analysis without any reduction by the paper's own equations to fitted quantities, self-citations, or definitional tautologies. No load-bearing step renames a fit as a prediction or imports uniqueness via author self-citation; the central claims remain independent of the paper's inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is inferred at a high level from the described method and theory; no explicit free parameters, axioms, or invented entities are stated.

free parameters (1)
  • active dimension k
    The number of coordinates retained after warm-up; its value is chosen according to the criterion and appears tuned per dataset and experiment.
axioms (1)
  • domain assumption There exists a small coordinate support that captures sufficient gradient energy for optimization progress
    This premise is required for the warm-up phase to be useful and for the stationarity bound to show benefit from restricting to k coordinates.

pith-pipeline@v0.9.1-grok · 5706 in / 1474 out tokens · 31014 ms · 2026-06-28T07:01:55.393556+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 1 canonical work pages

  1. [1]

    Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

    Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 308–318

  2. [2]

    Kamil Adamczewski, Yingchen He, and Mijung Park. 2023. Pre-Pruning and Gradient-Dropping Improve Differentially Private Image Classification.arXiv preprint arXiv:2306.11754(2023)

  3. [3]

    Brendan McMahan, and Swaroop Ramaswamy

    Galen Andrew, Om Thakkar, H. Brendan McMahan, and Swaroop Ramaswamy

  4. [4]

    InAdvances in Neural Information Processing Systems, Vol

    Differentially Private Learning with Adaptive Clipping. InAdvances in Neural Information Processing Systems, Vol. 34. 17455–17466

  5. [5]

    Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. 2019. Private Stochastic Convex Optimization with Optimal Rates. InAdvances in Neural Information Processing Systems, Vol. 32. 11282–11291

  6. [6]

    Raef Bassily, Adam Smith, and Abhradeep Thakurta. 2014. Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds. InProceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 464–473

  7. [7]

    Philipp Benz et al. 2023. Equivariant Differentially Private Deep Learning: Why DP-SGD Needs Sparser Models.arXiv preprint arXiv:2301.13104(2023)

  8. [8]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems33 (2020), 1877–1901

  9. [9]

    Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. 2023. Au- tomatic Clipping: Differentially Private Deep Learning Made Easier and Stronger. InAdvances in Neural Information Processing Systems, Vol. 36. 41727–41764. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 8249b30d877c91611fd8c7aa6ac2b5fe-Abstract-Conference.html

  10. [10]

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramèr. 2022. Membership Inference Attacks From First Principles. In 2022 IEEE Symposium on Security and Privacy (SP). 1897–1914

  11. [11]

    Lin Chen, Xiaofeng Ding, Mengqi Li, and Hai Jin. 2023. Differentially private fed- erated learning with importance client sampling.IEEE Transactions on Consumer Electronics70, 1 (2023), 3635–3649. When Do Fewer Coordinates Suffice in DP-SGD?

  12. [12]

    Mia Xu Chen, Benjamin N Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M Dai, Zhifeng Chen, et al. 2019. Gmail smart compose: Real-time assisted writing. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2287– 2295

  13. [13]

    Christopher Choquette-Choo, Arun Ganesh, Saminul Haque, Thomas Steinke, and Abhradeep Guha Thakurta. 2025. Near-Exact Privacy Amplification for Matrix Mechanisms. InInternational Conference on Learning Representations. 98772–98802

  14. [14]

    Choquette-Choo, Arun Ganesh, Ryan McKenna, H

    Christopher A. Choquette-Choo, Arun Ganesh, Ryan McKenna, H. Brendan McMahan, John Rush, Abhradeep Guha Thakurta, and Zheng Xu. 2023. (Ampli- fied) Banded Matrix Factorization: A Unified Approach to Private Training. In Advances in Neural Information Processing Systems, Vol. 36. 74856–74889

  15. [15]

    Smith, and Borja Balle

    Soham De, Leonard Berrada, Jamie Hayes, Samuel L. Smith, and Borja Balle. 2022. Unlocking High-Accuracy Differentially Private Image Classification through Scale. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 162. PMLR, 4815–4827

  16. [16]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.International Conference on Learning Representations (ICLR)(2021)

  17. [17]

    Cynthia Dwork. 2006. Differential Privacy. InAutomata, Languages and Pro- gramming (Lecture Notes in Computer Science), Vol. 4052. Springer, 1–12

  18. [18]

    Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differ- ential Privacy.Foundations and Trends in Theoretical Computer Science9, 3–4 (2014), 211–407

  19. [19]

    Jie Fu, Zhili Chen, and XinPeng Ling. 2022. SA-DPSGD: Differentially Private Stochastic Gradient Descent based on Simulated Annealing.arXiv preprint arXiv:2211.07218(2022). arXiv:2211.07218 https://arxiv.org/abs/2211.07218

  20. [20]

    Saeed Ghadimi and Guanghui Lan. 2013. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming.SIAM Journal on Optimization 23, 4 (2013), 2341–2368

  21. [21]

    Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Chiyuan Zhang. 2023. Sparsity-Preserving Differentially Pri- vate Training of Large Embedding Models. InAdvances in Neural Information Processing Systems, Vol. 36. 10951–10971

  22. [22]

    Sivakanth Gopi, Yin Tat Lee, and Lukas Wutschitz. 2021. Numerical Composition of Differential Privacy. InAdvances in Neural Information Processing Systems, Vol. 34. 11631–11642

  23. [23]

    Shlomi Hod, Lucas Rosenblatt, and Julia Stoyanovich. 2025. Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data. arXiv preprint arXiv:2504.14368(2025)

  24. [24]

    Yu, and Xuyun Zhang

    Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, and Xuyun Zhang. 2022. Membership Inference Attacks on Machine Learning: A Survey. Comput. Surveys54, 11s (2022), 1–37

  25. [25]

    Kaggle / EyePACS. 2015. EyePACS Diabetic Retinopathy Detection Dataset. https://www.kaggle.com/c/diabetic-retinopathy-detection

  26. [26]

    Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. 2021. Practical and Private (Deep) Learning with- out Sampling or Shuffling. InInternational Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 5213–5225

  27. [27]

    Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U. Stich. 2023. Revisit- ing Gradient Clipping: Stochastic Bias and Tight Convergence Guarantees. In International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 202. PMLR, 17343–17363

  28. [28]

    Yixuan Liu, Li Xiong, Yuhan Liu, Yujie Gu, Ruixuan Liu, and Hong Chen. 2024. DPDR: Gradient Decomposition and Reconstruction for Differentially Private Deep Learning.arXiv preprint arXiv:2406.02744(2024). arXiv:2406.02744 https: //arxiv.org/abs/2406.02744

  29. [29]

    Linyuan Lü, Matúš Medo, Chi Ho Yeung, Yi-Cheng Zhang, Zi-Ke Zhang, and Tao Zhou. 2012. Recommender systems.Physics reports519, 1 (2012), 1–49

  30. [30]

    Alexander Selvikvåg Lundervold and Arvid Lundervold. 2019. An overview of deep learning in medical imaging focusing on MRI.Zeitschrift fuer medizinische Physik29, 2 (2019), 102–127

  31. [31]

    Mehdi Makni, Kayhan Behdin, Gabriel Afriat, Zheng Xu, Sergei Vassilvitskii, Natalia Ponomareva, Rahul Mazumder, and Hussein Hazimeh. 2025. SPARTA: An Optimization Framework for Differentially Private Sparse Fine-Tuning. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2090–2101

  32. [32]

    Ilya Mironov. 2017. Rényi Differential Privacy. InProceedings of the 2017 IEEE 30th Computer Security Foundations Symposium. IEEE Computer Society, 263–275

  33. [33]

    Marziyeh Mohammadi, Mohsen Vejdanihemmat, Mahshad Lotfinia, Mirabela Rusu, Daniel Truhn, Andreas Maier, and Soroosh Tayebi Arasteh. 2026. Differ- ential privacy for medical deep learning: methods, tradeoffs, and deployment implications.npj Digital Medicine9 (2026), 93

  34. [34]

    Sabrina Mokhtari, Sara Kodeiri, Shubhankar Mohapatra, Florian Tramèr, and Gautam Kamath. 2026. Rethinking Benchmarks for Differentially Private Image Classification.arXiv preprint arXiv:2601.17189(2026)

  35. [35]

    Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan, Munawar Hayat, Fahad Shahbaz Khan, and Huazhu Fu. 2023. Transformers in medical imaging: A survey.Medical Image Analysis88 (2023), 102802

  36. [36]

    Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. 1310–1321

  37. [37]

    Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In2017 IEEE symposium on security and privacy. 3–18

  38. [38]

    Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. 2013. Stochastic Gra- dient Descent with Differentially Private Updates. In2013 IEEE Global Conference on Signal and Information Processing. IEEE, 245–248

  39. [39]

    Florian Tramèr, Gautam Kamath, and Nicholas Carlini. 2024. Position: Consider- ations for Differentially Private Learning with Large-Scale Public Pretraining. InProceedings of the 41st International Conference on Machine Learning. PMLR, 48453–48467

  40. [40]

    Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. 2019. Sub- sampled Rényi Differential Privacy and Analytical Moments Accountant. In Proceedings of the Twenty-Second International Conference on Artificial Intelli- gence and Statistics (Proceedings of Machine Learning Research), Vol. 89. PMLR, 1226–1235. https://proceedings.mlr.press/v89/wang19b.html

  41. [41]

    Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. 2024. A Survey on Large Language Models for Recommendation.World Wide Web (2024)

  42. [42]

    Alvarez, Jan Kautz, and Pavlo Molchanov

    Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M. Alvarez, Jan Kautz, and Pavlo Molchanov. 2021. See Through Gradients: Image Batch Recovery via GradInver- sion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16337–16346

  43. [43]

    Da Yu, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. 2021. Do Not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning. InInter- national Conference on Learning Representations. https://openreview.net/forum? id=7aogOj_VYO0

  44. [44]

    Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. 2021. Large Scale Private Learning via Low-Rank Reparametrization. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 12208–12218. https://proceedings.mlr.press/v139/ yu21f.html

  45. [45]

    Xinwei Zhang, Zhiqi Bu, Borja Balle, Mingyi Hong, Meisam Razaviyayn, and Vahab Mirrokni. 2025. DiSK: Differentially Private Optimizer with Simplified Kalman Filter for Noise Reduction. InInternational Conference on Learning Rep- resentations. https://openreview.net/forum?id=Lfy9q7Icp9

  46. [46]

    Xinwei Zhang, Zhiqi Bu, Mingyi Hong, and Meisam Razaviyayn. 2024. DOPPLER: Differentially Private Optimizers with Low-Pass Filter for Privacy Noise Reduction. InAdvances in Neural Information Processing Systems, Vol. 37. 41826–41851. https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 49c466ccc038f39b08b1980a2b06673c-Abstract-Conference.html

  47. [47]

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2026. A Survey of Large Language Models.Frontiers of Computer Science20, 12 (2026), 2012627. https://doi.org/10.1007/s11704-026-60308-3

  48. [48]

    S Kevin Zhou, Hayit Greenspan, Christos Davatzikos, James S Duncan, Bram Van Ginneken, Anant Madabhushi, Jerry L Prince, Daniel Rueckert, and Ronald M Summers. 2021. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proc. IEEE109, 5 (2021), 820–838

  49. [49]

    Yingxue Zhou, Zhiwei Steven Wu, and Arindam Banerjee. 2021. Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification. In International Conference on Learning Representations. https://openreview.net/ forum?id=7dpmlkBuJFC

  50. [50]

    Blaschko

    Junyi Zhu and Matthew B. Blaschko. 2023. Improving Differentially Private SGD via Randomly Sparsified Gradients.Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=sY35BAiIf4

  51. [51]

    Ligeng Zhu, Zhijian Liu, and Song Han. 2019. Deep leakage from gradients. Advances in neural information processing systems32 (2019)