On Optimal Hyperparameters for Differentially Private Deep Transfer Learning

Aki Rehn; Antti Honkela; Linzh Zhao; Mikko A. Heikkil\"a

arxiv: 2510.20616 · v2 · submitted 2025-10-23 · 💻 cs.LG

On Optimal Hyperparameters for Differentially Private Deep Transfer Learning

Aki Rehn , Linzh Zhao , Mikko A. Heikkil\"a , Antti Honkela This is my paper

Pith reviewed 2026-05-18 04:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords differentially private learningtransfer learningclipping boundbatch sizegradient distributiondifferential privacyhyperparameter tuning

0 comments

The pith

Larger clipping bounds outperform theory under strong privacy in DP transfer learning due to gradient distribution shifts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines choices of clipping bound C and batch size B when fine-tuning pretrained models under differential privacy. It identifies a mismatch where theory recommends smaller C for stronger privacy, yet experiments show larger C yields better results because per-sample gradient distributions change with added noise. Under a fixed-epoch compute budget, common heuristics for selecting B fail to predict performance while total accumulated DP noise provides a better explanation. A single fixed pair of C and B across tasks or privacy regimes produces suboptimal accuracy, especially when moving between loose and tight privacy or between high and low compute limits.

Core claim

The central claim is that clipping bound C should be chosen larger under stronger privacy constraints in practice, contrary to standard theory, because DP noise alters the distribution of gradient norms; under a fixed number of epochs, cumulative DP noise rather than per-step heuristics governs whether small or large batches are preferable; and clipping acts as a gradient re-weighting whose interaction with noise accumulation makes a universal (C, B) setting ineffective when privacy level or compute budget changes.

What carries the argument

Analysis of how DP noise changes per-sample gradient magnitude distributions, combined with cumulative DP noise over fixed epochs and the interpretation of clipping as gradient re-weighting

If this is right

A single (C, B) pair across tasks leads to clear performance loss when privacy budgets or epoch limits change.
Batch-size selection should be guided by total accumulated noise rather than existing per-step rules under fixed compute.
Larger C can be preferable to smaller C once noise alters gradient distributions under strong privacy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tuning C and B separately for each privacy level and compute regime is likely necessary instead of reusing values from prior runs.
Making the clipping bound adaptive to observed gradient statistics during training could reduce the observed mismatch without extra privacy cost.

Load-bearing premise

The analysis assumes training is limited to a fixed number of epochs rather than continued until convergence.

What would settle it

Demonstrating that the reported mismatch between theory and empirical results on C disappears when models are trained for many more epochs until validation performance plateaus would falsify the role of fixed-epoch cumulative noise.

Figures

Figures reproduced from arXiv: 2510.20616 by Aki Rehn, Antti Honkela, Linzh Zhao, Mikko A. Heikkil\"a.

**Figure 2.** Figure 2: Accuracy difference to the best, mean over 5 repeats with min–max bands; SUN397 dataset, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Gradient norm distributions after training, ViT-Tiny on SUN397, 8 epochs; [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-class effects of gradient clipping under increasing task difficulty (left to right), SUN397, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of AUTO-S (Bu et al., 2023) and properly tuned flat clipping, mean accuracy with min–max bands over 3 repeats, ViT-Tiny model, 8 epochs training for SUN397 and CIFAR-100, 32 epochs for Cassava; δ = 10−5 . Batch size and learning rate were tuned over a fixed grid for both methods. AUTO-S performs notably worse on harder datasets (SUN397, Cassava), especially under tight privacy, as predicted by o… view at source ↗

**Figure 6.** Figure 6: Neither per-step averaged gradient noise standard deviation suggested by Ponomareva et al. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Cumulative noise, batch size and the number [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Differentially private (DP) transfer learning, i.e., fine-tuning a pretrained model on private data, is the current state-of-the-art approach for training large models under privacy constraints. We focus on two key hyperparameters in this setting: the clipping bound $C$ and batch size $B$. We show a clear mismatch between the current theoretical understanding of how to choose an optimal $C$ (stronger privacy requires smaller $C$) and empirical outcomes (larger $C$ performs better under strong privacy), caused by changes in the gradient distributions. Assuming a limited compute budget (fixed epochs), we demonstrate that the existing heuristics for tuning $B$ do not work, while cumulative DP noise better explains whether smaller or larger batches perform better. We also highlight how the common practice of using a single $(C,B)$ setting across tasks can lead to suboptimal performance. We find that performance drops especially when moving between loose and tight privacy and between plentiful and limited compute, which we explain by analyzing clipping as a form of gradient re-weighting and examining cumulative DP noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper documents a reversal in optimal clipping bound for strong privacy in DP transfer learning and favors cumulative noise for batch size under fixed epochs, but the mechanism for the reversal needs tighter controls.

read the letter

The main thing to know is that this paper documents a reversal in the optimal clipping bound C for differentially private transfer learning: theory predicts smaller C with stronger privacy, but experiments show larger C works better, which they link to changes in gradient distributions. They also find that under a fixed number of epochs, batch size B performance is better explained by cumulative DP noise than by standard heuristics. The work does a good job highlighting practical issues in the dominant approach for private large-model training. It shows that using one (C, B) setting across different tasks, privacy levels, or compute budgets leads to suboptimal results, and it provides some analysis framing clipping as a form of gradient re-weighting. The cumulative noise account for batch size is a reasonable alternative explanation when compute is limited. That said, the central explanation for the C finding could use more isolation. As the stress test notes, increasing C scales the noise magnitude directly because the added noise has standard deviation sigma times C, and sigma grows as epsilon gets smaller. Without pre- and post-noise gradient statistics or controls that hold the effective noise level constant, it's difficult to separate distribution shift from signal-to-noise changes. The paper would benefit from those checks to make the mechanism claim more convincing. The empirical observations seem grounded in the abstract's description, and there is no obvious circularity since the claims come from observed behaviors rather than fitted models. This is useful for practitioners who fine-tune large models under differential privacy constraints, particularly those dealing with sensitive data and needing to choose hyperparameters carefully. A reader interested in DP-SGD implementation details and hyperparameter sensitivity will get value from the targeted experiments and explanations. It deserves a serious referee. The topic is relevant, the mismatch is worth documenting, and the proposed mechanisms are worth testing further even if some aspects need tightening. I would recommend sending it to peer review, with requests for additional ablations on the noise scaling and more quantitative results in the main text.

Referee Report

1 major / 2 minor

Summary. The manuscript examines optimal choices of clipping bound C and batch size B for differentially private transfer learning of pretrained models. It reports an empirical reversal where larger C outperforms under strong privacy (contrary to theory recommending smaller C for tighter epsilon), attributing this to shifts in gradient distributions. Under a fixed-epoch compute budget, it finds that standard B-tuning heuristics fail while cumulative DP noise accumulation better explains performance; it further shows that a single (C, B) pair is suboptimal when moving between privacy regimes or compute budgets, with supporting analysis of clipping as gradient re-weighting.

Significance. If the reported empirical patterns and mechanistic explanations hold after addressing potential confounds, the work would be useful for practitioners selecting hyperparameters in DP fine-tuning. The emphasis on cumulative noise and cross-regime sensitivity provides concrete guidance beyond generic DP-SGD rules, and the gradient-distribution account could inform future theoretical refinements.

major comments (1)

[§4] §4 (clipping-bound experiments): the central claim that gradient-distribution shifts explain why larger C is preferable under tight privacy is not isolated from the direct scaling of added noise (noise std = sigma * C, with sigma rising as epsilon falls). Without controls that hold effective noise magnitude fixed while varying C, or pre-/post-noise gradient-norm histograms, the experiments cannot distinguish distribution shift from signal-to-noise ratio changes as the operative mechanism.

minor comments (2)

[Abstract] Abstract and §5: key quantitative results (accuracy deltas, standard deviations, dataset sizes, number of runs) are not summarized; adding a short table or effect-size statements would improve readability and allow readers to assess practical importance without consulting the full figures.
Figure captions and legends: several plots comparing privacy levels lack explicit indication of whether error bars represent standard deviation across seeds or across tasks; clarifying this would aid interpretation of the cross-regime claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The major comment identifies a valid potential confound in the clipping-bound experiments of §4. We address this point directly below and agree that additional controls will strengthen the mechanistic claims.

read point-by-point responses

Referee: [§4] §4 (clipping-bound experiments): the central claim that gradient-distribution shifts explain why larger C is preferable under tight privacy is not isolated from the direct scaling of added noise (noise std = sigma * C, with sigma rising as epsilon falls). Without controls that hold effective noise magnitude fixed while varying C, or pre-/post-noise gradient-norm histograms, the experiments cannot distinguish distribution shift from signal-to-noise ratio changes as the operative mechanism.

Authors: We agree that the experiments as presented do not fully isolate gradient-distribution shifts from changes in effective noise magnitude, since sigma is scaled to achieve the target epsilon and noise std therefore grows with C. Our current results show larger C outperforming under tight privacy alongside observed changes in gradient-norm distributions, but we acknowledge this leaves open the possibility that SNR effects contribute. To address the concern, we will add (i) experiments that hold sigma * C approximately fixed while varying C (by appropriate adjustment of the privacy accountant where feasible) and (ii) pre- and post-noise gradient-norm histograms across privacy regimes. These additions will be included in the revised manuscript and should clarify the relative contributions of distribution shift versus noise scaling without altering the main empirical finding that larger C is preferable under tight privacy. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical analysis of DP hyperparameters

full rationale

The paper presents empirical observations on mismatches between theoretical guidance for clipping bound C and batch size B versus observed performance in differentially private transfer learning. Explanations rely on measured changes in gradient distributions and cumulative DP noise under fixed-epoch budgets, without any mathematical derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. Claims are supported by direct experimental results rather than reducing to inputs by construction, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central observations rest on the assumption that gradient distributions shift with privacy strength and that total compute is fixed in epochs; no new entities or free parameters are introduced in the abstract.

axioms (2)

domain assumption Gradient distributions change meaningfully with privacy strength (epsilon).
Invoked to explain why larger C outperforms under strong privacy.
domain assumption Total compute budget is fixed in number of epochs.
Stated explicitly when discussing batch-size heuristics.

pith-pipeline@v0.9.0 · 5726 in / 1109 out tokens · 36893 ms · 2026-05-18T04:37:43.980268+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 5.1. The optimal clipping constant C∗ that minimizes the mean squared error between the per-sample clipped DP gradient g̃ and the true gradient g for a fixed minibatch satisfies C∗=N_C^T G_C / (N_C^T N_C + σ²d)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show a clear mismatch between the current theoretical understanding of how to choose an optimal C (stronger privacy requires smaller C) and empirical outcomes (larger C performs better under strong privacy), caused by changes in the gradient distributions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy
cs.LG 2026-04 unverdicted novelty 7.0

DPrivBench shows that top LLMs handle basic differential privacy mechanisms but fail on advanced algorithms, exposing gaps in automated DP reasoning.
DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy
cs.LG 2026-04 accept novelty 7.0

DPrivBench is a new benchmark for evaluating LLMs on differential privacy reasoning, with results showing good performance on textbook mechanisms but substantial failures on advanced algorithms.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

Martín Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep Learning with Differential Privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security,

work page 2016
[2]

Goodfellow, H

doi: 10.1145/2976749.2978318. Kareem Amin, Alex Kulesza, Andres Munoz, and Sergei Vassilvtiskii. Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy. InProceedings of the 36th International Conference on Machine Learning, ICML. PMLR,

work page doi:10.1145/2976749.2978318
[3]

arXiv preprint arXiv:2308.10888 , year =

URLhttp://arxiv.org/abs/2308.10888. Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger. InThe Thirty-seventh Conference on Neural Information Processing Systems, NeurIPS,

work page arXiv
[4]

Smith, and Borja Balle

URL http://arxiv. org/abs/2204.13650. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InThe 9th International Conference...

work page arXiv
[5]

doi: 10.1007/11761679_29

Springer, 2006a. doi: 10.1007/11761679_29. Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating Noise to Sensitivity in Private Data Analysis. InAdvances in Cryptology - 25th International Conference on the Theory and Applications of Cryptographic Techniques, EUROCRYPT 2006, volume

work page doi:10.1007/11761679_29 2006
[6]

doi: 10.1007/11681878_14

Springer, 2006b. doi: 10.1007/11681878_14. 10 Maria S. Esipova, Atiyeh Ashari Ghomi, Yaqiao Luo, and Jesse C. Cresswell. Disparate Impact in Differential Privacy from Gradient Misalignment. InThe 11th International Conference on Learning Representations, ICLR,

work page doi:10.1007/11681878_14
[7]

Ruixuan Liu and Zhiqi Bu

URL http://www.cs.utoronto.ca/~kriz/ learning-features-2009-TR.pdf. Ruixuan Liu and Zhiqi Bu. Towards hyperparameter-free optimization with differential privacy. In The 13th International Conference on Learning Representations, ICLR,

work page 2009
[8]

A General Approach to Adding Differential Privacy to Iterative Training Procedures

URLhttp://arxiv.org/abs/1812.06210. Harsh Mehta, Abhradeep Thakurta, Alexey Kurakin, and Ashok Cutkosky. Towards Large Scale Transfer Learning for Differentially Pri- vate Image Classification.Transactions on Machine Learning Research, 2023,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

In: 2024 Annual Computer Security Applications Conference (ACSAC)

doi: 10.1109/ ACSAC63791.2024.00097. Ernest Mwebaze, Timnit Gebru, Andrea Frome, Solomon Nsumba, and Jeremy Tusubira. iCassava 2019 Fine-Grained Visual Categorization Challenge,

work page arXiv 2024
[10]

Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar, Vikash Sehwag, and Prateek Mittal

URL http://arxiv.org/abs/ 1908.02900. Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar, Vikash Sehwag, and Prateek Mittal. A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization. InThe Forty-first International Conference on Machine Learning, ICML,

work page arXiv 1908
[11]

Perez , author F

doi: 10.1609/AAAI.V32I1.11671. Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H. Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Thakurta. How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy.Journal of Artificial Intelligence Research, 77,

work page doi:10.1609/aaai.v32i1.11671
[12]

URLhttps://doi.org/10.1613/jair.1.14649

doi: 10.1613/jair.1.14649. 11 Ossi Räisä, Joonas Jälkö, and Antti Honkela. Subsampling is not Magic: Why Large Batch Sizes Work for Differentially Private Stochastic Optimisation. InThe Forty-first International Conference on Machine Learning, ICML,

work page doi:10.1613/jair.1.14649
[13]

Andreas Peter Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer

doi: 10.1109/GlobalSIP.2013.6736861. Andreas Peter Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers.Transactions on Machine Learning Research,

work page doi:10.1109/globalsip.2013.6736861 2013
[14]

Target: X5

doi: 10.5281/zenodo.4414861. Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN Database: Large-scale scene recognition from Abbey to Zoo. InIEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR,

work page doi:10.5281/zenodo.4414861
[15]

Yang You, Jing Li, Sashank J

doi: 10.1007/s11263-014-0748-y. Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. InThe 8th International Conference on Learning Representations, ICLR. OpenReview.net,

work page doi:10.1007/s11263-014-0748-y
[16]

Opacus: User-friendly differential privacy library in pytorch

URL https: //arxiv.org/abs/2109.12298. Xinwei Zhang, Zhiqi Bu, Zhiwei Steven Wu, and Mingyi Hong. Differentially private SGD without clipping bias: An error-feedback approach. InProceedings of the International Conference on Learning Representations, ICLR,

work page arXiv
[17]

A LLMUSAGE We used large language models (LLMs), including OpenAI’s ChatGPT and GitHub Copilot, at various points during the development of this paper. 12 These tools assisted with grammar and phrasing, clarification of technical concepts, code generation and refactoring code for data processing, filtering, and visualization, as well as interpretation of ...

work page 2016

[1] [1]

Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

Martín Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep Learning with Differential Privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security,

work page 2016

[2] [2]

Goodfellow, H

doi: 10.1145/2976749.2978318. Kareem Amin, Alex Kulesza, Andres Munoz, and Sergei Vassilvtiskii. Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy. InProceedings of the 36th International Conference on Machine Learning, ICML. PMLR,

work page doi:10.1145/2976749.2978318

[3] [3]

arXiv preprint arXiv:2308.10888 , year =

URLhttp://arxiv.org/abs/2308.10888. Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger. InThe Thirty-seventh Conference on Neural Information Processing Systems, NeurIPS,

work page arXiv

[4] [4]

Smith, and Borja Balle

URL http://arxiv. org/abs/2204.13650. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InThe 9th International Conference...

work page arXiv

[5] [5]

doi: 10.1007/11761679_29

Springer, 2006a. doi: 10.1007/11761679_29. Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating Noise to Sensitivity in Private Data Analysis. InAdvances in Cryptology - 25th International Conference on the Theory and Applications of Cryptographic Techniques, EUROCRYPT 2006, volume

work page doi:10.1007/11761679_29 2006

[6] [6]

doi: 10.1007/11681878_14

Springer, 2006b. doi: 10.1007/11681878_14. 10 Maria S. Esipova, Atiyeh Ashari Ghomi, Yaqiao Luo, and Jesse C. Cresswell. Disparate Impact in Differential Privacy from Gradient Misalignment. InThe 11th International Conference on Learning Representations, ICLR,

work page doi:10.1007/11681878_14

[7] [7]

Ruixuan Liu and Zhiqi Bu

URL http://www.cs.utoronto.ca/~kriz/ learning-features-2009-TR.pdf. Ruixuan Liu and Zhiqi Bu. Towards hyperparameter-free optimization with differential privacy. In The 13th International Conference on Learning Representations, ICLR,

work page 2009

[8] [8]

A General Approach to Adding Differential Privacy to Iterative Training Procedures

URLhttp://arxiv.org/abs/1812.06210. Harsh Mehta, Abhradeep Thakurta, Alexey Kurakin, and Ashok Cutkosky. Towards Large Scale Transfer Learning for Differentially Pri- vate Image Classification.Transactions on Machine Learning Research, 2023,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

In: 2024 Annual Computer Security Applications Conference (ACSAC)

doi: 10.1109/ ACSAC63791.2024.00097. Ernest Mwebaze, Timnit Gebru, Andrea Frome, Solomon Nsumba, and Jeremy Tusubira. iCassava 2019 Fine-Grained Visual Categorization Challenge,

work page arXiv 2024

[10] [10]

Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar, Vikash Sehwag, and Prateek Mittal

URL http://arxiv.org/abs/ 1908.02900. Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar, Vikash Sehwag, and Prateek Mittal. A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization. InThe Forty-first International Conference on Machine Learning, ICML,

work page arXiv 1908

[11] [11]

Perez , author F

doi: 10.1609/AAAI.V32I1.11671. Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H. Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Thakurta. How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy.Journal of Artificial Intelligence Research, 77,

work page doi:10.1609/aaai.v32i1.11671

[12] [12]

URLhttps://doi.org/10.1613/jair.1.14649

doi: 10.1613/jair.1.14649. 11 Ossi Räisä, Joonas Jälkö, and Antti Honkela. Subsampling is not Magic: Why Large Batch Sizes Work for Differentially Private Stochastic Optimisation. InThe Forty-first International Conference on Machine Learning, ICML,

work page doi:10.1613/jair.1.14649

[13] [13]

Andreas Peter Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer

doi: 10.1109/GlobalSIP.2013.6736861. Andreas Peter Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers.Transactions on Machine Learning Research,

work page doi:10.1109/globalsip.2013.6736861 2013

[14] [14]

Target: X5

doi: 10.5281/zenodo.4414861. Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN Database: Large-scale scene recognition from Abbey to Zoo. InIEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR,

work page doi:10.5281/zenodo.4414861

[15] [15]

Yang You, Jing Li, Sashank J

doi: 10.1007/s11263-014-0748-y. Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. InThe 8th International Conference on Learning Representations, ICLR. OpenReview.net,

work page doi:10.1007/s11263-014-0748-y

[16] [16]

Opacus: User-friendly differential privacy library in pytorch

URL https: //arxiv.org/abs/2109.12298. Xinwei Zhang, Zhiqi Bu, Zhiwei Steven Wu, and Mingyi Hong. Differentially private SGD without clipping bias: An error-feedback approach. InProceedings of the International Conference on Learning Representations, ICLR,

work page arXiv

[17] [17]

A LLMUSAGE We used large language models (LLMs), including OpenAI’s ChatGPT and GitHub Copilot, at various points during the development of this paper. 12 These tools assisted with grammar and phrasing, clarification of technical concepts, code generation and refactoring code for data processing, filtering, and visualization, as well as interpretation of ...

work page 2016