pith. sign in

arxiv: 2510.20616 · v2 · submitted 2025-10-23 · 💻 cs.LG

On Optimal Hyperparameters for Differentially Private Deep Transfer Learning

Pith reviewed 2026-05-18 04:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords differentially private learningtransfer learningclipping boundbatch sizegradient distributiondifferential privacyhyperparameter tuning
0
0 comments X

The pith

Larger clipping bounds outperform theory under strong privacy in DP transfer learning due to gradient distribution shifts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines choices of clipping bound C and batch size B when fine-tuning pretrained models under differential privacy. It identifies a mismatch where theory recommends smaller C for stronger privacy, yet experiments show larger C yields better results because per-sample gradient distributions change with added noise. Under a fixed-epoch compute budget, common heuristics for selecting B fail to predict performance while total accumulated DP noise provides a better explanation. A single fixed pair of C and B across tasks or privacy regimes produces suboptimal accuracy, especially when moving between loose and tight privacy or between high and low compute limits.

Core claim

The central claim is that clipping bound C should be chosen larger under stronger privacy constraints in practice, contrary to standard theory, because DP noise alters the distribution of gradient norms; under a fixed number of epochs, cumulative DP noise rather than per-step heuristics governs whether small or large batches are preferable; and clipping acts as a gradient re-weighting whose interaction with noise accumulation makes a universal (C, B) setting ineffective when privacy level or compute budget changes.

What carries the argument

Analysis of how DP noise changes per-sample gradient magnitude distributions, combined with cumulative DP noise over fixed epochs and the interpretation of clipping as gradient re-weighting

If this is right

  • A single (C, B) pair across tasks leads to clear performance loss when privacy budgets or epoch limits change.
  • Batch-size selection should be guided by total accumulated noise rather than existing per-step rules under fixed compute.
  • Larger C can be preferable to smaller C once noise alters gradient distributions under strong privacy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tuning C and B separately for each privacy level and compute regime is likely necessary instead of reusing values from prior runs.
  • Making the clipping bound adaptive to observed gradient statistics during training could reduce the observed mismatch without extra privacy cost.

Load-bearing premise

The analysis assumes training is limited to a fixed number of epochs rather than continued until convergence.

What would settle it

Demonstrating that the reported mismatch between theory and empirical results on C disappears when models are trained for many more epochs until validation performance plateaus would falsify the role of fixed-epoch cumulative noise.

Figures

Figures reproduced from arXiv: 2510.20616 by Aki Rehn, Antti Honkela, Linzh Zhao, Mikko A. Heikkil\"a.

Figure 1
Figure 1. Figure 1: Macro accuracy heatmaps under increasing learning-problem difficulty (left to right): strong [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy difference to the best, mean over 5 repeats with min–max bands; SUN397 dataset, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Gradient norm distributions after training, ViT-Tiny on SUN397, 8 epochs; [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-class effects of gradient clipping under increasing task difficulty (left to right), SUN397, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of AUTO-S (Bu et al., 2023) and properly tuned flat clipping, mean accuracy with min–max bands over 3 repeats, ViT-Tiny model, 8 epochs training for SUN397 and CIFAR-100, 32 epochs for Cassava; δ = 10−5 . Batch size and learning rate were tuned over a fixed grid for both methods. AUTO-S performs notably worse on harder datasets (SUN397, Cassava), especially under tight privacy, as predicted by o… view at source ↗
Figure 6
Figure 6. Figure 6: Neither per-step averaged gradient noise standard deviation suggested by Ponomareva et al. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cumulative noise, batch size and the number [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Differentially private (DP) transfer learning, i.e., fine-tuning a pretrained model on private data, is the current state-of-the-art approach for training large models under privacy constraints. We focus on two key hyperparameters in this setting: the clipping bound $C$ and batch size $B$. We show a clear mismatch between the current theoretical understanding of how to choose an optimal $C$ (stronger privacy requires smaller $C$) and empirical outcomes (larger $C$ performs better under strong privacy), caused by changes in the gradient distributions. Assuming a limited compute budget (fixed epochs), we demonstrate that the existing heuristics for tuning $B$ do not work, while cumulative DP noise better explains whether smaller or larger batches perform better. We also highlight how the common practice of using a single $(C,B)$ setting across tasks can lead to suboptimal performance. We find that performance drops especially when moving between loose and tight privacy and between plentiful and limited compute, which we explain by analyzing clipping as a form of gradient re-weighting and examining cumulative DP noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript examines optimal choices of clipping bound C and batch size B for differentially private transfer learning of pretrained models. It reports an empirical reversal where larger C outperforms under strong privacy (contrary to theory recommending smaller C for tighter epsilon), attributing this to shifts in gradient distributions. Under a fixed-epoch compute budget, it finds that standard B-tuning heuristics fail while cumulative DP noise accumulation better explains performance; it further shows that a single (C, B) pair is suboptimal when moving between privacy regimes or compute budgets, with supporting analysis of clipping as gradient re-weighting.

Significance. If the reported empirical patterns and mechanistic explanations hold after addressing potential confounds, the work would be useful for practitioners selecting hyperparameters in DP fine-tuning. The emphasis on cumulative noise and cross-regime sensitivity provides concrete guidance beyond generic DP-SGD rules, and the gradient-distribution account could inform future theoretical refinements.

major comments (1)
  1. [§4] §4 (clipping-bound experiments): the central claim that gradient-distribution shifts explain why larger C is preferable under tight privacy is not isolated from the direct scaling of added noise (noise std = sigma * C, with sigma rising as epsilon falls). Without controls that hold effective noise magnitude fixed while varying C, or pre-/post-noise gradient-norm histograms, the experiments cannot distinguish distribution shift from signal-to-noise ratio changes as the operative mechanism.
minor comments (2)
  1. [Abstract] Abstract and §5: key quantitative results (accuracy deltas, standard deviations, dataset sizes, number of runs) are not summarized; adding a short table or effect-size statements would improve readability and allow readers to assess practical importance without consulting the full figures.
  2. Figure captions and legends: several plots comparing privacy levels lack explicit indication of whether error bars represent standard deviation across seeds or across tasks; clarifying this would aid interpretation of the cross-regime claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The major comment identifies a valid potential confound in the clipping-bound experiments of §4. We address this point directly below and agree that additional controls will strengthen the mechanistic claims.

read point-by-point responses
  1. Referee: [§4] §4 (clipping-bound experiments): the central claim that gradient-distribution shifts explain why larger C is preferable under tight privacy is not isolated from the direct scaling of added noise (noise std = sigma * C, with sigma rising as epsilon falls). Without controls that hold effective noise magnitude fixed while varying C, or pre-/post-noise gradient-norm histograms, the experiments cannot distinguish distribution shift from signal-to-noise ratio changes as the operative mechanism.

    Authors: We agree that the experiments as presented do not fully isolate gradient-distribution shifts from changes in effective noise magnitude, since sigma is scaled to achieve the target epsilon and noise std therefore grows with C. Our current results show larger C outperforming under tight privacy alongside observed changes in gradient-norm distributions, but we acknowledge this leaves open the possibility that SNR effects contribute. To address the concern, we will add (i) experiments that hold sigma * C approximately fixed while varying C (by appropriate adjustment of the privacy accountant where feasible) and (ii) pre- and post-noise gradient-norm histograms across privacy regimes. These additions will be included in the revised manuscript and should clarify the relative contributions of distribution shift versus noise scaling without altering the main empirical finding that larger C is preferable under tight privacy. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical analysis of DP hyperparameters

full rationale

The paper presents empirical observations on mismatches between theoretical guidance for clipping bound C and batch size B versus observed performance in differentially private transfer learning. Explanations rely on measured changes in gradient distributions and cumulative DP noise under fixed-epoch budgets, without any mathematical derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. Claims are supported by direct experimental results rather than reducing to inputs by construction, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central observations rest on the assumption that gradient distributions shift with privacy strength and that total compute is fixed in epochs; no new entities or free parameters are introduced in the abstract.

axioms (2)
  • domain assumption Gradient distributions change meaningfully with privacy strength (epsilon).
    Invoked to explain why larger C outperforms under strong privacy.
  • domain assumption Total compute budget is fixed in number of epochs.
    Stated explicitly when discussing batch-size heuristics.

pith-pipeline@v0.9.0 · 5726 in / 1109 out tokens · 36893 ms · 2026-05-18T04:37:43.980268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy

    cs.LG 2026-04 unverdicted novelty 7.0

    DPrivBench shows that top LLMs handle basic differential privacy mechanisms but fail on advanced algorithms, exposing gaps in automated DP reasoning.

  2. DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy

    cs.LG 2026-04 accept novelty 7.0

    DPrivBench is a new benchmark for evaluating LLMs on differential privacy reasoning, with results showing good performance on textbook mechanisms but substantial failures on advanced algorithms.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

    Martín Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep Learning with Differential Privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security,

  2. [2]

    Goodfellow, H

    doi: 10.1145/2976749.2978318. Kareem Amin, Alex Kulesza, Andres Munoz, and Sergei Vassilvtiskii. Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy. InProceedings of the 36th International Conference on Machine Learning, ICML. PMLR,

  3. [3]

    arXiv preprint arXiv:2308.10888 , year =

    URLhttp://arxiv.org/abs/2308.10888. Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger. InThe Thirty-seventh Conference on Neural Information Processing Systems, NeurIPS,

  4. [4]

    Smith, and Borja Balle

    URL http://arxiv. org/abs/2204.13650. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InThe 9th International Conference...

  5. [5]

    doi: 10.1007/11761679_29

    Springer, 2006a. doi: 10.1007/11761679_29. Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating Noise to Sensitivity in Private Data Analysis. InAdvances in Cryptology - 25th International Conference on the Theory and Applications of Cryptographic Techniques, EUROCRYPT 2006, volume

  6. [6]

    doi: 10.1007/11681878_14

    Springer, 2006b. doi: 10.1007/11681878_14. 10 Maria S. Esipova, Atiyeh Ashari Ghomi, Yaqiao Luo, and Jesse C. Cresswell. Disparate Impact in Differential Privacy from Gradient Misalignment. InThe 11th International Conference on Learning Representations, ICLR,

  7. [7]

    Ruixuan Liu and Zhiqi Bu

    URL http://www.cs.utoronto.ca/~kriz/ learning-features-2009-TR.pdf. Ruixuan Liu and Zhiqi Bu. Towards hyperparameter-free optimization with differential privacy. In The 13th International Conference on Learning Representations, ICLR,

  8. [8]

    A General Approach to Adding Differential Privacy to Iterative Training Procedures

    URLhttp://arxiv.org/abs/1812.06210. Harsh Mehta, Abhradeep Thakurta, Alexey Kurakin, and Ashok Cutkosky. Towards Large Scale Transfer Learning for Differentially Pri- vate Image Classification.Transactions on Machine Learning Research, 2023,

  9. [9]

    In: 2024 Annual Computer Security Applications Conference (ACSAC)

    doi: 10.1109/ ACSAC63791.2024.00097. Ernest Mwebaze, Timnit Gebru, Andrea Frome, Solomon Nsumba, and Jeremy Tusubira. iCassava 2019 Fine-Grained Visual Categorization Challenge,

  10. [10]

    Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar, Vikash Sehwag, and Prateek Mittal

    URL http://arxiv.org/abs/ 1908.02900. Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar, Vikash Sehwag, and Prateek Mittal. A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization. InThe Forty-first International Conference on Machine Learning, ICML,

  11. [11]

    Perez , author F

    doi: 10.1609/AAAI.V32I1.11671. Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H. Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Thakurta. How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy.Journal of Artificial Intelligence Research, 77,

  12. [12]

    URLhttps://doi.org/10.1613/jair.1.14649

    doi: 10.1613/jair.1.14649. 11 Ossi Räisä, Joonas Jälkö, and Antti Honkela. Subsampling is not Magic: Why Large Batch Sizes Work for Differentially Private Stochastic Optimisation. InThe Forty-first International Conference on Machine Learning, ICML,

  13. [13]

    Andreas Peter Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer

    doi: 10.1109/GlobalSIP.2013.6736861. Andreas Peter Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers.Transactions on Machine Learning Research,

  14. [14]

    Target: X5

    doi: 10.5281/zenodo.4414861. Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN Database: Large-scale scene recognition from Abbey to Zoo. InIEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR,

  15. [15]

    Yang You, Jing Li, Sashank J

    doi: 10.1007/s11263-014-0748-y. Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. InThe 8th International Conference on Learning Representations, ICLR. OpenReview.net,

  16. [16]

    Opacus: User-friendly differential privacy library in pytorch

    URL https: //arxiv.org/abs/2109.12298. Xinwei Zhang, Zhiqi Bu, Zhiwei Steven Wu, and Mingyi Hong. Differentially private SGD without clipping bias: An error-feedback approach. InProceedings of the International Conference on Learning Representations, ICLR,

  17. [17]

    A LLMUSAGE We used large language models (LLMs), including OpenAI’s ChatGPT and GitHub Copilot, at various points during the development of this paper. 12 These tools assisted with grammar and phrasing, clarification of technical concepts, code generation and refactoring code for data processing, filtering, and visualization, as well as interpretation of ...