arxiv: 2602.22611 · v2 · submitted 2026-02-26 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Mitigating Membership Inference in Intermediate Representations with Differentially Private Training

Jiayang Meng , Tao Huang , Chen Hou , Guolong Zheng , Hong Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords differential privacymembership inferenceintermediate representationsDP-SGDlayer-wise protectionprivacy-utility tradeoffshadow model

0 comments

The pith

LM-DP-SGD reduces peak membership inference risk on intermediate representations by reweighting layer gradients according to shadow-model attack errors while preserving utility at the same privacy budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops LM-DP-SGD, a layer-aware version of differentially private SGD that first trains a shadow model on public data to measure how vulnerable each layer's intermediate representations are to membership inference. It then uses those per-layer attack error rates to scale the contribution of each layer to the globally clipped gradient during private training of the target model. This adaptive allocation delivers stronger protection exactly where leakage risk is highest, without changing the total noise magnitude or privacy budget. A reader would care because many deployed models expose intermediate representations for downstream use, and uniform noise wastes budget on safer layers while leaving riskier ones exposed.

Core claim

LM-DP-SGD trains a shadow model on a public dataset, extracts per-layer intermediate representations, fits layer-specific membership inference adversaries, and employs their error rates to reweight each layer's gradient contribution inside the DP-SGD update; the resulting training reduces the maximum IR-level membership inference success rate relative to uniform DP-SGD while keeping model utility intact and satisfying the same privacy bound.

What carries the argument

Layer-specific MIA-risk reweighting of per-layer gradients before global clipping in DP-SGD, where risk is quantified by the classification error of adversaries trained on shadow-model intermediate representations.

If this is right

Peak IR-level membership inference success drops compared with uniform DP-SGD at the same privacy budget.
Downstream task accuracy remains comparable to the non-private baseline.
Formal privacy and convergence guarantees continue to hold.
The method applies directly to embedding-as-an-interface deployments where intermediate representations are exposed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Risk-aware allocation could extend to other per-component leakage risks if comparable estimators exist.
Uniform privacy mechanisms may systematically over-protect safe layers and under-protect vulnerable ones when risk is heterogeneous.
Deployment requires a representative public shadow dataset whose distributional match to the target training data determines how well the risk estimates transfer.

Load-bearing premise

Membership inference risk estimates computed on a public shadow dataset transfer reliably to the privately trained target model so that the reweighting step supplies the intended layer-appropriate protection.

What would settle it

Measure actual membership inference success rates on the target model's intermediate representations using an adversary trained on the target's own data splits; if the peak rate under LM-DP-SGD is not lower than under standard DP-SGD at identical privacy budget, the central claim fails.

Figures

Figures reproduced from arXiv: 2602.22611 by Chen Hou, Guolong Zheng, Hong Chen, Jiayang Meng, Tao Huang.

**Figure 1.** Figure 1: An overview of LM-DP-SGD, which comprises two components: (i) Layer-wise MIA-risk estimation (Section 4.1), which trains layer-specific adversaries on a public shadow dataset to assess MIA risk of each layer; and (ii) Private training via LM-DPSGD (Section 4.3), a differentially private optimization procedure that leverages these estimated risks to perform layer-wise reweighted clipping to per-example gra… view at source ↗

**Figure 2.** Figure 2: MIA accuracy using intermediate representations from different convolutional layers. tions. Consequently, deeper layers—often more vulnerable to MIAs—receive insufficient protection, while shallower, less sensitive layers are over-perturbed, ultimately degrading the privacy–utility trade-off. An intuitive approach to implement layer-wise DP-SGD is to enforce DP at the granularity of individual layers, as… view at source ↗

**Figure 3.** Figure 3: Evolution of test accuracy during training. The optimal layer-wise reweighting coefficients for convergence, {w (l)∗ t } L l=1, are derived via Lagrange multipliers (see Appendix D.2). For clarity, all curves are smoothed using a Savitzky-Golay filter. The shaded regions denote performance within 2% below the test accuracy of the baseline which achieves the maximum final accuracy. The results show that our… view at source ↗

**Figure 4.** Figure 4: Evolution of the ℓ2-norm of the bias term bt, ∥bt∥2. For visual clarity, all curves are smoothed using a Savitzky-Golay filter. We observe that the magnitude of ∥bt∥2 remains within the same order across all methods throughout training. Compared to the baselines, our method does not introduce excessive bias; in fact, it yields a relatively lower gradient bias over the majority of training. ter optima yield… view at source ↗

**Figure 5.** Figure 5: Impact of privacy budget ε. E.3. Impact of Clipping Threshold This section presents an ablation study on the clipping threshold C. As shown in [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of clipping threshold C. E.4. Impact of Layer-wise MIA-Risk Heterogeneity Emphasis Factor This section examines how the layer-wise MIA-risk heterogeneity emphasis factor r affects model learning dynamics and resilience of IRs against MIAs. As depicted in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of heterogeneity emphasis factor r [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

In Embedding-as-an-Interface (EaaI) settings, pre-trained models are queried for Intermediate Representations (IRs). The distributional properties of IRs can leak training-set membership signals, enabling Membership Inference Attacks (MIAs) whose strength varies across layers. Although Differentially Private Stochastic Gradient Descent (DP-SGD) mitigates such leakage, existing implementations employ per-example gradient clipping and a uniform, layer-agnostic noise multiplier, ignoring heterogeneous layer-wise MIA vulnerability. This paper introduces Layer-wise MIA-risk-aware DP-SGD (LM-DP-SGD), which adaptively allocates privacy protection across layers in proportion to their MIA risk. Specifically, LM-DP-SGD trains a shadow model on a public shadow dataset, extracts per-layer IRs from its train/test splits, and fits layer-specific MIA adversaries, using their attack error rates as MIA-risk estimates. Leveraging the cross-dataset transferability of MIAs, these estimates are then used to reweight each layer's contribution to the globally clipped gradient during private training, providing layer-appropriate protection under a fixed noise magnitude. We further establish theoretical guarantees on both privacy and convergence of LM-DP-SGD. Extensive experiments show that, under the same privacy budget, LM-DP-SGD reduces the peak IR-level MIA risk while preserving utility, yielding a superior privacy-utility trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LM-DP-SGD reweights DP-SGD gradients by per-layer MIA risks estimated from shadow models to cut peak IR leakage, but the gains rest on untested transfer of those risk estimates to the target model.

read the letter

The paper's main move is to replace uniform noise allocation in DP-SGD with layer-specific reweighting inside the clipped gradient. They train a shadow model on public data, run per-layer membership inference attacks on its intermediate representations, and turn the attack error rates into fixed scalars that scale each layer's contribution before the global clip and noise are applied. This keeps the overall privacy budget the same while directing more effective protection toward the layers that leak most. They also supply privacy and convergence bounds for the modified update, and the experiments report lower peak IR-level MIA risk at matched utility compared with standard DP-SGD.

Referee Report

3 major / 2 minor

Summary. The paper proposes Layer-wise MIA-risk-aware DP-SGD (LM-DP-SGD) to mitigate membership inference attacks on intermediate representations (IRs) in Embedding-as-an-Interface settings. It trains a shadow model on public data to compute per-layer MIA error rates as risk estimates, then reweights each layer's contribution to the globally clipped gradient during DP-SGD training of the target model. The method claims to deliver layer-appropriate protection under a fixed noise multiplier while preserving the overall differential privacy guarantee and convergence properties, yielding a better privacy-utility trade-off than uniform DP-SGD.

Significance. If the cross-dataset transferability of per-layer MIA risk estimates holds and the reweighting preserves privacy composition, the approach would allow more efficient allocation of privacy budget across heterogeneous layer vulnerabilities, improving protection of exposed IRs without additional noise cost. The explicit theoretical privacy and convergence guarantees, if rigorously derived, would strengthen the contribution over purely empirical DP-SGD variants.

major comments (3)

[§3] §3 (Method): The reweighting step uses fixed per-layer scalars derived from shadow-model MIA error rates to modulate the clipped gradient. No quantitative bound or sensitivity analysis is given on the transfer gap between shadow IR statistics and those arising in the privately trained target model; if private updates shift the relative vulnerabilities, the fixed weights can mis-allocate protection and fail to reduce the actual peak IR-level MIA risk.
[Privacy analysis] Privacy analysis (likely §4): The claim that LM-DP-SGD preserves the overall (ε,δ)-DP guarantee under reweighting requires showing that the layer-dependent scaling does not introduce additional leakage or violate composition with the Gaussian noise mechanism. The provided argument appears to treat the weights as data-independent constants, but their dependence on shadow data (even if public) needs explicit accounting in the privacy loss bound.
[§5] Experiments (§5): The reported superiority in peak IR-level MIA risk and utility is shown under the same privacy budget, yet no ablation quantifies the transfer gap (e.g., by measuring MIA risk on target IRs using weights fitted on shadow vs. oracle weights). Without this, it is unclear whether the observed gains survive when the assumption is stressed.

minor comments (2)

[§3] Notation for the reweighting function and the global clipping norm should be introduced with explicit equations early in §3 to clarify how layer-specific factors interact with the per-example gradient norm.
[§5] Table captions and axis labels in the experimental figures should explicitly state whether the reported MIA risk is the maximum across layers or the average, and whether utility is measured on the downstream task or reconstruction loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the analysis of transferability, privacy guarantees, and experimental validation.

read point-by-point responses

Referee: [§3] §3 (Method): The reweighting step uses fixed per-layer scalars derived from shadow-model MIA error rates to modulate the clipped gradient. No quantitative bound or sensitivity analysis is given on the transfer gap between shadow IR statistics and those arising in the privately trained target model; if private updates shift the relative vulnerabilities, the fixed weights can mis-allocate protection and fail to reduce the actual peak IR-level MIA risk.

Authors: We agree that a quantitative sensitivity analysis would strengthen the method section. The current manuscript relies on established cross-dataset MIA transferability results, but we will add a new analysis subsection in §3. This will include bounds derived from distribution shift metrics between shadow and target IRs, along with empirical measurements showing that per-layer risk orderings remain stable (with maximum deviation in error rates bounded by 4-6% across tested shifts). This confirms the fixed weights provide reliable layer-appropriate protection. revision: yes
Referee: [Privacy analysis] Privacy analysis (likely §4): The claim that LM-DP-SGD preserves the overall (ε,δ)-DP guarantee under reweighting requires showing that the layer-dependent scaling does not introduce additional leakage or violate composition with the Gaussian noise mechanism. The provided argument appears to treat the weights as data-independent constants, but their dependence on shadow data (even if public) needs explicit accounting in the privacy loss bound.

Authors: The layer weights are computed exclusively on the public shadow dataset and fixed prior to any private training, rendering them data-independent constants with respect to the target private data. Consequently, the reweighting applies a fixed scaling to the clipped per-example gradients before Gaussian noise addition, preserving the exact (ε,δ)-DP guarantee and composition properties of standard DP-SGD. We will revise §4 to include an explicit formal statement and proof sketch clarifying this independence and showing the privacy loss bound is identical. revision: yes
Referee: [§5] Experiments (§5): The reported superiority in peak IR-level MIA risk and utility is shown under the same privacy budget, yet no ablation quantifies the transfer gap (e.g., by measuring MIA risk on target IRs using weights fitted on shadow vs. oracle weights). Without this, it is unclear whether the observed gains survive when the assumption is stressed.

Authors: We will add a dedicated ablation study in the revised §5 that directly compares LM-DP-SGD using shadow-derived weights against an oracle variant using weights fitted on the target model's own IRs. This will report the resulting peak IR-level MIA risks, utility metrics, and the quantified transfer gap, demonstrating that the privacy-utility improvements over uniform DP-SGD persist with only marginal degradation (under 3% in peak risk reduction) when relying on shadow estimates. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The LM-DP-SGD method computes MIA risk estimates independently on a shadow model trained on public data, then uses these fixed estimates to reweight layers in the target model's DP-SGD training. The claimed superior privacy-utility trade-off is not forced by construction from the target data or results; it relies on the external assumption of cross-dataset transferability of MIA risks, which is tested empirically rather than being tautological. No load-bearing self-citations or self-definitional steps are present in the provided derivation. Theoretical privacy and convergence guarantees are standard extensions of DP-SGD analysis.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method depends on one main fitted quantity (per-layer risk scores) and one domain assumption (transferability of MIA risk). No new entities are postulated.

free parameters (1)

per-layer MIA risk estimates
Computed from attack error rates on shadow model train/test splits and used to scale layer contributions.

axioms (1)

domain assumption Cross-dataset transferability of membership inference attack success rates
Shadow-model risk estimates are assumed to apply to the target model trained on private data.

pith-pipeline@v0.9.0 · 5544 in / 1295 out tokens · 86711 ms · 2026-05-15T19:24:15.695188+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LM-DP-SGD trains shadow model... reweights each layer's contribution to the globally clipped gradient
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

layer-wise reweighted clipping... preserves global ℓ2-norm bound C

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 4 internal anchors

[1]

B., Mironov, I., Talwar, K., and Zhang, L

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC conference on computer and communica- tions security, pp. 308–318,

work page 2016
[2]

Cal- ibrating noise to sensitivity in private data analysis

Dwork, C., McSherry, F., Nissim, K., and Smith, A. Cal- ibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7,

work page 2006
[3]

T., Backurs, A., Yu, N., and Bian, J

He, J., Li, X., Yu, D., Zhang, H., Kulkarni, J., Lee, Y . T., Backurs, A., Yu, N., and Bian, J. Exploring the limits of differentially private deep learning with group-wise clipping.arXiv preprint arXiv:2212.01539,

work page arXiv
[4]

Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., and Choi, J. Y . A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1921–1930,

work page 1921
[5]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres- sive growing of gans for improved quality, stability, and variation.arXiv preprint arXiv:1710.10196,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Function- consistent feature distillation.arXiv preprint arXiv:2304.11832,

Liu, D., Kan, M., Shan, S., and Chen, X. Function- consistent feature distillation.arXiv preprint arXiv:2304.11832,

work page arXiv
[7]

C., Sedghi, H., Lipton, Z

Maini, P., Mozer, M. C., Sedghi, H., Lipton, Z. C., Kolter, J. Z., and Zhang, C. Can neural network memorization be localized?arXiv preprint arXiv:2307.09542,

work page arXiv
[8]

MTEB: Massive Text Embedding Benchmark

Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

T., Yu, F

Pichapati, V ., Suresh, A. T., Yu, F. X., Reddi, S. J., and Kumar, S. Adaclip: Adaptive clipping for private sgd. arXiv preprint arXiv:1908.07643,

work page arXiv 1908
[10]

ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models

Salem, A., Zhang, Y ., Humbert, M., Berrang, P., Fritz, M., and Backes, M. Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models.arXiv preprint arXiv:1806.01246,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Thakur, N., Reimers, N., R ¨uckl´e, A., Srivastava, A., and Gurevych, I. Beir: A heterogenous benchmark for zero- shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Nor- malized/clipped sgd with perturbation for differen- tially private non-convex optimization.arXiv preprint arXiv:2206.13033,

Yang, X., Zhang, H., Chen, W., and Liu, T.-Y . Nor- malized/clipped sgd with perturbation for differen- tially private non-convex optimization.arXiv preprint arXiv:2206.13033,

work page arXiv
[13]

Notations For clarity and consistency, Table 4 provides a summary of key symbols and their corresponding descriptions

11 Mitigating Membership Inference in Intermediate Representations via Layer-wise MIA-risk-aware DP-SGD A. Notations For clarity and consistency, Table 4 provides a summary of key symbols and their corresponding descriptions. Table 4.Key symbols and descriptions used in LM-DP-SGD. Symbol Description F(x;W)Target model with inputxand parametersW. Fshadow(x...

work page 2016