Model Privacy: A Unified Framework for Understanding Model Stealing Attacks and Defenses

Ganghua Wang; Jie Ding; Yuhong Yang

arxiv: 2502.15567 · v3 · submitted 2025-02-21 · 💻 cs.LG · stat.ML

Model Privacy: A Unified Framework for Understanding Model Stealing Attacks and Defenses

Ganghua Wang , Yuhong Yang , Jie Ding This is my paper

Pith reviewed 2026-05-23 02:30 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords model privacymodel stealing attacksmachine learning securityattack defensesprivacy-utility tradeoffperturbation structuretheoretical framework

0 comments

The pith

The Model Privacy framework quantifies model stealing attacks and defenses while identifying the role of attack-specific perturbations in effective protection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a unified framework called Model Privacy to analyze how adversaries recover machine learning models through limited query interactions. It sets up formal threat models, introduces ways to measure the effectiveness of both attacks and defenses, and examines tradeoffs between a model's performance and its resistance to extraction. The work argues that defenses improve when they account for the specific structure of perturbations used in a given attack. Readers would care because this moves evaluation of ML security from case-by-case experiments toward consistent, comparable criteria that apply to cloud services and on-device models.

Core claim

The paper claims that by rigorously formulating the threat model and objectives for model stealing, proposing quantification methods for the goodness of attack and defense strategies, and analyzing the fundamental utility-privacy tradeoffs, the Model Privacy framework demonstrates that the attack-specific structure of perturbations is key to building effective defenses.

What carries the argument

The Model Privacy framework, consisting of threat model formulations, quantification methods for attack and defense goodness, and utility-privacy tradeoff analysis.

If this is right

Defenses become more effective when perturbations are chosen to match the structure of the specific attack being countered.
Quantifiable tradeoffs exist between model utility and privacy that can guide design choices in learning scenarios.
Standardized goodness metrics enable systematic comparison of attack and defense strategies.
The framework supports defender-oriented analysis across multiple machine learning applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The quantification approach could be tested on attacks that use different query strategies than those examined in the paper.
The emphasis on perturbation structure may suggest new ways to combine the framework with existing robustness techniques.
If the metrics hold, they could help set minimum security requirements for public ML APIs.

Load-bearing premise

The proposed methods for quantifying the goodness of attacks and defenses produce numbers that are meaningful and directly comparable across different models, query limits, and attack types.

What would settle it

Empirical tests in which the framework's goodness scores for defenses fail to predict lower success rates for corresponding model stealing attacks under realistic query budgets.

Figures

Figures reproduced from arXiv: 2502.15567 by Ganghua Wang, Jie Ding, Yuhong Yang.

**Figure 2.** Figure 2: Left: The unprotected responses and perturbed responses by Order Disguise (Algorithm 1) [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗

**Figure 3.** Figure 3: Goodness comparison of different defense mechanisms against an attacker performing penalized [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

**Figure 4.** Figure 4: Variable selection reliability of different defense mechanisms against an attacker performing [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Average accuracy of the attacker’s rebuilt model for the hate speech detection task. [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: Average F1 score of the attacker’s rebuilt model for the hate speech detection task. [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

read the original abstract

The use of machine learning (ML) has become increasingly prevalent in various domains, highlighting the importance of understanding and ensuring its safety. One pressing concern is the vulnerability of ML applications to model stealing attacks. These attacks involve adversaries attempting to recover a learned model through limited query-response interactions, such as those found in cloud-based services or on-chip artificial intelligence interfaces. While existing literature proposes various attack and defense strategies, these often lack a theoretical foundation and standardized evaluation criteria. In response, this work presents a framework called ``Model Privacy'', providing a foundation for comprehensively analyzing model stealing attacks and defenses. We establish a rigorous formulation for the threat model and objectives, propose methods to quantify the goodness of attack and defense strategies, and analyze the fundamental tradeoffs between utility and privacy in ML models. Our developed theory offers valuable insights into enhancing the security of ML models, especially highlighting the importance of the attack-specific structure of perturbations for effective defenses. We demonstrate the application of model privacy from the defender's perspective through various learning scenarios. Extensive experiments corroborate the insights and the effectiveness of defense mechanisms developed under the proposed framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Model Privacy framework organizes model stealing analysis around threat models and perturbation structure but its quantification methods need scrutiny for hidden assumptions on access and architecture.

read the letter

The core of this paper is a new framework called Model Privacy that tries to give model stealing attacks and defenses a shared theoretical setup. It defines a threat model, offers ways to measure the effectiveness of attacks and defenses, examines utility-privacy tradeoffs, and stresses that defenses should exploit the specific structure of the perturbations an attacker uses. They test the ideas across several learning scenarios and run experiments to show the defenses work better when that structure is taken into account. This is a reasonable attempt to move beyond the scattered empirical papers that currently dominate the area. A reader who wants a single place to compare different stealing strategies will find the organization helpful. The experiments appear to line up with the claimed insights on perturbation structure. The main soft spot is the quantification of attack and defense goodness. The abstract describes methods to produce these numbers and analyze tradeoffs, yet supplies no equations or invariance properties. If those metrics turn out to depend on fixed query budgets, particular model families, or unstated access assumptions, then the numbers will not be comparable across settings and the tradeoff claims will rest on weaker ground. That matches the stress-test concern exactly. The paper is written for people working on query-based model extraction in cloud or edge settings who need a more systematic way to evaluate defenses. A reader already familiar with information-theoretic or game-theoretic treatments of extraction will want to check whether the new metrics add anything beyond those earlier approaches. It is coherent enough on its own terms to deserve referee time, even if the metrics section will likely need tightening.

Referee Report

2 major / 1 minor

Summary. The paper introduces the 'Model Privacy' framework as a unified theoretical foundation for model stealing attacks and defenses in ML. It formulates threat models and objectives, proposes methods to quantify the 'goodness' of attacks and defenses, analyzes utility-privacy tradeoffs, highlights the role of attack-specific perturbation structures for effective defenses, and demonstrates the approach via learning scenarios and experiments.

Significance. If the quantification methods yield standardized, comparable metrics independent of unstated assumptions and the perturbation-structure insight is rigorously derived, the framework could provide a valuable basis for evaluating and improving ML model security against stealing attacks, filling a noted gap in theoretical foundations.

major comments (2)

[Framework formulation and quantification methods] The methods proposed to quantify the 'goodness' of attacks and defenses (described in the framework section following the threat model) are not shown to be invariant under variations in query access models or model architectures; without explicit definitions or invariance properties, cross-scenario comparability is not established and this undercuts the central claim of standardized evaluation criteria.
[Tradeoff analysis] The analysis of fundamental tradeoffs between utility and privacy (in the tradeoff analysis section) rests on the same quantification methods; if these metrics implicitly depend on fixed query budgets or architecture-specific choices not stated in the threat model, the reported tradeoffs reduce to scenario-specific observations rather than general insights.

minor comments (1)

[Experiments section] The abstract and introduction reference 'extensive experiments' corroborating the insights, but the manuscript would benefit from explicit statements of the baselines used and statistical controls applied to the defense effectiveness claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments raise important points about the invariance and generality of the proposed quantification methods and tradeoff analysis. We address each major comment below and indicate planned revisions to strengthen the paper.

read point-by-point responses

Referee: [Framework formulation and quantification methods] The methods proposed to quantify the 'goodness' of attacks and defenses (described in the framework section following the threat model) are not shown to be invariant under variations in query access models or model architectures; without explicit definitions or invariance properties, cross-scenario comparability is not established and this undercuts the central claim of standardized evaluation criteria.

Authors: We appreciate this observation. The quantification methods are defined abstractly in terms of the general threat model (Section 3), using notions of model similarity and utility that are intended to apply independently of specific query access models or architectures. The framework formulation emphasizes generality across learning scenarios. However, explicit invariance statements and proofs are not currently provided. We will add a dedicated subsection to the framework section that formally defines the metrics, states their invariance properties under the relevant variations, and includes brief proofs or arguments establishing cross-scenario comparability. This revision will directly support the claim of standardized evaluation criteria. revision: yes
Referee: [Tradeoff analysis] The analysis of fundamental tradeoffs between utility and privacy (in the tradeoff analysis section) rests on the same quantification methods; if these metrics implicitly depend on fixed query budgets or architecture-specific choices not stated in the threat model, the reported tradeoffs reduce to scenario-specific observations rather than general insights.

Authors: The tradeoff analysis (Section 5) is derived from the same general quantification methods and the abstract threat model, which does not fix query budgets or architecture-specific parameters. The derivations aim to yield fundamental insights that hold across the considered scenarios, with the perturbation-structure emphasis emerging as a general principle. That said, the current presentation does not explicitly demonstrate that the metrics remain independent of unstated assumptions in all cases. We will revise the tradeoff analysis section to include a clarification paragraph and additional discussion showing how the results generalize beyond specific query budgets or architectures, reinforcing that the tradeoffs constitute general insights rather than scenario-specific observations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework introduces independent quantification methods

full rationale

The paper proposes a new 'Model Privacy' framework that formulates threat models, defines methods to quantify attack/defense goodness, and analyzes utility-privacy tradeoffs. No equations or definitions in the abstract or described content reduce the proposed quantifications to fitted parameters or prior self-citations by construction. The central claims rest on the novelty of the framework itself rather than renaming or self-referential inputs. This is a standard proposal of new metrics and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework itself appears to introduce new quantification methods whose grounding cannot be audited from the provided text.

pith-pipeline@v0.9.0 · 5725 in / 1067 out tokens · 21058 ms · 2026-05-23T02:30:16.730361+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Juuti, S

M. Juuti, S. Szyller, S. Marchal, and N. Asokan. Prada: Protecting against dnn model stealing attacks. In Proceedings of the 2019 IEEE European Symposium on Security and Privacy, pages 512–527,

work page 2019
[2]

Papernot, P

N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against machine learning. InProceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, page 506–519. Association for Computing Machinery,

work page 2017
[3]

Zou and T

H. Zou and T. Hastie. Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005

work page 2005

[1] [1]

Juuti, S

M. Juuti, S. Szyller, S. Marchal, and N. Asokan. Prada: Protecting against dnn model stealing attacks. In Proceedings of the 2019 IEEE European Symposium on Security and Privacy, pages 512–527,

work page 2019

[2] [2]

Papernot, P

N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against machine learning. InProceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, page 506–519. Association for Computing Machinery,

work page 2017

[3] [3]

Zou and T

H. Zou and T. Hastie. Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005

work page 2005