Alignment Defends LLMs from Property Inference Attacks
Pith reviewed 2026-06-27 17:03 UTC · model grok-4.3
The pith
Alignment-based defenses mitigate property inference attacks on LLMs by reshaping output distributions after training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adapting DPO and GRPO frameworks, the model’s output distribution can be reshaped towards a target property ratio via post-training alignment, effectively mitigating property inference attacks without modifying the training data or requiring retraining.
What carries the argument
Adaptation of RLHF frameworks (DPO and GRPO) to construct preference pairs and rewards that enforce a target property ratio in outputs.
If this is right
- Property inference attacks achieve lower success rates after applying the defense.
- Models maintain utility on standard tasks despite the alignment adjustments.
- Defenses apply to already fine-tuned and deployed models without data access.
- Both DPO and GRPO adaptations provide effective mitigation options.
Where Pith is reading between the lines
- Similar alignment strategies might apply to other inference attacks beyond property inference.
- The method could extend to protecting against membership inference if target ratios are defined appropriately.
- Choosing the target ratio might require domain knowledge but avoids revealing the sensitive property itself.
Load-bearing premise
A suitable target property ratio can be chosen and preference pairs or rewards constructed without knowledge of the actual sensitive property in the dataset.
What would settle it
An experiment where after applying the DPO or GRPO defense, a property inference attack still achieves high success rate comparable to the undefended model.
Figures
read the original abstract
Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted through property inference attacks, posing a confidentiality risk. Existing defenses against these attacks primarily operate by modifying the training data distribution and hence require access to the original data and retraining the model, limiting their applicability to settings where data is unavailable or models are already deployed. In this work, we propose alignment-based defenses for mitigating property inference attacks in LLMs. Our approach reshapes the model's output distribution towards a target property ratio via post-training alignment, without modifying the training data. In particular, we adapt two widely used RLHF frameworks--Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO)--as our defenses by constructing preference pairs and defining a specific reward function respectively. Through comprehensive experiments, we show that our alignment based defenses effectively mitigate property inference attacks while maintaining a strong utility confidentiality tradeoff.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that post-training alignment via adapted DPO (preference pairs) and GRPO (reward function) can reshape an LLM's output distribution to a chosen target property ratio, thereby mitigating property inference attacks on dataset-level sensitive properties while preserving a strong utility-confidentiality tradeoff, without requiring access to or modification of the original training data.
Significance. If the central mechanism can be realized without presupposing knowledge of the secret property, the result would be significant for practical deployment of LLMs: it offers a defense applicable to already-trained models, unlike prior data-distribution defenses that mandate retraining. The reuse of standard RLHF frameworks is a practical strength that could facilitate adoption.
major comments (2)
- [Abstract and §3] Abstract and §3 (defense construction): The DPO adaptation constructs preference pairs differentiated by the target property ratio, and the GRPO adaptation defines a reward function that likewise requires property-specific labeling or generation of responses; both steps presuppose the defender possesses or can access the sensitive property to create the necessary data, which directly contradicts the threat model in which the property is unknown to the defender and is precisely the information the attack seeks to extract.
- [Experiments] Experiments section: The abstract asserts that 'comprehensive experiments demonstrate effective mitigation and a good utility tradeoff,' yet the high-level description provides no attack success rates, baseline comparisons, utility metrics, or details on how the target ratio was selected and validated; without these quantitative anchors the central empirical claim cannot be assessed.
minor comments (1)
- [Abstract] Abstract: Including one or two headline quantitative results (e.g., attack success rate reduction and utility delta) would strengthen the summary of the experimental findings.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below, indicating where revisions to the manuscript are warranted.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (defense construction): The DPO adaptation constructs preference pairs differentiated by the target property ratio, and the GRPO adaptation defines a reward function that likewise requires property-specific labeling or generation of responses; both steps presuppose the defender possesses or can access the sensitive property to create the necessary data, which directly contradicts the threat model in which the property is unknown to the defender and is precisely the information the attack seeks to extract.
Authors: The referee correctly notes that constructing the adapted DPO preference pairs and GRPO reward function requires the ability to label or generate responses according to the target property. This assumption is implicit in the current defense design. We will revise the abstract, threat model section, and §3 to explicitly state that the defender is assumed to have (or be able to obtain) sufficient access to the property for the purpose of alignment data creation—e.g., when the defender wishes to enforce a specific ratio for a known sensitive attribute. This clarifies rather than contradicts the setting and removes any implication that the defense applies to completely unknown properties. revision: yes
-
Referee: [Experiments] Experiments section: The abstract asserts that 'comprehensive experiments demonstrate effective mitigation and a good utility tradeoff,' yet the high-level description provides no attack success rates, baseline comparisons, utility metrics, or details on how the target ratio was selected and validated; without these quantitative anchors the central empirical claim cannot be assessed.
Authors: We agree that the abstract and any high-level overview omit the specific quantitative results. The experiments section of the manuscript contains the requested details (attack success rates before/after defense, baseline comparisons, utility metrics such as downstream task accuracy and perplexity, and target-ratio selection via validation sweeps). To improve accessibility, we will expand the abstract with key numerical results and ensure the experiments section foregrounds these metrics with explicit tables and selection methodology. revision: yes
Circularity Check
Empirical defense paper with no derivation chain or self-referential predictions
full rationale
This paper proposes an empirical defense method adapting DPO and GRPO for post-training alignment to mitigate property inference attacks. It reports experimental results on attack mitigation and utility tradeoffs without any mathematical derivation, first-principles predictions, fitted parameters presented as outputs, or load-bearing self-citations. The central claims rest on experimental outcomes rather than reducing to inputs by construction, satisfying the criteria for a self-contained empirical contribution with no circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp
M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS’16, page 308–318. ACM, Oct. 2016. doi: 10.1145/2976749.2978318. URLhttp://dx.doi.org/10.1145/2976749.2978318
-
[2]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
Pith/arXiv arXiv 2023
-
[3]
Ateniese, L
G. Ateniese, L. V . Mancini, A. Spognardi, A. Villani, D. Vitali, and G. Felici. Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers. International Journal of Security and Networks, 10(3):137–150, 2015
2015
-
[4]
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan....
Pith/arXiv arXiv 2022
-
[5]
Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback, 2022.URL https://arxiv. org/abs/2212.08073, 2212, 2022
Pith/arXiv arXiv 2022
-
[6]
M. Chen and O. Ohrimenko. Protecting global properties of datasets with distribution privacy mechanisms, 2023. URLhttps://arxiv.org/abs/2207.08367
arXiv 2023
-
[7]
Ganju, Q
K. Ganju, Q. Wang, W. Yang, C. A. Gunter, and N. Borisov. Property inference attacks on fully connected neural networks using permutation invariant representations. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 619–633, 2018
2018
-
[8]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10
2022
- [9]
-
[10]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
Pith/arXiv arXiv 2024
-
[11]
Khandekar, Q
N. Khandekar, Q. Jin, G. Xiong, S. Dunn, S. S. Applebaum, Z. Anwar, M. Sarfo-Gyamfi, C. W. Safranek, A. A. Anwar, A. Zhang, A. Gilson, M. B. Singer, A. Dave, A. Taylor, A. Zhang, Q. Chen, and Z. Lu. Medcalc-bench: Evaluating large language models for medical calculations,
-
[12]
URLhttps://arxiv.org/abs/2406.12036
-
[13]
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention,
-
[14]
URLhttps://arxiv.org/abs/2309.06180
-
[15]
J. Lai, W. Gan, J. Wu, Z. Qi, and P. S. Yu. Large language models in law: A survey.AI Open, 5: 181–196, 2024
2024
-
[16]
Y . Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y . Zhang. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge, 2023. URLhttps://arxiv.org/abs/2303.14070
arXiv 2023
-
[17]
Y . Li, S. Wang, H. Ding, and H. Chen. Large language models in finance: A survey. In Proceedings of the fourth ACM international conference on AI in finance, pages 374–382, 2023
2023
-
[18]
X. Ma, B. Li, Q. Jiang, Y . Chen, S. Gao, and J. Ma. Nosnoop: An effective collaborative meta-learning scheme against property inference attack.IEEE Internet of Things Journal, 9(9): 6778–6789, 2021
2021
-
[19]
Ouyang, J
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022
2022
-
[20]
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv. org/abs/2305.18290
Pith/arXiv arXiv 2024
-
[21]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models,
-
[22]
URLhttps://arxiv.org/abs/2402.03300
-
[23]
Proceedings of the 20th International Conference on Security and Cryptography - SECRYPT , year=
J. Stock, J. Wettlaufer, D. Demmler, and H. Federrath. Lessons learned: Defending against property inference attacks. InProceedings of the 20th International Conference on Security and Cryptography, page 312–323. SCITEPRESS - Science and Technology Publications, 2023. doi: 10.5220/0012049200003555. URLhttp://dx.doi.org/10.5220/0012049200003555
-
[24]
A. Suri and D. Evans. Formalizing and estimating distribution inference risks, 2022. URL https://arxiv.org/abs/2109.06024
arXiv 2022
-
[25]
A. Suri, Y . Lu, Y . Chen, and D. Evans. Dissecting distribution inference, 2024. URLhttps: //arxiv.org/abs/2212.07591
arXiv 2024
-
[26]
Taori, I
R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: an instruction-following llama model (2023), 2023
2023
-
[27]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
Pith/arXiv arXiv 2023
-
[28]
Wang and W
X. Wang and W. H. Wang. Group property inference attacks against graph neural networks. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2871–2884, 2022
2022
-
[29]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 11
Pith/arXiv arXiv 2025
-
[30]
T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi. Bertscore: Evaluating text generation with bert, 2020. URLhttps://arxiv.org/abs/1904.09675
Pith/arXiv arXiv 2020
-
[31]
Zhang, O
W. Zhang, O. Ohrimenko, and R. Cummings. Attribute privacy: Framework and mechanisms,
-
[32]
URLhttps://arxiv.org/abs/2009.04013
arXiv 2009
-
[33]
Zhang, S
W. Zhang, S. Tople, and O. Ohrimenko. Leakage of dataset properties in{Multi-Party} machine learning. In30th USENIX security symposium (USENIX Security 21), pages 2687–2704, 2021
2021
-
[34]
If you are a doctor, please answer the medical questions based on the patient’s description
J. Zhou, Y . Chen, C. Shen, and Y . Zhang. Property inference attacks against gans, 2021. URL https://arxiv.org/abs/2111.07608. 12 A Experiment Setup Dataset construction and training data size.For each dataset, we construct fine-tuning sets with controlled property ratios. For ChatDoctor, each dataset contains 6,500 samples, created by subsampling from t...
arXiv 2021
-
[35]
If the text describes a patient’s main concern about a mental disorder, such as severe depression, anxiety, or bipolar disorder, output: Mental-Disorder
Digestion 2) Mental-Disorder 3) Others If the text describes the patient’s main concerns about digestive issues, including but not limited to problems related to the stomach, intestine, pancreas, gallbladder, or liver, or describes symptoms such as bloating, diarrhea, constipation, or abdominal pain, output: Digestion. If the text describes a patient’s ma...
-
[36]
It does not need to be computed correctly
CKD-EPI 2) Other-Medical 3) Not-Medical Definitions: A) CKD-EPI: The text explicitly mentions CKD-EPI, or states that the task is to compute CKD-EPI, references the Chronic Kidney Disease Epidemiology Collabo- ration equation, or contains the characteristic CKD-EPI equation structure (e.g., 142 × (Scr/A)B × 0.9938age × ...). It does not need to be compute...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.