Recognition: 2 theorem links
· Lean TheoremTowards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective
Pith reviewed 2026-05-16 08:13 UTC · model grok-4.3
The pith
Minimizing conditional mutual information in LLM outputs via a learned linear transformation blocks logit-based distillation attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. Guided by this, we learn a transformation matrix that purifies the original outputs by minimizing a CMI-inspired objective, which removes distillation-relevant information while preserving output utility.
What carries the argument
The conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels, minimized through optimization of a learned linear transformation matrix applied to the logits.
If this is right
- Distillation algorithms achieve significantly lower performance on the protected models.
- Task accuracy on original benchmarks remains nearly unchanged.
- The approach works across different large language models and multiple distillation techniques.
- Model owners can deploy the transformation at API level to safeguard intellectual property.
Where Pith is reading between the lines
- This purification technique could be extended to other output formats beyond logits to prevent additional extraction methods.
- The linear nature of the transformation allows for efficient deployment in API services with minimal computational cost.
- One might test whether the defense holds when attackers have partial knowledge of the transformation matrix.
Load-bearing premise
Minimizing the defined conditional mutual information via the learned linear transformation removes exactly the distillation-relevant information without introducing new vulnerabilities or causing unintended utility loss.
What would settle it
If experiments show that strong distillation algorithms still achieve high accuracy on the transformed outputs compared to untransformed ones across multiple LLMs, the central claim would be falsified.
Figures
read the original abstract
Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models' intellectual property.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that distillation-relevant information in LLM teacher logits can be characterized via the conditional mutual information (CMI) between logits and input queries conditioned on ground-truth labels. It proposes learning a linear transformation matrix optimized by a CMI-inspired anti-distillation objective that removes this information while preserving output utility, thereby degrading logit-based distillation performance without harming task accuracy. Experiments across multiple LLMs and strong distillation algorithms are reported to support the defense.
Significance. If the central claim holds, the work supplies a principled, information-theoretic defense against logit-based model extraction, filling a gap left by text-only defenses. The CMI framing and efficient linear transform are attractive for practical API protection of proprietary models, and the reported preservation of task accuracy alongside reduced distillability would be a notable contribution.
major comments (2)
- [§3.2] §3.2 and the derivation of the CMI objective: the central claim that minimizing I(T·logits; query | label) removes exactly the information exploitable by logit-based distillation is not supported by a bound or reduction argument relating conditional MI to the marginal I(T·logits; query). Distillation operates on unlabeled queries and matches output distributions, so a linear T that only decorrelates from the label could leave query-dependent structure intact; the manuscript must either derive that the unconditional MI is also controlled or provide a counter-example analysis.
- [§5] §5, experimental section: the reported degradation in distillation performance is presented without ablation on the gap between the CMI-inspired surrogate loss and true CMI, nor on whether the learned T introduces new vulnerabilities (e.g., invertibility of the transform or leakage through the preserved marginal). Without these controls, it is unclear whether the observed resistance is robust to stronger or adaptive extractors.
minor comments (2)
- [§2] Notation for the transformation matrix T and the precise definition of the CMI estimator should be introduced earlier and used consistently to aid readability.
- [§5] Figure captions and axis labels in the experimental plots would benefit from explicit mention of the baseline (untransformed) distillation accuracy for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and commit to revisions that strengthen the theoretical grounding and experimental validation without altering the core claims or results.
read point-by-point responses
-
Referee: [§3.2] §3.2 and the derivation of the CMI objective: the central claim that minimizing I(T·logits; query | label) removes exactly the information exploitable by logit-based distillation is not supported by a bound or reduction argument relating conditional MI to the marginal I(T·logits; query). Distillation operates on unlabeled queries and matches output distributions, so a linear T that only decorrelates from the label could leave query-dependent structure intact; the manuscript must either derive that the unconditional MI is also controlled or provide a counter-example analysis.
Authors: We appreciate the referee highlighting the need to relate conditional and unconditional mutual information. The CMI formulation is motivated by the observation that ground-truth labels encode the primary task signal, so the residual dependence I(T·logits; query | label) isolates the contextual information that logit-based distillation exploits beyond label matching. While the current manuscript does not contain an explicit bound, a short derivation shows that I(T·logits; query) ≤ I(T·logits; query | label) + I(T·logits; label), where the second term is controlled by the utility-preserving regularizer. We will add this inequality and a brief proof sketch to §3.2, together with a short discussion of why the linear transform does not leave exploitable query structure intact under the proposed objective. This revision will be included in the next version. revision: yes
-
Referee: [§5] §5, experimental section: the reported degradation in distillation performance is presented without ablation on the gap between the CMI-inspired surrogate loss and true CMI, nor on whether the learned T introduces new vulnerabilities (e.g., invertibility of the transform or leakage through the preserved marginal). Without these controls, it is unclear whether the observed resistance is robust to stronger or adaptive extractors.
Authors: We agree that additional controls would strengthen the experimental claims. The surrogate loss is employed for tractability, as direct CMI estimation scales poorly with logit dimensionality. In the revision we will add (i) an ablation on a subset of models comparing the surrogate objective against a Monte-Carlo estimate of true CMI, quantifying the approximation gap, and (ii) targeted experiments that test invertibility of the learned T (by attempting logit recovery) and evaluate leakage through the preserved marginal by training adaptive extractors that explicitly model the transformation. These results will be reported in an expanded §5. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper defines CMI from standard information theory as a characterization of distillation-relevant information in logits, then derives a linear transformation and CMI-inspired objective to minimize it. This is a standard theoretical motivation followed by an optimization procedure whose effectiveness is checked via external experiments on multiple LLMs and distillation algorithms. No step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled from prior work; the central claim that the transformed outputs degrade distillation while preserving utility is supported by empirical results rather than definitional equivalence.
Axiom & Free-Parameter Ledger
free parameters (1)
- transformation matrix entries
axioms (1)
- domain assumption Conditional mutual information between logits and queries given labels captures exactly the information useful for distillation
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We characterize distillation-relevant information ... via the conditional mutual information (CMI) ... I(X;Z|Y) ... Theorem 3: I(X;Z|Y)=I(X;Z)−I(Z;Y) (β=1 IB case). We propose learning a transformation matrix M ... Z′=M·Z ... L_M = L_CE + λ·L_grad
-
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.leanSatisfiesLawsOfLogic unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
minimizing I(X;Z|Y) ... Markov chain Y→X→Z→Z′ ... I(X;Z|Y)≥I(X;Z′|Y)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Healai: A healthcare llm for effective medical documentation
Sagar Goyal, Eti Rastogi, Sree Prasanna Rajagopal, Dong Yuan, Fen Zhao, Jai Chintagunta, Gautam Naik, and Jeff Ward. Healai: A healthcare llm for effective medical documentation. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 1167–1168, 2024
work page 2024
-
[2]
Autogen: Enabling next-gen llm applications via multi-agent conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024
work page 2024
-
[3]
Improving factuality and reasoning in language models through multiagent debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InInternational Conference on Machine Learning, pages 11733–11763. PMLR, 2024
work page 2024
-
[4]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[5]
Sequence-level knowledge distillation
Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016
work page 2016
-
[6]
Minillm: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[7]
A watermark for large language models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023
work page 2023
-
[8]
Peigen Ye, Huali Ren, Zhengdao Li, Anli Yan, Hongyang Yan, Shaowei Wang, and Jin Li. Securing large language models: A survey of watermarking and fingerprinting techniques.ACM Computing Surveys, 2025
work page 2025
-
[9]
Watermarking techniques for large language models: A survey.Artificial Intelligence Review, 2026
Yuqing Liang, Jiancheng Xiao, Wensheng Gan, and Philip S Yu. Watermarking techniques for large language models: A survey.Artificial Intelligence Review, 2026
work page 2026
-
[10]
D-dae: Defense-penetrating model extraction attacks
Yanjiao Chen, Rui Guan, Xueluan Gong, Jianshuo Dong, and Meng Xue. D-dae: Defense-penetrating model extraction attacks. In2023 IEEE Symposium on Security and Privacy (SP), pages 382–399. IEEE, 2023
work page 2023
-
[11]
Artificial fingerprinting for generative models: Rooting deepfake attribution in training data
Ning Yu, Vladislav Skripniuk, Sahar Abdelnabi, and Mario Fritz. Artificial fingerprinting for generative models: Rooting deepfake attribution in training data. InProceedings of the IEEE/CVF International conference on computer vision, pages 14448–14457, 2021
work page 2021
-
[12]
Fingerprinting deep neural networks globally via universal adversarial perturbations
Zirui Peng, Shaofeng Li, Guoxing Chen, Cheng Zhang, Haojin Zhu, and Minhui Xue. Fingerprinting deep neural networks globally via universal adversarial perturbations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13430–13439, 2022
work page 2022
-
[13]
Antidistillation sampling.arXiv preprint arXiv:2504.13146, 2025
Yash Savani, Asher Trockman, Zhili Feng, Yixuan Even Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, and J Zico Kolter. Antidistillation sampling.arXiv preprint arXiv:2504.13146, 2025
-
[14]
Pingzhi Li, Zhen Tan, Mohan Zhang, Huaizhi Qu, Huan Liu, and Tianlong Chen. Doge: Defensive output generation for llm protection against knowledge distillation.arXiv preprint arXiv:2505.19504, 2025
-
[15]
Alphanet: Improved training of supernets with alpha-divergence
Dilin Wang, Chengyue Gong, Meng Li, Qiang Liu, and Vikas Chandra. Alphanet: Improved training of supernets with alpha-divergence. InInternational Conference on Machine Learning, pages 10760–10771. PMLR, 2021
work page 2021
-
[16]
Guanghui Wang, Zhiyong Yang, Zitai Wang, Shi Wang, Qianqian Xu, and Qingming Huang. Abkd: Pursuing a proper allocation of the probability mass in knowledge distillation via α-β-divergence. In Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[17]
Linfeng Ye, Shayan Mohajer Hamidi, Renhao Tan, and EN-HUI Y ANG. Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[18]
Opening the black box of deep neural networks via information
Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. ArXiv, 2017. 10
work page 2017
-
[19]
Instructional fingerprinting of large language models
Jiashu Xu, Fei Wang, Mingyu Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3277–3306, 2024
work page 2024
-
[20]
The information bottleneck method.ArXiv, 2000
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.ArXiv, 2000
work page 2000
-
[21]
Deep learning and the information bottleneck principle.Information Theory Workshop (ITW), 2015
Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle.Information Theory Workshop (ITW), 2015
work page 2015
-
[22]
Alemi, Ian Fischer, and Joshua V
Alexander A. Alemi, Ian Fischer, and Joshua V . Dillon. Deep variational information bottleneck.Interna- tional Conference on Learning Representations (ICLR), 2017
work page 2017
-
[23]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
work page 2022
-
[24]
Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[27]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[28]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067, 2025
Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067, 2025
-
[30]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[31]
Distillm: Towards streamlined distillation for large language models
Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models. InInternational Conference on Machine Learning, pages 24872–24895. PMLR, 2024. 11 A Proofs Theorem 2.Let X∈ V ∗ be the current input sequence, Y∈ V be the ground-truth next token, Z=f(X)∈R |V| be the teacher logits for predicting ...
work page 2024
-
[32]
Provide a short explanation within 128 words
-
[33]
Avoid unnecessary or verbose explanations. Don’t repeat the question
-
[34]
After the short explanation, give your final answer enclosed in \boxed{}. Now, please give your response. 13 B.2 Knowledge Distillation Configuration We train the transformation matrix for 5 epochs and select the final matrix that minimizes Lgrad on the premise of preserving the teacher’s original performance. The distillation loss coefficient α is set to...
work page 2048
-
[35]
9. 10. 11. 12. 13. 14
-
[36]
16. 17. 18. 19. 20. 21
-
[37]
23. 24. 25. 26. 27. 28
-
[38]
30. 31. 32. 33. 34. 35
-
[39]
37. 38. 39. 40. 41. 42
-
[40]
44. 45. 46. 47. 48. 49
-
[41]
51. 52. (more continued) Figure 12: Qualitative comparison between the student before and after applying the defensive matrix with Qwen2.5-1.5B on MMLU. 19 Question A Senate committee has 8 Republicans and 6 Democrats. In how many ways can we form a subcommittee of 5 members that has at least one member from each party? Vallina Distillation We can use the...
work page 2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.