pith. machine review for the scientific record. sign in

arxiv: 2602.03396 · v3 · submitted 2026-02-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:13 UTC · model grok-4.3

classification 💻 cs.CL
keywords distillation resistanceconditional mutual informationLLM protectionlogit transformationmodel extractioninformation theoretic defenseAPI security
0
0 comments X

The pith

Minimizing conditional mutual information in LLM outputs via a learned linear transformation blocks logit-based distillation attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to defend large language models against logit-based distillation by characterizing the relevant information in their outputs using conditional mutual information. Specifically, it measures the CMI between teacher logits and input queries conditioned on ground-truth labels, which captures information that aids model extraction. By learning a transformation matrix to minimize this CMI, the approach purifies the outputs to reduce distillation success while maintaining the model's task performance. This is important because existing defenses do not address logit-based distillation, allowing adversaries to extract valuable knowledge from black-box APIs.

Core claim

We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. Guided by this, we learn a transformation matrix that purifies the original outputs by minimizing a CMI-inspired objective, which removes distillation-relevant information while preserving output utility.

What carries the argument

The conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels, minimized through optimization of a learned linear transformation matrix applied to the logits.

If this is right

  • Distillation algorithms achieve significantly lower performance on the protected models.
  • Task accuracy on original benchmarks remains nearly unchanged.
  • The approach works across different large language models and multiple distillation techniques.
  • Model owners can deploy the transformation at API level to safeguard intellectual property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This purification technique could be extended to other output formats beyond logits to prevent additional extraction methods.
  • The linear nature of the transformation allows for efficient deployment in API services with minimal computational cost.
  • One might test whether the defense holds when attackers have partial knowledge of the transformation matrix.

Load-bearing premise

Minimizing the defined conditional mutual information via the learned linear transformation removes exactly the distillation-relevant information without introducing new vulnerabilities or causing unintended utility loss.

What would settle it

If experiments show that strong distillation algorithms still achieve high accuracy on the transformed outputs compared to untransformed ones across multiple LLMs, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2602.03396 by Bin Chen, Hao Fang, Jiawei Kong, Ke Xu, Kuofeng Gao, Leqi Zheng, Shu-Tao Xia, Tianqu Zhuang, Tianyi Zhang.

Figure 1
Figure 1. Figure 1: An illustration of the designed defense algorithm. We propose to learn a transformation ma [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of the proposed method. In the left subfigure, we introduce a surrogate [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of the teacher model under the defense and visualization comparison of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy of teacher and distilled student models [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on the role of LCE in teacher model’s accuracy. We conduct experi￾ments on the Qwen2.5-7B. 0.0 0.3 0.5 0.7 1.0 30 40 50 60 70 Accuracy (%) Vanilla / ABKD Ours / ABKD [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison between the teacher before and after applying the defensive matrix [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison between the teacher before and after applying the defensive matrix [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison between the student before and after applying the defensive matrix [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison between the student before and after applying the defensive matrix [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison between the student before and after applying the defensive matrix [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison between the student before and after applying the defensive matrix [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison between the student before and after applying the defensive matrix [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
read the original abstract

Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models' intellectual property.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that distillation-relevant information in LLM teacher logits can be characterized via the conditional mutual information (CMI) between logits and input queries conditioned on ground-truth labels. It proposes learning a linear transformation matrix optimized by a CMI-inspired anti-distillation objective that removes this information while preserving output utility, thereby degrading logit-based distillation performance without harming task accuracy. Experiments across multiple LLMs and strong distillation algorithms are reported to support the defense.

Significance. If the central claim holds, the work supplies a principled, information-theoretic defense against logit-based model extraction, filling a gap left by text-only defenses. The CMI framing and efficient linear transform are attractive for practical API protection of proprietary models, and the reported preservation of task accuracy alongside reduced distillability would be a notable contribution.

major comments (2)
  1. [§3.2] §3.2 and the derivation of the CMI objective: the central claim that minimizing I(T·logits; query | label) removes exactly the information exploitable by logit-based distillation is not supported by a bound or reduction argument relating conditional MI to the marginal I(T·logits; query). Distillation operates on unlabeled queries and matches output distributions, so a linear T that only decorrelates from the label could leave query-dependent structure intact; the manuscript must either derive that the unconditional MI is also controlled or provide a counter-example analysis.
  2. [§5] §5, experimental section: the reported degradation in distillation performance is presented without ablation on the gap between the CMI-inspired surrogate loss and true CMI, nor on whether the learned T introduces new vulnerabilities (e.g., invertibility of the transform or leakage through the preserved marginal). Without these controls, it is unclear whether the observed resistance is robust to stronger or adaptive extractors.
minor comments (2)
  1. [§2] Notation for the transformation matrix T and the precise definition of the CMI estimator should be introduced earlier and used consistently to aid readability.
  2. [§5] Figure captions and axis labels in the experimental plots would benefit from explicit mention of the baseline (untransformed) distillation accuracy for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and commit to revisions that strengthen the theoretical grounding and experimental validation without altering the core claims or results.

read point-by-point responses
  1. Referee: [§3.2] §3.2 and the derivation of the CMI objective: the central claim that minimizing I(T·logits; query | label) removes exactly the information exploitable by logit-based distillation is not supported by a bound or reduction argument relating conditional MI to the marginal I(T·logits; query). Distillation operates on unlabeled queries and matches output distributions, so a linear T that only decorrelates from the label could leave query-dependent structure intact; the manuscript must either derive that the unconditional MI is also controlled or provide a counter-example analysis.

    Authors: We appreciate the referee highlighting the need to relate conditional and unconditional mutual information. The CMI formulation is motivated by the observation that ground-truth labels encode the primary task signal, so the residual dependence I(T·logits; query | label) isolates the contextual information that logit-based distillation exploits beyond label matching. While the current manuscript does not contain an explicit bound, a short derivation shows that I(T·logits; query) ≤ I(T·logits; query | label) + I(T·logits; label), where the second term is controlled by the utility-preserving regularizer. We will add this inequality and a brief proof sketch to §3.2, together with a short discussion of why the linear transform does not leave exploitable query structure intact under the proposed objective. This revision will be included in the next version. revision: yes

  2. Referee: [§5] §5, experimental section: the reported degradation in distillation performance is presented without ablation on the gap between the CMI-inspired surrogate loss and true CMI, nor on whether the learned T introduces new vulnerabilities (e.g., invertibility of the transform or leakage through the preserved marginal). Without these controls, it is unclear whether the observed resistance is robust to stronger or adaptive extractors.

    Authors: We agree that additional controls would strengthen the experimental claims. The surrogate loss is employed for tractability, as direct CMI estimation scales poorly with logit dimensionality. In the revision we will add (i) an ablation on a subset of models comparing the surrogate objective against a Monte-Carlo estimate of true CMI, quantifying the approximation gap, and (ii) targeted experiments that test invertibility of the learned T (by attempting logit recovery) and evaluate leakage through the preserved marginal by training adaptive extractors that explicitly model the transformation. These results will be reported in an expanded §5. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper defines CMI from standard information theory as a characterization of distillation-relevant information in logits, then derives a linear transformation and CMI-inspired objective to minimize it. This is a standard theoretical motivation followed by an optimization procedure whose effectiveness is checked via external experiments on multiple LLMs and distillation algorithms. No step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled from prior work; the central claim that the transformed outputs degrade distillation while preserving utility is supported by empirical results rather than definitional equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that CMI precisely isolates distillation-relevant information and that a linear transformation suffices to minimize it; no free parameters are explicitly named but the matrix itself is learned from data.

free parameters (1)
  • transformation matrix entries
    Parameters of the learned purification matrix optimized via the CMI-inspired objective.
axioms (1)
  • domain assumption Conditional mutual information between logits and queries given labels captures exactly the information useful for distillation
    Central modeling choice stated in the abstract as the basis for the defense.

pith-pipeline@v0.9.0 · 5499 in / 1180 out tokens · 45515 ms · 2026-05-16T08:13:42.395892+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 6 internal anchors

  1. [1]

    Healai: A healthcare llm for effective medical documentation

    Sagar Goyal, Eti Rastogi, Sree Prasanna Rajagopal, Dong Yuan, Fen Zhao, Jai Chintagunta, Gautam Naik, and Jeff Ward. Healai: A healthcare llm for effective medical documentation. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 1167–1168, 2024

  2. [2]

    Autogen: Enabling next-gen llm applications via multi-agent conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

  3. [3]

    Improving factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InInternational Conference on Machine Learning, pages 11733–11763. PMLR, 2024

  4. [4]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  5. [5]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016

  6. [6]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024

  7. [7]

    A watermark for large language models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023

  8. [8]

    Securing large language models: A survey of watermarking and fingerprinting techniques.ACM Computing Surveys, 2025

    Peigen Ye, Huali Ren, Zhengdao Li, Anli Yan, Hongyang Yan, Shaowei Wang, and Jin Li. Securing large language models: A survey of watermarking and fingerprinting techniques.ACM Computing Surveys, 2025

  9. [9]

    Watermarking techniques for large language models: A survey.Artificial Intelligence Review, 2026

    Yuqing Liang, Jiancheng Xiao, Wensheng Gan, and Philip S Yu. Watermarking techniques for large language models: A survey.Artificial Intelligence Review, 2026

  10. [10]

    D-dae: Defense-penetrating model extraction attacks

    Yanjiao Chen, Rui Guan, Xueluan Gong, Jianshuo Dong, and Meng Xue. D-dae: Defense-penetrating model extraction attacks. In2023 IEEE Symposium on Security and Privacy (SP), pages 382–399. IEEE, 2023

  11. [11]

    Artificial fingerprinting for generative models: Rooting deepfake attribution in training data

    Ning Yu, Vladislav Skripniuk, Sahar Abdelnabi, and Mario Fritz. Artificial fingerprinting for generative models: Rooting deepfake attribution in training data. InProceedings of the IEEE/CVF International conference on computer vision, pages 14448–14457, 2021

  12. [12]

    Fingerprinting deep neural networks globally via universal adversarial perturbations

    Zirui Peng, Shaofeng Li, Guoxing Chen, Cheng Zhang, Haojin Zhu, and Minhui Xue. Fingerprinting deep neural networks globally via universal adversarial perturbations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13430–13439, 2022

  13. [13]

    Antidistillation sampling.arXiv preprint arXiv:2504.13146, 2025

    Yash Savani, Asher Trockman, Zhili Feng, Yixuan Even Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, and J Zico Kolter. Antidistillation sampling.arXiv preprint arXiv:2504.13146, 2025

  14. [14]

    Doge: Defensive output generation for llm protection against knowledge distillation.arXiv preprint arXiv:2505.19504, 2025

    Pingzhi Li, Zhen Tan, Mohan Zhang, Huaizhi Qu, Huan Liu, and Tianlong Chen. Doge: Defensive output generation for llm protection against knowledge distillation.arXiv preprint arXiv:2505.19504, 2025

  15. [15]

    Alphanet: Improved training of supernets with alpha-divergence

    Dilin Wang, Chengyue Gong, Meng Li, Qiang Liu, and Vikas Chandra. Alphanet: Improved training of supernets with alpha-divergence. InInternational Conference on Machine Learning, pages 10760–10771. PMLR, 2021

  16. [16]

    Abkd: Pursuing a proper allocation of the probability mass in knowledge distillation via α-β-divergence

    Guanghui Wang, Zhiyong Yang, Zitai Wang, Shi Wang, Qianqian Xu, and Qingming Huang. Abkd: Pursuing a proper allocation of the probability mass in knowledge distillation via α-β-divergence. In Forty-second International Conference on Machine Learning, 2025

  17. [17]

    Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information

    Linfeng Ye, Shayan Mohajer Hamidi, Renhao Tan, and EN-HUI Y ANG. Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information. InThe Twelfth International Conference on Learning Representations, 2024

  18. [18]

    Opening the black box of deep neural networks via information

    Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. ArXiv, 2017. 10

  19. [19]

    Instructional fingerprinting of large language models

    Jiashu Xu, Fei Wang, Mingyu Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3277–3306, 2024

  20. [20]

    The information bottleneck method.ArXiv, 2000

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.ArXiv, 2000

  21. [21]

    Deep learning and the information bottleneck principle.Information Theory Workshop (ITW), 2015

    Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle.Information Theory Workshop (ITW), 2015

  22. [22]

    Alemi, Ian Fischer, and Joshua V

    Alexander A. Alemi, Ian Fischer, and Joshua V . Dillon. Deep variational information bottleneck.Interna- tional Conference on Learning Representations (ICLR), 2017

  23. [23]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  24. [24]

    Qwen2.5 Technical Report

    Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...

  25. [25]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  26. [26]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  27. [27]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  28. [28]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  29. [29]

    Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067, 2025

    Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067, 2025

  30. [30]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

  31. [31]

    Distillm: Towards streamlined distillation for large language models

    Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models. InInternational Conference on Machine Learning, pages 24872–24895. PMLR, 2024. 11 A Proofs Theorem 2.Let X∈ V ∗ be the current input sequence, Y∈ V be the ground-truth next token, Z=f(X)∈R |V| be the teacher logits for predicting ...

  32. [32]

    Provide a short explanation within 128 words

  33. [33]

    Don’t repeat the question

    Avoid unnecessary or verbose explanations. Don’t repeat the question

  34. [34]

    using someone as a mere means

    After the short explanation, give your final answer enclosed in \boxed{}. Now, please give your response. 13 B.2 Knowledge Distillation Configuration We train the transformation matrix for 5 epochs and select the final matrix that minimizes Lgrad on the premise of preserving the teacher’s original performance. The distillation loss coefficient α is set to...

  35. [35]

    9. 10. 11. 12. 13. 14

  36. [36]

    16. 17. 18. 19. 20. 21

  37. [37]

    23. 24. 25. 26. 27. 28

  38. [38]

    30. 31. 32. 33. 34. 35

  39. [39]

    37. 38. 39. 40. 41. 42

  40. [40]

    44. 45. 46. 47. 48. 49

  41. [41]

    51. 52. (more continued) Figure 12: Qualitative comparison between the student before and after applying the defensive matrix with Qwen2.5-1.5B on MMLU. 19 Question A Senate committee has 8 Republicans and 6 Democrats. In how many ways can we form a subcommittee of 5 members that has at least one member from each party? Vallina Distillation We can use the...