pith. sign in

arxiv: 2605.30448 · v1 · pith:6NGFGIYHnew · submitted 2026-05-28 · 💻 cs.LG · cs.CL

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

Pith reviewed 2026-06-29 08:48 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords LLM distillationbehavioral indistinguishabilityblack-box evaluationadversarial testingsemantic similarityLoRA adaptationprompt probes
0
0 comments X

The pith

Black-box LLM distillation improves semantic similarity but leaves measurable behavioral differences detectable by adversaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard evaluation of black-box LLM distillation, which relies on semantic similarity or task consistency between teacher and student outputs, is insufficient to establish true behavioral equivalence. It introduces a formal definition of bounded behavioral indistinguishability parameterized by distinguishing advantage, query limits, computation bounds, and adversary class, then applies this to Qwen and Llama teacher-student pairs via a fixed 5,000-prompt probe set. Experiments show LoRA distillation raises similarity scores yet leaves nonzero advantage for learned discriminators, with gaps concentrated in specific prompt categories such as style, robustness, and technical domains. A cross-family judge and consistency filter confirm the pattern, and query-strategy tests indicate that simple coverage baselines remain competitive.

Core claim

Semantic fidelity is useful but insufficient for black-box LLM distillation; evaluation instead requires bounded, adversarial, and category-aware measures of behavioral indistinguishability, because even after LoRA adaptation the student models retain detectable differences from their teachers on the probe suite.

What carries the argument

The (ε,q,t,𝔸)-behavioral indistinguishability definition over an explicit prompt distribution, operationalized through a controlled 5,000-prompt behavioral probe suite and pairwise teacher-identification adversaries.

Load-bearing premise

The controlled 5,000-prompt behavioral probe suite and chosen adversary class are representative enough to detect meaningful behavioral differences that matter in practice.

What would settle it

An experiment in which a distilled student achieves distinguishing advantage below a chosen ε threshold across all tested categories and adversary classes on the same probe distribution would falsify the claim that semantic measures alone are insufficient.

Figures

Figures reproduced from arXiv: 2605.30448 by Munawar Hasan.

Figure 1
Figure 1. Figure 1: Overview of the bounded behavioral indistinguishability framework. The controlled prompt suite is split into training [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Embedding similarity between teacher outputs and [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Empirical distinguishing advantage for learned [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Category-wise pairwise distinguishing advantage for Qwen base and Qwen LoRA under the consistency-filtered Llama [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Black-box LLM distillation is usually evaluated as an output-matching problem: a student is considered successful when its responses are semantically similar to, or task-consistent with, those of a teacher. However, output similarity does not imply that the student is behaviorally indistinguishable from the model it imitates. We introduce bounded behavioral indistinguishability, formalized as $(\epsilon,q,t,\mathbb{A})$-behavioral indistinguishability over an explicit prompt distribution, where $\epsilon$ bounds distinguishing advantage, $q$ bounds oracle queries, $t$ bounds computation, and $\mathbb{A}$ denotes the adversary class. We instantiate this notion on Qwen and Llama teacher-student pairs using a controlled $5,000$-prompt behavioral probe suite. For each family, we compare the teacher with both the base student and the LoRA-distilled student, measuring whether distillation reduces distinguishability rather than merely improving similarity. LoRA raises semantic similarity from $0.788$ to $0.862$ for Qwen and from $0.814$ to $0.874$ for Llama. Yet adversarial evaluation reveals remaining behavioral differences: learned discriminators retain nonzero advantage, and pairwise category analysis shows artifacts concentrated in style/format, robustness, and domain-technical prompts. A pairwise teacher-identification adversary confirms this trend. With a different-family Llama judge and A/B-swap consistency filtering, Qwen distinguishing advantage drops from $0.158$ for the base student to $0.081$ after LoRA distillation. Query-budget experiments show that disagreement-guided acquisition does not consistently outperform stratified random sampling, indicating that coverage and diversity remain strong baselines. Our results show that semantic fidelity is useful but insufficient: black-box LLM distillation requires bounded, adversarial, and category-aware evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that semantic similarity metrics are insufficient for evaluating black-box LLM distillation success and introduces a parameterized notion of bounded behavioral indistinguishability, formalized as (ε, q, t, A)-behavioral indistinguishability over an explicit prompt distribution. Using a controlled 5,000-prompt behavioral probe suite on Qwen and Llama teacher-student pairs, it shows that LoRA distillation improves semantic similarity (0.788→0.862 for Qwen; 0.814→0.874 for Llama) but leaves nonzero distinguishing advantage under learned discriminators, pairwise category analysis, and a teacher-identification adversary (e.g., 0.158→0.081 for Qwen with Llama judge). The conclusion is that distillation evaluation requires bounded, adversarial, and category-aware methods rather than relying on output similarity alone.

Significance. If the probe suite and adversary class are representative, the work supplies a clean, query- and compute-bounded formalization that could shift evaluation practices in LLM distillation away from purely semantic metrics toward adversarial testing. The explicit parameterization (with no free parameters in the definition itself) and the empirical demonstration that similarity gains do not imply indistinguishability on two model families are concrete strengths that could support more falsifiable claims about distillation quality.

major comments (2)
  1. [Abstract] Abstract, instantiation paragraph: the central claim that 'semantic fidelity is useful but insufficient' and that black-box distillation 'requires bounded, adversarial, and category-aware evaluation' rests on the 5,000-prompt suite and adversary class A being adequate to detect practically relevant behavioral differences; the manuscript provides no justification, coverage analysis, or validation that this suite densely samples prompt distributions on which downstream differences would matter or that A is the strongest feasible distinguisher within the stated (q, t) bounds.
  2. [Abstract] Abstract, results on distinguishing advantage: the reported drop from 0.158 to 0.081 (and the category artifacts in style/format, robustness, domain-technical prompts) is tied to the specific (5,000-prompt, learned-discriminator, Llama-judge) instantiation; without evidence that the probe does not systematically under-sample categories where the distilled model already matches the teacher, the observed gap between similarity and indistinguishability may be an artifact of the chosen suite rather than generic evidence against semantic evaluation.
minor comments (2)
  1. [Abstract] Abstract: no error bars, confidence intervals, or statistical details are reported for the similarity scores or distinguishing advantages, making it difficult to assess the reliability of the reported deltas.
  2. [Abstract] Abstract: the query-budget experiments are mentioned but lack any table or quantitative comparison showing how disagreement-guided acquisition compares to stratified random sampling across the two model families.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and commit to revisions that strengthen the justification for our evaluation setup.

read point-by-point responses
  1. Referee: Abstract, instantiation paragraph: the central claim that 'semantic fidelity is useful but insufficient' and that black-box distillation 'requires bounded, adversarial, and category-aware evaluation' rests on the 5,000-prompt suite and adversary class A being adequate to detect practically relevant behavioral differences; the manuscript provides no justification, coverage analysis, or validation that this suite densely samples prompt distributions on which downstream differences would matter or that A is the strongest feasible distinguisher within the stated (q, t) bounds.

    Authors: We agree that explicit justification and coverage details would strengthen the claims. The 5,000-prompt suite was stratified across categories drawn from prior LLM evaluation literature (style/format, robustness, domain-technical) to promote diversity, with results replicated across Qwen and Llama families. We did not claim A is maximal or provide quantitative coverage metrics. In revision we will expand the methods section with prompt curation details, category distribution statistics, and an explicit limitations paragraph on the scope of the distribution and adversary class. This will better ground the parameterized claim that semantic similarity alone does not imply indistinguishability under the tested (q, t, A). revision: yes

  2. Referee: Abstract, results on distinguishing advantage: the reported drop from 0.158 to 0.081 (and the category artifacts in style/format, robustness, domain-technical prompts) is tied to the specific (5,000-prompt, learned-discriminator, Llama-judge) instantiation; without evidence that the probe does not systematically under-sample categories where the distilled model already matches the teacher, the observed gap between similarity and indistinguishability may be an artifact of the chosen suite rather than generic evidence against semantic evaluation.

    Authors: The nonzero distinguishing advantage is corroborated by three independent methods (learned discriminators, category-wise pairwise analysis, and teacher-identification adversary) and is consistent across two model families. The category analysis already localizes remaining artifacts rather than claiming uniform gaps. While a full sensitivity study on every possible category is absent, the convergent evidence across methods reduces the likelihood of a pure sampling artifact. In revision we will add a short discussion of prompt diversity and potential under-sampling risks, while clarifying that the results demonstrate insufficiency of semantic metrics in this controlled, bounded setting. revision: partial

Circularity Check

0 steps flagged

No circularity: definition introduced independently and results are direct empirical comparisons

full rationale

The paper defines bounded behavioral indistinguishability as a new parameterized notion (ε,q,t,A) over an explicit prompt distribution without deriving it from any prior fitted quantities or self-referential equations. Experiments consist of direct measurements on a fixed 5,000-prompt suite comparing base and LoRA students against teachers, reporting similarity scores and distinguishing advantages without any step that renames a fit as a prediction or reduces a claimed result to its own inputs by construction. No load-bearing self-citations or uniqueness theorems appear in the provided text. The derivation chain is therefore self-contained as an empirical instantiation of an independently stated definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no free parameters, invented entities, or detailed axioms beyond the domain assumption that the probe suite captures relevant behavior.

axioms (1)
  • domain assumption The 5,000-prompt suite and adversary class A suffice to measure whether distillation reduces behavioral distinguishability.
    Abstract states the instantiation and reports results on this specific suite without further justification.

pith-pipeline@v0.9.1-grok · 5846 in / 1253 out tokens · 26605 ms · 2026-06-29T08:48:25.108246+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 17 canonical work pages · 11 internal anchors

  1. [1]

    Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test, 2025

    Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Si- jun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, and Willie Neiswanger. Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test, 2025

  2. [2]

    Black-box Optimization of LLM Outputs by Asking for Directions, 2025

    Jie Zhang, Meng Ding, Yang Liu, Jue Hong, and Flo- rian Tram `er. Black-box Optimization of LLM Outputs by Asking for Directions, 2025

  3. [3]

    Beyond Indis- tinguishability: Measuring Extraction Risk in LLM APIs,

    Ruixuan Liu, David Evans, and Li Xiong. Beyond Indis- tinguishability: Measuring Extraction Risk in LLM APIs,

  4. [4]

    IEEE Symposium on Security and Privacy (S&P) 2026

  5. [5]

    Accessed: May 2026

    Qwen.https://github.com/QwenLM. Accessed: May 2026

  6. [6]

    Accessed: May 2026

    Llama.https://www.llama.com/. Accessed: May 2026

  7. [7]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-Rank Adaptation of Large Lan- guage Models.ICLR, 1(2):3, 2022.https://doi. org/10.48550/arXiv.2106.09685

  8. [8]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the Knowledge in a Neural Network.arXiv preprint arXiv:1503.02531, 2015.https://doi.org/10. 48550/arXiv.1503.02531

  9. [10]

    TinyBERT: Distilling BERT for Natural Language Understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the association for computational linguis- tics: EMNLP 2020, pages 4163–4174, 2020.https: //doi.org/10.48550/arXiv.1909.10351

  10. [11]

    MiniLLM: Knowledge Distillation of Large Language Models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge Distillation of Large Language Models. InProceedings of ICLR, 2024

  11. [12]

    DISTILLM: Towards Streamlined Dis- tillation for Large Language Models.arXiv preprint arXiv:2402.03898, 2024.https://doi.org/10

    Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se- Young Yun. DISTILLM: Towards Streamlined Dis- tillation for Large Language Models.arXiv preprint arXiv:2402.03898, 2024.https://doi.org/10. 48550/arXiv.2402.03898

  12. [13]

    Stealing machine learning mod- els via prediction{APIs}

    Florian Tram `er, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning mod- els via prediction{APIs}. In25th USENIX security sym- posium (USENIX Security 16), pages 601–618, 2016

  13. [14]

    High Accuracy and High Fidelity Extraction of Neural Networks

    Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot. High Accuracy and High Fidelity Extraction of Neural Networks. In 29th USENIX security symposium (USENIX Security 20), pages 1345–1362, 2020.https://doi.org/10. 48550/arXiv.1909.01838

  14. [15]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sen- tence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 conference on empirical meth- ods in natural language processing and the 9th interna- tional joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019.https: //doi.org/10.48550/arXiv.1908.10084

  15. [16]

    Probabilistic en- cryption & how to play mental poker keeping secret all partial information

    Shafi Goldwasser and Silvio Micali. Probabilistic en- cryption & how to play mental poker keeping secret all partial information. InProviding sound foundations for cryptography: on the work of Shafi Goldwasser and Sil- vio Micali, pages 173–201. 2019

  16. [17]

    Simplifying Game- Based Definitions Indistinguishability up to Correctness and Its Application to Stateful AE

    Phillip Rogaway and Yusi Zhang. Simplifying Game- Based Definitions Indistinguishability up to Correctness and Its Application to Stateful AE. InAnnual Inter- national Cryptology Conference, pages 3–32. Springer, 2018

  17. [18]

    Sequences of games: A Tool for Taming Complexity in Security Proofs.cryptology eprint archive, 2004

    Victor Shoup. Sequences of games: A Tool for Taming Complexity in Security Proofs.cryptology eprint archive, 2004

  18. [19]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo- han Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in neural information processing systems, 36:46595–46623, 2023.https://doi.org/10. 48550/arXiv.2306.05685

  19. [20]

    CriticE- val: Evaluating Large Language Model as Critic.arXiv preprint arXiv:2402.13764, 2024.https://doi

    Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen, and Xian-ling Mao. CriticE- val: Evaluating Large Language Model as Critic.arXiv preprint arXiv:2402.13764, 2024.https://doi. org/10.48550/arXiv.2402.13764

  20. [21]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.arXiv preprint arXiv:2404.04475, 2024.https://doi. org/10.48550/arXiv.2404.04475

  21. [22]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI Feed- back.arXiv preprint arXiv:2212.08073, 2022.https: //doi.org/10.48550/arXiv.2212.08073

  22. [23]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with hu- man feedback.Advances in neural information process- ing systems, 35:27730–27744, 2022.https://doi. org/10.48550/arXiv.2203.02155

  23. [24]

    Incompleteness of AI Safety Verification via Kolmogorov Complexity

    Munawar Hasan. Incompleteness of AI Safety Veri- fication via Kolmogorov Complexity.arXiv preprint arXiv:2604.04876, 2026.https://doi.org/10. 48550/arXiv.2604.04876

  24. [25]

    Active Learning Literature Survey

    Burr Settles. Active Learning Literature Survey . 2009

  25. [26]

    Active Learning for Convolutional Neural Networks: A Core-Set Approach

    Ozan Sener and Silvio Savarese. Active Learning for Convolutional Neural Networks: A Core-Set Approach. arXiv preprint arXiv:1708.00489, 2017.https:// doi.org/10.48550/arXiv.1708.00489

  26. [27]

    Knowledge Distil- lation via Query Selection for Detection Transformer

    Yi Liu, Luting Wang, Zongheng Tang, Yue Liao, Yi- fan Sun, Lijun Zhang, and Si Liu. Knowledge Distil- lation via Query Selection for Detection Transformer. arXiv preprint arXiv:2409.06443, 2024.https:// doi.org/10.48550/arXiv.2409.06443

  27. [28]

    Retrieval-Feedback- Driven Distillation and Preference Alignment for Ef- ficient LLM-based Query Expansion.arXiv preprint arXiv:2603.13776, 2026.https://doi.org/10

    Minghan Li and Guodong Zhou. Retrieval-Feedback- Driven Distillation and Preference Alignment for Ef- ficient LLM-based Query Expansion.arXiv preprint arXiv:2603.13776, 2026.https://doi.org/10. 48550/arXiv.2603.13776

  28. [29]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- moyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692, 2019

  29. [30]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. 15