Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

Munawar Hasan

arxiv: 2605.30448 · v1 · pith:6NGFGIYHnew · submitted 2026-05-28 · 💻 cs.LG · cs.CL

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

Munawar Hasan This is my paper

Pith reviewed 2026-06-29 08:48 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM distillationbehavioral indistinguishabilityblack-box evaluationadversarial testingsemantic similarityLoRA adaptationprompt probes

0 comments

The pith

Black-box LLM distillation improves semantic similarity but leaves measurable behavioral differences detectable by adversaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard evaluation of black-box LLM distillation, which relies on semantic similarity or task consistency between teacher and student outputs, is insufficient to establish true behavioral equivalence. It introduces a formal definition of bounded behavioral indistinguishability parameterized by distinguishing advantage, query limits, computation bounds, and adversary class, then applies this to Qwen and Llama teacher-student pairs via a fixed 5,000-prompt probe set. Experiments show LoRA distillation raises similarity scores yet leaves nonzero advantage for learned discriminators, with gaps concentrated in specific prompt categories such as style, robustness, and technical domains. A cross-family judge and consistency filter confirm the pattern, and query-strategy tests indicate that simple coverage baselines remain competitive.

Core claim

Semantic fidelity is useful but insufficient for black-box LLM distillation; evaluation instead requires bounded, adversarial, and category-aware measures of behavioral indistinguishability, because even after LoRA adaptation the student models retain detectable differences from their teachers on the probe suite.

What carries the argument

The (ε,q,t,𝔸)-behavioral indistinguishability definition over an explicit prompt distribution, operationalized through a controlled 5,000-prompt behavioral probe suite and pairwise teacher-identification adversaries.

Load-bearing premise

The controlled 5,000-prompt behavioral probe suite and chosen adversary class are representative enough to detect meaningful behavioral differences that matter in practice.

What would settle it

An experiment in which a distilled student achieves distinguishing advantage below a chosen ε threshold across all tested categories and adversary classes on the same probe distribution would falsify the claim that semantic measures alone are insufficient.

Figures

Figures reproduced from arXiv: 2605.30448 by Munawar Hasan.

**Figure 1.** Figure 1: Overview of the bounded behavioral indistinguishability framework. The controlled prompt suite is split into training [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Embedding similarity between teacher outputs and [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical distinguishing advantage for learned [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 5.** Figure 5: Category-wise pairwise distinguishing advantage for Qwen base and Qwen LoRA under the consistency-filtered Llama [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Black-box LLM distillation is usually evaluated as an output-matching problem: a student is considered successful when its responses are semantically similar to, or task-consistent with, those of a teacher. However, output similarity does not imply that the student is behaviorally indistinguishable from the model it imitates. We introduce bounded behavioral indistinguishability, formalized as $(\epsilon,q,t,\mathbb{A})$-behavioral indistinguishability over an explicit prompt distribution, where $\epsilon$ bounds distinguishing advantage, $q$ bounds oracle queries, $t$ bounds computation, and $\mathbb{A}$ denotes the adversary class. We instantiate this notion on Qwen and Llama teacher-student pairs using a controlled $5,000$-prompt behavioral probe suite. For each family, we compare the teacher with both the base student and the LoRA-distilled student, measuring whether distillation reduces distinguishability rather than merely improving similarity. LoRA raises semantic similarity from $0.788$ to $0.862$ for Qwen and from $0.814$ to $0.874$ for Llama. Yet adversarial evaluation reveals remaining behavioral differences: learned discriminators retain nonzero advantage, and pairwise category analysis shows artifacts concentrated in style/format, robustness, and domain-technical prompts. A pairwise teacher-identification adversary confirms this trend. With a different-family Llama judge and A/B-swap consistency filtering, Qwen distinguishing advantage drops from $0.158$ for the base student to $0.081$ after LoRA distillation. Query-budget experiments show that disagreement-guided acquisition does not consistently outperform stratified random sampling, indicating that coverage and diversity remain strong baselines. Our results show that semantic fidelity is useful but insufficient: black-box LLM distillation requires bounded, adversarial, and category-aware evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes bounded behavioral indistinguishability and shows LoRA distillation improves similarity but leaves nonzero distinguishing advantage under their probes.

read the letter

The core takeaway is that output similarity metrics fall short for black-box distillation because they miss cases where an adversary can still tell the student model apart from the teacher. The paper defines this as (ε, q, t, A)-behavioral indistinguishability over a prompt distribution and tests it on Qwen and Llama teacher-student pairs.

It does two useful things. First, the parameterization cleanly separates the distinguishing advantage bound, query limit, compute limit, and adversary class without reducing to fitted parameters. Second, the experiments compare base students against LoRA-distilled ones on the same 5,000-prompt suite, showing similarity rising (0.788 to 0.862 for Qwen) while learned discriminators retain advantage that drops but stays positive (0.158 to 0.081). The category breakdown and teacher-identification test add concrete detail on where the gaps appear.

The main limitation is the probe suite itself. Nothing demonstrates that 5,000 prompts densely cover the distributions where downstream differences would actually matter, or that the chosen adversary class A is close to the strongest feasible one within the bounds. If the suite over-weights style artifacts or under-samples domains where the models already match, the reported gap between similarity and indistinguishability could be narrower or wider in practice. The abstract also omits error bars and full statistical methods, so the numerical claims need the full paper's details to assess stability.

This is for people working on distillation evaluation, safety verification, or deployment standards who want a more adversarial lens than pure semantic matching. It deserves peer review because the formal definition is independent and the empirical direction is worth tightening, even if the current instantiation is preliminary.

Referee Report

2 major / 2 minor

Summary. The paper claims that semantic similarity metrics are insufficient for evaluating black-box LLM distillation success and introduces a parameterized notion of bounded behavioral indistinguishability, formalized as (ε, q, t, A)-behavioral indistinguishability over an explicit prompt distribution. Using a controlled 5,000-prompt behavioral probe suite on Qwen and Llama teacher-student pairs, it shows that LoRA distillation improves semantic similarity (0.788→0.862 for Qwen; 0.814→0.874 for Llama) but leaves nonzero distinguishing advantage under learned discriminators, pairwise category analysis, and a teacher-identification adversary (e.g., 0.158→0.081 for Qwen with Llama judge). The conclusion is that distillation evaluation requires bounded, adversarial, and category-aware methods rather than relying on output similarity alone.

Significance. If the probe suite and adversary class are representative, the work supplies a clean, query- and compute-bounded formalization that could shift evaluation practices in LLM distillation away from purely semantic metrics toward adversarial testing. The explicit parameterization (with no free parameters in the definition itself) and the empirical demonstration that similarity gains do not imply indistinguishability on two model families are concrete strengths that could support more falsifiable claims about distillation quality.

major comments (2)

[Abstract] Abstract, instantiation paragraph: the central claim that 'semantic fidelity is useful but insufficient' and that black-box distillation 'requires bounded, adversarial, and category-aware evaluation' rests on the 5,000-prompt suite and adversary class A being adequate to detect practically relevant behavioral differences; the manuscript provides no justification, coverage analysis, or validation that this suite densely samples prompt distributions on which downstream differences would matter or that A is the strongest feasible distinguisher within the stated (q, t) bounds.
[Abstract] Abstract, results on distinguishing advantage: the reported drop from 0.158 to 0.081 (and the category artifacts in style/format, robustness, domain-technical prompts) is tied to the specific (5,000-prompt, learned-discriminator, Llama-judge) instantiation; without evidence that the probe does not systematically under-sample categories where the distilled model already matches the teacher, the observed gap between similarity and indistinguishability may be an artifact of the chosen suite rather than generic evidence against semantic evaluation.

minor comments (2)

[Abstract] Abstract: no error bars, confidence intervals, or statistical details are reported for the similarity scores or distinguishing advantages, making it difficult to assess the reliability of the reported deltas.
[Abstract] Abstract: the query-budget experiments are mentioned but lack any table or quantitative comparison showing how disagreement-guided acquisition compares to stratified random sampling across the two model families.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and commit to revisions that strengthen the justification for our evaluation setup.

read point-by-point responses

Referee: Abstract, instantiation paragraph: the central claim that 'semantic fidelity is useful but insufficient' and that black-box distillation 'requires bounded, adversarial, and category-aware evaluation' rests on the 5,000-prompt suite and adversary class A being adequate to detect practically relevant behavioral differences; the manuscript provides no justification, coverage analysis, or validation that this suite densely samples prompt distributions on which downstream differences would matter or that A is the strongest feasible distinguisher within the stated (q, t) bounds.

Authors: We agree that explicit justification and coverage details would strengthen the claims. The 5,000-prompt suite was stratified across categories drawn from prior LLM evaluation literature (style/format, robustness, domain-technical) to promote diversity, with results replicated across Qwen and Llama families. We did not claim A is maximal or provide quantitative coverage metrics. In revision we will expand the methods section with prompt curation details, category distribution statistics, and an explicit limitations paragraph on the scope of the distribution and adversary class. This will better ground the parameterized claim that semantic similarity alone does not imply indistinguishability under the tested (q, t, A). revision: yes
Referee: Abstract, results on distinguishing advantage: the reported drop from 0.158 to 0.081 (and the category artifacts in style/format, robustness, domain-technical prompts) is tied to the specific (5,000-prompt, learned-discriminator, Llama-judge) instantiation; without evidence that the probe does not systematically under-sample categories where the distilled model already matches the teacher, the observed gap between similarity and indistinguishability may be an artifact of the chosen suite rather than generic evidence against semantic evaluation.

Authors: The nonzero distinguishing advantage is corroborated by three independent methods (learned discriminators, category-wise pairwise analysis, and teacher-identification adversary) and is consistent across two model families. The category analysis already localizes remaining artifacts rather than claiming uniform gaps. While a full sensitivity study on every possible category is absent, the convergent evidence across methods reduces the likelihood of a pure sampling artifact. In revision we will add a short discussion of prompt diversity and potential under-sampling risks, while clarifying that the results demonstrate insufficiency of semantic metrics in this controlled, bounded setting. revision: partial

Circularity Check

0 steps flagged

No circularity: definition introduced independently and results are direct empirical comparisons

full rationale

The paper defines bounded behavioral indistinguishability as a new parameterized notion (ε,q,t,A) over an explicit prompt distribution without deriving it from any prior fitted quantities or self-referential equations. Experiments consist of direct measurements on a fixed 5,000-prompt suite comparing base and LoRA students against teachers, reporting similarity scores and distinguishing advantages without any step that renames a fit as a prediction or reduces a claimed result to its own inputs by construction. No load-bearing self-citations or uniqueness theorems appear in the provided text. The derivation chain is therefore self-contained as an empirical instantiation of an independently stated definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no free parameters, invented entities, or detailed axioms beyond the domain assumption that the probe suite captures relevant behavior.

axioms (1)

domain assumption The 5,000-prompt suite and adversary class A suffice to measure whether distillation reduces behavioral distinguishability.
Abstract states the instantiation and reports results on this specific suite without further justification.

pith-pipeline@v0.9.1-grok · 5846 in / 1253 out tokens · 26605 ms · 2026-06-29T08:48:25.108246+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 17 canonical work pages · 11 internal anchors

[1]

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test, 2025

Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Si- jun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, and Willie Neiswanger. Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test, 2025

2025
[2]

Black-box Optimization of LLM Outputs by Asking for Directions, 2025

Jie Zhang, Meng Ding, Yang Liu, Jue Hong, and Flo- rian Tram `er. Black-box Optimization of LLM Outputs by Asking for Directions, 2025

2025
[3]

Beyond Indis- tinguishability: Measuring Extraction Risk in LLM APIs,

Ruixuan Liu, David Evans, and Li Xiong. Beyond Indis- tinguishability: Measuring Extraction Risk in LLM APIs,
[4]

IEEE Symposium on Security and Privacy (S&P) 2026

2026
[5]

Accessed: May 2026

Qwen.https://github.com/QwenLM. Accessed: May 2026

2026
[6]

Accessed: May 2026

Llama.https://www.llama.com/. Accessed: May 2026

2026
[7]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-Rank Adaptation of Large Lan- guage Models.ICLR, 1(2):3, 2022.https://doi. org/10.48550/arXiv.2106.09685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2022
[8]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the Knowledge in a Neural Network.arXiv preprint arXiv:1503.02531, 2015.https://doi.org/10. 48550/arXiv.1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the association for computational linguis- tics: EMNLP 2020, pages 4163–4174, 2020.https: //doi.org/10.48550/arXiv.1909.10351

work page doi:10.48550/arxiv.1909.10351 2020
[11]

MiniLLM: Knowledge Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge Distillation of Large Language Models. InProceedings of ICLR, 2024

2024
[12]

DISTILLM: Towards Streamlined Dis- tillation for Large Language Models.arXiv preprint arXiv:2402.03898, 2024.https://doi.org/10

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se- Young Yun. DISTILLM: Towards Streamlined Dis- tillation for Large Language Models.arXiv preprint arXiv:2402.03898, 2024.https://doi.org/10. 48550/arXiv.2402.03898

work page arXiv 2024
[13]

Stealing machine learning mod- els via prediction{APIs}

Florian Tram `er, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning mod- els via prediction{APIs}. In25th USENIX security sym- posium (USENIX Security 16), pages 601–618, 2016

2016
[14]

High Accuracy and High Fidelity Extraction of Neural Networks

Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot. High Accuracy and High Fidelity Extraction of Neural Networks. In 29th USENIX security symposium (USENIX Security 20), pages 1345–1362, 2020.https://doi.org/10. 48550/arXiv.1909.01838

work page arXiv 2020
[15]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sen- tence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 conference on empirical meth- ods in natural language processing and the 9th interna- tional joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019.https: //doi.org/10.48550/arXiv.1908.10084

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019
[16]

Probabilistic en- cryption & how to play mental poker keeping secret all partial information

Shafi Goldwasser and Silvio Micali. Probabilistic en- cryption & how to play mental poker keeping secret all partial information. InProviding sound foundations for cryptography: on the work of Shafi Goldwasser and Sil- vio Micali, pages 173–201. 2019

2019
[17]

Simplifying Game- Based Definitions Indistinguishability up to Correctness and Its Application to Stateful AE

Phillip Rogaway and Yusi Zhang. Simplifying Game- Based Definitions Indistinguishability up to Correctness and Its Application to Stateful AE. InAnnual Inter- national Cryptology Conference, pages 3–32. Springer, 2018

2018
[18]

Sequences of games: A Tool for Taming Complexity in Security Proofs.cryptology eprint archive, 2004

Victor Shoup. Sequences of games: A Tool for Taming Complexity in Security Proofs.cryptology eprint archive, 2004

2004
[19]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo- han Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in neural information processing systems, 36:46595–46623, 2023.https://doi.org/10. 48550/arXiv.2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

CriticE- val: Evaluating Large Language Model as Critic.arXiv preprint arXiv:2402.13764, 2024.https://doi

Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen, and Xian-ling Mao. CriticE- val: Evaluating Large Language Model as Critic.arXiv preprint arXiv:2402.13764, 2024.https://doi. org/10.48550/arXiv.2402.13764

work page doi:10.48550/arxiv.2402.13764 2024
[21]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.arXiv preprint arXiv:2404.04475, 2024.https://doi. org/10.48550/arXiv.2404.04475

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.04475 2024
[22]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI Feed- back.arXiv preprint arXiv:2212.08073, 2022.https: //doi.org/10.48550/arXiv.2212.08073

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073 2022
[23]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with hu- man feedback.Advances in neural information process- ing systems, 35:27730–27744, 2022.https://doi. org/10.48550/arXiv.2203.02155

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
[24]

Incompleteness of AI Safety Verification via Kolmogorov Complexity

Munawar Hasan. Incompleteness of AI Safety Veri- fication via Kolmogorov Complexity.arXiv preprint arXiv:2604.04876, 2026.https://doi.org/10. 48550/arXiv.2604.04876

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Active Learning Literature Survey

Burr Settles. Active Learning Literature Survey . 2009

2009
[26]

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener and Silvio Savarese. Active Learning for Convolutional Neural Networks: A Core-Set Approach. arXiv preprint arXiv:1708.00489, 2017.https:// doi.org/10.48550/arXiv.1708.00489

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1708.00489 2017
[27]

Knowledge Distil- lation via Query Selection for Detection Transformer

Yi Liu, Luting Wang, Zongheng Tang, Yue Liao, Yi- fan Sun, Lijun Zhang, and Si Liu. Knowledge Distil- lation via Query Selection for Detection Transformer. arXiv preprint arXiv:2409.06443, 2024.https:// doi.org/10.48550/arXiv.2409.06443

work page doi:10.48550/arxiv.2409.06443 2024
[28]

Retrieval-Feedback- Driven Distillation and Preference Alignment for Ef- ficient LLM-based Query Expansion.arXiv preprint arXiv:2603.13776, 2026.https://doi.org/10

Minghan Li and Guodong Zhou. Retrieval-Feedback- Driven Distillation and Preference Alignment for Ef- ficient LLM-based Query Expansion.arXiv preprint arXiv:2603.13776, 2026.https://doi.org/10. 48550/arXiv.2603.13776

work page arXiv 2026
[29]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- moyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[30]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. 15

work page internal anchor Pith review Pith/arXiv arXiv 1910

[1] [1]

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test, 2025

Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Si- jun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, and Willie Neiswanger. Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test, 2025

2025

[2] [2]

Black-box Optimization of LLM Outputs by Asking for Directions, 2025

Jie Zhang, Meng Ding, Yang Liu, Jue Hong, and Flo- rian Tram `er. Black-box Optimization of LLM Outputs by Asking for Directions, 2025

2025

[3] [3]

Beyond Indis- tinguishability: Measuring Extraction Risk in LLM APIs,

Ruixuan Liu, David Evans, and Li Xiong. Beyond Indis- tinguishability: Measuring Extraction Risk in LLM APIs,

[4] [4]

IEEE Symposium on Security and Privacy (S&P) 2026

2026

[5] [5]

Accessed: May 2026

Qwen.https://github.com/QwenLM. Accessed: May 2026

2026

[6] [6]

Accessed: May 2026

Llama.https://www.llama.com/. Accessed: May 2026

2026

[7] [7]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-Rank Adaptation of Large Lan- guage Models.ICLR, 1(2):3, 2022.https://doi. org/10.48550/arXiv.2106.09685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2022

[8] [8]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the Knowledge in a Neural Network.arXiv preprint arXiv:1503.02531, 2015.https://doi.org/10. 48550/arXiv.1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[9] [10]

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the association for computational linguis- tics: EMNLP 2020, pages 4163–4174, 2020.https: //doi.org/10.48550/arXiv.1909.10351

work page doi:10.48550/arxiv.1909.10351 2020

[10] [11]

MiniLLM: Knowledge Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge Distillation of Large Language Models. InProceedings of ICLR, 2024

2024

[11] [12]

DISTILLM: Towards Streamlined Dis- tillation for Large Language Models.arXiv preprint arXiv:2402.03898, 2024.https://doi.org/10

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se- Young Yun. DISTILLM: Towards Streamlined Dis- tillation for Large Language Models.arXiv preprint arXiv:2402.03898, 2024.https://doi.org/10. 48550/arXiv.2402.03898

work page arXiv 2024

[12] [13]

Stealing machine learning mod- els via prediction{APIs}

Florian Tram `er, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning mod- els via prediction{APIs}. In25th USENIX security sym- posium (USENIX Security 16), pages 601–618, 2016

2016

[13] [14]

High Accuracy and High Fidelity Extraction of Neural Networks

Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot. High Accuracy and High Fidelity Extraction of Neural Networks. In 29th USENIX security symposium (USENIX Security 20), pages 1345–1362, 2020.https://doi.org/10. 48550/arXiv.1909.01838

work page arXiv 2020

[14] [15]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sen- tence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 conference on empirical meth- ods in natural language processing and the 9th interna- tional joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019.https: //doi.org/10.48550/arXiv.1908.10084

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019

[15] [16]

Probabilistic en- cryption & how to play mental poker keeping secret all partial information

Shafi Goldwasser and Silvio Micali. Probabilistic en- cryption & how to play mental poker keeping secret all partial information. InProviding sound foundations for cryptography: on the work of Shafi Goldwasser and Sil- vio Micali, pages 173–201. 2019

2019

[16] [17]

Simplifying Game- Based Definitions Indistinguishability up to Correctness and Its Application to Stateful AE

Phillip Rogaway and Yusi Zhang. Simplifying Game- Based Definitions Indistinguishability up to Correctness and Its Application to Stateful AE. InAnnual Inter- national Cryptology Conference, pages 3–32. Springer, 2018

2018

[17] [18]

Sequences of games: A Tool for Taming Complexity in Security Proofs.cryptology eprint archive, 2004

Victor Shoup. Sequences of games: A Tool for Taming Complexity in Security Proofs.cryptology eprint archive, 2004

2004

[18] [19]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo- han Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in neural information processing systems, 36:46595–46623, 2023.https://doi.org/10. 48550/arXiv.2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [20]

CriticE- val: Evaluating Large Language Model as Critic.arXiv preprint arXiv:2402.13764, 2024.https://doi

Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen, and Xian-ling Mao. CriticE- val: Evaluating Large Language Model as Critic.arXiv preprint arXiv:2402.13764, 2024.https://doi. org/10.48550/arXiv.2402.13764

work page doi:10.48550/arxiv.2402.13764 2024

[20] [21]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.arXiv preprint arXiv:2404.04475, 2024.https://doi. org/10.48550/arXiv.2404.04475

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.04475 2024

[21] [22]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI Feed- back.arXiv preprint arXiv:2212.08073, 2022.https: //doi.org/10.48550/arXiv.2212.08073

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073 2022

[22] [23]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with hu- man feedback.Advances in neural information process- ing systems, 35:27730–27744, 2022.https://doi. org/10.48550/arXiv.2203.02155

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022

[23] [24]

Incompleteness of AI Safety Verification via Kolmogorov Complexity

Munawar Hasan. Incompleteness of AI Safety Veri- fication via Kolmogorov Complexity.arXiv preprint arXiv:2604.04876, 2026.https://doi.org/10. 48550/arXiv.2604.04876

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [25]

Active Learning Literature Survey

Burr Settles. Active Learning Literature Survey . 2009

2009

[25] [26]

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener and Silvio Savarese. Active Learning for Convolutional Neural Networks: A Core-Set Approach. arXiv preprint arXiv:1708.00489, 2017.https:// doi.org/10.48550/arXiv.1708.00489

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1708.00489 2017

[26] [27]

Knowledge Distil- lation via Query Selection for Detection Transformer

Yi Liu, Luting Wang, Zongheng Tang, Yue Liao, Yi- fan Sun, Lijun Zhang, and Si Liu. Knowledge Distil- lation via Query Selection for Detection Transformer. arXiv preprint arXiv:2409.06443, 2024.https:// doi.org/10.48550/arXiv.2409.06443

work page doi:10.48550/arxiv.2409.06443 2024

[27] [28]

Retrieval-Feedback- Driven Distillation and Preference Alignment for Ef- ficient LLM-based Query Expansion.arXiv preprint arXiv:2603.13776, 2026.https://doi.org/10

Minghan Li and Guodong Zhou. Retrieval-Feedback- Driven Distillation and Preference Alignment for Ef- ficient LLM-based Query Expansion.arXiv preprint arXiv:2603.13776, 2026.https://doi.org/10. 48550/arXiv.2603.13776

work page arXiv 2026

[28] [29]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- moyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[29] [30]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. 15

work page internal anchor Pith review Pith/arXiv arXiv 1910