arxiv: 2604.25794 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.CV

Recognition: unknown

Diverse Image Priors for Black-box Data-free Knowledge Distillation

Tri-Nhan Vo , Dang Nguyen , Trung Le , Kien Do , Sunil Gupta

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:38 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords knowledge distillationdata-free learningblack-box distillationsynthetic image generationcontrastive learningmodel compressionprivacy-preserving AI

0 comments

The pith

DIP-KD enables effective knowledge distillation from black-box teachers by synthesizing diverse image priors when no data or full predictions are available.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces DIP-KD to solve knowledge distillation in cases where the teacher model can only provide top-1 class predictions and the original training data is unavailable due to privacy or proprietary reasons. The approach generates synthetic images that represent a wide range of visual patterns, applies contrastive learning to make these images more distinguishable from one another, and then uses a primer student to convert the limited predictions into soft labels for training the final student model. A sympathetic reader would care because this setup is common in decentralized or secure AI systems, and successful distillation here means smaller, faster models can be created without risking data leaks. The method is shown to outperform prior techniques on twelve different evaluation benchmarks.

Core claim

The central claim is that in the black-box data-free knowledge distillation scenario, a three-phase pipeline of synthesizing diverse image priors to capture varied semantics, enhancing distinctions between synthetic samples with contrastive learning, and performing distillation through a novel primer student that enables soft-probability knowledge transfer, allows the student model to acquire the teacher's expertise effectively, as evidenced by superior performance on twelve benchmarks and the importance of data diversity confirmed through ablations.

What carries the argument

A three-phase collaborative pipeline that synthesizes diverse image priors, uses contrastive learning to enhance sample distinction, and employs a primer student for soft-probability distillation from top-1 predictions.

If this is right

State-of-the-art performance is achieved across 12 benchmarks in the black-box data-free KD setting.
Data diversity is shown to be critical for knowledge acquisition in environments with restricted access.
The framework operates successfully using only top-1 predictions without requiring original training data.
Ablation studies validate that each component of the pipeline contributes to the overall results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the image prior synthesis to other data types like text or tabular data could broaden the applicability to non-vision tasks.
Improving the diversity of generated samples might yield larger gains than changes to the distillation objective itself.
This method could facilitate model deployment in regulated industries by avoiding the need to share sensitive datasets.
Testing the robustness of the synthetic priors against distribution shifts would provide further validation of the approach.

Load-bearing premise

That the generated synthetic images with enhanced diversity can adequately proxy the original data distribution to allow the teacher’s knowledge to be transferred via top-1 predictions alone.

What would settle it

If experiments on the reported benchmarks show that DIP-KD does not achieve higher accuracy than the strongest existing black-box data-free KD baseline, or if ablations indicate that data diversity does not affect performance, the central claims would be challenged.

Figures

Figures reproduced from arXiv: 2604.25794 by Dang Nguyen, Kien Do, Sunil Gupta, Tri-Nhan Vo, Trung Le.

**Figure 1.** Figure 1: Illustration of our method DIP-KD of three phases. (1) Synthesis: We craft image priors XP with three sub-components: hierarchical noise, nonlinear transformation, and semantic cutmixing. Synthetic images x ∼ XP have diverse and distinct patterns. (2) Contrast: We construct the dataset D = {x|x ∼ XP} and train a primer student S0 as a feature extractor. Then, we enforce an instance-discriminator R to disti… view at source ↗

**Figure 2.** Figure 2: In the Synthesis phase, we generate image priors XP with 3 sub-components. (1) Hierarchical noise: We sample xhn from multi-scale noise images to capture hierarchical structures. (2) Nonlinear transform: We transform xhn to xnt with nonlinear filters to capture nonlinearity. and (3) Semantic cutmixing: We apply cutmixing on xnt with a semantic mask mˆ to construct the final images x, simulating diverse sem… view at source ↗

**Figure 3.** Figure 3: Feature embeddings of complex datasets, noise ( view at source ↗

**Figure 4.** Figure 4: Feature embeddings of image priors XP before and after the Contrast phase view at source ↗

**Figure 5.** Figure 5: Distillation accuracy (%) by number of image priors view at source ↗

**Figure 6.** Figure 6: Distillation accuracy (%) by compression ratio (CR). view at source ↗

read the original abstract

Knowledge distillation (KD) represents a vital mechanism to transfer expertise from complex teacher networks to efficient student models. However, in decentralized or secure AI ecosystems, privacy regulations and proprietary interests often restrict access to the teacher's interface and original datasets. These constraints define a challenging black-box data-free KD scenario where only top-1 predictions and no training data are available. While recent approaches utilize synthetic data, they still face limitations in data diversity and distillation signals. We propose Diverse Image Priors Knowledge Distillation (DIP-KD), a framework that addresses these challenges through a three-phase collaborative pipeline: (1) Synthesis of image priors to capture diverse visual patterns and semantics; (2) Contrast to enhance the collective distinction between synthetic samples via contrastive learning; and (3) Distillation via a novel primer student that enables soft-probability KD. Our evaluation across 12 benchmarks shows that DIP-KD achieves state-of-the-art performance, with ablations confirming data diversity as critical for knowledge acquisition in restricted AI environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIP-KD puts together image synthesis, contrastive distinction, and a primer student to handle black-box data-free KD from top-1 labels only, but the soft-target bootstrap remains the least clear part of the pipeline.

read the letter

The paper's core move is a three-phase pipeline for knowledge distillation when the teacher gives only hard top-1 predictions and no original data or logits. It synthesizes diverse image priors, applies contrastive learning to sharpen distinctions among those samples, and then uses a primer student to enable the actual distillation step. That specific combination is what they position as new relative to earlier synthetic-data methods that struggled with diversity or weak signals. The evaluation covers 12 benchmarks and includes ablations that tie performance gains to the diversity component, which is a reasonable way to show the idea has some empirical traction. Those elements are the parts that feel most grounded and worth noting. The soft spot sits in the third phase. The claim that the primer student reliably turns hard labels into useful soft probabilities for KD is central to the reported gains, yet the abstract gives no equations, loss formulation, or pseudocode for how that bootstrap works without further teacher queries or internal access. If the full paper does not spell out the exact mechanism and show it avoids collapsing to plain hard-label cross-entropy, the SOTA numbers become harder to trust or reproduce. The stress-test concern lands on that point and should be checked directly against the methods section. This work is aimed at people building efficient models under privacy or data-access constraints. A reader already working on data-free KD or decentralized deployment could extract the pipeline structure and test it, even if they have to fill in implementation details themselves. It is worth sending to peer review because the problem is practical, the benchmark scope is broad, and the ablations address a plausible failure mode, though any referee will need to press on the soft-label generation step and ask for more implementation specifics before the results can be taken at face value.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes DIP-KD for black-box data-free knowledge distillation, where only top-1 predictions from the teacher are available and no original training data can be used. It introduces a three-phase pipeline consisting of (1) synthesis of diverse image priors to capture varied visual patterns and semantics, (2) contrastive enhancement to improve distinctions among the synthetic samples, and (3) distillation through a novel primer student that purportedly enables soft-probability KD. The authors claim state-of-the-art results across 12 benchmarks, supported by ablations that identify data diversity as critical.

Significance. If the central mechanism can be shown to reliably produce useful soft targets and superior distillation signals from hard labels alone, the work would meaningfully advance privacy-preserving model compression. The emphasis on diversity via image priors and the use of ablations to isolate its contribution are constructive elements that could inform future restricted-access KD research.

major comments (1)

[Abstract] Abstract (three-phase pipeline description): The assertion that the primer-student stage 'enables soft-probability KD' is load-bearing for the central claim yet provides no mechanism for generating calibrated soft targets from top-1 hard predictions without further teacher queries, temperature scaling, or internal model access. If this bootstrap reduces to hard-label cross-entropy, the reported gains over prior black-box data-free baselines would not hold.

minor comments (2)

[Abstract] The abstract states results on '12 benchmarks' but does not name them or indicate whether they include standard CIFAR/ImageNet splits or more specialized tasks; this should be clarified for reproducibility.
[Abstract] No mention of error bars, number of runs, or statistical significance tests appears in the abstract's performance claims; these should be added to the evaluation section to support the SOTA assertion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our work. The feedback on the abstract's description of the primer student is particularly helpful for improving clarity. We respond to the major comment below and will incorporate revisions to address the concern.

read point-by-point responses

Referee: [Abstract] Abstract (three-phase pipeline description): The assertion that the primer-student stage 'enables soft-probability KD' is load-bearing for the central claim yet provides no mechanism for generating calibrated soft targets from top-1 hard predictions without further teacher queries, temperature scaling, or internal model access. If this bootstrap reduces to hard-label cross-entropy, the reported gains over prior black-box data-free baselines would not hold.

Authors: We appreciate the referee raising this point, as it directly concerns the core contribution. In the manuscript, the primer student is first trained on the synthesized diverse image priors using cross-entropy loss against the teacher's top-1 hard labels. After this initial training phase, the primer student produces its own output probability distributions over the same synthetic samples; these distributions serve as the soft targets for the subsequent distillation to the final student model (with temperature scaling applied in the KD loss). No additional queries to the black-box teacher are required, since the soft probabilities are generated internally by the primer student. The full paper includes ablations comparing this approach against direct hard-label training of the final student, showing consistent gains attributable to the soft targets. That said, we agree the abstract is overly concise and does not explicitly outline this two-stage process, which could reasonably lead to the interpretation that the method reduces to hard-label cross-entropy. We will revise the abstract (and add a clarifying sentence in Section 3.3) to describe the primer student's role in generating soft targets from the hard-label bootstrap. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical three-phase framework (image prior synthesis, contrastive enhancement, primer-student distillation) for black-box data-free KD and supports its claims via benchmark evaluations and ablations rather than any mathematical derivation, equations, or fitted quantities. No load-bearing steps reduce by construction to the method's own inputs, self-citations, or renamed patterns; the abstract and described pipeline contain no equations or parameter-fitting loops that could create self-definition or prediction-by-fit. The central assertion that the primer student enables soft-probability KD is presented as a methodological contribution evaluated externally, not as a tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes that synthetic image priors can be generated to capture teacher knowledge without access to real data.

pith-pipeline@v0.9.0 · 5480 in / 1142 out tokens · 59788 ms · 2026-05-07T16:38:16.445627+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Distilling the knowledge in a neural network.NeurIPS Workshop, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.NeurIPS Workshop, 2015

2015
[2]

Health insurance portability and accountability act of 1996, public law no

United States Congress. Health insurance portability and accountability act of 1996, public law no. 104-191. Public Law 104-191, 110 Stat. 1936, 1996. Enacted August 21, 1996

1996
[3]

Regulation(eu)2016/679oftheeuropeanparliament (general data protection regulation)

EuropeanParliamentandCouncil. Regulation(eu)2016/679oftheeuropeanparliament (general data protection regulation). Official Journal of the European Union, L 119, 1–88, 2016. Official Journal L 119, 04/05/2016, p. 1-88

2016
[4]

Defend trade secrets act of 2016, public law no

United States Congress. Defend trade secrets act of 2016, public law no. 114-153. Public Law 114–153, 130 Stat. 376, 2016. Enacted May 11, 2016

2016
[5]

Zero-shot knowledge distillation from a decision-based black-box model

Zi Wang. Zero-shot knowledge distillation from a decision-based black-box model. In ICML. PMLR, 2021

2021
[6]

Ideal: Query-efficient data-free learning from black-box models

Jie Zhang, Chen Chen, and Lingjuan Lyu. Ideal: Query-efficient data-free learning from black-box models. InICLR, 2023

2023
[7]

Data-free hard-label robustness stealing attack

Xiaojian Yuan, Kejiang Chen, Wen Huang, Jie Zhang, Weiming Zhang, and Nenghai Yu. Data-free hard-label robustness stealing attack. InAAAI, 2024. 14 T-N. Vo et al. Diverse Image Priors for Black-box Data-free KD

2024
[8]

Model compression

Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, 2006

2006
[9]

Fitnets: Hints for thin deep nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. InICLR, 2015

2015
[10]

Knowledge distillation with distribution mismatch

Dang Nguyen, Sunil Gupta, Trong Nguyen, Santu Rana, Phuoc Nguyen, Truyen Tran, Ky Le, Shannon Ryan, and Svetha Venkatesh. Knowledge distillation with distribution mismatch. InECML-PKDD. Springer, 2021

2021
[11]

Neural networks are more productive teachers than human raters: Active mixup for data-efficient knowledge distillation from a blackbox model

Dongdong Wang, Yandong Li, Liqiang Wang, and Boqing Gong. Neural networks are more productive teachers than human raters: Active mixup for data-efficient knowledge distillation from a blackbox model. InCVPR, 2020

2020
[12]

Black-box few-shot knowl- edge distillation

Dang Nguyen, Sunil Gupta, Kien Do, and Svetha Venkatesh. Black-box few-shot knowl- edge distillation. InECCV. Springer, 2022

2022
[13]

Improving diversity in black- box few-shot knowledge distillation

Tri-Nhan Vo, Dang Nguyen, Kien Do, and Sunil Gupta. Improving diversity in black- box few-shot knowledge distillation. InECML-PKDD. Springer, 2024

2024
[14]

Learning student networks in the wild

Hanting Chen, Tianyu Guo, Chang Xu, Wenshuo Li, Chunjing Xu, Chao Xu, and Yunhe Wang. Learning student networks in the wild. InCVPR, 2021

2021
[15]

Distribution shift matters for knowledge distillation with webly collected images

Jialiang Tang, Shuo Chen, Gang Niu, Masashi Sugiyama, and Chen Gong. Distribution shift matters for knowledge distillation with webly collected images. InICCV, 2023

2023
[16]

Unbiased look at dataset bias

Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. InCVPR 2011. IEEE, 2011

2011
[17]

Data-free learning of student networks

Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu, Boxin Shi, Chunjing Xu, Chao Xu, and Qi Tian. Data-free learning of student networks. InICCV, 2019

2019
[18]

Contrastive model inversion for data-free knowledge distillation

Gongfan Fang, Jie Song, Xinchao Wang, Chengchao Shen, Xingen Wang, and Mingli Song. Contrastive model inversion for data-free knowledge distillation. InIJCAI, 2021

2021
[19]

Momentum adversarial distillation: Handling large distribution shifts in data-free knowledge distillation.NeurIPS, 2022

Kien Do, Thai Hung Le, Dung Nguyen, Dang Nguyen, Haripriya Harikumar, Truyen Tran, Santu Rana, and Svetha Venkatesh. Momentum adversarial distillation: Handling large distribution shifts in data-free knowledge distillation.NeurIPS, 2022

2022
[20]

Nayer: Noisy layer data generation for efficient and effective data-free knowledge distillation

Minh-Tuan Tran, Trung Le, Xuan-May Le, Mehrtash Harandi, Quan Hung Tran, and Dinh Phung. Nayer: Noisy layer data generation for efficient and effective data-free knowledge distillation. InCVPR, 2024

2024
[21]

Gradient-based learn- ing applied to document recognition.Proceedings of the IEEE, 2002

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learn- ing applied to document recognition.Proceedings of the IEEE, 2002

2002
[22]

Best practices for convolutional neural networks applied to visual document analysis

Patrice Y Simard, David Steinkraus, John C Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. InICDAR. Edinburgh, 2003. 15 T-N. Vo et al. Diverse Image Priors for Black-box Data-free KD

2003
[23]

Cutmix: Regularization strategy to train strong classifiers with local- izable features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with local- izable features. InICCV, 2019

2019
[24]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML. PmLR, 2020

2020
[25]

Momentumcontrast for unsupervised visual representation learning

KaimingHe, HaoqiFan, YuxinWu, SainingXie, andRossGirshick. Momentumcontrast for unsupervised visual representation learning. InCVPR, 2020

2020
[26]

Barlow twins: Self-supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InICML. PMLR, 2021

2021
[27]

Jonathan J. Hull. A database for handwritten text recognition research.PAMI, 1994

1994
[28]

Reading digits in natural images with unsupervised feature learning

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. InNeurIPS Workshop. Granada, 2011

2011
[29]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review arXiv 2017
[30]

Learningmultiplelayersoffeaturesfromtinyimages

AlexKrizhevsky. Learningmultiplelayersoffeaturesfromtinyimages. Technicalreport, MIT, NYU, 2009. CIFAR10 and CIFAR100 were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton

2009
[31]

Tiny imagenet visual recognition challenge.CS 231N, 2015

Yann Le, Xuan Yang, et al. Tiny imagenet visual recognition challenge.CS 231N, 2015

2015
[32]

Imagenette: A subset of 10 easily classified classes from imagenet

Jeremy Howard. Imagenette: A subset of 10 easily classified classes from imagenet. https://github.com/fastai/imagenette, 2020

2020
[33]

Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis

Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. InInternational Symposium on Biomedical Imaging (ISBI). IEEE, 2021

2021
[34]

Imagenet classification with deep convolutional neural networks.NeurIPS, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.NeurIPS, 2012

2012
[35]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016

2016
[36]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approxi- mation and projection for dimension reduction.arXiv:1802.03426, 2018

work page internal anchor Pith review arXiv 2018
[37]

Reliable fidelity and diversity metrics for generative models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. InICML. PMLR, 2020. 16

2020