Dataset Distillation

Alexei A. Efros; Antonio Torralba; Jun-Yan Zhu; Tongzhou Wang

arxiv: 1811.10959 · v3 · pith:ORMB5QL2new · submitted 2018-11-27 · 💻 cs.LG · stat.ML

Dataset Distillation

Tongzhou Wang , Jun-Yan Zhu , Antonio Torralba , Alexei A. Efros This is my paper

Pith reviewed 2026-05-23 23:50 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords dataset distillationsynthetic datadata compressionneural network trainingMNISTgradient descentmodel optimization

0 comments

The pith

Ten synthetic images can train a neural network on MNIST to near full-dataset performance in a few steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces dataset distillation, which keeps the model architecture fixed and instead compresses the training data into a much smaller synthetic set. The central demonstration is that 60,000 MNIST images can be replaced by 10 synthetic images, one per class, such that a network with fixed random initialization reaches close to original accuracy after only a few gradient descent steps. The synthetic points are optimized so their effect on the learning trajectory approximates that of the full dataset, without needing to match the original data distribution. Experiments across initialization settings and learning objectives, plus results on additional datasets, support the approach over prior alternatives.

Core claim

Dataset distillation synthesizes a small collection of data points that, when supplied to a learning algorithm with a fixed network initialization, produce a model whose performance approximates that obtained by training on the entire original dataset.

What carries the argument

The synthetic distilled dataset, optimized to replicate the learning dynamics of the full dataset under a fixed initialization.

If this is right

A network can be trained to high accuracy using orders of magnitude fewer examples and far fewer gradient steps.
The same distilled set works for multiple random initializations within the tested range.
The method extends to other datasets and learning objectives beyond the MNIST case.
Training becomes feasible under severe data or compute constraints while preserving final model quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the distilled set generalizes across architectures, it could serve as a portable training resource independent of any single model.
The approach might combine with existing data-augmentation pipelines to further reduce the required number of synthetic points.
Success on classification tasks raises the question of whether similar distillation applies directly to regression or reinforcement-learning environments.

Load-bearing premise

The learning dynamics produced by the full dataset on a fixed random initialization can be closely matched by training on a small optimized set of synthetic points.

What would settle it

Training a network on the 10 reported synthetic MNIST images with the paper's fixed initialization and few steps fails to reach within a few percent of the accuracy obtained from the full 60,000-image set.

read the original abstract

Model distillation aims to distill the knowledge of a complex model into a simpler one. In this paper, we consider an alternative formulation called dataset distillation: we keep the model fixed and instead attempt to distill the knowledge from a large training dataset into a small one. The idea is to synthesize a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. For example, we show that it is possible to compress 60,000 MNIST training images into just 10 synthetic distilled images (one per class) and achieve close to original performance with only a few gradient descent steps, given a fixed network initialization. We evaluate our method in various initialization settings and with different learning objectives. Experiments on multiple datasets show the advantage of our approach compared to alternative methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows you can optimize 10 synthetic MNIST images to nearly match full-dataset training performance in a few steps on a fixed init, but the fixed-init constraint and scaling questions are the real limits.

read the letter

The main takeaway is that dataset distillation lets you synthesize a handful of images that, when used to train a fixed network for a few gradient steps, get close to the test accuracy you'd get from the full 60k MNIST set. That's the concrete result they lead with, and the experiments back it up on MNIST plus some other small datasets and init regimes. The formulation is straightforward: treat the synthetic points as optimizable variables, unroll the training trajectory, and minimize the difference in final parameters or loss relative to training on the real data. This is distinct from model distillation and from just picking real examples; the points don't have to look like real data or come from the same distribution. They also test different matching objectives and show the approach beats some baselines. That part is solid and worth the attention it got. The soft spots are mostly around practicality. Everything is locked to one fixed random initialization, so the distilled set is tied to that specific starting point and may not transfer if you change the seed or the architecture. The unrolling makes the meta-optimization expensive, and it's not obvious how this extends to larger images, deeper nets, or datasets where a few steps don't get you anywhere near convergence. The abstract claims results across settings, but without seeing the full tables it's hard to tell how big the gaps are on harder cases. For someone working on data efficiency or bilevel optimization this is worth reading and citing if you're building on the idea. It deserves peer review because the core claim is testable, the method is clearly defined, and the MNIST numbers are strong enough to spark useful discussion even if later work has to relax the fixed-init assumption.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces dataset distillation, an approach that synthesizes a small number of data points (not required to follow the original data distribution) such that training a fixed-architecture network on these points, starting from a fixed random initialization, produces parameter trajectories and test performance that closely approximate those obtained by training on the full original dataset. The central empirical claim is that the 60,000 MNIST training images can be compressed to 10 synthetic images (one per class) that achieve near-original accuracy after only a few gradient-descent steps; the method is evaluated across initialization regimes, learning objectives, and additional datasets.

Significance. If the reported approximation holds under the stated conditions, the work provides a concrete mechanism for dataset compression that directly targets learning dynamics rather than data statistics, with potential utility for meta-learning, continual learning, and resource-constrained training. The explicit experimental protocol (unrolled optimization matching parameter trajectories or losses) and results on multiple datasets constitute reproducible empirical support for the core premise.

minor comments (3)

[Method] The abstract states that the synthetic images 'do not need to come from the correct data distribution' yet the optimization objective in the method section should explicitly clarify whether any distributional regularizer is applied or whether the images are unconstrained.
[Experiments] Figure captions and axis labels in the experimental section should include the exact number of gradient steps used for the distilled-set evaluation so that the 'few gradient descent steps' claim can be directly compared to the full-dataset baseline.
[Related Work] The paper should add a short paragraph in the related-work section contrasting the fixed-initialization setting with standard dataset distillation variants that allow the network weights to vary during the distillation process.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed and positive summary of our manuscript on dataset distillation. We appreciate the recognition of the empirical protocol and results across datasets, as well as the recommendation for minor revision. Since no specific major comments or requested changes were provided in the report, we have no points requiring direct rebuttal or clarification at this time. We are happy to make any minor editorial adjustments suggested by the editor or in a subsequent round.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical optimization procedure for synthesizing a small set of training images whose gradient-descent trajectories on a fixed initialization approximate those obtained from the full dataset. The method is defined by an explicit matching objective (gradient or parameter trajectory matching) that is minimized over the synthetic images; the resulting performance is measured on held-out test data and compared against baselines. No derivation reduces a claimed result to a fitted parameter by construction, no load-bearing premise rests solely on self-citation, and the central claim is supported by direct experimental outcomes rather than by renaming or re-deriving its own inputs. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond the high-level concept of synthetic distilled images; assessment is limited by lack of full text.

invented entities (1)

synthetic distilled images no independent evidence
purpose: small set of artificial points that approximate the effect of the full training dataset when used for gradient descent
Introduced in the abstract as the core output of the method; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5679 in / 1132 out tokens · 16325 ms · 2026-05-23T23:50:42.972243+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation
cs.CR 2026-05 unverdicted novelty 7.0

SubPopMark embeds verifiable subpopulation biases into distilled datasets via CVM and USTM optimization stages, allowing provenance inference through comparison of model output signatures against a reference behavior bank.
From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation
cs.CR 2026-05 unverdicted novelty 7.0

SubPopMark protects distilled datasets by injecting verifiable subpopulation biases that create distinguishable model behaviors for copyright tracing without using backdoors.
Spectral Gradient Surgery for Domain-Generalizable Dataset Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Spectral Gradient Surgery disentangles class-discriminative and domain-specific signals in distribution-matching distilled datasets by analyzing gradient agreement in the spectral domain, yielding better out-of-distri...
Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models
cs.CV 2026-05 conditional novelty 7.0

CLP-DD distills small synthetic datasets for linear probing on pre-trained models via closed-form inner solver and discriminative outer loss, matching or exceeding LGM+DSA performance at much lower cost on ImageNet-10...
Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection
cs.CV 2026-04 unverdicted novelty 7.0

A replay method for continual face forgery detection condenses real-fake distribution discrepancies into compact maps and synthesizes compatible samples from current real faces to reduce forgetting under tight memory ...
Synthetic Designed Experiments for Diagnosing Vision Model Failure
cs.CV 2026-03 unverdicted novelty 7.0

SDRS uses designed experiments and ANOVA decomposition on synthetic data to identify Type I coverage gaps and Type II spurious dependencies in vision models, then generates targeted data to improve performance.
OD3: Optimization-free Dataset Distillation for Object Detection
cs.CV 2025-06 unverdicted novelty 7.0

OD3 presents an optimization-free dataset distillation framework for object detection that reports new state-of-the-art accuracy on COCO and VOC at compression ratios from 0.25% to 5%.
DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery
cs.CV 2026-05 unverdicted novelty 6.0

DIVER is a dual-stage distillation method using diffusion models to enhance semantic preservation and cross-architecture generalization in dataset distillation.
Fair Dataset Distillation via Cross-Group Barycenter Alignment
cs.LG 2026-04 unverdicted novelty 6.0

Dataset distillation introduces fairness gaps from subgroup pattern mismatches rather than just imbalance; distilling to a group-agnostic barycenter of predictive information reduces these gaps.
Soft Label Pruning and Quantization for Large-Scale Dataset Distillation
cs.CV 2026-04 unverdicted novelty 6.0

LPQLD reduces soft label storage in dataset distillation by 78-500x on ImageNet datasets via pruning with dynamic reuse and quantization with student-teacher alignment, while improving accuracy.
Omnimodal Dataset Distillation via High-order Proxy Alignment
cs.CV 2026-04 unverdicted novelty 6.0

HoPA captures high-order cross-modal alignments via a shared proxy to enable scalable omnimodal dataset distillation with better performance-compression trade-offs.
ROAST: Risk-aware Outlier-exposure for Adversarial Selective Training of Anomaly Detectors Against Evasion Attacks
cs.CR 2026-03 unverdicted novelty 6.0

ROAST selectively trains anomaly detectors on less vulnerable patient data with targeted outlier exposure, boosting recall by 16.2% in black-box settings and reducing training time by 88.3%.
EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training
cs.CV 2024-11 unverdicted novelty 6.0

EPS uses DCT features to cluster patches by spatial-temporal complexity and adaptively samples from the highest cluster, cutting training patches by 75-91.69% and speeding sampling up to 82.1x versus EMT while claimin...
Policy Gradient with Kernel Quadrature
cs.LG 2023-10 unverdicted novelty 6.0

Episodic kernel quadrature compresses batches of episodes via GP-modeled returns to enable efficient policy gradient updates without evaluating rewards on every sample.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
cs.CL 2023-09 conditional novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
SAS: Semantic-aware Sampling for Generative Dataset Distillation
cs.CV 2026-05 unverdicted novelty 5.0

SAS adds semantic scoring with CLIP and a two-stage filter-then-diversity selection process to make generative dataset distillation produce more class-discriminative and diverse compact datasets.
Robust Server Defense Against Unreliable Clients in One-Shot Fair Collaborative Machine Learning
cs.LG 2026-05 unverdicted novelty 5.0

Bilevel optimization learns client weights to defend fairness in one-shot collaborative ML by anchoring to a small trusted root dataset at the server.
Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration
cs.CV 2026-05 unverdicted novelty 5.0

FedHD performs federated distillation for whole slide images by generating one synthetic feature set per real slide via Gaussian-mixture alignment and adding them via curriculum integration, outperforming prior federa...
Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration
cs.CV 2026-05 unverdicted novelty 5.0

FedHD is a federated learning framework for whole slide images that distills one-to-one synthetic features aligned via Gaussian mixtures and progressively integrates cross-site features through curriculum learning to ...
A Systematic Framework for Tabular Data Disentanglement
cs.LG 2026-04 unverdicted novelty 5.0

A systematic framework modularizes tabular data disentanglement into data extraction, modeling, analysis, and latent extrapolation, with a case study on synthetic data generation.
Diffusion Models as Dataset Distillation Priors
cs.LG 2025-10 unverdicted novelty 5.0

DAP formalizes a representativeness prior via Mercer kernel similarity in feature space and uses it to guide diffusion reverse process for higher-quality distilled datasets on ImageNet without retraining.
Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning
cs.LG 2026-05 unverdicted novelty 4.0

Sampling-based inference for Bayesian neural networks has achieved computational parity with optimization-based methods and should be prioritized to deliver better uncertainty quantification and model insights.
Position: Graph Condensation Needs a Reset -- Move Beyond Full-dataset Training and Model-Dependence
cs.LG 2026-05 conditional novelty 4.0

The paper claims current graph condensation approaches are flawed due to full-dataset training requirements, high overhead, poor generalization, and misleading evaluation metrics, calling for a reset toward lightweigh...
Position: Graph Condensation Needs a Reset -- Move Beyond Full-dataset Training and Model-Dependence
cs.LG 2026-05 unverdicted novelty 3.0

Graph condensation methods must move beyond full-dataset training and model dependence toward lightweight, architecture-agnostic designs to achieve practical efficiency in GNNs.
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models
cs.LG 2026-04 unverdicted novelty 2.0

The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with e...
Knowledge Distillation in Federated Learning: a Survey on Long Lasting Challenges and New Solutions
cs.LG 2024-06 unverdicted novelty 2.0

A survey organizing knowledge distillation techniques for addressing privacy, heterogeneity, communication, and personalization challenges in federated learning.

Reference graph

Works this paper leans on

124 extracted references · 124 canonical work pages · cited by 24 Pith papers · 9 internal anchors

[1]

University of Montreal , volume=

Visualizing higher-layer features of a deep network , author=. University of Montreal , volume=

work page
[2]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

An analysis of single-layer networks in unsupervised feature learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

work page
[3]

2014 , organization=

Analyzing the performance of multilayer neural networks for object recognition , author=. 2014 , organization=

work page 2014
[4]

Advances in neural information processing systems , pages=

How transferable are features in deep neural networks? , author=. Advances in neural information processing systems , pages=

work page
[5]

Object detectors emerge in deep scene cnns , author=

work page
[6]

Aravindh Mahendran and Andrea Vedaldi , booktitle =CVPR, title =

work page
[7]

Advances in Neural Information Processing Systems , pages=

Synthesizing the preferred inputs for neurons in neural networks via deep generator networks , author=. Advances in Neural Information Processing Systems , pages=

work page
[8]

Imagenet classification with deep convolutional neural networks , author=

work page
[9]

Understanding Black-box Predictions via Influence Functions , author =

work page
[10]

Technometrics , volume=

Characterizations of an empirical influence function for detecting influential cases in regression , author=. Technometrics , volume=. 1980 , publisher=

work page 1980
[11]

2011 , organization=

Unbiased look at dataset bias , author=. 2011 , organization=

work page 2011
[12]

Toward category-level object recognition , pages=

Dataset issues in object recognition , author=. Toward category-level object recognition , pages=

work page
[13]

Network dissection: Quantifying interpretability of deep visual representations , author=

work page
[14]

Visualizing and understanding convolutional networks , author=

work page
[15]

ICLR Workshop , year=

Deep inside convolutional networks: Visualising image classification models and saliency maps , author=. ICLR Workshop , year=

work page
[18]

Data poisoning attacks on factorization-based collaborative filtering , author=

work page
[19]

Pruning training sets for learning of object categories , author=

work page
[20]

2010 , publisher=

Object detection with discriminatively trained part-based models , author=. 2010 , publisher=

work page 2010
[21]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Training region-based object detectors with online hard example mining , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[22]

Support vector machine active learning with applications to text classification , author=

work page
[23]

Journal of artificial intelligence research , volume=

Active learning with statistical models , author=. Journal of artificial intelligence research , volume=

work page
[24]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

work page 1998
[25]

International conference on artificial intelligence and statistics , year=

Understanding the difficulty of training deep feedforward neural networks , author=. International conference on artificial intelligence and statistics , year=

work page
[26]

Data Distillation: Towards Omni-Supervised Learning , author=

work page
[27]

2015 , booktitle =

Distilling the Knowledge in a Neural Network , author =. 2015 , booktitle =

work page 2015
[28]

, author=

Using Machine Teaching to Identify Optimal Training-Set Attacks on Machine Learners. , author=

work page
[29]

Poisoning attacks against support vector machines , author=

work page
[30]

Explaining and Harnessing Adversarial Examples

Explaining and harnessing adversarial examples , author=. arXiv preprint arXiv:1412.6572 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Intriguing properties of neural networks

Intriguing properties of neural networks , author=. arXiv preprint arXiv:1312.6199 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages=

Towards poisoning of deep learning algorithms with back-gradient optimization , author=. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages=

work page
[33]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , author=. arXiv preprint arXiv:1712.05526 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Striving for Simplicity: The All Convolutional Net

Striving for simplicity: The all convolutional net , author=. arXiv preprint arXiv:1412.6806 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Do deep nets really need to be deep? , author=

work page
[36]

Fitnets: Hints for thin deep nets , author=

work page
[37]

Adapting visual category models to new domains , author=

work page
[38]

Daume III, Hal , booktitle = ACL, title =

work page
[39]

Mobilenets: Efficient convolutional neural networks for mobile vision applications , author=

work page
[40]

Gradient-based hyperparameter optimization through reversible learning , author=

work page
[41]

Neural computation , volume=

Gradient-based optimization of hyperparameters , author=. Neural computation , volume=. 2000 , publisher=

work page 2000
[42]

Artificial Intelligence and Statistics , pages=

Generic methods for optimization-based modeling , author=. Artificial Intelligence and Statistics , pages=

work page
[43]

Hyperparameter optimization with approximate gradient , author=

work page
[44]

and Zisserman, A

Simonyan, K. and Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. 2014

work page 2014
[45]

Automatic differentiation in PyTorch , author=

work page
[46]

Neural computation , volume=

Fast exact multiplication by the Hessian , author=. Neural computation , volume=. 1994 , publisher=

work page 1994
[47]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification , author=

work page
[48]

2009 , pages=

Covariate shift and local learning by distribution matching , author=. 2009 , pages=

work page 2009
[49]

Data-dependent Initializations of Convolutional Neural Networks

Data-dependent initializations of convolutional neural networks , author=. arXiv preprint arXiv:1511.06856 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[51]

http://yann

The MNIST database of handwritten digits , author=. http://yann. lecun. com/exdb/mnist/ , year=

work page
[52]

2009 , institution=

Learning multiple layers of features from tiny images , author=. 2009 , institution=

work page 2009
[53]

https://github.com/akrizhevsky/cuda-convnet2 , year=

cuda-convnet: High-performance c++/cuda implementation of convolutional neural networks , author=. https://github.com/akrizhevsky/cuda-convnet2 , year=

work page
[54]

Imagenet: A large-scale hierarchical image database , author=

work page
[55]

Few-shot adversarial domain adaptation , author=

work page
[56]

2010 , publisher=

The pascal visual object classes (voc) challenge , author=. 2010 , publisher=

work page 2010
[57]

and Branson, S

Wah, C. and Branson, S. and Welinder, P. and Perona, P. and Belongie, S. , Year =

work page
[58]

One weird trick for parallelizing convolutional neural networks

One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Core vector machines: Fast SVM training on very large data sets , author=

work page
[60]

Small Coresets to Represent Large Training Data for Support Vector Machines , author=

work page
[61]

Discrete & Computational Geometry , volume=

Smaller coresets for k-median and k-means clustering , author=. Discrete & Computational Geometry , volume=. 2007 , publisher=

work page 2007
[63]

Artificial Intelligence Review , volume=

A review of instance selection methods , author=. Artificial Intelligence Review , volume=. 2010 , publisher=

work page 2010
[64]

Active Learning for Convolutional Neural Networks: A Core-Set Approach , author=

work page
[65]

NIPS workshop , year=

Reading digits in natural images with unsupervised feature learning , author=. NIPS workshop , year=

work page
[66]

A database for handwritten text recognition research , author=

work page
[67]

1957 , publisher=

The perceptron, a perceiving and recognizing automaton Project Para , author=. 1957 , publisher=

work page 1957
[68]

IEEE Intelligent Systems and their applications , volume=

Support vector machines , author=. IEEE Intelligent Systems and their applications , volume=. 1998 , publisher=

work page 1998
[69]

Journal of Computer and System Sciences , volume=

On the complexity of teaching , author=. Journal of Computer and System Sciences , volume=. 1995 , publisher=

work page 1995
[70]

New Generation Computing , volume=

Teachability in computational learning , author=. New Generation Computing , volume=. 1991 , publisher=

work page 1991
[71]

Machine teaching for bayesian learners in the exponential family , author=

work page
[72]

, author=

Machine Teaching: An Inverse Problem to Machine Learning and an Approach Toward Optimal Education. , author=

work page
[73]

Adam: A method for stochastic optimization , author=

work page
[75]

Pruning training sets for learning of object categories

Anelia Angelova, Yaser Abu-Mostafam, and Pietro Perona. Pruning training sets for learning of object categories. In CVPR, 2005

work page 2005
[76]

Do deep nets really need to be deep? In NIPS, 2014

Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS, 2014

work page 2014
[77]

Practical Coreset Constructions for Machine Learning

Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[78]

Network dissection: Quantifying interpretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017

work page 2017
[79]

Gradient-based optimization of hyperparameters

Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12 0 (8): 0 1889--1900, 2000

work page 1900
[80]

Poisoning attacks against support vector machines

Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. In ICML, 2012

work page 2012
[81]

Active learning with statistical models

David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical models. Journal of artificial intelligence research, 4: 0 129--145, 1996

work page 1996
[82]

Frustratingly easy domain adaptation

Hal Daume III. Frustratingly easy domain adaptation. In ACL, 2007

work page 2007
[83]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009
[84]

Generic methods for optimization-based modeling

Justin Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pp.\ 318--326, 2012

work page 2012

Showing first 80 references.

[1] [1]

University of Montreal , volume=

Visualizing higher-layer features of a deep network , author=. University of Montreal , volume=

work page

[2] [2]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

An analysis of single-layer networks in unsupervised feature learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

work page

[3] [3]

2014 , organization=

Analyzing the performance of multilayer neural networks for object recognition , author=. 2014 , organization=

work page 2014

[4] [4]

Advances in neural information processing systems , pages=

How transferable are features in deep neural networks? , author=. Advances in neural information processing systems , pages=

work page

[5] [5]

Object detectors emerge in deep scene cnns , author=

work page

[6] [6]

Aravindh Mahendran and Andrea Vedaldi , booktitle =CVPR, title =

work page

[7] [7]

Advances in Neural Information Processing Systems , pages=

Synthesizing the preferred inputs for neurons in neural networks via deep generator networks , author=. Advances in Neural Information Processing Systems , pages=

work page

[8] [8]

Imagenet classification with deep convolutional neural networks , author=

work page

[9] [9]

Understanding Black-box Predictions via Influence Functions , author =

work page

[10] [10]

Technometrics , volume=

Characterizations of an empirical influence function for detecting influential cases in regression , author=. Technometrics , volume=. 1980 , publisher=

work page 1980

[11] [11]

2011 , organization=

Unbiased look at dataset bias , author=. 2011 , organization=

work page 2011

[12] [12]

Toward category-level object recognition , pages=

Dataset issues in object recognition , author=. Toward category-level object recognition , pages=

work page

[13] [13]

Network dissection: Quantifying interpretability of deep visual representations , author=

work page

[14] [14]

Visualizing and understanding convolutional networks , author=

work page

[15] [15]

ICLR Workshop , year=

Deep inside convolutional networks: Visualising image classification models and saliency maps , author=. ICLR Workshop , year=

work page

[16] [18]

Data poisoning attacks on factorization-based collaborative filtering , author=

work page

[17] [19]

Pruning training sets for learning of object categories , author=

work page

[18] [20]

2010 , publisher=

Object detection with discriminatively trained part-based models , author=. 2010 , publisher=

work page 2010

[19] [21]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Training region-based object detectors with online hard example mining , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page

[20] [22]

Support vector machine active learning with applications to text classification , author=

work page

[21] [23]

Journal of artificial intelligence research , volume=

Active learning with statistical models , author=. Journal of artificial intelligence research , volume=

work page

[22] [24]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

work page 1998

[23] [25]

International conference on artificial intelligence and statistics , year=

Understanding the difficulty of training deep feedforward neural networks , author=. International conference on artificial intelligence and statistics , year=

work page

[24] [26]

Data Distillation: Towards Omni-Supervised Learning , author=

work page

[25] [27]

2015 , booktitle =

Distilling the Knowledge in a Neural Network , author =. 2015 , booktitle =

work page 2015

[26] [28]

, author=

Using Machine Teaching to Identify Optimal Training-Set Attacks on Machine Learners. , author=

work page

[27] [29]

Poisoning attacks against support vector machines , author=

work page

[28] [30]

Explaining and Harnessing Adversarial Examples

Explaining and harnessing adversarial examples , author=. arXiv preprint arXiv:1412.6572 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [31]

Intriguing properties of neural networks

Intriguing properties of neural networks , author=. arXiv preprint arXiv:1312.6199 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [32]

Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages=

Towards poisoning of deep learning algorithms with back-gradient optimization , author=. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages=

work page

[31] [33]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , author=. arXiv preprint arXiv:1712.05526 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [34]

Striving for Simplicity: The All Convolutional Net

Striving for simplicity: The all convolutional net , author=. arXiv preprint arXiv:1412.6806 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [35]

Do deep nets really need to be deep? , author=

work page

[34] [36]

Fitnets: Hints for thin deep nets , author=

work page

[35] [37]

Adapting visual category models to new domains , author=

work page

[36] [38]

Daume III, Hal , booktitle = ACL, title =

work page

[37] [39]

Mobilenets: Efficient convolutional neural networks for mobile vision applications , author=

work page

[38] [40]

Gradient-based hyperparameter optimization through reversible learning , author=

work page

[39] [41]

Neural computation , volume=

Gradient-based optimization of hyperparameters , author=. Neural computation , volume=. 2000 , publisher=

work page 2000

[40] [42]

Artificial Intelligence and Statistics , pages=

Generic methods for optimization-based modeling , author=. Artificial Intelligence and Statistics , pages=

work page

[41] [43]

Hyperparameter optimization with approximate gradient , author=

work page

[42] [44]

and Zisserman, A

Simonyan, K. and Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. 2014

work page 2014

[43] [45]

Automatic differentiation in PyTorch , author=

work page

[44] [46]

Neural computation , volume=

Fast exact multiplication by the Hessian , author=. Neural computation , volume=. 1994 , publisher=

work page 1994

[45] [47]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification , author=

work page

[46] [48]

2009 , pages=

Covariate shift and local learning by distribution matching , author=. 2009 , pages=

work page 2009

[47] [49]

Data-dependent Initializations of Convolutional Neural Networks

Data-dependent initializations of convolutional neural networks , author=. arXiv preprint arXiv:1511.06856 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [50]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[49] [51]

http://yann

The MNIST database of handwritten digits , author=. http://yann. lecun. com/exdb/mnist/ , year=

work page

[50] [52]

2009 , institution=

Learning multiple layers of features from tiny images , author=. 2009 , institution=

work page 2009

[51] [53]

https://github.com/akrizhevsky/cuda-convnet2 , year=

cuda-convnet: High-performance c++/cuda implementation of convolutional neural networks , author=. https://github.com/akrizhevsky/cuda-convnet2 , year=

work page

[52] [54]

Imagenet: A large-scale hierarchical image database , author=

work page

[53] [55]

Few-shot adversarial domain adaptation , author=

work page

[54] [56]

2010 , publisher=

The pascal visual object classes (voc) challenge , author=. 2010 , publisher=

work page 2010

[55] [57]

and Branson, S

Wah, C. and Branson, S. and Welinder, P. and Perona, P. and Belongie, S. , Year =

work page

[56] [58]

One weird trick for parallelizing convolutional neural networks

One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [59]

Core vector machines: Fast SVM training on very large data sets , author=

work page

[58] [60]

Small Coresets to Represent Large Training Data for Support Vector Machines , author=

work page

[59] [61]

Discrete & Computational Geometry , volume=

Smaller coresets for k-median and k-means clustering , author=. Discrete & Computational Geometry , volume=. 2007 , publisher=

work page 2007

[60] [63]

Artificial Intelligence Review , volume=

A review of instance selection methods , author=. Artificial Intelligence Review , volume=. 2010 , publisher=

work page 2010

[61] [64]

Active Learning for Convolutional Neural Networks: A Core-Set Approach , author=

work page

[62] [65]

NIPS workshop , year=

Reading digits in natural images with unsupervised feature learning , author=. NIPS workshop , year=

work page

[63] [66]

A database for handwritten text recognition research , author=

work page

[64] [67]

1957 , publisher=

The perceptron, a perceiving and recognizing automaton Project Para , author=. 1957 , publisher=

work page 1957

[65] [68]

IEEE Intelligent Systems and their applications , volume=

Support vector machines , author=. IEEE Intelligent Systems and their applications , volume=. 1998 , publisher=

work page 1998

[66] [69]

Journal of Computer and System Sciences , volume=

On the complexity of teaching , author=. Journal of Computer and System Sciences , volume=. 1995 , publisher=

work page 1995

[67] [70]

New Generation Computing , volume=

Teachability in computational learning , author=. New Generation Computing , volume=. 1991 , publisher=

work page 1991

[68] [71]

Machine teaching for bayesian learners in the exponential family , author=

work page

[69] [72]

, author=

Machine Teaching: An Inverse Problem to Machine Learning and an Approach Toward Optimal Education. , author=

work page

[70] [73]

Adam: A method for stochastic optimization , author=

work page

[71] [75]

Pruning training sets for learning of object categories

Anelia Angelova, Yaser Abu-Mostafam, and Pietro Perona. Pruning training sets for learning of object categories. In CVPR, 2005

work page 2005

[72] [76]

Do deep nets really need to be deep? In NIPS, 2014

Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS, 2014

work page 2014

[73] [77]

Practical Coreset Constructions for Machine Learning

Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[74] [78]

Network dissection: Quantifying interpretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017

work page 2017

[75] [79]

Gradient-based optimization of hyperparameters

Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12 0 (8): 0 1889--1900, 2000

work page 1900

[76] [80]

Poisoning attacks against support vector machines

Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. In ICML, 2012

work page 2012

[77] [81]

Active learning with statistical models

David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical models. Journal of artificial intelligence research, 4: 0 129--145, 1996

work page 1996

[78] [82]

Frustratingly easy domain adaptation

Hal Daume III. Frustratingly easy domain adaptation. In ACL, 2007

work page 2007

[79] [83]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009

[80] [84]

Generic methods for optimization-based modeling

Justin Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pp.\ 318--326, 2012

work page 2012