Mixed citations

Towards understanding ensemble, knowledge distillation and self-distillation in deep learning

Towards understanding ensemble, knowledge distillation, self-distillation in deep learning , author= · 2012 · arXiv 2012.09816

Mixed citation behavior. Most common role is background (60%).

12 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 12 citing papers

citation-role summary

background 4 baseline 1

citation-polarity summary

background 3 baseline 1 unclear 1

representative citing papers

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Proves GD convergence to stationary point neighborhoods for general NN architectures beyond NTK via block-level analysis, analyticity, and local smoothness conditions.

Quantifying and Defending against the Privacy Risk in Logit-based Federated Learning

cs.CR · 2026-06-06 · unverdicted · novelty 7.0

Logit-based federated learning leaks private model information to a semi-honest server via shared logits even with unrelated public data, enabling an adaptive stealing attack with theoretical bounds and a logit-perturbation defense.

Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

GDPD treats partial student features as degraded observations and uses a learned diffusion prior over teacher features to sample restorative long-context targets for improved partial time-series classification.

Benign Overfitting in Adversarial Training for Vision Transformers

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.

Hierarchical Mixture-of-Experts with Two-Stage Optimization

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.

FLAME: Condensing Ensemble Diversity into a Single Network for Efficient Sequential Recommendation

cs.IR · 2026-04-05 · conditional · novelty 6.0

FLAME condenses ensemble diversity into a single network via modular ensemble simulation and guided mutual learning during training, delivering ensemble-level performance with single-network inference speed on sequential recommendation tasks.

Provable Knowledge Acquisition and Extraction in One-Layer Transformers

cs.LG · 2025-07-28 · unverdicted · novelty 6.0

In a stylized one-layer transformer, pre-training encodes factual knowledge via relation-specific feature directions and attention patterns; fine-tuning extracts it through a relation-covering mechanism that succeeds when enough latent templates are triggered, with a failure regime explaining inauds

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

cs.LG · 2024-01-02 · unverdicted · novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.

Muon Learns More Robust and Transferable Features than Adam

cs.LG · 2026-06-08 · unverdicted · novelty 5.0

Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.

Easy Ensemble: Simple Deep Ensemble Learning for Sensor-Based Human Activity Recognition

cs.CV · 2022-03-08 · unverdicted · novelty 5.0

Easy Ensemble enables deep ensemble learning for HAR in a single model, with experiments showing effectiveness on benchmark datasets compared to conventional methods.

citing papers explorer

Showing 12 of 12 citing papers.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English? cs.CL · 2023-05-12 · conditional · none · ref 2
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime cs.LG · 2026-06-22 · unverdicted · none · ref 70
Proves GD convergence to stationary point neighborhoods for general NN architectures beyond NTK via block-level analysis, analyticity, and local smoothness conditions.
Quantifying and Defending against the Privacy Risk in Logit-based Federated Learning cs.CR · 2026-06-06 · unverdicted · none · ref 2
Logit-based federated learning leaks private model information to a semi-honest server via shared logits even with unrelated public data, enabling an adaptive stealing attack with theoretical bounds and a logit-perturbation defense.
Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer cs.LG · 2026-05-12 · unverdicted · none · ref 1
GDPD treats partial student features as degraded observations and uses a learned diffusion prior over teacher features to sample restorative long-context targets for improved partial time-series classification.
Benign Overfitting in Adversarial Training for Vision Transformers cs.LG · 2026-04-21 · unverdicted · none · ref 43
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
Hierarchical Mixture-of-Experts with Two-Stage Optimization cs.LG · 2026-05-08 · unverdicted · none · ref 1
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 3
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
FLAME: Condensing Ensemble Diversity into a Single Network for Efficient Sequential Recommendation cs.IR · 2026-04-05 · conditional · none · ref 1
FLAME condenses ensemble diversity into a single network via modular ensemble simulation and guided mutual learning during training, delivering ensemble-level performance with single-network inference speed on sequential recommendation tasks.
Provable Knowledge Acquisition and Extraction in One-Layer Transformers cs.LG · 2025-07-28 · unverdicted · none · ref 2
In a stylized one-layer transformer, pre-training encodes factual knowledge via relation-specific feature directions and attention patterns; fine-tuning extracts it through a relation-covering mechanism that succeeds when enough latent templates are triggered, with a failure regime explaining inauds
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models cs.LG · 2024-01-02 · unverdicted · none · ref 13
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
Muon Learns More Robust and Transferable Features than Adam cs.LG · 2026-06-08 · unverdicted · none · ref 127
Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.
Easy Ensemble: Simple Deep Ensemble Learning for Sensor-Based Human Activity Recognition cs.CV · 2022-03-08 · unverdicted · none · ref 20
Easy Ensemble enables deep ensemble learning for HAR in a single model, with experiments showing effectiveness on benchmark datasets compared to conventional methods.

Towards understanding ensemble, knowledge distillation and self-distillation in deep learning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer