Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
Mixed citations
Towards understanding ensemble, knowledge distillation and self-distillation in deep learning
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
representative citing papers
GDPD treats partial student features as degraded observations and uses a learned diffusion prior over teacher features to sample restorative long-context targets for improved partial time-series classification.
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
FLAME condenses ensemble diversity into a single network via modular ensemble simulation and guided mutual learning during training, delivering ensemble-level performance with single-network inference speed on sequential recommendation tasks.
In a stylized one-layer transformer, pre-training encodes factual knowledge via relation-specific feature directions and attention patterns; fine-tuning extracts it through a relation-covering mechanism that succeeds when enough latent templates are triggered, with a failure regime explaining inauds
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
Easy Ensemble enables deep ensemble learning for HAR in a single model, with experiments showing effectiveness on benchmark datasets compared to conventional methods.
citing papers explorer
-
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
-
Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer
GDPD treats partial student features as degraded observations and uses a learned diffusion prior over teacher features to sample restorative long-context targets for improved partial time-series classification.
-
Benign Overfitting in Adversarial Training for Vision Transformers
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
-
Hierarchical Mixture-of-Experts with Two-Stage Optimization
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
-
FLAME: Condensing Ensemble Diversity into a Single Network for Efficient Sequential Recommendation
FLAME condenses ensemble diversity into a single network via modular ensemble simulation and guided mutual learning during training, delivering ensemble-level performance with single-network inference speed on sequential recommendation tasks.
-
Provable Knowledge Acquisition and Extraction in One-Layer Transformers
In a stylized one-layer transformer, pre-training encodes factual knowledge via relation-specific feature directions and attention patterns; fine-tuning extracts it through a relation-covering mechanism that succeeds when enough latent templates are triggered, with a failure regime explaining inauds
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
-
Easy Ensemble: Simple Deep Ensemble Learning for Sensor-Based Human Activity Recognition
Easy Ensemble enables deep ensemble learning for HAR in a single model, with experiments showing effectiveness on benchmark datasets compared to conventional methods.