Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
hub Canonical reference
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Canonical reference. 70% of citing Pith papers cite this work as background.
abstract
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
A generative-model-based test for equality of conditional distributions that uses cross-generation, an RKHS-indexed supremum statistic, and multiplier bootstrap, with claimed double robustness to generator errors.
Neurons exhibit concept-conditioned activation ranges forming Gaussian-like distributions with minimal overlap, and range-based interventions via NeuronLens outperform neuron-level masking in targeted manipulation with reduced collateral effects.
Universal Differential Equations unify scientific models with machine learning by embedding flexible approximators into differential equations, enabling applications from biological mechanism discovery to high-dimensional optimization.
Taylor-expansion importance scoring enables layer-agnostic pruning of neural networks that outperforms prior methods on ImageNet accuracy-FLOPs trade-offs.
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
A new partitioning algorithm that provably load-balances arbitrary sparse tensor algebra expressions by generalizing parallel merging to multi-operand, multi-dimensional hierarchical structures, implemented in a compiler framework.
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
A vector-quantized autoencoder learns minimal control codebooks for forward invariance in sampled-data control, achieving 157x reduction over grid baselines on a 12D quadrotor model.
Experiments across code LLMs show no-review collapses fastest, human-gated filters slow collapse, and AI self-gates lose effect over time, degenerating to ungated self-training under self-confirming acceptance as proven via gated distributional reweighting and spectral analysis.
On-policy distillation produces coordinate-sparse, FFN-heavy updates that are full-rank but spectrally concentrated away from principal singular subspaces and near-zero source weights.
Sign-flip perturbations produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-norm sign-preserving perturbations in a ReLU + RMSNorm block because ReLU creates directional asymmetry that RMSNorm's transverse projection exposes.
Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.
MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
A three-stage pill-based augmentation makes existing FL poisoning attacks evade popular defenses while raising error rates up to 7x on both IID and non-IID data.
SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.
Sparse autoencoders applied to language model activations yield more interpretable and monosemantic features than alternative approaches, enabling finer causal analysis on the indirect object identification task.
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on most Matbench Discovery metrics and downstream tasks.
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
A dynamic training framework for 3D Gaussian Splatting alternates incremental pruning and adaptive growing of primitives to maintain high rendering quality at up to 80% lower peak memory than standard 3DGS.
Introduces integration, metastability, and dynamical stability index measures from layer activations and reports patterns distinguishing CIFAR-10 from CIFAR-100 difficulty plus early convergence signals across ResNet variants, DenseNet, MobileNetV2, VGG-16, and a Vision Transformer.
SubFLOT uses optimal transport to generate data-aware personalized submodels via server-side pruning and scaling-based adaptive regularization to mitigate parametric divergence in heterogeneous federated learning.
citing papers explorer
-
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
-
Progress measures for grokking via mechanistic interpretability
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
-
Testing Equality of Conditional Distributions via Generative Models
A generative-model-based test for equality of conditional distributions that uses cross-generation, an RKHS-indexed supremum statistic, and multiplier bootstrap, with claimed double robustness to generator errors.
-
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Neurons exhibit concept-conditioned activation ranges forming Gaussian-like distributions with minimal overlap, and range-based interventions via NeuronLens outperform neuron-level masking in targeted manipulation with reduced collateral effects.
-
Universal Differential Equations for Scientific Machine Learning
Universal Differential Equations unify scientific models with machine learning by embedding flexible approximators into differential equations, enabling applications from biological mechanism discovery to high-dimensional optimization.
-
Importance Estimation for Neural Network Pruning
Taylor-expansion importance scoring enables layer-agnostic pruning of neural networks that outperforms prior methods on ImageNet accuracy-FLOPs trade-offs.
-
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
-
Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution
A new partitioning algorithm that provably load-balances arbitrary sparse tensor algebra expressions by generalizing parallel merging to multi-operand, multi-dimensional hierarchical structures, implemented in a compiler framework.
-
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
-
Minimal Information Control Invariance via Vector Quantization
A vector-quantized autoencoder learns minimal control codebooks for forward invariance in sampled-data control, achieving 157x reduction over grid baselines on a 12D quadrotor model.
-
When AI Reviews Its Own Code: Recursive Self-Training Collapse in Code LLMs
Experiments across code LLMs show no-review collapses fastest, human-gated filters slow collapse, and AI self-gates lose effect over time, degenerating to ungated self-training under self-confirming acceptance as proven via gated distributional reweighting and spectral analysis.
-
Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
On-policy distillation produces coordinate-sparse, FFN-heavy updates that are full-rank but spectrally concentrated away from principal singular subspaces and near-zero source weights.
-
A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization
Sign-flip perturbations produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-norm sign-preserving perturbations in a ReLU + RMSNorm block because ReLU creates directional asymmetry that RMSNorm's transverse projection exposes.
-
Prediction horizon shapes representations in predictive learning
Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.
-
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs
MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
-
Poisoning with A Pill: Circumventing Detection in Federated Learning
A three-stage pill-based augmentation makes existing FL poisoning attacks evade popular defenses while raising error rates up to 7x on both IID and non-IID data.
-
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.
-
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Sparse autoencoders applied to language model activations yield more interpretable and monosemantic features than alternative approaches, enabling finer causal analysis on the indirect object identification task.
-
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
-
Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning
Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on most Matbench Discovery metrics and downstream tasks.
-
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
-
Gaussians on a Diet: High-Quality Memory-Bounded 3D Gaussian Splatting Training
A dynamic training framework for 3D Gaussian Splatting alternates incremental pruning and adaptive growing of primitives to maintain high rendering quality at up to 80% lower peak memory than standard 3DGS.
-
Training Deep Visual Networks Beyond Loss and Accuracy Through a Dynamical Systems Approach
Introduces integration, metastability, and dynamical stability index measures from layer activations and reports patterns distinguishing CIFAR-10 from CIFAR-100 difficulty plus early convergence signals across ResNet variants, DenseNet, MobileNetV2, VGG-16, and a Vision Transformer.
-
SubFLOT: Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport
SubFLOT uses optimal transport to generate data-aware personalized submodels via server-side pruning and scaling-based adaptive regularization to mitigate parametric divergence in heterogeneous federated learning.
-
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
-
PASTA: A Paraphrasing And Self-Training Approach for Knowledge Updating in LLMs
PASTA combines data augmentation and a self-learning DPO process to integrate new factual knowledge from news articles into LLMs, raising accuracy from 0.02 to 0.82 on post-cutoff questions while preserving general capabilities.
-
The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
Full fine-tuning causes negative transfer and performance collapse in sub-300M SLMs on math tasks, establishing PEFT as a stability requirement.
-
STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing
STARFISH recovers accuracy in pruned neural networks by optimizing internal state alignment to the original model with a minimal unlabeled calibration set, outperforming prior recovery methods especially at high pruning ratios.
-
BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices
BitTP applies weight-only 1.58-bit quantization to LLM trajectory predictors, claiming improved ADE/FDE over BF16 baseline with reduced resource demands on edge devices.
-
Comparing Classical Simulation and Sample-Based Learning of Quantum Systems: Learning the Hardness of Quantum Systems from Samples
Empirical study finds neural-network learning difficulty (via Hessian eigenvalue and random subspace optimization) correlates with classical simulation hardness parameterized by MPS bond dimension and T-gate count.
-
Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization
Tuning the depth-width ratio positions models in an efficient neural interaction interval that correlates with better generalization under fixed budgets and remains stable with scale.
-
Surrogate Neural Architecture Codesign Package (SNAC-Pack)
SNAC-Pack is a new framework for hardware-aware neural architecture codesign that uses surrogate models, NSGA-II search, quantization-aware training, and hls4ml synthesis to produce compact FPGA-deployable models.
-
Strategic Over-Parameterization for Generalizable Low-Rank Adaptation
LoRA-Over injects auxiliary parameters into low-rank adapters during training and decomposes them back into standard LoRA at inference, with static or dynamic scheduling to allocate extra capacity where needed, yielding better generalization than vanilla LoRA on GLUE, MT-Bench, GSM8K and HumanEval.
-
On the Stability of Growth in Structural Plasticity
Growth during training inserts new units into a specialized trajectory, making them forward-active but backward-starved with weaker gradients than existing units.
-
Features have life history. And we should care
Language model features form an early stable carrier scaffold of about 50 sparse features that is load-bearing, predictable from onset firing, and recruits most later features.
-
Spectral methods: crucial for machine learning, natural for quantum computers?
Quantum computers may enable more natural manipulation of Fourier spectra in ML models via the Quantum Fourier Transform, potentially leading to resource-efficient spectral methods.
-
Convolutional Dictionary Learning in Hierarchical Networks
A hierarchical convolutional dictionary learning model for piecewise smooth signals using recursive scale-detail filtering and sparse coding, learned by alternating minimization and demonstrated on MNIST.
-
Deep network as memory space: complexity, generalization, disentangled representation and interpretability
Deep networks are framed as memory spaces whose complexity is defined by a Fisher metric, with the least action principle linking this complexity to generalization and disentanglement for better interpretability.
-
Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models
SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.
-
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
-
Representation-Aligned Multi-Scale Personalization for Federated Learning
FRAMP generates client-specific models from compact descriptors in federated learning, trains tailored submodels, and aligns representations to balance personalization with global consistency.
-
Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference
SentryFuse delivers modality-aware zero-shot pruning and sparse attention that improves accuracy by 12.7% on average and up to 18% under sensor dropout while cutting memory 28.2% and latency up to 1.63x across multimodal edge models.
-
Beyond Dense Connectivity: Explicit Sparsity for Scalable Recommendation
SSR uses static random filters and iterative competitive sparse mechanisms to explicitly enforce sparsity in recommendation models, outperforming dense baselines on public and billion-scale industrial datasets.
-
Sparse Orthogonal Parameters Tuning for Continual Learning
SoTU merges sparse orthogonal delta parameters learned across streaming tasks to fuse knowledge and mitigate forgetting in pre-trained model continual learning.
-
Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression
The prune-quantize-distill ordering produces a better accuracy-size-latency frontier on CIFAR-10/100 than any single technique or other orderings, with INT8 QAT providing the main runtime gain.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
-
The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter
Established mathematical bottlenecks in representation, optimization, complexity, and high-dimensional learning aligned with the central disappointments of early AI research periods.
-
Resource-Constrained Affect Modelling via Variance Regularisation Pruning
Variance-Regularised Pruning maintains competitive CCC performance at 80% sparsity on the AGAIN dataset by incorporating cross-participant variance into the pruning process.
-
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.