In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight subnetworks.
International Conference on Learning Representations , year=
8 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
HORST uses non-commutative operator composition and a hyperbolic mirror map to combine stability from adaptive optimizers with L1 sparsity bias, outperforming AdamW across sparsity levels on vision and language tasks.
Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.
Evolutionary game theory shows gradient descent and stochastic gradient descent drive neural networks to distinct stable states favoring shortcut or core subnetworks, with data and optimization noise shaping shortcut bias formation.
AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbiased gradients, delivering 1.7x-3.0x wall-clock speedup on LLaMA and OPT models.
DistilBERT compresses BERT by 40% via pre-training distillation with a triple loss, retaining 97% performance and running 60% faster.
Active learning with randomly initialized models achieves comparable results to traditional candidate-model methods, with low-confidence sampling proving most effective.
Machine learning research should prioritize ideas by testing their predicted behavioral signatures in modern models through custom experiments instead of leaderboard chasing or abstract theorems.
citing papers explorer
-
Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space
In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight subnetworks.
-
HORST: Composing Optimizer Geometries for Sparse Transformer Training
HORST uses non-commutative operator composition and a hyperbolic mirror map to combine stability from adaptive optimizers with L1 sparsity bias, outperforming AdamW across sparsity levels on vision and language tasks.
-
Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems
Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.
-
Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective
Evolutionary game theory shows gradient descent and stochastic gradient descent drive neural networks to distinct stable states favoring shortcut or core subnetworks, with data and optimization noise shaping shortcut bias formation.
-
Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling
AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbiased gradients, delivering 1.7x-3.0x wall-clock speedup on LLaMA and OPT models.
-
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
DistilBERT compresses BERT by 40% via pre-training distillation with a triple loss, retaining 97% performance and running 60% faster.
-
Are Candidate Models Really Needed for Active Learning?
Active learning with randomly initialized models achieves comparable results to traditional candidate-model methods, with low-confidence sampling proving most effective.
-
Position: Ideas Should be the Center of Machine Learning Research
Machine learning research should prioritize ideas by testing their predicted behavioral signatures in modern models through custom experiments instead of leaderboard chasing or abstract theorems.