International Conference on Learning Representations , year=

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

representative citing papers

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight subnetworks.

HORST: Composing Optimizer Geometries for Sparse Transformer Training

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

HORST uses non-commutative operator composition and a hyperbolic mirror map to combine stability from adaptive optimizers with L1 sparsity bias, outperforming AdamW across sparsity levels on vision and language tasks.

Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.

Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

Evolutionary game theory shows gradient descent and stochastic gradient descent drive neural networks to distinct stable states favoring shortcut or core subnetworks, with data and optimization noise shaping shortcut bias formation.

Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbiased gradients, delivering 1.7x-3.0x wall-clock speedup on LLaMA and OPT models.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

cs.CL · 2019-10-02 · unverdicted · novelty 6.0

DistilBERT compresses BERT by 40% via pre-training distillation with a triple loss, retaining 97% performance and running 60% faster.

Are Candidate Models Really Needed for Active Learning?

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

Active learning with randomly initialized models achieves comparable results to traditional candidate-model methods, with low-confidence sampling proving most effective.

Position: Ideas Should be the Center of Machine Learning Research

cs.LG · 2026-05-14 · conditional · novelty 4.0

Machine learning research should prioritize ideas by testing their predicted behavioral signatures in modern models through custom experiments instead of leaderboard chasing or abstract theorems.

citing papers explorer

Showing 8 of 8 citing papers.

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space cs.LG · 2026-05-18 · unverdicted · none · ref 1
In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight subnetworks.
HORST: Composing Optimizer Geometries for Sparse Transformer Training cs.LG · 2026-05-20 · unverdicted · none · ref 117
HORST uses non-commutative operator composition and a hyperbolic mirror map to combine stability from adaptive optimizers with L1 sparsity bias, outperforming AdamW across sparsity levels on vision and language tasks.
Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems cs.CL · 2026-05-15 · unverdicted · none · ref 94
Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.
Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective cs.AI · 2026-05-04 · unverdicted · none · ref 28
Evolutionary game theory shows gradient descent and stochastic gradient descent drive neural networks to distinct stable states favoring shortcut or core subnetworks, with data and optimization noise shaping shortcut bias formation.
Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling cs.LG · 2026-04-20 · unverdicted · none · ref 22
AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbiased gradients, delivering 1.7x-3.0x wall-clock speedup on LLaMA and OPT models.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter cs.CL · 2019-10-02 · unverdicted · none · ref 8
DistilBERT compresses BERT by 40% via pre-training distillation with a triple loss, retaining 97% performance and running 60% faster.
Are Candidate Models Really Needed for Active Learning? cs.CV · 2026-05-14 · unverdicted · none · ref 143
Active learning with randomly initialized models achieves comparable results to traditional candidate-model methods, with low-confidence sampling proving most effective.
Position: Ideas Should be the Center of Machine Learning Research cs.LG · 2026-05-14 · conditional · none · ref 27
Machine learning research should prioritize ideas by testing their predicted behavioral signatures in modern models through custom experiments instead of leaderboard chasing or abstract theorems.

International Conference on Learning Representations , year=

fields

years

verdicts

representative citing papers

citing papers explorer