Optimizers qualitatively alter solutions and we should leverage this

Razvan Pascanu, Clare Lyle, Ionut-Vlad Modoranu, Naima Elosegui Borras, Dan Alistarh, Petar Velickovic, Sarath Chandar, Soham De, James Martens · 2025 · arXiv 2507.12224

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

How does the optimizer implicitly bias the model merging loss landscape?

cs.LG · 2025-10-06 · unverdicted · novelty 7.0

Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.

Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization

cs.RO · 2026-04-22 · unverdicted · novelty 3.0

Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gain over evolutionary strategies.

citing papers explorer

Showing 3 of 3 citing papers.

How does the optimizer implicitly bias the model merging loss landscape? cs.LG · 2025-10-06 · unverdicted · none · ref 8
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws cs.LG · 2026-05-20 · unverdicted · none · ref 3
The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.
Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization cs.RO · 2026-04-22 · unverdicted · none · ref 30
Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gain over evolutionary strategies.

Optimizers qualitatively alter solutions and we should leverage this

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer