IGT-OMD reduces gradient transport error from quadratic to linear in delay length for delayed bilevel optimization and achieves sublinear regret with adaptive steps.
Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
A randomized (1+ε)-approximation algorithm for ordered-norm load balancing uses O((n+d)(ε^{-2} + log log d) log(n+d)) linear-oracle calls via follow-the-regularized-leader prices and martingale progress analysis.
citing papers explorer
-
IGT-OMD: Implicit Gradient Transport for Decision-Focused Learning under Delayed Feedback
IGT-OMD reduces gradient transport error from quadratic to linear in delay length for delayed bilevel optimization and achieves sublinear regret with adaptive steps.
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
An Efficient Algorithm for Minimizing Ordered Norms in Fractional Load Balancing
A randomized (1+ε)-approximation algorithm for ordered-norm load balancing uses O((n+d)(ε^{-2} + log log d) log(n+d)) linear-oracle calls via follow-the-regularized-leader prices and martingale progress analysis.