arXiv preprint arXiv:2408.13359 , year=

Power Scheduler: A Batch Size, Token Number Agnostic Learning Rate Scheduler , author= · arXiv 2408.13359

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Optimal hyperparameters for LLM continued pre-training follow predictable scaling laws derived from proxy models, enabling a two-stage framework that predicts settings from compute budget and checkpoint state to reduce search overhead by 90%.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training cs.CL · 2026-06-04 · unverdicted · none · ref 24
Optimal hyperparameters for LLM continued pre-training follow predictable scaling laws derived from proxy models, enabling a two-stage framework that predicts settings from compute budget and checkpoint state to reduce search overhead by 90%.

arXiv preprint arXiv:2408.13359 , year=

fields

years

verdicts

representative citing papers

citing papers explorer