Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Back to basics: Revisiting exploration in reinforcement learning for llm reasoning via generative probabilities.arXiv preprint arXiv:2602.05281
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2representative citing papers
Entrocraft uses rejection sampling to enforce precise entropy schedules in LLM RL by biasing advantages, enabling longer training, better generalization, and higher performance than baselines.
citing papers explorer
-
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
-
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
Entrocraft uses rejection sampling to enforce precise entropy schedules in LLM RL by biasing advantages, enabling longer training, better generalization, and higher performance than baselines.