Goh, Why momentum really works, Distill 10.23915/distill.00006 (2017)

· 2017 · DOI 10.23915/distill.00006

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Denoise First, Orthogonalize Later: Understanding Momentum in Muon via Spectral Filtering

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Momentum in Muon functions as a spectral filter on signal-plus-perturbation gradients, enlarging the gap to stabilize singular subspaces before orthogonalization and outperforming the reverse order.

Solving Classical and Quantum Spin Glasses with Deep Boltzmann Quantum States

cond-mat.dis-nn · 2026-05-15 · unverdicted · novelty 6.0

Deep Boltzmann Quantum States with natural-gradient optimization and annealing-like training match exact or best-known solutions for large infinite-range Ising spin glasses and solve job shop scheduling instances.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

citing papers explorer

Showing 5 of 5 citing papers.

Denoise First, Orthogonalize Later: Understanding Momentum in Muon via Spectral Filtering cs.LG · 2026-06-02 · unverdicted · none · ref 20
Momentum in Muon functions as a spectral filter on signal-plus-perturbation gradients, enlarging the gap to stabilize singular subspaces before orthogonalization and outperforming the reverse order.
Solving Classical and Quantum Spin Glasses with Deep Boltzmann Quantum States cond-mat.dis-nn · 2026-05-15 · unverdicted · none · ref 110
Deep Boltzmann Quantum States with natural-gradient optimization and annealing-like training match exact or best-known solutions for large infinite-range Ising spin glasses and solve job shop scheduling instances.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 227
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 150
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 108
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Goh, Why momentum really works, Distill 10.23915/distill.00006 (2017)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer