hub

and Novak, Roman and Liu, Peter J

· 2024 · arXiv 2407.05872

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

GQA-{\mu}P: The maximal parameterization update for grouped query attention

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.

Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Asymmetric Langevin Unlearning uses public data to suppress unlearning noise costs by O(1/n_pub²), enabling practical mass unlearning with preserved utility under distribution mismatch.

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

cond-mat.dis-nn · 2026-02-04 · unverdicted · novelty 7.0

In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

Learning Rate Transfer in Normalized Transformers

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.

C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outperforming HRM on Sudoku-extreme and Maze when paired with the new ItrSA++ model.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

cond-mat.dis-nn · 2025-02-07 · unverdicted · novelty 6.0

Derives a novel two-point deterministic equivalence for random matrix resolvents to obtain unified asymptotics for SGD-trained linear regression, kernel regression, and random feature models.

GWT: Scalable Optimizer State Compression for Large Language Model Training

cs.LG · 2025-01-13 · unverdicted · novelty 6.0

GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.

citing papers explorer

Showing 10 of 10 citing papers.

GQA-{\mu}P: The maximal parameterization update for grouped query attention cs.LG · 2026-05-14 · unverdicted · none · ref 6
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization cs.LG · 2026-05-13 · unverdicted · none · ref 79
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data cs.LG · 2026-05-11 · unverdicted · none · ref 141
Asymmetric Langevin Unlearning uses public data to suppress unlearning noise costs by O(1/n_pub²), enabling practical mass unlearning with preserved utility under distribution mismatch.
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model cond-mat.dis-nn · 2026-02-04 · unverdicted · none · ref 7
In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 51
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Learning Rate Transfer in Normalized Transformers cs.LG · 2026-04-29 · unverdicted · none · ref 4
νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions cs.LG · 2026-04-15 · unverdicted · none · ref 5
C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outperforming HRM on Sudoku-extreme and Maze when paired with the new ItrSA++ model.
Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 26
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models cond-mat.dis-nn · 2025-02-07 · unverdicted · none · ref 21
Derives a novel two-point deterministic equivalence for random matrix resolvents to obtain unified asymptotics for SGD-trained linear regression, kernel regression, and random feature models.
GWT: Scalable Optimizer State Compression for Large Language Model Training cs.LG · 2025-01-13 · unverdicted · none · ref 12
GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.

and Novak, Roman and Liu, Peter J

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer