The paper develops a martingale-consistent SSL framework enforcing expected coherence between coarse and refined predictions via new objectives and a Monte Carlo estimator, improving robustness under partial observations.
Decoupled weight decay regularization
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 2representative citing papers
STAL transfers spectral tail uplift cues via a frequency teacher to train a spatial detector for AI-generated images, discarding frequency modules at inference for strong cross-generator generalization.
A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing the embedding layer learning rate to avoid bottlenecks and instabilities in AdamW.
An auxiliary modulus during training reduces wrap-around issues and preserves train-test input distributions, enabling better accuracy and sample efficiency for large N and q in modular addition learning.
citing papers explorer
-
Martingale-Consistent Self-Supervised Learning
The paper develops a martingale-consistent SSL framework enforcing expected coherence between coarse and refined predictions via new objectives and a Monte Carlo estimator, improving robustness under partial observations.
-
Spectral Tail Auxiliary Learning for AI-Generated Image Detection
STAL transfers spectral tail uplift cues via a frequency teacher to train a spatial detector for AI-generated images, discarding frequency modules at inference for strong cross-generator generalization.
-
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing the embedding layer learning rate to avoid bottlenecks and instabilities in AdamW.
-
Learning Large-Scale Modular Addition with an Auxiliary Modulus
An auxiliary modulus during training reduces wrap-around issues and preserves train-test input distributions, enabling better accuracy and sample efficiency for large N and q in modular addition learning.