pith. machine review for the scientific record. sign in

arxiv: 1412.6980 · v9 · submitted 2014-12-22 · 💻 cs.LG

Recognition: 4 theorem links

· Lean Theorem

Adam: A Method for Stochastic Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-09 01:36 UTC · model claude-opus-4-7

classification 💻 cs.LG MSC 90C1568T0590C25
keywords stochastic optimizationadaptive learning ratefirst-order methodsmoment estimationbias correctiononline convex optimizationdeep learningAdaMax
0
0 comments X

The pith

Adam sets per-parameter step sizes from bias-corrected running averages of the gradient and its square, giving a robust default optimizer for noisy, high-dimensional problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks: can a first-order optimizer with almost no tuning reliably train large, noisy, sparse-gradient problems? Its answer is Adam, which keeps two running averages per parameter — one of the gradient, one of the squared gradient — and divides the first by the square root of the second to set a per-coordinate step. A small but important detail is the explicit bias correction that undoes the zero-initialization of those averages, which matters most when the second-moment decay rate β₂ is set close to 1 to handle sparse gradients. The construction makes the effective step size invariant to gradient rescaling and approximately bounded by the user-chosen α, so α functions like a trust-region radius rather than a raw learning rate. Empirically the authors show that one default setting tracks or beats AdaGrad, RMSProp, SGD with Nesterov momentum, AdaDelta, and a quasi-Newton baseline on logistic regression, MLPs with dropout, and convnets, and they derive an infinity-norm variant (AdaMax) with an even cleaner update bound.

Core claim

The paper proposes Adam, a first-order stochastic optimizer that maintains two exponential moving averages per parameter — one of the gradient (first moment) and one of the squared gradient (second raw moment) — and uses their ratio, with an explicit bias correction for the zero-initialization of those averages, to set a per-parameter step size. The authors argue this combines the sparse-gradient handling of AdaGrad with the non-stationarity handling of RMSProp, while the effective per-step move in parameter space stays approximately bounded by the user-chosen stepsize α, giving the method a built-in trust-region feel. They claim a single set of defaults (α=0.001, β₁=0.9, β₂=0.999, ε=1e-8) w

What carries the argument

The bias-corrected ratio m̂_t / √v̂_t, where m_t and v_t are exponential moving averages of g_t and g_t², and the corrections m̂_t = m_t/(1−β₁ᵗ), v̂_t = v_t/(1−β₂ᵗ) undo the zero-initialization bias. This ratio is gradient-scale invariant, behaves like a per-coordinate signal-to-noise ratio that automatically anneals near optima, and bounds the per-step parameter move by roughly α — turning the stepsize hyperparameter into something close to a trust-region radius.

If this is right

  • <parameter name="0">A practitioner can train a wide range of deep models with the same optimizer and the same defaults
  • removing learning-rate tuning as a first-order concern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • <parameter name="0">Editorial: the regret proof's telescoping step requires √v̂_t/α_t to be non-decreasing along each coordinate
  • which is not generally true
  • later work has constructed simple convex counterexamples on which Adam diverges
  • so the O(√T) bound as stated should be read as suggestive rather than airtight
  • even though the empirical recipe survives unchanged.

Load-bearing premise

The regret proof leans on a quantity that grows monotonically along every coordinate as training proceeds; the paper asserts this without justification, and the bound only holds where that monotonicity actually holds.

What would settle it

Run Adam with the recommended defaults against well-tuned SGD-with-momentum, AdaGrad, and RMSProp on the same suite of problems (MNIST logistic regression and MLP, IMDB bag-of-words logistic regression, CIFAR-10 convnet, and a variational autoencoder). If Adam fails to match or beat them on training loss within the same wall-clock budget, or if removing the bias-correction terms does not visibly destabilize training when β₂ is close to 1, the central practical claim fails. For the regret claim, a convex online sequence on which Adam's iterates do not satisfy R(T)=O(√T) would falsify the theore

read the original abstract

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 8 minor

Summary. The paper introduces Adam, a first-order stochastic optimizer that maintains exponential moving averages of the gradient (m_t) and the squared gradient (v_t), applies bias-correction for the zero-initialization, and updates parameters by θ_t ← θ_{t-1} − α · m̂_t / (√v̂_t + ε). The authors motivate the update via a signal-to-noise interpretation, derive the bias-correction from the EMA recurrence, prove an O(√T) regret bound in the online convex setting (Theorem 4.1), present an L_∞-norm variant (AdaMax), and report experiments on logistic regression (MNIST, IMDB-BoW), MLPs (MNIST, with and without dropout), CNNs (CIFAR-10), and a VAE. Default hyperparameters (α=10⁻³, β₁=0.9, β₂=0.999, ε=10⁻⁸) are recommended and shown to be competitive with or better than SGD+Nesterov, AdaGrad, RMSProp, AdaDelta, and SFO.

Significance. If the algorithmic and empirical claims hold, Adam offers a practically important contribution: a simple, memory-light, scale-invariant adaptive optimizer with intuitive hyperparameters that performs robustly across convex and non-convex deep learning workloads. The bias-correction derivation in §3 is clean and useful in its own right (it cleanly explains an effect that earlier RMSProp-with-momentum variants get wrong for β₂ near 1), and the SNR/effective-step discussion in §2.1 gives a usable mental model for setting α. The AdaMax derivation (§7.1) is elegant and yields a particularly simple update with a tighter step bound |Δ_t| ≤ α. The empirical comparisons span enough model classes (logistic regression, fully-connected nets with/without dropout, CNNs, VAE) to support the robustness claim, and the bias-correction ablation in §6.4 is a genuinely informative experiment. The theoretical contribution (Theorem 4.1) is partial — see major comments — but the algorithmic and empirical case is strong.

major comments (4)
  1. [§4 / §10.1, Theorem 4.1 / 10.5] The regret proof contains a load-bearing step that is not justified. In the displayed bound at the top of p. 14, the sum ∑_{t=2}^T (θ_{t,i}−θ*_i)² (√v̂_{t,i}/α_t − √v̂_{t−1,i}/α_{t−1}) is replaced by (D²/(2α(1−β₁))) ∑_i √(T v̂_{T,i}). This telescoping is valid only if √v̂_{t,i}/α_t is non-decreasing in t for every coordinate i. With α_t = α/√t, the quantity is √(t·v̂_{t,i})/α, and since v̂_t is a bias-corrected EMA of g_t² it can strictly decrease whenever a coordinate sees a small gradient following a large one. The authors should either (i) state and justify a monotonicity assumption on v̂_t, (ii) carry through the proof with the absolute value of the increment (which changes the bound), or (iii) restrict the theorem to a class of sequences for which the monotonicity holds. As written, the bound is not established for general bounded convex sequences, and a one-dimensional counterexamp
  2. [§4, Theorem 4.1 statement] The hypothesis β₁²/√β₂ < 1 is stated but its role should be made explicit in the main text — it is used in Lemma 10.4 to bound an arithmetic-geometric series. With the recommended defaults β₁=0.9, β₂=0.999 one has β₁²/√β₂ ≈ 0.811, so the assumption is satisfied at defaults; however readers tuning β₁ upward (a common practice with momentum) can violate it. Please flag this in §4 alongside the theorem so that the regime of validity is clear.
  3. [§6.3, Figure 3] The CNN experiment reports that v̂_t 'vanishes to zeros after a few epochs and is dominated by the ε in algorithm 1', and that consequently 'Adagrad converges much slower than others' while Adam shows only 'marginal improvement over SGD with momentum'. This is an interesting and honest observation, but it slightly undercuts the central claim that adaptive second-moment scaling is the source of Adam's advantage. It would strengthen the paper to (a) report what fraction of coordinates have √v̂_t < ε at the cited epochs, and (b) show an ablation in which ε is varied, so readers can tell whether Adam in this regime is effectively SGD-with-momentum + a small constant preconditioner or whether the second moment still contributes.
  4. [§5, Related work / RMSProp comparison] The claim that lack of bias-correction in RMSProp 'leads to very large stepsizes and often divergence' for β₂ near 1 is supported by the VAE experiment in §6.4, but the comparison fixes architecture and dataset. Since this is one of the paper's main differentiators from RMSProp, a second setting (e.g., the MLP+dropout or CNN tasks already in the paper) showing the same effect would make the case substantially more robust.
minor comments (8)
  1. [Algorithm 1] The placement of ε inside the square root (√v̂_t + ε) versus inside (√(v̂_t + ε)) matters in practice and differs across implementations. Please state explicitly which convention is used and whether the analysis is affected.
  2. [§2.1] The two cases for the step bound, |Δ_t| ≤ α·(1−β₁)/√(1−β₂) versus |Δ_t| ≤ α, would be clearer with a one-line derivation rather than asserted. Currently the reader has to reconstruct the algebra.
  3. [§3, Eq. (4)] The term ζ is introduced and immediately argued to be small for stationary or slowly-varying gradients, but is not formally bounded. A short remark giving an explicit bound in terms of the variation of E[g_t²] would tighten the derivation.
  4. [§4] The decay schedule β_{1,t} = β₁·λ^{t−1} with λ very close to 1 is required for the proof but is not used in any of the experiments (which appear to use constant β₁=0.9). Please clarify whether the empirical performance corresponds to a regime covered by the theorem.
  5. [§7.1, Eq. (12)] It would be helpful to note that u_t = max(β₂·u_{t−1}, |g_t|) corresponds to a max over an exponentially-weighted history and therefore does not require bias correction, as briefly stated; an explicit derivation showing E[u_t] in the stationary case would parallel §3.
  6. [Lemma 10.3 proof] The inductive step uses the inequality √(a − b) ≤ √a − b/(2√a) which requires a ≥ b ≥ 0; this is fine but worth stating, since a = ∥g_{1:T,i}∥² and b = g_{T,i}² satisfy it by construction.
  7. [§6] The phrase 'searched over a dense grid' for hyperparameters of the baselines is not specific. Listing the grids (at minimum for α and momentum) in an appendix would improve reproducibility.
  8. [Typos] Several minor typos: 'theoratical' (§6.1, twice), 'BoW feature Logistic Regression' axis label, 'Initalization' (§7.2). 'β₁' appears where 'β₂' is meant in the sentence following Eq. (4) ('the exponential decay rate β₁ can be chosen…' — context is the second moment).

Simulated Author's Rebuttal

4 responses · 1 unresolved

We thank the referee for a careful and constructive report. The most substantive point — the unstated monotonicity assumption underlying the telescoping step in the regret proof — is correct, and we will revise Theorem 4.1 and its proof to state the assumption explicitly rather than leaving it implicit. We also agree to flag the β₁²/√β₂ < 1 hypothesis prominently in §4, to add quantitative support to the §6.3 discussion of v̂_t vanishing on CNNs (including an ε ablation), and to broaden the bias-correction comparison in §6.4 beyond the VAE setting. None of these revisions affect the algorithm itself, the bias-correction derivation in §3, the SNR discussion in §2.1, the AdaMax derivation in §7.1, or the empirical conclusions; they sharpen the theoretical statement and strengthen the empirical case. A point-by-point response follows.

read point-by-point responses
  1. Referee: The telescoping step in the regret proof (top of p. 14) replaces ∑_{t=2}^T (θ_{t,i}−θ*_i)² (√v̂_{t,i}/α_t − √v̂_{t−1,i}/α_{t−1}) by (D²/(2α(1−β₁))) ∑_i √(T v̂_{T,i}). This is only valid if √v̂_{t,i}/α_t is non-decreasing in t per coordinate, which need not hold for a bias-corrected EMA of g_t² when a small gradient follows a large one.

    Authors: We agree that this step requires an additional assumption that we did not state explicitly. The telescoping is valid when √(t·v̂_{t,i}) is non-decreasing in t for each coordinate, which is not guaranteed by a bias-corrected EMA of g_t² in general. We will revise §4 and §10.1 in two ways: (i) we will explicitly add the assumption that √(t·v̂_{t,i})/α is non-decreasing in t for all i (equivalently, that t·v̂_{t,i} is non-decreasing), and flag that this is what makes the telescoping well-defined; and (ii) we will note the alternative route in which the increment is replaced by its absolute value, which yields a weaker but unconditional bound. We thank the referee for catching this — the assumption is implicit in our derivation but should be made part of the theorem statement, and we will add a sentence describing the regime in which it is reasonable (sufficiently slowly-varying second-moment estimates) and acknowledging that pathological sequences can violate it. We do not claim a fix for the general non-monotone case in this revision. revision: yes

  2. Referee: The hypothesis β₁²/√β₂ < 1 should be flagged in the main text alongside Theorem 4.1, since users tuning β₁ upward can violate it (defaults satisfy it: 0.9²/√0.999 ≈ 0.811).

    Authors: We agree. We will add a short remark in §4 immediately after the theorem statement noting (i) the role of this assumption — it is used in Lemma 10.4 to bound an arithmetic-geometric series via γ = β₁²/√β₂ < 1 — (ii) that the recommended defaults β₁=0.9, β₂=0.999 give γ ≈ 0.811 and so satisfy it comfortably, and (iii) that practitioners increasing β₁ (e.g. β₁ ≥ 0.95 with default β₂) should check the inequality. We will also include a one-line worked example so the regime of validity is unambiguous. revision: yes

  3. Referee: In the CNN experiment (§6.3), the authors note v̂_t vanishes to near-zero so the update is dominated by ε, which somewhat undercuts the claim that adaptive second-moment scaling drives Adam's advantage. Please report what fraction of coordinates have √v̂_t < ε at the cited epochs, and add an ablation varying ε.

    Authors: This is a fair point and we agree the §6.3 discussion would benefit from quantitative support. In the revision we will add (a) a measurement, taken from the same CIFAR-10 run, of the fraction of coordinates with √v̂_t below ε (and below 10ε, 100ε) as a function of epoch, and (b) an ablation varying ε ∈ {10⁻⁴,10⁻⁶,10⁻⁸,10⁻¹⁰} to expose how much of Adam's behavior in this regime is attributable to the second-moment term versus an effectively constant preconditioner combined with the first-moment term. We will not retract the broader claim — on the logistic, MLP, and VAE experiments the second moment plainly contributes — but we will explicitly state that on CNNs of this size much of Adam's benefit over plain SGD with momentum comes from per-layer scale adaptation early in training and from the first-moment term, and that the improvement margin over well-tuned SGD+momentum is correspondingly modest. This nuance is consistent with what is already written in §6.3 but will be made quantitative. revision: yes

  4. Referee: The claim that absent bias-correction RMSProp diverges for β₂ near 1 is supported only by the VAE experiment (§6.4); a second setting would substantially strengthen the differentiator from RMSProp.

    Authors: We accept this. The bias-correction-vs-no-correction comparison is a central claim and one experiment is thinner than it should be. For the revision we will add a sweep over β₂ ∈ {0.99, 0.999, 0.9999} and α ∈ [10⁻⁵,10⁻¹], with and without the bias-correction terms, on the MLP+dropout MNIST setting from §6.2 (and, if space permits, on the CNN setting from §6.3). We expect — based on the analysis in §3, where the (1−β₂^t) factor is largest precisely when β₂ is near 1 — to reproduce the same instability pattern observed in §6.4. The resulting figure will be added as a panel to Figure 4 or as a new figure in §6.4. revision: yes

standing simulated objections not resolved
  • We do not have a proof of the O(√T) regret bound that dispenses with the monotonicity assumption on √(t·v̂_{t,i})/α. The revised theorem will therefore be conditional on this assumption; an unconditional bound for general bounded convex sequences with bias-corrected v̂_t is left to future work.

Circularity Check

0 steps flagged

No meaningful circularity: Adam's algorithm, bias correction, and empirical claims stand on independent content; the proof gap flagged by the reader is a correctness/soundness issue, not a circular derivation.

full rationale

Walking the derivation chain: (1) §2 Algorithm: defines Adam by EMAs of g and g². No claim is being "derived" from itself — the update rule is a definition. (2) §3 Bias correction: derives E[v_t] = E[g_t²]·(1−β_2^t) + ζ from the EMA recursion (Eq. 1–4). The (1−β_2^t) divisor is then read off this expectation. This is a straightforward algebraic identity, not a circular fit; nothing is fitted to data and then re-predicted. (3) §2.1 SNR / effective stepsize bounds: |Δ_t| ≤ α-style bounds follow from the algebra of m̂_t/√v̂_t. Independent content. (4) §4 / §10 Convergence: Theorem 4.1 derives an O(√T) regret bound from stated assumptions (bounded gradients, bounded iterate distance, β_1²/√β_2 < 1). The reader's concern is that the telescoping step in the proof of Theorem 10.5 implicitly assumes √(t·v̂_{t,i})/α is monotone non-decreasing — a soundness gap later exploited by Reddi et al. (2018). That is a *correctness* problem, not a circularity problem: the bound is not "the input renamed as the output" — it is an attempted proof from external assumptions that turns out to have an unjustified inequality. No quantity is fitted to the regret and then claimed as a prediction of the regret; no self-citation is load-bearing (the proof cites Zinkevich 2003's framework, not the authors' own prior work). (5) §5 Related work / §6 Experiments: comparisons to AdaGrad/RMSProp/SGD use independently implemented baselines on standard datasets (MNIST, IMDB, CIFAR-10). No fitted-input-as-prediction pattern. (6) §7 AdaMax: derived as the p→∞ limit of an L_p generalization (Eq. 6–12). Algebraic limit, not circular. There is essentially no self-citation load: the references are to Duchi, Tieleman & Hinton, Zeiler, Sohl-Dickstein, Zinkevich, etc. The Kingma & Welling (2013) self-cite is only used to specify the VAE architecture used as a *test problem* in §6.4 — it is not load-bearing for any theoretical claim. Conclusion: the paper's core claims are self-contained against external benchmarks and standard online-convex machinery. The Theorem 4.1 issue is a real bug in the proof (correctly diagnosed by the reader), but it is a missing-step / unjustified-monotonicity flaw, not a circular derivation. Score: 1 (one minor self-citation, not load-bearing).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The algorithm itself introduces no postulated entities. Its hyperparameters (α, β₁, β₂, ε) are user-settable knobs, not free parameters fitted to make a derivation work, though the recommended defaults were chosen empirically. The convergence theorem leans on standard online-convex-optimization assumptions plus one assumption that turned out to be false in general.

pith-pipeline@v0.9.0 · 9516 in / 5575 out tokens · 82700 ms · 2026-05-09T01:36:22.993249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ENSEMBITS: an alphabet of protein conformational ensembles

    cs.LG 2026-05 unverdicted novelty 8.0

    Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.

  2. ENSEMBITS: an alphabet of protein conformational ensembles

    cs.LG 2026-05 unverdicted novelty 8.0

    Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.

  3. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  4. Online Learning-to-Defer with Varying Experts

    stat.ML 2026-05 unverdicted novelty 8.0

    Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

  5. Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models

    cs.LG 2026-05 unverdicted novelty 8.0

    In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to gener...

  6. Convergent Stochastic Training of Attention and Understanding LoRA

    cs.LG 2026-05 unverdicted novelty 8.0

    Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.

  7. SLayerGen: a Crystal Generative Model for all Space and Layer Groups

    cond-mat.mtrl-sci 2026-05 unverdicted novelty 8.0

    SLayerGen generates crystals invariant to any space or layer group via autoregressive lattice and Wyckoff sampling plus equivariant diffusion, achieving gains over bulk models on diperiodic materials after correcting ...

  8. 3DSS: 3D Surface Splatting for Inverse Rendering

    cs.GR 2026-05 unverdicted novelty 8.0

    3DSS is the first differentiable surface splatting renderer that recovers shape, spatially-varying BRDF materials, and HDR illumination from multi-view images via a coverage-based compositing model derived from recons...

  9. Random test functions, $H^{-1}$ norm equivalence, and stochastic variational physics-informed neural networks

    math.NA 2026-05 unverdicted novelty 8.0

    H^{-1} norm equivalence to expected squared evaluations on domain-dependent random test functions enables SV-PINNs that recover accurate solutions to challenging second-order elliptic PDEs faster than standard PINNs.

  10. A Parameter-Free First-Order Algorithm for Non-Convex Optimization with $\tilde{\mkern1mu O}(\epsilon^{-5/3})$ Global Rate

    math.OC 2026-05 conditional novelty 8.0

    PF-AGD is the first parameter-free deterministic accelerated first-order method with Õ(ε^{-5/3} log(1/ε)) complexity for smooth non-convex optimization.

  11. Characterizing the Expressivity of Local Attention in Transformers

    cs.CL 2026-05 unverdicted novelty 8.0

    Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...

  12. STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

    cs.CR 2026-05 unverdicted novelty 8.0

    STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the genera...

  13. Qvine: Vine Structured Quantum Circuits for Loading High Dimensional Distributions

    quant-ph 2026-04 unverdicted novelty 8.0

    Qvine uses vine copula-inspired quantum circuit structures to achieve linear or quadratic depth scaling for loading high-dimensional distributions with high approximation quality.

  14. Neural Spectral Bias and Conformal Correlators I: Introduction and Applications

    hep-th 2026-04 unverdicted novelty 8.0

    Neural networks optimized solely on crossing symmetry reconstruct CFT correlators from minimal input data to few-percent accuracy across generalized free fields, minimal models, Ising, N=4 SYM, and AdS diagrams.

  15. MMGait: Towards Multi-Modal Gait Recognition

    cs.CV 2026-04 conditional novelty 8.0

    MMGait provides a new multi-sensor gait dataset and OmniGait baseline to support single-modal, cross-modal, and unified multi-modal person identification from walking patterns.

  16. Proton Structure from Neural Simulation-Based Inference at the LHC

    hep-ph 2026-04 unverdicted novelty 8.0

    Neural simulation-based inference on unbinned top-quark pair data at 13 TeV yields improved gluon PDF precision over traditional binned analyses while incorporating experimental and theoretical uncertainties.

  17. Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate

    math.OC 2026-04 unverdicted novelty 8.0

    Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.

  18. CMCC-ReID: Cross-Modality Clothing-Change Person Re-Identification

    cs.CV 2026-04 unverdicted novelty 8.0

    The paper introduces the CMCC-ReID task, constructs the SYSU-CMCC benchmark dataset, and proposes the PIA network with disentangling and prototype modules that outperforms prior methods on combined modality and clothi...

  19. Traces of Helium Detected in Type Ic Supernova 2014L

    astro-ph.HE 2026-03 accept novelty 8.0

    Quantitative Bayesian inference using a deep-learning emulator detects 0.018-0.020 M_sun of helium in the Type Ic supernova 2014L.

  20. Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    cs.LG 2024-07 conditional novelty 8.0

    TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

  21. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    cs.AI 2023-06 conditional novelty 8.0

    LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.

  22. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    cs.LG 2022-09 unverdicted novelty 8.0

    Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

  23. Offline Reinforcement Learning with Implicit Q-Learning

    cs.LG 2021-10 unverdicted novelty 8.0

    IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.

  24. PathVQA: 30000+ Questions for Medical Visual Question Answering

    cs.CL 2020-03 accept novelty 8.0

    PathVQA is the first public dataset of over 32,000 questions on nearly 5,000 pathology images for medical visual question answering.

  25. Passage Re-ranking with BERT

    cs.IR 2019-01 unverdicted novelty 8.0

    Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.

  26. Density estimation using Real NVP

    cs.LG 2016-05 accept novelty 8.0

    Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.

  27. Adaptive Computation Time for Recurrent Neural Networks

    cs.NE 2016-03 accept novelty 8.0

    ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.

  28. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    cs.LG 2015-11 accept novelty 8.0

    DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.

  29. NICE: Non-linear Independent Components Estimation

    cs.LG 2014-10 accept novelty 8.0

    NICE learns a composition of invertible neural-network layers that transform data into independent latent variables, enabling exact log-likelihood training and sampling for density estimation.

  30. Generating HDR Video from SDR Video

    cs.CV 2026-05 unverdicted novelty 7.0

    A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.

  31. Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model

    stat.ML 2026-05 accept novelty 7.0

    A solvable hierarchical model with power-law feature strengths yields explicit power-law scaling of prediction error through sequential recovery of latent directions by a layer-wise spectral algorithm.

  32. LiWi: Layering in the Wild

    cs.CV 2026-05 unverdicted novelty 7.0

    LiWi uses an agent-driven data synthesis pipeline to build the LiWi-100k dataset and a model with shadow-guided and degradation-restoration objectives that achieves SoTA performance on RGB L1 and Alpha IoU for natural...

  33. Lang2MLIP: End-to-End Language-to-Machine Learning Interatomic Potential Development with Autonomous Agentic Workflows

    cs.LG 2026-05 unverdicted novelty 7.0

    Lang2MLIP is an LLM multi-agent framework that automates end-to-end development of machine learning interatomic potentials from natural language input for heterogeneous materials systems.

  34. Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

    cs.LG 2026-05 conditional novelty 7.0

    A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.

  35. Multiple mechanisms of rhythm switching in recurrent neural networks with adaptive time constants

    q-bio.NC 2026-05 unverdicted novelty 7.0

    Leaky integrator RNNs with adaptive time constants switch between four frequency bands using multiple mechanisms including subpopulation turnover, baseline shifts, and phase reorganization, with high frequencies domin...

  36. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  37. Convergence of difference inclusions via a diameter criterion

    math.OC 2026-05 unverdicted novelty 7.0

    A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.

  38. A Neural-Network Framework to Learn History-Dependent Constitutive Laws and Identifiability of Internal Variables

    cond-mat.mtrl-sci 2026-05 unverdicted novelty 7.0

    A causal energetic neural network framework learns thermodynamically consistent history-dependent constitutive laws, proving internal variables are unique up to linear transformation and achieving 2% error on polycrys...

  39. Stochastic global optimization of continuous functions via random walks on Grassmannians

    math.OC 2026-05 unverdicted novelty 7.0

    A stochastic global optimizer samples random k-dimensional subspaces, solves the restricted problem on each, and moves to the improved point, with rate controlled by a gap parameter on the distribution of restricted minima.

  40. Determining star formation histories and age-metallicity relations with convolutional neural networks

    astro-ph.GA 2026-05 unverdicted novelty 7.0

    A CNN with attention and shared latent space recovers SFHs and metallicities from spectro-photometric data with ~0.12 dex age and ~0.03 dex metallicity dispersion while running thousands of times faster than full spec...

  41. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 7.0

    R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

  42. QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

    cs.LG 2026-05 unverdicted novelty 7.0

    QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.

  43. Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

    cond-mat.str-el 2026-05 conditional novelty 7.0

    PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

  44. Learning to Optimize Radiotherapy Plans via Fluence Maps Diffusion Model Generation and LSTM-based Optimization

    cs.CV 2026-05 unverdicted novelty 7.0

    A distilled diffusion model generates clinically feasible fluence maps for VMAT and an LSTM-based optimizer refines them to meet dose objectives, improving efficiency and deliverability on prostate cancer data.

  45. A Majorization-Minimization with Monte Carlo Approach for Hyperparameter Estimation

    math.NA 2026-05 unverdicted novelty 7.0

    M³C replaces the hard hyperparameter optimization with a sequence of simpler problems using a majorant for the log-determinant approximated via Monte Carlo, with proven high-probability convergence to a critical point...

  46. Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...

  47. VLTI/PIONIER imaging of post-AGB binaries. An INSPIRING hunt for inner rim substructures in circumbinary discs

    astro-ph.SR 2026-05 unverdicted novelty 7.0

    High-resolution interferometric imaging of eight post-AGB circumbinary discs reveals diverse inner-rim substructures including azimuthal brightness enhancements and arc-like features not explained by inclination alone.

  48. Beyond Oversquashing: Understanding Signal Propagation in GNNs Via Observables

    cs.LG 2026-05 unverdicted novelty 7.0

    Quantum-inspired observables reveal poor signal routing in standard spectral GNNs and motivate Schrödinger GNNs with superior propagation capacity.

  49. Spatial Competition for Low-Complexity Learned Image Compression

    eess.IV 2026-05 unverdicted novelty 7.0

    Spatial competition among specialized neural codecs with a transmitted mode map achieves up to 14.5% rate savings over a single codec while matching HEVC performance at single-codec decoding complexity.

  50. Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks

    cs.CR 2026-05 unverdicted novelty 7.0

    Backdoors can be realized as statistically natural latent directions in modern neural networks, achieving high attack success with negligible clean accuracy loss and resisting existing defenses.

  51. STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

    cs.CV 2026-05 conditional novelty 7.0

    STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.

  52. On Hallucinations in Inverse Problems: Fundamental Limits and Provable Assessment Methods

    stat.ML 2026-05 unverdicted novelty 7.0

    Hallucinations in inverse problem reconstructions are fundamental to ill-posedness, with necessary and sufficient conditions plus computable bounds depending only on the forward model.

  53. Identifying the nonlinear string dynamics with port-Hamiltonian neural networks

    cs.LG 2026-05 unverdicted novelty 7.0

    Port-Hamiltonian neural networks extended to PDEs recover the Hamiltonian and dissipation of nonlinear string dynamics from data and outperform non-physics-informed baselines.

  54. Scaling Laws for Mixture Pretraining Under Data Constraints

    cs.LG 2026-05 conditional novelty 7.0

    Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.

  55. Spectral Energy Centroid: a Metric for Improving Performance and Analyzing Spectral Bias in Implicit Neural Representations

    cs.LG 2026-05 unverdicted novelty 7.0

    Spectral Energy Centroid is a new metric that quantifies signal frequency and INR spectral bias, supporting better hyperparameter selection and cross-architecture analysis.

  56. Newton methods beyond Hessian Lipschitz continuity: A nonlinear preconditioning approach

    math.OC 2026-05 unverdicted novelty 7.0

    Nonlinear preconditioning extends Newton methods to objectives lacking Hessian Lipschitz continuity by analyzing a transformed mapping under a relaxed smoothness condition, with superlinear convergence and O(ε^{-3/2})...

  57. Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.

  58. SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    SEMIR replaces dense voxel computation with a learned topology-preserving graph minor that supports exact decoding and GNN-based inference for small-structure segmentation in large medical images.

  59. Neural-Schwarz Tiling for Geometry-Universal PDE Solving at Scale

    cs.LG 2026-05 unverdicted novelty 7.0

    Local neural operators on 3x3x3 patches, composed via Schwarz iteration, solve large-scale nonlinear elasticity on arbitrary geometries without domain-specific retraining.

  60. Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

    cs.LG 2026-05 unverdicted novelty 7.0

    SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.

Reference graph

Works this paper leans on

27 extracted references · 5 canonical work pages · cited by 741 Pith papers

  1. [1]

    Natural gradient works efficiently in learning

    Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural computation, 10 0 (2): 0 251--276, 1998

  2. [2]

    Recent advances in deep learning for speech research at microsoft

    Deng, Li, Li, Jinyu, Huang, Jui-Ting, Yao, Kaisheng, Yu, Dong, Seide, Frank, Seltzer, Michael, Zweig, Geoff, He, Xiaodong, Williams, Jason, et al. Recent advances in deep learning for speech research at microsoft. ICASSP 2013, 2013

  3. [3]

    Adaptive subgradient methods for online learning and stochastic optimization

    Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12: 0 2121--2159, 2011

  4. [4]

    arXiv preprint arXiv:1308.0850 (2013) 4, 5

    Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

  5. [5]

    Speech recognition with deep recurrent neural networks

    Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.\ 6645--6649. IEEE, 2013

  6. [6]

    and Salakhutdinov, R.R

    Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 313 0 (5786): 0 504--507, 2006

  7. [7]

    Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

    Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29 0 (6): 0 82--97, 2012 a

  8. [8]

    E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R

    Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012 b

  9. [9]

    Auto-Encoding Variational Bayes

    Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes . In The 2nd International Conference on Learning Representations (ICLR), 2013

  10. [10]

    Imagenet classification with deep convolutional neural networks

    Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012

  11. [11]

    Learning word vectors for sentiment analysis

    Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp.\ 142--150. Association for Computational Linguistics, 2011

  12. [12]

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning

    Moulines, Eric and Bach, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp.\ 451--459, 2011

  13. [13]

    Revisiting Natural Gradient for Deep Networks

    Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013

  14. [14]

    Acceleration of stochastic approximation by averaging

    Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30 0 (4): 0 838--855, 1992

  15. [15]

    A fast natural newton method

    Roux, Nicolas L and Fitzgibbon, Andrew W. A fast natural newton method. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp.\ 623--630, 2010

  16. [16]

    Efficient estimations from a slowly convergent robbins-monro process

    Ruppert, David. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

  17. [17]

    arXiv , arxivId =:arXiv:1206.1106v2 , title =

    Schaul, Tom, Zhang, Sixin, and LeCun, Yann. No more pesky learning rates. arXiv preprint arXiv:1206.1106, 2012

  18. [18]

    Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods

    Sohl-Dickstein, Jascha, Poole, Ben, and Ganguli, Surya. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp.\ 604--612, 2014

  19. [19]

    On the importance of initialization and momentum in deep learning

    Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 1139--1147, 2013

  20. [20]

    and Hinton, G

    Tieleman, T. and Hinton, G. Lecture 6.5 - RMSP rop, COURSERA : N eural N etworks for M achine L earning. Technical report, 2012

  21. [21]

    Fast dropout training

    Wang, Sida and Manning, Christopher. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 118--126, 2013

  22. [22]

    Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701,

    Zeiler, Matthew D. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012

  23. [23]

    Online convex programming and generalized infinitesimal gradient ascent

    Zinkevich, Martin. Online convex programming and generalized infinitesimal gradient ascent. 2003

  24. [24]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  25. [25]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  26. [26]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  27. [27]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...