LLQR+SAM pairs a slow learned geometry preconditioner with fast SAM perturbations to amplify escape from locally sharp 'potholes' while stabilizing flat basins, producing consistent gains over SAM and LLQR alone.
9th International Conference on Learning Representations
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.
HORST uses non-commutative operator composition and a hyperbolic mirror map to combine stability from adaptive optimizers with L1 sparsity bias, outperforming AdamW across sparsity levels on vision and language tasks.
ERPPO adds a DSA-based ambiguity estimator to MAPPO and switches between L1 and L2 entropy regularization to improve exploration and stability in non-stationary multi-dimensional observations.
SED modifies diffusion models to generate only non-zero values in sparse data, preserving sparsity patterns, cutting computation, and matching or beating standard DM performance on benchmarks.
citing papers explorer
-
Navigating Potholes with Geometry-Aware Sharpness Minimization
LLQR+SAM pairs a slow learned geometry preconditioner with fast SAM perturbations to amplify escape from locally sharp 'potholes' while stabilizing flat basins, producing consistent gains over SAM and LLQR alone.
-
MiVE: Multiscale Vision-language features for reference-guided video Editing
MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.
-
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.
-
HORST: Composing Optimizer Geometries for Sparse Transformer Training
HORST uses non-commutative operator composition and a hyperbolic mirror map to combine stability from adaptive optimizers with L1 sparsity bias, outperforming AdamW across sparsity levels on vision and language tasks.
-
ERPPO: Entropy Regularization-based Proximal Policy Optimization
ERPPO adds a DSA-based ambiguity estimator to MAPPO and switches between L1 and L2 entropy regularization to improve exploration and stability in non-stationary multi-dimensional observations.
-
Skipping the Zeros in Diffusion Models for Sparse Data Generation
SED modifies diffusion models to generate only non-zero values in sparse data, preserving sparsity patterns, cutting computation, and matching or beating standard DM performance on benchmarks.