Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
mega hub Mixed citations
Decoupled Weight Decay Regularization
Mixed citation behavior. Most common role is method (58%).
abstract
L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
Momentum-based async SGD achieves optimal convergence rates for data-dependent delays without biasing updates toward simpler samples.
CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.
EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-equivariant baselines across classification, segmentation, and super-resolution.
Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
GaussianFusion presents a 3D Gaussian-based framework that unifies multi-modal features in continuous space for 3D object detection and semantic occupancy, reporting gains over BEVFusion and GaussFormer on nuScenes.
A transformer-based diffusion model learns the joint distribution of convergence maps and cosmology from log-normal weak lensing simulations and generates calibrated posterior samples matching MCMC results.
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
citing papers explorer
-
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
-
Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation
MixTGFormer reports state-of-the-art 3D pose estimation errors of 37.6 mm on Human3.6M and 15.7 mm on MPI-INF-3DHP by using parallel GCN-Transformer streams with SE layers for local-global feature fusion.
-
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.