mega hub Mixed citations

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter · 2017 · cs.LG · arXiv 1711.05101

Mixed citation behavior. Most common role is method (58%).

1096 Pith papers citing it

Method 58% of classified citations

open full Pith review browse 1096 citing papers more from Ilya Loshchilov and Frank Hutter arXiv PDF

abstract

L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 102 background 64 dataset 5 baseline 3 other 3

citation-polarity summary

use method 102 background 57 unclear 9 use dataset 5 baseline 3 support 1

claims ledger

abstract L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the

authors

Ilya Loshchilov and Frank Hutter

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

cs.CV · 2026-05-27 · conditional · novelty 8.0

Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Dissecting Jet-Tagger Through Mechanistic Interpretability

hep-ph · 2026-05-11 · accept · novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum

cs.LG · 2026-05-03 · unverdicted · novelty 8.0 · 2 refs

Momentum-based async SGD achieves optimal convergence rates for data-dependent delays without biasing updates toward simpler samples.

CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 8.0

CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.

Stability and Generalization in Looped Transformers

cs.LG · 2026-04-16 · unverdicted · novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Rotation Equivariant Mamba for Vision Tasks

cs.CV · 2026-03-10 · unverdicted · novelty 8.0

EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-equivariant baselines across classification, segmentation, and super-resolution.

A document is worth a structured record: Principled inductive bias design for document recognition

cs.CV · 2025-07-11 · unverdicted · novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

VMamba: Visual State Space Model

cs.CV · 2024-01-18 · conditional · novelty 8.0

VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.

Progress measures for grokking via mechanistic interpretability

cs.LG · 2023-01-12 · accept · novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

cs.LG · 2022-09-07 · unverdicted · novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

RoFormer: Enhanced Transformer with Rotary Position Embedding

cs.CL · 2021-04-20 · accept · novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

GaussianFusion: Unified 3D Gaussian Representation for Multi-Modal Fusion Perception

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

GaussianFusion presents a 3D Gaussian-based framework that unifies multi-modal features in continuous space for 3D object detection and semantic occupancy, reporting gains over BEVFusion and GaussFormer on nuScenes.

Joint inference of weak lensing convergence map and cosmology with diffusion models

astro-ph.CO · 2026-06-30 · unverdicted · novelty 7.0

A transformer-based diffusion model learns the joint distribution of convergence maps and cosmology from log-normal weak lensing simulations and generates calibrated posterior samples matching MCMC results.

citing papers explorer

Showing 15 of 15 citing papers after filters.

DeVAR: Low-Dose CT Denoising via Visual Autoregressive Modeling eess.IV · 2026-06-26 · unverdicted · none · ref 20 · internal anchor
DeVAR is the first application of visual autoregressive modeling to low-dose CT denoising, using next-scale token prediction, a residual refiner, and hybrid discrete-continuous decoding to outperform prior methods on two public datasets.
DIPA: Distilled Preconditioned Algorithms for Solving Imaging Inverse Problems eess.IV · 2026-05-14 · unverdicted · none · ref 50 · internal anchor
DIPA learns preconditioning operators via distillation from a teacher with a better sensing matrix to improve reconstruction quality for the student's physically constrained matrix in imaging inverse problems.
Echo-POSED: Geometric Self-Distillation for Echocardiography Guidance eess.IV · 2026-05-30 · unverdicted · none · ref 21 · internal anchor
Echo-POSED learns an SO(3)×SO(3) pose representation via self-supervised equivariance to probe motion and invariance to cardiac phase from 2D slices of 3D echocardiography volumes, reporting 8.2° mean angular error in intra-patient guidance simulations.
A Clinically Validated Foundation Model for Comprehensive Lung Pathology Interpretation eess.IV · 2026-05-25 · unverdicted · none · ref 113 · internal anchor
PulmoFoundation achieves 92.3% average AUC on 32 lung pathology tasks in prospective validation and raises pathologist accuracy from 83.8% to 91.7% in a crossover RCT.
MonoUNet: A Robust Tiny Neural Network for Automated Knee Cartilage Segmentation on Point-of-Care Ultrasound Devices eess.IV · 2026-04-09 · unverdicted · none · ref 36 · 2 links · internal anchor
MonoUNet is a tiny segmentation network that achieves 92-95% Dice scores on multi-device knee cartilage ultrasound while using 10-700x fewer parameters than prior lightweight models by injecting trainable local phase features.
MC-GenRef: Annotation-free mammography microcalcification segmentation with generative posterior refinement eess.IV · 2026-04-06 · unverdicted · none · ref 22 · internal anchor
MC-GenRef performs annotation-free microcalcification segmentation via synthetic data from a lightweight image formation model plus test-time generative posterior refinement with a rectified-flow generator, yielding top Dice on INbreast and gains on an external cohort.
Task-Guided Prompting for Unified Remote Sensing Image Restoration eess.IV · 2026-04-03 · unverdicted · none · ref 54 · internal anchor
TGPNet unifies denoising, cloud removal, shadow removal, deblurring, and SAR despeckling into one model via task-guided prompting and reports state-of-the-art results on a new multi-modal benchmark.
Deep Distillation Gradient Preconditioning for Inverse Problems eess.IV · 2025-08-06 · unverdicted · none · ref 25 · internal anchor
Knowledge distillation trains a nonlinear gradient preconditioner that improves convergence and reconstruction quality for plug-and-play FISTA on ill-conditioned inverse imaging problems.
Interlaced R2D2 DNN Series for Scalable Non-Cartesian MRI with Sensitivity Self-calibration eess.IV · 2025-03-12 · unverdicted · none · ref 40 · internal anchor
iR2D2 extends the R2D2 DNN series paradigm with an interlaced dual-series architecture and error-controlled updates to jointly reconstruct MR images and self-calibrate sensitivity maps from undersampled radial k-space data.
A Self-Supervised Learning Framework for Video Encoding Complexity Clustering eess.IV · 2026-06-28 · unverdicted · none · ref 19 · internal anchor
CECL pretrains video encoders via compression responses for downstream clustering by encoding complexity, claiming gains over SOTA encoders and bitrate/quality savings vs fixed ladders.
Unsupervised Deep Learning for Limited-Angle STEM-EDX Tomography -- Application to 3D Chemical Analysis of Phase-Change Memory Devices eess.IV · 2026-06-09 · unverdicted · none · ref 43 · internal anchor
Unsupervised multi-channel DIP-TV reconstructs near-isotropic 3D elemental maps from limited-angle EDX tomography data using only EDX signals, applied to GST memory devices in virgin and SET states.
Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It eess.IV · 2026-04-02 · unverdicted · none · ref 45 · internal anchor
MaskGen improves domain generalization for biomedical image segmentation by using source intensities plus domain-stable foundation model representations with minimal added complexity.
Cardiovascular disease classification using radiomics and geometric features from cardiac CT eess.IV · 2025-06-27 · conditional · none · ref 12 · internal anchor
A pipeline using segmentation, atlas registration, radiomics, and geometric features achieves 87.5% CVD classification accuracy on ASOCA, outperforming direct raw-image classification at 67.5%.
Clinical utility of foundation models in musculoskeletal MRI for biomarker fidelity and predictive outcomes eess.IV · 2025-01-23 · unverdicted · none · ref 47 · internal anchor
Fine-tuned foundation models produce reliable MSK MRI biomarkers that support workload-reducing triage and calibrated 48-month prediction of knee replacement and incident OA.
Federated Breast Cancer Detection Enhanced by Synthetic Ultrasound Image Augmentation eess.IV · 2025-06-29 · conditional · none · ref 38 · internal anchor
Balanced synthetic image augmentation via GANs and diffusion models raises average AUC from 0.9206 to 0.9362 for FedAvg and 0.9429 to 0.9574 for FedProx in federated breast ultrasound classification.