DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
mega hub Mixed citations
Decoupled Weight Decay Regularization
Mixed citation behavior. Most common role is method (58%).
abstract
L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
Momentum-based async SGD achieves optimal convergence rates for data-dependent delays without biasing updates toward simpler samples.
CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.
EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-equivariant baselines across classification, segmentation, and super-resolution.
Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
GaussianFusion presents a 3D Gaussian-based framework that unifies multi-modal features in continuous space for 3D object detection and semantic occupancy, reporting gains over BEVFusion and GaussFormer on nuScenes.
A transformer-based diffusion model learns the joint distribution of convergence maps and cosmology from log-normal weak lensing simulations and generates calibrated posterior samples matching MCMC results.
citing papers explorer
-
DeVAR: Low-Dose CT Denoising via Visual Autoregressive Modeling
DeVAR is the first application of visual autoregressive modeling to low-dose CT denoising, using next-scale token prediction, a residual refiner, and hybrid discrete-continuous decoding to outperform prior methods on two public datasets.
-
DIPA: Distilled Preconditioned Algorithms for Solving Imaging Inverse Problems
DIPA learns preconditioning operators via distillation from a teacher with a better sensing matrix to improve reconstruction quality for the student's physically constrained matrix in imaging inverse problems.
-
Echo-POSED: Geometric Self-Distillation for Echocardiography Guidance
Echo-POSED learns an SO(3)×SO(3) pose representation via self-supervised equivariance to probe motion and invariance to cardiac phase from 2D slices of 3D echocardiography volumes, reporting 8.2° mean angular error in intra-patient guidance simulations.
-
A Clinically Validated Foundation Model for Comprehensive Lung Pathology Interpretation
PulmoFoundation achieves 92.3% average AUC on 32 lung pathology tasks in prospective validation and raises pathologist accuracy from 83.8% to 91.7% in a crossover RCT.
-
MonoUNet: A Robust Tiny Neural Network for Automated Knee Cartilage Segmentation on Point-of-Care Ultrasound Devices
MonoUNet is a tiny segmentation network that achieves 92-95% Dice scores on multi-device knee cartilage ultrasound while using 10-700x fewer parameters than prior lightweight models by injecting trainable local phase features.
-
MC-GenRef: Annotation-free mammography microcalcification segmentation with generative posterior refinement
MC-GenRef performs annotation-free microcalcification segmentation via synthetic data from a lightweight image formation model plus test-time generative posterior refinement with a rectified-flow generator, yielding top Dice on INbreast and gains on an external cohort.
-
Task-Guided Prompting for Unified Remote Sensing Image Restoration
TGPNet unifies denoising, cloud removal, shadow removal, deblurring, and SAR despeckling into one model via task-guided prompting and reports state-of-the-art results on a new multi-modal benchmark.
-
Deep Distillation Gradient Preconditioning for Inverse Problems
Knowledge distillation trains a nonlinear gradient preconditioner that improves convergence and reconstruction quality for plug-and-play FISTA on ill-conditioned inverse imaging problems.
-
Interlaced R2D2 DNN Series for Scalable Non-Cartesian MRI with Sensitivity Self-calibration
iR2D2 extends the R2D2 DNN series paradigm with an interlaced dual-series architecture and error-controlled updates to jointly reconstruct MR images and self-calibrate sensitivity maps from undersampled radial k-space data.
-
A Self-Supervised Learning Framework for Video Encoding Complexity Clustering
CECL pretrains video encoders via compression responses for downstream clustering by encoding complexity, claiming gains over SOTA encoders and bitrate/quality savings vs fixed ladders.
-
Unsupervised Deep Learning for Limited-Angle STEM-EDX Tomography -- Application to 3D Chemical Analysis of Phase-Change Memory Devices
Unsupervised multi-channel DIP-TV reconstructs near-isotropic 3D elemental maps from limited-angle EDX tomography data using only EDX signals, applied to GST memory devices in virgin and SET states.
-
Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It
MaskGen improves domain generalization for biomedical image segmentation by using source intensities plus domain-stable foundation model representations with minimal added complexity.
-
Cardiovascular disease classification using radiomics and geometric features from cardiac CT
A pipeline using segmentation, atlas registration, radiomics, and geometric features achieves 87.5% CVD classification accuracy on ASOCA, outperforming direct raw-image classification at 67.5%.
-
Clinical utility of foundation models in musculoskeletal MRI for biomarker fidelity and predictive outcomes
Fine-tuned foundation models produce reliable MSK MRI biomarkers that support workload-reducing triage and calibrated 48-month prediction of knee replacement and incident OA.
-
Federated Breast Cancer Detection Enhanced by Synthetic Ultrasound Image Augmentation
Balanced synthetic image augmentation via GANs and diffusion models raises average AUC from 0.9206 to 0.9362 for FedAvg and 0.9429 to 0.9574 for FedProx in federated breast ultrasound classification.