super hub Mixed citations

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter · 2017 · cs.LG · arXiv 1711.05101

Mixed citation behavior. Most common role is method (57%).

475 Pith papers citing it

Method 57% of classified citations

open full Pith review browse 475 citing papers more from Ilya Loshchilov and Frank Hutter arXiv PDF

abstract

L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 4 background 3

citation-polarity summary

use method 4 background 2 unclear 1

claims ledger

abstract L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the

authors

Ilya Loshchilov and Frank Hutter

co-cited works

representative citing papers

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Online Learning-to-Defer with Varying Experts

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

Dissecting Jet-Tagger Through Mechanistic Interpretability

hep-ph · 2026-05-11 · accept · novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 8.0

CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.

Stability and Generalization in Looped Transformers

cs.LG · 2026-04-16 · unverdicted · novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Progress measures for grokking via mechanistic interpretability

cs.LG · 2023-01-12 · accept · novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

cs.LG · 2022-09-07 · unverdicted · novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

RoFormer: Enhanced Transformer with Rotary Position Embedding

cs.CL · 2021-04-20 · accept · novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

MeMo: Memory as a Model

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to retrieval noise.

A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

A real Schur decomposition projection maps the state matrix of discrete-time state-space layers onto its nearest stable counterpart, delivering accuracy comparable to prior stable identification methods with fewer weights.

Training-Free Generative Sampling via Moment-Matched Score Smoothing

stat.ML · 2026-05-14 · unverdicted · novelty 7.0

MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.

ASH: Agents that Self-Hone via Embodied Learning

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.

Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction

cs.LG · 2026-05-13 · conditional · novelty 7.0

PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.

DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.

AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

cs.CV · 2026-05-13 · conditional · novelty 7.0

AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.

Very Efficient Listwise Multimodal Reranking for Long Documents

cs.IR · 2026-05-12 · unverdicted · novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

citing papers explorer

Showing 50 of 475 citing papers.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation cs.CV · 2026-05-13 · unverdicted · none · ref 46 · internal anchor
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation cs.GR · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 50 · internal anchor
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
Online Learning-to-Defer with Varying Experts stat.ML · 2026-05-12 · unverdicted · none · ref 106 · internal anchor
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Dissecting Jet-Tagger Through Mechanistic Interpretability hep-ph · 2026-05-11 · accept · none · ref 43 · internal anchor
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
LLM Translation of Compiler Intermediate Representation cs.PL · 2026-05-07 · unverdicted · none · ref 25 · internal anchor
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models cs.CV · 2026-05-03 · unverdicted · none · ref 25 · internal anchor
CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.
Stability and Generalization in Looped Transformers cs.LG · 2026-04-16 · unverdicted · none · ref 15 · internal anchor
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations cs.LG · 2026-04-14 · unverdicted · none · ref 14 · internal anchor
CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 30 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Progress measures for grokking via mechanistic interpretability cs.LG · 2023-01-12 · accept · none · ref 39 · internal anchor
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow cs.LG · 2022-09-07 · unverdicted · none · ref 45 · internal anchor
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
RoFormer: Enhanced Transformer with Rotary Position Embedding cs.CL · 2021-04-20 · accept · none · ref 13 · internal anchor
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 38 · internal anchor
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
MeMo: Memory as a Model cs.CL · 2026-05-14 · unverdicted · none · ref 70 · internal anchor
MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to retrieval noise.
A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures cs.LG · 2026-05-14 · unverdicted · none · ref 25 · internal anchor
A real Schur decomposition projection maps the state matrix of discrete-time state-space layers onto its nearest stable counterpart, delivering accuracy comparable to prior stable identification methods with fewer weights.
Training-Free Generative Sampling via Moment-Matched Score Smoothing stat.ML · 2026-05-14 · unverdicted · none · ref 53 · internal anchor
MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.
ASH: Agents that Self-Hone via Embodied Learning cs.AI · 2026-05-14 · unverdicted · none · ref 51 · internal anchor
ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction cs.LG · 2026-05-13 · conditional · none · ref 21 · internal anchor
PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages cs.CL · 2026-05-13 · unverdicted · none · ref 39 · internal anchor
A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport cs.CV · 2026-05-13 · unverdicted · none · ref 51 · internal anchor
DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.
AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters cs.CV · 2026-05-13 · conditional · none · ref 62 · internal anchor
AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.
Very Efficient Listwise Multimodal Reranking for Long Documents cs.IR · 2026-05-12 · unverdicted · none · ref 57 · internal anchor
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters cs.LG · 2026-05-12 · unverdicted · none · ref 40 · internal anchor
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models cs.LG · 2026-05-12 · unverdicted · none · ref 29 · 2 links · internal anchor
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations cs.CV · 2026-05-12 · unverdicted · none · ref 21 · internal anchor
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
Relative Score Policy Optimization for Diffusion Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 76 · internal anchor
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
Inverse Design of Metainterfaces for Static Friction Control: Beyond the Hertzian Limit cond-mat.soft · 2026-05-10 · unverdicted · none · ref 22 · internal anchor
A differentiable physics engine inside a neural network discovers non-Hertzian asperity shapes that produce programmable nonlinear friction-area relations, validated by BEM simulations.
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases cs.LG · 2026-05-10 · unverdicted · none · ref 54 · internal anchor
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems cs.SD · 2026-05-10 · unverdicted · none · ref 41 · internal anchor
MixtureTT performs direct per-stem timbre transfer on polyphonic mixtures via a shared diffusion transformer, outperforming single-stem baselines on SATB choral data while eliminating cascaded separation errors.
Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation cs.IR · 2026-05-09 · unverdicted · none · ref 22 · internal anchor
Reddit2Deezer supplies 190k authentic Reddit dialogues grounded in Deezer music entities for scalable conversational music recommendation research.
Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning cs.CV · 2026-05-09 · unverdicted · none · ref 62 · internal anchor
UniTopo unifies lane detection and topology reasoning into a single perception model, outperforming prior methods on OpenLane-V2 benchmarks with TOP_ll scores of 30.1% and 31.8%.
From Holo Pockets to Electron Density: GPT-style Drug Design with Density cs.AI · 2026-05-09 · unverdicted · none · ref 15 · internal anchor
EDMolGPT generates drug-like molecules from low-resolution electron density point clouds of holo binding pockets and shows effectiveness across 101 biological targets.
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation cs.CV · 2026-05-09 · unverdicted · none · ref 55 · internal anchor
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
NeuralBench: A Unifying Framework to Benchmark NeuroAI Models cs.LG · 2026-05-08 · conditional · none · ref 177 · internal anchor
NeuralBench is a new benchmarking framework for neuroAI models on EEG data that finds foundation models only marginally outperform task-specific ones while many tasks like cognitive decoding stay highly challenging.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms cs.LG · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
Fast Byte Latent Transformer cs.CL · 2026-05-08 · unverdicted · none · ref 19 · 2 links · internal anchor
BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.
MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning cs.CL · 2026-05-08 · unverdicted · none · ref 16 · internal anchor
MatryoshkaLoRA inserts a crafted diagonal matrix P into LoRA to learn accurate nested low-rank adapters that support dynamic rank selection with minimal performance drop.
SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild cs.CV · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdom benchmarks.
Structured Role-Aware Policy Optimization for Multimodal Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 37 · internal anchor
SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified via a shared baseline.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation cs.CL · 2026-05-08 · unverdicted · none · ref 47 · internal anchor
MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 91 · internal anchor
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models stat.ML · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
Higher-variance classes are learned first in diffusion models; strong class imbalance reverses the order and imposes distinct delayed learning times on minority classes.
Layer Collapse in Diffusion Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 13 · 2 links · internal anchor
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
Autoregressive Visual Generation Needs a Prologue cs.CV · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training cs.CL · 2026-05-07 · unverdicted · none · ref 56 · internal anchor
Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models cs.LG · 2026-05-07 · unverdicted · none · ref 19 · 2 links · internal anchor
TFM-Retouche is an architecture-agnostic input-space residual adapter that improves tabular foundation model accuracy on 51 datasets by learning input corrections through the frozen backbone, with an identity guard to fall back to the original model.
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery cs.CV · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egocentric motion recovery.
Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation cs.CV · 2026-05-06 · unverdicted · none · ref 33 · internal anchor
BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.
HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities cs.CL · 2026-05-06 · unverdicted · none · ref 50 · internal anchor
Training on automatically generated hard negative captions improves vision-language models' zero-shot detection of fine-grained image-text mismatches and robustness to noisy inputs.

Decoupled Weight Decay Regularization

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer