hub

Decoupled weight decay regularization

Ilya Loshchilov, Frank Hutter · 2019

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

browse 10 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

cs.CL · 2026-05-10 · conditional · novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

LayerNorm Induces Recency Bias in Transformer Decoders

cs.CL · 2025-09-25 · unverdicted · novelty 7.0

Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.

Learning to Forget: Continual Learning with Adaptive Weight Decay

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

FADE adapts per-parameter weight decay rates online via approximate meta-gradient descent to improve controlled forgetting over fixed decay in online tracking and streaming classification.

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

cs.CV · 2025-11-01 · unverdicted · novelty 6.0

A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

cs.CL · 2024-10-23 · conditional · novelty 6.0

Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.

Simplified State Space Layers for Sequence Modeling

cs.LG · 2022-08-09 · accept · novelty 6.0

S5 uses a single MIMO state space model with S4-derived initialization to match S4 efficiency and reach 87.4% average accuracy on the Long Range Arena benchmark.

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

cs.LG · 2025-07-01 · unverdicted · novelty 5.0

JAM aligns frozen vision and language models via joint autoencoders and multimodal Spread Loss, reliably inducing cross-modal alignment across layer depths, objectives, and model scales.

Enhancing Trust in Large Language Models via Uncertainty-Calibrated Fine-Tuning

cs.CL · 2024-12-03 · unverdicted · novelty 5.0

Uncertainty-aware fine-tuning with a decision-theory-based loss produces better-calibrated uncertainty estimates than standard training on free-form QA tasks.

Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping

cs.LG · 2023-05-18 · unverdicted · novelty 5.0

Affine mapping dominates LTSF benchmarks by learning similar input-to-output transition matrices, captures periodic signals well but struggles with non-periodic or cross-channel varying periods; reversible normalization converts trends to periodic-like patterns.

citing papers explorer

Showing 10 of 10 citing papers.

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models cs.CL · 2026-05-10 · conditional · none · ref 60
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
LayerNorm Induces Recency Bias in Transformer Decoders cs.CL · 2025-09-25 · unverdicted · none · ref 10
Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.
Learning to Forget: Continual Learning with Adaptive Weight Decay cs.LG · 2026-04-29 · unverdicted · none · ref 27
FADE adapts per-parameter weight decay rates online via approximate meta-gradient descent to improve controlled forgetting over fixed decay in online tracking and streaming classification.
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models cs.CV · 2025-11-01 · unverdicted · none · ref 51
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
Scaling Diffusion Language Models via Adaptation from Autoregressive Models cs.CL · 2024-10-23 · conditional · none · ref 156
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
Simplified State Space Layers for Sequence Modeling cs.LG · 2022-08-09 · accept · none · ref 126
S5 uses a single MIMO state space model with S4-derived initialization to match S4 efficiency and reach 87.4% average accuracy on the Long Range Arena benchmark.
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers cs.CV · 2026-05-16 · unverdicted · none · ref 36
EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.
Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models cs.LG · 2025-07-01 · unverdicted · none · ref 39
JAM aligns frozen vision and language models via joint autoencoders and multimodal Spread Loss, reliably inducing cross-modal alignment across layer depths, objectives, and model scales.
Enhancing Trust in Large Language Models via Uncertainty-Calibrated Fine-Tuning cs.CL · 2024-12-03 · unverdicted · none · ref 41
Uncertainty-aware fine-tuning with a decision-theory-based loss produces better-calibrated uncertainty estimates than standard training on free-form QA tasks.
Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping cs.LG · 2023-05-18 · unverdicted · none · ref 17
Affine mapping dominates LTSF benchmarks by learning similar input-to-output transition matrices, captures periodic signals well but struggles with non-periodic or cross-channel varying periods; reversible normalization converts trends to periodic-like patterns.

Decoupled weight decay regularization

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer