pith. sign in

hub

Decoupled weight decay regularization

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

hub tools

representative citing papers

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

cs.CL · 2024-10-23 · conditional · novelty 6.0

Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.

Simplified State Space Layers for Sequence Modeling

cs.LG · 2022-08-09 · accept · novelty 6.0

S5 uses a single MIMO state space model with S4-derived initialization to match S4 efficiency and reach 87.4% average accuracy on the Long Range Arena benchmark.

citing papers explorer

Showing 10 of 10 citing papers.