Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.
Roformer: Enhanced transformer with rotary position embedding
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
A Markov category framework for language models provides an information-theoretic rationale for speculative decoding and shows that a quadratic surrogate to negative log-likelihood induces generalized CCA alignment in linear-softmax heads after normalization.
Small cortical patches multiplex phonetic, syllabic, and lexical representations during speech production via dynamic temporal coding.
Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
citing papers explorer
-
LayerNorm Induces Recency Bias in Transformer Decoders
Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.
-
A Markov Categorical Framework for Language Modeling
A Markov category framework for language models provides an information-theoretic rationale for speculative decoding and shows that a quadratic surrogate to negative log-likelihood induces generalized CCA alignment in linear-softmax heads after normalization.
-
Temporal structure of the language hierarchy within small cortical patches
Small cortical patches multiplex phonetic, syllabic, and lexical representations during speech production via dynamic temporal coding.
-
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
-
When Attention Sink Emerges in Language Models: An Empirical View
Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.
-
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.