TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
Transformers are rnns: Fast autoregressive transformers with linear attention
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 2representative citing papers
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
SPAN is a hierarchical attention framework that constructs multi-scale pyramid representations from single-scale patch inputs for WSI classification and segmentation while preserving spatial relationships.
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
citing papers explorer
-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
-
PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention
PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
-
Learning Spatial-Preserving Hierarchical Representations for Digital Pathology
SPAN is a hierarchical attention framework that constructs multi-scale pyramid representations from single-scale patch inputs for WSI classification and segmentation while preserving spatial relationships.
-
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.