DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
Pointer sentinel mixture models
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
Jordan-RoPE realizes a distance-modulated phase basis via non-semisimple Jordan blocks, generating features such as d e^{iωd} for relative positional encoding.
TRC² is a brain-inspired decoder-only architecture that localizes fast plasticity and uses thalamic and hippocampal pathways to substantially reduce cumulative forgetting in sequential language model training on streams like C4, WikiText-103, and GSM8K.
citing papers explorer
-
Dynamic Chunking for Diffusion Language Models
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
-
Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks
Jordan-RoPE realizes a distance-modulated phase basis via non-semisimple Jordan blocks, generating features such as d e^{iωd} for relative positional encoding.
-
Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns
TRC² is a brain-inspired decoder-only architecture that localizes fast plasticity and uses thalamic and hippocampal pathways to substantially reduce cumulative forgetting in sequential language model training on streams like C4, WikiText-103, and GSM8K.
- Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs