Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.
Pointer sentinel mixture models
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5representative citing papers
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
SARQC augments standard PTQ calibration with a saliency-aware regularizer to keep quantized weights closer to original floating-point values, yielding improved perplexity and zero-shot accuracy on dense and MoE LLMs.
Different calibration objectives produce distinct layer pruning patterns in LLMs, while search algorithms converge to similar solutions under a fixed objective.
citing papers explorer
-
Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention
Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.
-
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
-
Saliency-Aware Regularized Quantization Calibration for Large Language Models
SARQC augments standard PTQ calibration with a saliency-aware regularizer to keep quantized weights closer to original floating-point values, yielding improved perplexity and zero-shot accuracy on dense and MoE LLMs.
-
Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning
Different calibration objectives produce distinct layer pruning patterns in LLMs, while search algorithms converge to similar solutions under a fixed objective.
- Backdooring Masked Diffusion Language Models