Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher · 2017

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

citation-role summary

background 2 dataset 1

citation-polarity summary

support 1 unclear 1 use dataset 1

representative citing papers

Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.

Saliency-Aware Regularized Quantization Calibration for Large Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

SARQC augments standard PTQ calibration with a saliency-aware regularizer to keep quantized weights closer to original floating-point values, yielding improved perplexity and zero-shot accuracy on dense and MoE LLMs.

Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning

cs.LG · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Different calibration objectives produce distinct layer pruning patterns in LLMs, while search algorithms converge to similar solutions under a fixed objective.

Backdooring Masked Diffusion Language Models

cs.LG · 2026-05-19

citing papers explorer

Showing 5 of 5 citing papers.

Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention cs.LG · 2026-05-08 · unverdicted · none · ref 23
Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention cs.LG · 2026-05-13 · unverdicted · none · ref 34
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
Saliency-Aware Regularized Quantization Calibration for Large Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 46 · 2 links
SARQC augments standard PTQ calibration with a saliency-aware regularizer to keep quantized weights closer to original floating-point values, yielding improved perplexity and zero-shot accuracy on dense and MoE LLMs.
Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning cs.LG · 2026-04-27 · unverdicted · none · ref 11 · 2 links
Different calibration objectives produce distinct layer pruning patterns in LLMs, while search algorithms converge to similar solutions under a fixed objective.
Backdooring Masked Diffusion Language Models cs.LG · 2026-05-19 · unreviewed · ref 17

Pointer sentinel mixture models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer