arxiv: 1609.07843 · v1 · submitted 2016-09-26 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Pointer Sentinel Mixture Models

Stephen Merity , Caiming Xiong , James Bradbury , Richard Socher

Authors on Pith no claims yet

Pith reviewed 2026-05-11 04:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords pointer sentinelmixture modellanguage modelingneural networksPenn TreebankWikiTextLSTMperplexity

0 comments

The pith

A pointer sentinel mixture model lets neural language models copy a word from recent context or generate one from a softmax classifier.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a mixture architecture that decides at each step whether to point to a word in the recent context or to use the standard softmax output. This addresses the problem of rare or unseen words that standard softmax models struggle with even when context is clear. The pointer sentinel LSTM reaches 70.9 perplexity on the Penn Treebank benchmark while using fewer parameters than a conventional softmax LSTM. The same model is also evaluated on a new larger corpus to test longer contexts and realistic vocabularies.

Core claim

The pointer sentinel mixture architecture enables a neural sequence model to either reproduce a word from the recent context using a pointer or produce a word from a standard softmax classifier. The pointer sentinel LSTM variant sets a new state of the art on the Penn Treebank dataset with a perplexity of 70.9 while requiring substantially fewer parameters than a conventional softmax based LSTM.

What carries the argument

The pointer sentinel mixture model, which uses a sentinel mechanism to choose between pointing to a word in the recent context and generating from the softmax classifier.

Load-bearing premise

The pointer mechanism can reliably select the correct word from recent context without introducing many errors that degrade overall performance.

What would settle it

A test set where the model frequently points to the wrong recent word or fails to improve perplexity on a corpus containing many rare words that appear unambiguously in context.

read the original abstract

Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The pointer-sentinel LSTM mixes copying from context with softmax generation to cut rare-word errors and parameter count on PTB, but the gains rest on unshown pointer accuracy.

read the letter

The core contribution is a sentinel gate that decides whether the model points to a word in the recent hidden states or falls back to the usual softmax output. This lets the network reproduce rare or context-clear words without paying for a full vocabulary-sized matrix every step. They report 70.9 perplexity on Penn Treebank with fewer parameters than a standard LSTM and release the WikiText corpus for longer-context tests with realistic vocabularies.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the pointer sentinel mixture architecture for neural sequence models. This model can either copy a word from recent context via a pointer mechanism or generate one from a standard softmax classifier. The pointer sentinel-LSTM variant is reported to achieve state-of-the-art language modeling performance on the Penn Treebank benchmark (70.9 perplexity) while using substantially fewer parameters than a comparable softmax LSTM. The paper also introduces the WikiText corpus to support evaluation on longer contexts and larger vocabularies.

Significance. If the performance claims hold, the work demonstrates that a lightweight mixture of copying and generation can improve perplexity and parameter efficiency on standard language modeling benchmarks. The introduction of the freely available WikiText corpus provides a useful resource for the community to study realistic vocabularies and longer-range dependencies. The approach directly targets the difficulty softmax-based models have with rare words when context is unambiguous.

major comments (2)

[§4 and Table 1] §4 (Experimental Results) and Table 1: The headline claim of 70.9 perplexity on PTB with fewer parameters than a standard softmax LSTM is load-bearing for the paper's contribution, yet the manuscript provides no ablation or error analysis quantifying pointer accuracy (e.g., fraction of tokens correctly copied by the pointer versus fallback to softmax). Without this, it is impossible to confirm that pointing errors are not being compensated by the softmax component, which would undermine the claimed efficiency advantage.
[§3.2] §3.2 (Pointer Sentinel Mixture): The sentinel gate is presented as deciding between pointer and softmax, but the training objective and inference procedure for the mixture are not shown to guarantee that the pointer component surfaces the correct token at a rate sufficient to explain the reported perplexity reduction. A concrete test (e.g., oracle pointer accuracy on the test set) is needed to secure the central claim.

minor comments (2)

[§4.3] The WikiText corpus is introduced but its construction details (tokenization, preprocessing, train/valid/test splits) are only sketched; a short appendix or subsection with exact statistics and download instructions would improve reproducibility.
[§3.2] Notation for the sentinel vector and attention scores in Eq. (3)–(5) is introduced without an explicit statement of how the final output distribution is normalized when the pointer and softmax are mixed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The comments highlight important aspects of validating the pointer mechanism's contribution, and we address each point below with plans for revision.

read point-by-point responses

Referee: [§4 and Table 1] §4 (Experimental Results) and Table 1: The headline claim of 70.9 perplexity on PTB with fewer parameters than a standard softmax LSTM is load-bearing for the paper's contribution, yet the manuscript provides no ablation or error analysis quantifying pointer accuracy (e.g., fraction of tokens correctly copied by the pointer versus fallback to softmax). Without this, it is impossible to confirm that pointing errors are not being compensated by the softmax component, which would undermine the claimed efficiency advantage.

Authors: We agree that quantifying the pointer's accuracy and usage frequency would strengthen the claims. The original experiments emphasize end-to-end perplexity and parameter count, but we will add an ablation in the revised manuscript reporting the fraction of tokens for which the pointer is selected at inference time, along with the pointer's per-token accuracy on the PTB test set. This will clarify the extent to which the mixture relies on copying versus softmax generation. revision: partial
Referee: [§3.2] §3.2 (Pointer Sentinel Mixture): The sentinel gate is presented as deciding between pointer and softmax, but the training objective and inference procedure for the mixture are not shown to guarantee that the pointer component surfaces the correct token at a rate sufficient to explain the reported perplexity reduction. A concrete test (e.g., oracle pointer accuracy on the test set) is needed to secure the central claim.

Authors: The sentinel is trained jointly via the mixture loss, which directly optimizes the decision between components. To provide the requested concrete validation, we will include an oracle analysis in the revision: we will report the accuracy of an ideal pointer that always copies the correct token when it appears in the context window, and compare it to the learned sentinel's selection rate. This will bound the contribution of the pointer mechanism to the observed perplexity improvement. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements on external benchmarks

full rationale

The paper proposes a pointer-sentinel mixture architecture for language modeling, defines it via explicit equations for the pointer attention, sentinel gate, and mixture, then trains the model and reports perplexity on the fixed Penn Treebank benchmark. No step in the architecture or evaluation reduces by construction to its own inputs: the 70.9 perplexity figure is measured on held-out data rather than being a fitted parameter renamed as a prediction, and no uniqueness theorem, self-citation chain, or ansatz is invoked to force the result. The derivation chain is self-contained against external evaluation, consistent with standard empirical model papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no exhaustive list of free parameters or axioms can be extracted; the model presumably inherits standard LSTM parameters plus attention weights for the pointer and sentinel gate.

pith-pipeline@v0.9.0 · 5415 in / 988 out tokens · 24078 ms · 2026-05-11T04:07:00.655355+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters
quant-ph 2026-05 unverdicted novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 conditional novelty 8.0

HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
Learning the Signature of Memorization in Autoregressive Language Models
cs.CL 2026-04 accept novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Probabilistic circuits have an output bottleneck with convex probability combinations and a context bottleneck limited to fixed vtree-aligned partitions, making them less expressive than transformers for language data...
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
cs.LG 2026-05 unverdicted novelty 7.0

BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
cs.LG 2026-05 unverdicted novelty 7.0

ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking
cs.LG 2026-05 unverdicted novelty 7.0

Residual connections align cross-layer gradients while symmetry-breaking activations prevent rotational drift, causing principal singular vectors of adjacent layers to align.
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
cs.DC 2026-05 unverdicted novelty 7.0

SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
cs.LG 2026-05 unverdicted novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs
cs.LG 2026-05 unverdicted novelty 7.0

BWLA is the first post-training quantization method for LLMs that achieves 1-bit weights paired with low-bit activations such as 6 bits, using OKT to reshape weights and suppress activation tails plus PSP for low-rank...
DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
cs.SE 2026-04 unverdicted novelty 7.0

DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...
The Safety-Aware Denoiser for Text Diffusion Models
cs.LG 2026-04 unverdicted novelty 7.0

SAD modifies the denoising process in text diffusion models to enforce safety constraints at inference time, reducing unsafe generations while preserving quality and diversity.
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers
cs.LG 2026-04 unverdicted novelty 7.0

In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
cs.LG 2026-04 unverdicted novelty 7.0

High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
cs.LG 2026-04 unverdicted novelty 7.0

NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
cs.AI 2026-04 conditional novelty 7.0

Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
cs.CL 2026-04 unverdicted novelty 7.0

Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU
cs.AR 2026-04 unverdicted novelty 7.0

A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.
SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs
cs.AR 2026-04 unverdicted novelty 7.0

SHIELD reduces eDRAM refresh energy by 35% for LLM inference on edge NPUs by isolating sign/exponent from mantissa bits, disabling refresh on transient QO mantissas, and relaxing it on persistent KV mantissas while ke...
Gradient Boosting within a Single Attention Layer
cs.LG 2026-04 conditional novelty 7.0

Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over st...
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
cs.AR 2026-03 unverdicted novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
Chronos: Learning the Language of Time Series
cs.LG 2024-03 conditional novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
cs.LG 2022-10 unverdicted novelty 7.0

GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
cs.AR 2026-05 unverdicted novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
Semantic Smoothing for Language Models via Distribution Estimation and Embeddings
cs.IT 2026-05 conditional novelty 6.0

Semantic smoothing formulates next-word distribution estimation under KL loss with embedding-based KL-proximity side information, yielding an interpolation estimator with worst-case risk O(min{Δ, d/n}) that empiricall...
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
cs.LG 2026-05 unverdicted novelty 6.0

Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

CAKI generates class-specific prompts from few-shot samples of the same class, stores them in a knowledge bank, and uses query-key matching to inject relevant class knowledge into test instance predictions for improve...
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.
DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

DiCLIP uses diffusion-based visual correlation enhancement and text semantic augmentation to improve CLIP-generated class activation maps for weakly supervised semantic segmentation, outperforming prior methods on PAS...
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
Context-Aware Wireless Token Communication via Joint Token Masking and Detection
eess.SP 2026-05 unverdicted novelty 6.0

A joint token masking and detection scheme with masked language models improves token reconstruction over noisy wireless channels by up to 1.77x on Europarl and 1.63x on WikiText-103 compared to conventional methods.
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
cs.DC 2026-05 unverdicted novelty 6.0

SplitZip delivers a GPU-friendly lossless KV-cache compressor using an offline top-16 exponent codebook plus escape stream, achieving 613 GB/s compression and 2182 GB/s decompression throughput with up to 1.32x end-to...
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
cs.LG 2026-04 unverdicted novelty 6.0

FASQ delivers calibration-free LLM compression with continuous size trade-offs via product quantization and custom CUDA kernels that accelerate decode beyond FP16 speeds on consumer hardware.
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
cs.CV 2026-04 unverdicted novelty 6.0

Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization
cs.LG 2026-04 unverdicted novelty 6.0

CLion achieves O(1/N) generalization error and O(√d / T^{1/4}) convergence for nonconvex stochastic optimization, improving on Lion's O(1/(N τ^T)) bound.
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
cs.LG 2026-04 unverdicted novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
Quantization Dominates Rank Reduction for KV-Cache Compression
cs.LG 2026-04 conditional novelty 6.0

Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves...
Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V
cs.CL 2026-04 unverdicted novelty 6.0

A position-agnostic nonlinear pre-projection MLP plus content skip connection in transformer attention improves LAMBADA accuracy by 40.6% and reduces perplexity by 39% on 160M-scale models.
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training
cs.AR 2026-04 unverdicted novelty 6.0

LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need
cs.LG 2026-04 unverdicted novelty 6.0

Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
Rethinking Residual Errors in Compensation-based LLM Quantization
cs.LG 2026-04 conditional novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
Linformer: Self-Attention with Linear Complexity
cs.LG 2020-06 conditional novelty 6.0

Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
Emergent Semantic Role Understanding in Language Models
cs.AI 2026-05 unverdicted novelty 5.0

Semantic role understanding partially emerges during language model pre-training, with linear probes on frozen representations achieving substantial performance that improves with scale but does not match fine-tuned m...
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters
cs.LG 2026-05 unverdicted novelty 5.0

Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
Adaptive Memory Decay for Log-Linear Attention
cs.LG 2026-05 conditional novelty 5.0

Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
cs.LG 2026-05 unverdicted novelty 5.0

FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...
Forge-UGC: FX optimization and register-graph engine for universal graph compiler
cs.AR 2026-04 unverdicted novelty 5.0

Forge-UGC delivers a hardware-agnostic four-phase compiler for transformers that reduces compilation time by 6.9-9.2x, inference latency by 18-36%, and energy use by 30-41% on NPU hardware compared with existing frameworks.
MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition
cs.LG 2026-04 unverdicted novelty 5.0

MUXQ uses low-rank outlier decomposition to redistribute activation outliers, allowing mixed-to-uniform INT8 quantization of LLMs with lower perplexity than naive methods on GPT-2 models.
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
cs.CL 2026-04 unverdicted novelty 4.0

Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder
cs.LG 2026-04 unverdicted novelty 2.0

A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 57 Pith papers

[1]

Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks

Adi, Yossi, Kermany, Einat, Belinkov, Yonatan, Lavi, Ofer, and Goldberg, Yoav. Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks. arXiv preprint arXiv:1608.04207,

work page Pith review arXiv
[2]

arXiv preprint arXiv:1608.00318 , Title =

Ahn, Sungjin, Choi, Heeyoul, P ¨arnamaa, Tanel, and Ben- gio, Yoshua. A Neural Knowledge Language Model. CoRR, abs/1608.00318,

work page arXiv
[3]

One billion word benchmark for measuring progress in statistical language modeling

Chelba, Ciprian, Mikolov, Tomas, Schuster, Mike, Ge, Qi, Brants, Thorsten, Koehn, Phillipp, and Robin- son, Tony. One Billion Word Benchmark for Measur- ing Progress in Statistical Language Modeling. arXiv preprint arXiv:1312.3005,

work page arXiv
[4]

Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,

Cheng, Jianpeng, Dong, Li, and Lapata, Mirella. Long Pointer Sentinel Mixture Models Short-Term Memory-Networks for Machine Reading. CoRR, abs/1601.06733,

work page arXiv
[5]

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Gal, Yarin. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. arXiv preprint arXiv:1512.05287,

work page arXiv
[6]

Gu, Jiatao, Lu, Zhengdong, Li, Hang, and Li, Victor O. K. Incorporating Copying Mechanism in Sequence- to-Sequence Learning. CoRR, abs/1603.06393,

work page arXiv
[7]

Pointing the unknown words

G¨ulc ¸ehre, C ¸ aglar, Ahn, Sungjin, Nallapati, Ramesh, Zhou, Bowen, and Bengio, Yoshua. Pointing the Unknown Words. arXiv preprint arXiv:1603.08148,

work page arXiv
[8]

Kadlec, M

ISSN 0899-7667. Kadlec, Rudolf, Schmid, Martin, Bajgar, Ondrej, and Kleindienst, Jan. Text Understanding with the Attention Sum Reader Network. arXiv preprint arXiv:1603.01547,

work page arXiv
[9]

Character-aware neural language models

Kim, Yoon, Jernite, Yacine, Sontag, David, and Rush, Alexander M. Character-aware neural language models. CoRR, abs/1508.06615,

work page arXiv
[10]

Zoneout: Regularizing RNNs by Ran- domly Preserving Hidden Activations

Krueger, David, Maharaj, Tegan, Kram´ar, J´anos, Pezeshki, Mohammad, Ballas, Nicolas, Ke, Nan Rosemary, Goyal, Anirudh, Bengio, Yoshua, Larochelle, Hugo, Courville, Aaron, et al. Zoneout: Regularizing RNNs by Ran- domly Preserving Hidden Activations. arXiv preprint arXiv:1606.01305,

work page arXiv
[11]

Latent Predictor Networks for Code Gen- eration

Ling, Wang, Grefenstette, Edward, Hermann, Karl Moritz, Kocisk´y, Tom ´as, Senior, Andrew, Wang, Fumin, and Blunsom, Phil. Latent Predictor Networks for Code Gen- eration. CoRR, abs/1603.06744,

work page arXiv
[12]

How to Construct Deep Recurrent Neural Networks, April 2014

Pascanu, Razvan, C ¸ aglar G¨ulc ¸ehre, Cho, Kyunghyun, and Bengio, Yoshua. How to Construct Deep Recurrent Neu- ral Networks. CoRR, abs/1312.6026, 2013a. Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difﬁculty of training recurrent neural networks. In ICML, 2013b. Rosenfeld, Roni. A Maximum Entropy Approach to Adap- tive Statistical Languag...

work page arXiv
[13]

Recurrent neural network regularization

Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329,

work page arXiv
[14]

Recurrent Highway Net- works

Zilly, Julian Georg, Srivastava, Rupesh Kumar, Koutn ´ık, Jan, and Schmidhuber, J¨urgen. Recurrent Highway Net- works. arXiv preprint arXiv:1607.03474,

work page arXiv