pith. sign in

super hub Mixed citations

Pointer Sentinel Mixture Models

Mixed citation behavior. Most common role is background (56%).

164 Pith papers citing it
Background 56% of classified citations
abstract

Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.

hub tools

citation-role summary

background 9 dataset 5 method 1 other 1

citation-polarity summary

claims ledger

  • abstract Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Tree

authors

co-cited works

clear filters

representative citing papers

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

TallyTrain: Communication-Efficient Federated Distillation

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TallyTrain is a hard-label distillation protocol for federated learning that uses argmax transmission and optional sparse merges to match soft-label performance at up to 1000x lower communication cost.

Tapered Language Models

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.

Decomposing how prompting steers behavior

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization cs.LG · 2026-05-11 · unverdicted · none · ref 19 · internal anchor

    BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.

  • RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 36 · internal anchor

    RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

  • Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators cs.LG · 2026-05-07 · unverdicted · none · ref 25 · internal anchor

    Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.

  • Superposition Yields Robust Neural Scaling cs.LG · 2025-05-15 · conditional · none · ref 43 · internal anchor

    Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.