hub

N., Kaiser, ., and Polosukhin, I

Vaswani, A · 2017

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

browse 15 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Randomness is sometimes necessary for coordination

cs.AI · 2026-05-07 · conditional · novelty 7.0

Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting

cs.LG · 2025-09-18 · unverdicted · novelty 7.0

Super-Linear introduces a pretrained MoE architecture using frequency-specialized linear experts and spectral gating for efficient general time series forecasting.

Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics

cs.LG · 2024-02-23 · unverdicted · novelty 7.0

A mean-field dynamical analysis of LoRA in transformers identifies phase transitions in catastrophic forgetting driven by perturbation norm and transformer depth.

Fast Inference from Transformers via Speculative Decoding

cs.LG · 2022-11-30 · accept · novelty 7.0

Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.

Training-Inference Consistent Segmented Execution for Long-Context LLMs

cs.CL · 2026-05-12 · conditional · novelty 6.0

A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.

HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

HELIX uses learnable feature identities and hybrid temporal-feature attention to achieve state-of-the-art time series imputation across multiple datasets and settings.

Stochastic Sparse Attention for Memory-Bound Inference

cs.LG · 2026-05-03 · accept · novelty 6.0

SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching baseline accuracy.

Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks

cs.LG · 2025-02-05 · unverdicted · novelty 6.0

TAVT improves OOD task generalization in meta-RL by preserving task characteristics in virtual tasks via metric learning and using state regularization.

Gated Linear Attention Transformers with Hardware-Efficient Training

cs.LG · 2023-12-11 · unverdicted · novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

cs.LG · 2026-04-22 · unverdicted · novelty 5.0

Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.

LACE: Lattice Attention for Cross-thread Exploration

cs.AI · 2026-04-16 · unverdicted · novelty 5.0 · 3 refs

LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.

The Serial Scaling Hypothesis

cs.LG · 2025-07-16 · unverdicted · novelty 5.0

The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.

Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management

cs.AI · 2026-05-19

citing papers explorer

Showing 15 of 15 citing papers.

Randomness is sometimes necessary for coordination cs.AI · 2026-05-07 · conditional · none · ref 94
Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CV · 2026-05-02 · unverdicted · none · ref 287
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences cs.LG · 2026-04-22 · unverdicted · none · ref 53
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting cs.LG · 2025-09-18 · unverdicted · none · ref 47
Super-Linear introduces a pretrained MoE architecture using frequency-specialized linear experts and spectral gating for efficient general time series forecasting.
Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics cs.LG · 2024-02-23 · unverdicted · none · ref 33
A mean-field dynamical analysis of LoRA in transformers identifies phase transitions in catastrophic forgetting driven by perturbation norm and transformer depth.
Fast Inference from Transformers via Speculative Decoding cs.LG · 2022-11-30 · accept · none · ref 66
Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.
Training-Inference Consistent Segmented Execution for Long-Context LLMs cs.CL · 2026-05-12 · conditional · none · ref 58
A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.
HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation cs.LG · 2026-05-04 · unverdicted · none · ref 23
HELIX uses learnable feature identities and hybrid temporal-feature attention to achieve state-of-the-art time series imputation across multiple datasets and settings.
Stochastic Sparse Attention for Memory-Bound Inference cs.LG · 2026-05-03 · accept · none · ref 37
SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching baseline accuracy.
Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks cs.LG · 2025-02-05 · unverdicted · none · ref 52
TAVT improves OOD task generalization in meta-RL by preserving task characteristics in virtual tasks via metric learning and using state regularization.
Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 95
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training cs.LG · 2026-04-22 · unverdicted · none · ref 85
Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
LACE: Lattice Attention for Cross-thread Exploration cs.AI · 2026-04-16 · unverdicted · none · ref 36 · 3 links
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
The Serial Scaling Hypothesis cs.LG · 2025-07-16 · unverdicted · none · ref 119
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management cs.AI · 2026-05-19 · unreviewed · ref 50

N., Kaiser, ., and Polosukhin, I

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer