hub

N., Kaiser, ., and Polosukhin, I

Vaswani, A · 2017

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

browse 15 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Randomness is sometimes necessary for coordination

cs.AI · 2026-05-07 · conditional · novelty 7.0

Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting

cs.LG · 2025-09-18 · unverdicted · novelty 7.0

Super-Linear introduces a pretrained MoE architecture using frequency-specialized linear experts and spectral gating for efficient general time series forecasting.

Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics

cs.LG · 2024-02-23 · unverdicted · novelty 7.0

A mean-field dynamical analysis of LoRA in transformers identifies phase transitions in catastrophic forgetting driven by perturbation norm and transformer depth.

Fast Inference from Transformers via Speculative Decoding

cs.LG · 2022-11-30 · accept · novelty 7.0

Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.

Training-Inference Consistent Segmented Execution for Long-Context LLMs

cs.CL · 2026-05-12 · conditional · novelty 6.0

A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.

HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

HELIX uses learnable feature identities and hybrid temporal-feature attention to achieve state-of-the-art time series imputation across multiple datasets and settings.

Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks

cs.LG · 2025-02-05 · unverdicted · novelty 6.0

TAVT improves OOD task generalization in meta-RL by preserving task characteristics in virtual tasks via metric learning and using state regularization.

Gated Linear Attention Transformers with Hardware-Efficient Training

cs.LG · 2023-12-11 · unverdicted · novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

cs.LG · 2026-04-22 · unverdicted · novelty 5.0

Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.

LACE: Lattice Attention for Cross-thread Exploration

cs.AI · 2026-04-16 · unverdicted · novelty 5.0 · 3 refs

LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.

The Serial Scaling Hypothesis

cs.LG · 2025-07-16 · unverdicted · novelty 5.0

The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.

Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management

cs.AI · 2026-05-19

Stochastic Sparse Attention for Memory-Bound Inference

cs.LG · 2026-05-03