Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.
hub
N., Kaiser, ., and Polosukhin, I
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Super-Linear introduces a pretrained MoE architecture using frequency-specialized linear experts and spectral gating for efficient general time series forecasting.
A mean-field dynamical analysis of LoRA in transformers identifies phase transitions in catastrophic forgetting driven by perturbation norm and transformer depth.
Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.
A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.
HELIX uses learnable feature identities and hybrid temporal-feature attention to achieve state-of-the-art time series imputation across multiple datasets and settings.
SANTA replaces full value-cache multiply-accumulates with stochastic gather-and-add sampling from the attention distribution to reduce memory bandwidth while preserving an unbiased estimator.
TAVT improves OOD task generalization in meta-RL by preserving task characteristics in virtual tasks via metric learning and using state regularization.
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
Existing proofs of autoregressive Transformer Turing-completeness apply to scaling families of models rather than fixed systems with context management, so they do not establish Turing-completeness for real-world LLMs.
Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
citing papers explorer
No citing papers match the current filters.