hub

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author= · 2020

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

browse 15 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Scaling Limits of Long-Context Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.

Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

Derives a blockwise resolvent-style attention operator that exploits structured sparsity for subquadratic O(n^{4/3}d) entity tracking while matching dense accuracy.

Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency points for high-resolution image generation on mobile devices.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

math.OC · 2026-05-11 · unverdicted · novelty 6.0

Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

KVM is a new block-recurrent compressed KV attention that turns transformers into O(N) chunked RNNs or growable sublinear-memory models while remaining implementable with standard operations.

Continuity Laws for Sequential Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.

Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attention baselines on LongBench and RAG tasks at 8x-32x compression.

MoBA: Mixture of Block Attention for Long-Context LLMs

cs.LG · 2025-02-18 · unverdicted · novelty 6.0

MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

LayerBoost selectively replaces or removes attention in non-critical transformer layers to cut inference latency up to 68% while recovering quality via brief distillation.

Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

cs.LG · 2026-04-22 · unverdicted · novelty 5.0

Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.

citing papers explorer

Showing 15 of 15 citing papers.

Scaling Limits of Long-Context Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 53
For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity cs.LG · 2026-05-21 · unverdicted · none · ref 12
Derives a blockwise resolvent-style attention operator that exploits structured sparsity for subquadratic O(n^{4/3}d) entity tracking while matching dense accuracy.
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors cs.LG · 2026-04-21 · unverdicted · none · ref 283
NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices cs.CV · 2026-05-15 · unverdicted · none · ref 21
ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency points for high-resolution image generation on mobile devices.
Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 91
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities math.OC · 2026-05-11 · unverdicted · none · ref 49
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory cs.LG · 2026-05-11 · unverdicted · none · ref 4 · 3 links
KVM is a new block-recurrent compressed KV attention that turns transformers into O(N) chunked RNNs or growable sublinear-memory models while remaining implementable with standard operations.
Continuity Laws for Sequential Models cs.LG · 2026-05-08 · unverdicted · none · ref 48
S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 19
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding cs.LG · 2026-04-23 · unverdicted · none · ref 26
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention cs.LG · 2026-04-22 · unverdicted · none · ref 19
Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attention baselines on LongBench and RAG tasks at 8x-32x compression.
MoBA: Mixture of Block Attention for Long-Context LLMs cs.LG · 2025-02-18 · unverdicted · none · ref 34
MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 46
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs cs.LG · 2026-04-23 · unverdicted · none · ref 8
LayerBoost selectively replaces or removes attention in non-critical transformer layers to cut inference latency up to 68% while recovering quality via brief distillation.
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training cs.LG · 2026-04-22 · unverdicted · none · ref 18
Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.

International conference on machine learning , pages=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer