pith. machine review for the scientific record. sign in

arxiv: 2307.08691 · v1 · submitted 2023-07-17 · 💻 cs.LG

Recognition: 1 theorem link

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords attention mechanismGPU optimizationtransformersFlashAttentionwork partitioningparallelismlanguage modelingmatrix multiplication
0
0 comments X

The pith

FlashAttention-2 speeds up transformer attention by about 2 times through better GPU thread and warp partitioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that attention remains the main slowdown when scaling transformers to longer sequences because its quadratic cost in sequence length dominates runtime and memory. It shows that the original FlashAttention still wastes GPU capacity through low occupancy and extra shared-memory traffic caused by how work is split among thread blocks and warps. By changing the partitioning in three concrete ways, the new version cuts unnecessary operations and raises hardware utilization without any approximation or loss of correctness. A sympathetic reader cares because faster exact attention makes training and inference on longer contexts practical on existing hardware, directly affecting language modeling, image understanding, and generation tasks.

Core claim

FlashAttention-2 reduces non-matrix-multiplication FLOPs, parallelizes the attention computation for even a single head across multiple thread blocks to raise occupancy, and redistributes work inside each block across warps to cut shared-memory reads and writes. These changes produce roughly 2 times speedup over FlashAttention, lifting performance from 25-40 percent to 50-73 percent of the A100's theoretical peak FLOPs per second and delivering end-to-end training throughput up to 225 TFLOPs per second per GPU with 72 percent model FLOPs utilization on GPT-style models.

What carries the argument

Repartitioning scheme that parallelizes single-head attention across thread blocks and distributes sub-tasks between warps to reduce shared-memory traffic and non-matmul operations.

If this is right

  • Attention layers approach the efficiency of optimized matrix multiplications on the same hardware.
  • End-to-end training of GPT-style models reaches up to 225 TFLOPs per second per A100 GPU.
  • Longer sequence lengths become feasible without quadratic memory growth or large accuracy trade-offs.
  • GPU occupancy increases and unnecessary memory traffic drops for the attention kernel.
  • The same attention implementation can be used for both training and inference at higher throughput.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same partitioning ideas could apply to other memory-bound kernels that mix matrix multiplies with reductions.
  • Further gains might appear when the method is combined with sequence parallelism or different GPU architectures.
  • Longer-context applications in audio, video, and code generation become more accessible on current hardware.
  • The gap between attention and GEMM efficiency narrows, suggesting attention need not remain the dominant bottleneck.

Load-bearing premise

The assumption that low occupancy and extra shared-memory traffic are the main remaining bottlenecks and that the three partitioning changes will deliver the measured speedups on target GPUs without hidden numerical or correctness costs.

What would settle it

Running the same attention benchmarks and end-to-end GPT training on A100 hardware and observing less than 1.5 times speedup over FlashAttention or model FLOPs utilization below 60 percent.

read the original abstract

Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4$\times$ compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2$\times$ speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces FlashAttention-2, which improves upon FlashAttention by reducing non-matmul FLOPs, parallelizing attention computation (including single-head cases) across thread blocks to raise occupancy, and redistributing warp-level work to cut shared-memory traffic. These changes are claimed to deliver ~2× kernel speedup over FlashAttention while remaining mathematically equivalent (no approximation), reaching 50-73% of A100 theoretical peak FLOPs/s and up to 225 TFLOPs/s (72% MFU) in end-to-end GPT-style training.

Significance. If the performance numbers and equivalence hold, the result would meaningfully advance practical long-context training by bringing attention kernels closer to GEMM efficiency on current hardware. Credit is due for the direct wall-clock and FLOPs measurements on A100, the end-to-end training runs, and the parameter-free algorithmic modifications that avoid fitted constants or self-referential definitions.

major comments (1)
  1. [§3.3] §3.3 (parallel block-level softmax): the manuscript describes combining partial online-softmax statistics (max and sum) across thread blocks when a single head is split, which necessarily changes the order of floating-point reductions relative to the original per-block schedule. No side-by-side tensor-equality tests, max-abs-diff bounds, or end-to-end loss/gradient-norm comparisons between FlashAttention and FlashAttention-2 outputs are reported; this verification is load-bearing for the central “no approximation” claim.
minor comments (2)
  1. [Figure 3] Figure 3 and §4.1: the occupancy and shared-memory traffic diagrams would benefit from explicit annotation of the warp-to-block mapping and the exact reduction tree used for cross-block statistics.
  2. [Table 2] Table 2: the reported speedups are given as ranges (50-73%); adding per-configuration raw TFLOPs/s numbers alongside the percentages would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and for highlighting the importance of explicit numerical verification. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (parallel block-level softmax): the manuscript describes combining partial online-softmax statistics (max and sum) across thread blocks when a single head is split, which necessarily changes the order of floating-point reductions relative to the original per-block schedule. No side-by-side tensor-equality tests, max-abs-diff bounds, or end-to-end loss/gradient-norm comparisons between FlashAttention and FlashAttention-2 outputs are reported; this verification is load-bearing for the central “no approximation” claim.

    Authors: We agree that the manuscript does not report explicit side-by-side numerical checks, and this is a fair observation. The block-parallel softmax uses the online-softmax merge rule (max and rescaled sum) that is mathematically exact in real arithmetic, as established in the original FlashAttention work; the only difference is the order of floating-point reductions. In practice the resulting discrepancy is on the order of machine epsilon scaled by the magnitude of the values. We will add the requested verification in the revision: (1) direct tensor comparisons for sequence lengths 512–4096 and head dimensions 64–128, reporting max-abs-diff < 1e-5 in FP32; (2) end-to-end GPT training runs confirming that loss curves and gradient norms match within FP tolerance. These results will appear in §3.3 or a new appendix. The “no approximation” claim remains unchanged because the algorithm performs the identical mathematical operations, only reordered. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to FlashAttention baseline; speedup claims rest on independent hardware measurements

full rationale

The paper's chain proceeds from observed GPU occupancy and shared-memory traffic issues in the prior FlashAttention algorithm, through three explicit partitioning modifications (non-matmul reduction, block-level head parallelism, warp-level distribution), to direct empirical timing and TFLOPs/s measurements on A100 hardware. These measurements are external benchmarks, not outputs of any fitted model or self-referential definition. The only self-citation is to the original FlashAttention work for the baseline description; it is not load-bearing for the new claims, which are independently specified and validated. No equations reduce by construction to their inputs, no parameters are fitted then renamed as predictions, and no uniqueness theorem or ansatz is smuggled via self-citation. The skeptic note on reduction order affects numerical verification but does not create a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard domain assumptions about GPU memory hierarchy and thread scheduling rather than any new fitted constants or invented entities.

axioms (1)
  • domain assumption GPU memory hierarchy is asymmetric, with fast shared memory per block and slower global memory, and thread occupancy and shared-memory traffic are the dominant remaining bottlenecks after FlashAttention.
    Invoked to justify the three partitioning changes; stated in the abstract as the observed cause of inefficiency.

pith-pipeline@v0.9.0 · 5613 in / 1341 out tokens · 46595 ms · 2026-05-11T02:34:21.618354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

    cs.CV 2026-05 accept novelty 8.0

    DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

  2. Efficient Training on Multiple Consumer GPUs with RoundPipe

    cs.DC 2026-04 conditional novelty 8.0

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...

  3. RULER: What's the Real Context Size of Your Long-Context Language Models?

    cs.CL 2024-04 accept novelty 8.0

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  4. TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

    cs.CV 2026-05 unverdicted novelty 7.0

    TurboVGGT uses adaptive sparse global attention with varying sparsity levels across frames and layers plus frame attention to enable faster multi-view 3D reconstruction while keeping competitive quality versus prior s...

  5. Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

    stat.ML 2026-05 unverdicted novelty 7.0

    MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...

  6. Very Efficient Listwise Multimodal Reranking for Long Documents

    cs.IR 2026-05 unverdicted novelty 7.0

    ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

  7. CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.

  8. CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

  9. ProteinJEPA: Latent prediction complements protein language models

    cs.LG 2026-05 unverdicted novelty 7.0

    Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.

  10. LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

    cs.CL 2026-05 unverdicted novelty 7.0

    LaTER reduces LLM token usage 16-33% on reasoning benchmarks by exploring in latent space then switching to explicit CoT verification, with gains like 70% to 73.3% on AIME 2025 in the training-free version.

  11. Long Context Pre-Training with Lighthouse Attention

    cs.CL 2026-05 conditional novelty 7.0

    Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...

  12. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  13. Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

    cs.DC 2026-05 unverdicted novelty 7.0

    Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.

  14. CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation

    q-bio.GN 2026-04 unverdicted novelty 7.0

    CellxPert uses inference-time MCMC steering on a multi-omics single-cell foundation model to predict genome-wide transcriptomic responses to gene perturbations and outperforms baselines on cell-type annotation, pertur...

  15. ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

    cs.LG 2026-04 unverdicted novelty 7.0

    ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.

  16. Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens

    cs.CL 2026-04 unverdicted novelty 7.0

    Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.

  17. Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction

    cs.CL 2026-04 unverdicted novelty 7.0

    Hyper-Parallel Decoding enables parallel generation of independent sequences in LLMs via position ID manipulation, delivering up to 13.8X speedup for attribute value extraction.

  18. QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention

    cs.LG 2026-04 unverdicted novelty 7.0

    QFlash implements end-to-end integer FlashAttention with integer-only softmax, delivering up to 8.69x speedup and 18.8% energy savings on ViT models while preserving accuracy under per-tensor quantization.

  19. A satellite foundation model for improved wealth monitoring

    cs.CY 2026-04 unverdicted novelty 7.0

    Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and genera...

  20. DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

    cs.CV 2026-04 unverdicted novelty 7.0

    DocPrune is a training-free token pruning method that removes background and irrelevant tokens from document images using question and comprehension signals, yielding 3x encoder and 3.3x decoder throughput gains plus ...

  21. Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

    cs.LG 2026-04 unverdicted novelty 7.0

    Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.

  22. ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    ScenarioControl introduces the first vision-language controllable generator for realistic vectorized 3D driving scenarios with temporal consistency across actor views.

  23. TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

    cs.CV 2026-04 unverdicted novelty 7.0

    TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.

  24. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  25. Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

    cs.CV 2026-04 unverdicted novelty 7.0

    LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.

  26. User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.

  27. Fast Cross-Operator Optimization of Attention Dataflow

    cs.AR 2026-04 unverdicted novelty 7.0

    MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.

  28. GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

    cs.DC 2026-03 unverdicted novelty 7.0

    GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.

  29. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  30. Chronos: Learning the Language of Time Series

    cs.LG 2024-03 conditional novelty 7.0

    Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

  31. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  32. Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.

  33. SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.

  34. TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation

    cs.DC 2026-05 unverdicted novelty 6.0

    TurboGR trains up to 0.2B-parameter generative recommendation models on Ascend NPUs at 54.71% MFU with 0.97 near-linear scalability via jagged acceleration, hierarchical parallelism, and negative sampling optimizations.

  35. Search Your Block Floating Point Scales!

    cs.LG 2026-05 unverdicted novelty 6.0

    ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

  36. Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

    cs.CL 2026-05 conditional novelty 6.0

    EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.

  37. Remember to Forget: Gated Adaptive Positional Encoding

    cs.LG 2026-05 unverdicted novelty 6.0

    GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.

  38. Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

    cs.RO 2026-05 unverdicted novelty 6.0

    SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.

  39. Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

    cs.CL 2026-05 unverdicted novelty 6.0

    TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.

  40. KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

    cs.AR 2026-05 unverdicted novelty 6.0

    KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.

  41. Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

    cs.LG 2026-05 unverdicted novelty 6.0

    A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

  42. Edit-Based Refinement for Parallel Masked Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.

  43. MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

    cs.DC 2026-05 unverdicted novelty 6.0

    MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

  44. ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

    cs.CL 2026-05 conditional novelty 6.0

    ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...

  45. RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...

  46. Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 6.0

    LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

  47. WiCER: Wiki-memory Compile, Evaluate, Refine Iterative Knowledge Compilation for LLM Wiki Systems

    cs.CL 2026-05 conditional novelty 6.0

    WiCER iteratively diagnoses and repairs fact loss during wiki compilation for LLMs, recovering 80% of quality lost in blind distillation across 17 domains while cutting catastrophic failures by 55%.

  48. Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

    cs.LG 2026-05 unverdicted novelty 6.0

    Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.

  49. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...

  50. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.

  51. The Impossibility Triangle of Long-Context Modeling

    cs.CL 2026-05 unverdicted novelty 6.0

    No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

  52. WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

    cs.CV 2026-05 unverdicted novelty 6.0

    WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.

  53. GateMOT: Q-Gated Attention for Dense Object Tracking

    cs.CV 2026-04 unverdicted novelty 6.0

    GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.

  54. SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

    cs.CV 2026-04 conditional novelty 6.0

    SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietar...

  55. SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

    cs.CV 2026-04 unverdicted novelty 6.0

    SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.

  56. Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

    cs.SE 2026-04 unverdicted novelty 6.0

    Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

  57. ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

    cs.LG 2026-04 unverdicted novelty 6.0

    ELSA casts online softmax attention as a prefix scan over monoid (m,S,W) to deliver exact FP32 semantics, O(n) memory, O(log n) depth, and Tensor-Core independence as a drop-in kernel.

  58. Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

    cs.LG 2026-04 unverdicted novelty 6.0

    CuTile delivers high performance on select AI workloads and GPUs but varies significantly by architecture and is less portable than Triton across tested platforms.

  59. HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

    cs.LG 2026-04 unverdicted novelty 6.0

    HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

  60. LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    LayerBoost applies layer-specific attention changes guided by sensitivity analysis plus brief distillation to cut LLM inference latency up to 68% while keeping competitive quality.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 96 Pith papers · 6 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

  2. [2]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020. 13

  3. [3]

    Scatterbrain: Unifying sparse and low-rank attention

    Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. InAdvances in Neural Information Processing Systems (NeurIPS) , 2021

  4. [4]

    Rethinking attention with performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Representations (ICLR) , 2020

  5. [5]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022

  6. [6]

    Dissecting the Ampere GPU architecture via microbenchmarking

    Zhe Jia and Peter Van Sandt. Dissecting the Ampere GPU architecture via microbenchmarking. GPU Technology Conference, 2021

  7. [7]

    Scarpazza

    Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. Dissecting the nvidia Volta GPU architecture via microbenchmarking.arXiv preprint arXiv:1804.06826 , 2018

  8. [8]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning , pages 5156–5165. PMLR, 2020

  9. [9]

    Reformer: The efficient transformer

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. InThe International Conference on Machine Learning (ICML) , 2020

  10. [10]

    xformers: A modular and hackable transformer modelling library.https://github.com/facebookresearch/xformers, 2022

    Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer modelling library.https://github.com/facebookresearch/xformers, 2022

  11. [11]

    Online normalizer calculation for softmax,

    Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018

  12. [12]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.ArXiv, abs/2303.08774, 2023

  13. [13]

    Self-attention does not needo(n 2)memory.arXiv preprint arXiv:2112.05682,

    Markus N Rabe and Charles Staats. Self-attention does not need 𝑂 (𝑛2) memory. arXiv preprint arXiv:2112.05682, 2021

  14. [14]

    Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics , 9: 53–68, 2021

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics , 9: 53–68, 2021

  15. [15]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

  16. [16]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  17. [17]

    Triton: an intermediate language and compiler for tiled neural network computations

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages 10–19, 2019

  18. [18]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  19. [19]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 , 2020

  20. [20]

    Big bird: Transformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems , 33, 2020. 14