pith. sign in

arxiv: 2602.14209 · v2 · pith:KOUS5TKRnew · submitted 2026-02-15 · 💻 cs.LG · cs.CL

MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM

classification 💻 cs.LG cs.CL
keywords blockattentiondiffusionllmsmagemasksparsesubset
0
0 comments X
read the original abstract

Block diffusion LLMs are an emerging paradigm for parallel language generation, but their KV caching makes memory access the dominant bottleneck in long-context inference. Sparse attention, which attends only to a small KV subset per query, can reduce this latency with minimal accuracy loss. In block diffusion, however, the B tokens of each block must share a single KV subset, and we show this per-block constraint degrades existing sparse KV estimators by up to 25% in recall. We address this challenge by exploiting a property that emerges from the block-diffusion training objective: it aligns the block-average query across denoising steps, so the All-[MASK] block at the first step already reveals the per-block KV subset for the entire trajectory. We exploit this in MAGE ([MASK]-Guided Sparse Attention), a training-free method that runs one exact attention pass at the first step and reuses its top-k index sets for all remaining steps within the block. Across three block-diffusion families on LongBench, MAGE matches Exact Attention at k=512 with near-lossless accuracy, achieves up to 6.82x end-to-end speedup at 128K context, and runs up to 3.35x and 2.28x faster than Quest and SparseD, designed for AR LLMs and fully bidirectional diffusion LLMs, respectively.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval

    cs.LG 2026-06 unverdicted novelty 6.0

    HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.