StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation
Pith reviewed 2026-06-26 18:00 UTC · model grok-4.3
The pith
StreamKL reduces the memory cost of attention KL divergence from quadratic in sequence length to constant by streaming a fused online computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. Experiments show this fused GPU primitive reduces the extra HBM footprint of attention distillation from O(N_Q N_K) to O(1) while delivering up to 43x forward and 14x backward speedups.
What carries the argument
The single one-pass tiled streaming kernel for the coupled KL reduction that fuses the two attention distributions into an online computation without materializing either full matrix.
If this is right
- Long-context attention distillation becomes feasible on a single GPU without multi-GPU setups or excessive swapping.
- The forward pass of attention distillation runs up to 43 times faster than baseline methods.
- The backward pass runs up to 14 times faster than baseline methods.
- The extra high-bandwidth memory required beyond model weights drops from quadratic in sequence length to constant.
Where Pith is reading between the lines
- The same tiling and recomputation pattern could be applied to other pairwise divergences between attention distributions, such as Jensen-Shannon or Wasserstein distances.
- Sparse-attention training loops that already rely on distillation may now scale to contexts previously blocked by memory without changing the training objective.
- The single-pass streaming approach suggests a template for removing quadratic intermediates in other attention-related losses that currently require full matrix storage.
Load-bearing premise
The online formulation for the coupled two-distribution KL reduction can be computed accurately in a single pass over tiles without numerical instability or precision loss imposed by on-chip SRAM limits.
What would settle it
Run StreamKL and a standard full-materialization KL routine on the same long sequence that fits in HBM only for the streaming version, then compare the scalar KL value and the resulting gradients to within floating-point tolerance.
Figures
read the original abstract
Attention distillation, which trains one attention distribution to match another by minimizing their Kullback-Leibler (KL) divergence, is widely used in knowledge distillation, model compression, continual learning, and sparse-attention LLM training. However, existing approaches materialize both attention distributions before computing the KL reduction, incurring $O(N_QN_K)$ memory and IO costs that become prohibitive at long context lengths. We present StreamKL, the first fused GPU primitive for attention KL divergence that eliminates this quadratic materialization. StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. We further design and implement efficient GPU kernels with dedicated optimizations. Experiments show StreamKL delivers up to $43\times$ and $14\times$ speedups over baseline methods in the forward and backward passes, respectively. Most importantly, StreamKL reduces the extra HBM footprint of attention distillation from $O(N_QN_K)$ to $O(1)$, enabling long-context distillation on a single GPU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce StreamKL, the first fused GPU kernel for attention KL divergence that uses a novel online formulation of the coupled two-distribution KL reduction to compute the divergence in a single one-pass streaming kernel over query-key tiles. This eliminates the need to materialize the full N_Q × N_K attention matrices, reducing extra HBM footprint from O(N_Q N_K) to O(1) while enabling recomputation-based backward pass; experiments are reported to show up to 43× forward and 14× backward speedups over baselines, enabling long-context attention distillation on a single GPU.
Significance. If the online formulation is algebraically exact and numerically stable, the result would be a practically significant systems contribution for memory-bound attention distillation workloads in knowledge distillation, model compression, and long-context LLM training, directly addressing a quadratic memory bottleneck that currently limits context length.
major comments (3)
- [Abstract (online formulation derivation)] The central O(1) memory-reduction guarantee and all reported speedups rest on the correctness of the novel online formulation for the coupled two-distribution KL reduction (abstract). The manuscript must supply the explicit identities for the coupled running statistics (maxima, sums, and log-ratio accumulators) together with a proof that they are exactly equivalent to the standard materialization; without this, the memory claim cannot be verified and any algebraic gap would invalidate the equivalence.
- [Abstract (numerical validation)] Numerical stability under single-pass tiled streaming is load-bearing for the claimed equivalence (abstract). The paper should include a direct numerical comparison (e.g., maximum absolute difference or relative error) between StreamKL results and a reference materializing implementation across representative sequence lengths and precisions; absence of such validation leaves open the possibility of precision loss from SRAM-limited accumulation.
- [Abstract (experiments)] Performance claims lack error bars, repeated-run statistics, or explicit baseline implementations (abstract). The 43× forward and 14× backward speedups cannot be assessed for robustness without these details and without stating the exact baseline kernels and hardware configuration used for the comparison.
minor comments (2)
- The abstract states that "dedicated optimizations" are designed for the GPU kernels but provides no description of the tiling strategy, register usage, or warp-level primitives; this should be expanded in the methods section for reproducibility.
- Notation for the two attention distributions (P and Q) and the precise definition of the online accumulators should be introduced with consistent symbols before the derivation is presented.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the presentation of the online formulation, numerical validation, and experimental details.
read point-by-point responses
-
Referee: [Abstract (online formulation derivation)] The central O(1) memory-reduction guarantee and all reported speedups rest on the correctness of the novel online formulation for the coupled two-distribution KL reduction (abstract). The manuscript must supply the explicit identities for the coupled running statistics (maxima, sums, and log-ratio accumulators) together with a proof that they are exactly equivalent to the standard materialization; without this, the memory claim cannot be verified and any algebraic gap would invalidate the equivalence.
Authors: Section 3.2 of the manuscript already derives the explicit identities for the coupled running statistics (online maxima, sums, and log-ratio accumulators) and Appendix A contains the algebraic proof of exact equivalence to the two-pass materialization. To address the referee's request for visibility in the abstract, we will insert a concise statement of the key identities and a pointer to the proof in the revised abstract. revision: yes
-
Referee: [Abstract (numerical validation)] Numerical stability under single-pass tiled streaming is load-bearing for the claimed equivalence (abstract). The paper should include a direct numerical comparison (e.g., maximum absolute difference or relative error) between StreamKL results and a reference materializing implementation across representative sequence lengths and precisions; absence of such validation leaves open the possibility of precision loss from SRAM-limited accumulation.
Authors: We agree that a direct numerical comparison is necessary. We will add a table in the experiments section reporting maximum absolute and relative errors versus a reference materializing implementation for sequence lengths from 1k to 32k in FP32, BF16, and FP16, confirming errors remain below 1e-5. revision: yes
-
Referee: [Abstract (experiments)] Performance claims lack error bars, repeated-run statistics, or explicit baseline implementations (abstract). The 43× forward and 14× backward speedups cannot be assessed for robustness without these details and without stating the exact baseline kernels and hardware configuration used for the comparison.
Authors: We will expand the experiments section (and update the abstract) to report means and standard deviations over five independent runs, explicitly name the baseline kernels (PyTorch fused, Triton, and custom CUDA), and state the hardware (NVIDIA A100 80 GB). revision: yes
Circularity Check
No significant circularity; novel online KL derivation is algebraically independent
full rationale
The paper presents a first-principles derivation of an online formulation for coupled two-distribution KL divergence that enables single-pass tiled streaming. This algebraic reduction is self-contained, relies on standard GPU tiling assumptions rather than fitted parameters or self-referential equations, and contains no load-bearing self-citations or uniqueness theorems imported from prior author work. The central memory-reduction claim follows directly from the streaming identities without reducing to any input by construction. No circular steps are present.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPU SRAM can hold query-key tiles and support efficient streaming reduction for the coupled KL computation without overflow or precision loss.
Reference graph
Works this paper leans on
-
[1]
Sanjay Agrawal, Deep Nayak, and Vivek Varadarajan Sembium. 2025. Multilingual Continual Learning using Attention Distillation. InPro- ceedings of the 31st International Conference on Computational Linguis- tics: Industry Track, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Dar- wish, and Apoor...
2025
-
[2]
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalam- barkar, Laurent Kirsch, Michael...
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
Pith/arXiv arXiv 2020
-
[4]
Kanghyun Choi, Hyeyoon Lee, Dain Kwon, SunJong Park, Kyuyeun Kim, Noseong Park, Jonghyun Choi, and Jinho Lee. 2025. MimiQ: Low- Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity.Proceedings of the AAAI Conference on Artificial Intelligence39, 15 (April 2025), 16037–16045. doi:10.1609/ aaai.v39i15.33761
2025
-
[5]
Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning. arXiv:2307.08691 [cs.LG]https: //arxiv.org/abs/2307.08691
Pith/arXiv arXiv 2023
-
[6]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
-
[7]
InAdvances in Neural Information Processing Systems, Vol
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35
-
[8]
DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guant- ing Chen, Guowei Li, H. Zhang, Hanwei Xu, ...
-
[9]
arXiv:2512.02556 [cs.CL]https://arxiv.org/abs/2512.02556
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556 [cs.CL]https://arxiv.org/abs/2512.02556
-
[10]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV] https://arxiv.org/abs/2010.11929
Pith/arXiv arXiv 2021
-
[11]
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zho...
Pith/arXiv arXiv 2026
-
[12]
Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Jonghyun Choi, Jaesik Park, and Kyungsu Kim. 2026. TRACE: Your Diffusion Model is Secretly an Instance Edge Detector. arXiv:2503.07982 [cs.CV]https://arxiv.org/ abs/2503.07982
arXiv 2026
-
[13]
Habin Lim, Yeongseob Won, Juwon Seo, and Gyeong-Moon Park. 2025. ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement. arXiv:2510.04668 [cs.CV]https://arxiv.org/abs/2510.04668
arXiv 2025
-
[14]
Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax.https://arxiv.org/abs/1805.02867
Pith/arXiv arXiv 2018
-
[15]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala
-
[16]
arXiv:1912.01703 [cs.LG]https://arxiv.org/abs/1912.01703
PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703 [cs.LG]https://arxiv.org/abs/1912.01703
Pith/arXiv arXiv 1912
-
[17]
William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Transformers. arXiv:2212.09748 [cs.CV]https://arxiv.org/abs/ 2212.09748
Pith/arXiv arXiv 2023
-
[18]
Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, and Mar Gonzalez-Franco. 2024. Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion. arXiv:2308.12469 [cs.CV]https://arxiv.org/abs/2308.12469
arXiv 2024
-
[19]
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an interme- diate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973
-
[20]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL]https://arxiv.org/ abs/1706.03762
Pith/arXiv arXiv 2023
-
[21]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep Self-Attention Distilla- tion for Task-Agnostic Compression of Pre-Trained Transformers. arXiv:2002.10957 [cs.CL]https://arxiv.org/abs/2002.10957
arXiv 2020
-
[22]
Lang Xiong, Ning Liu, Ao Ren, Yuheng Bai, Haining Fang, Binyan Zhang, Zhe Jiang, Yujuan Tan, and Duo Liu. 2026. D2 Prune: Spar- sifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness.Proceedings of the AAAI Con- ference on Artificial Intelligence40, 32 (March 2026), 27171–27179. doi:10.1609/aaai.v40i32.39932
-
[23]
2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism
Songlin Yang and Yu Zhang. 2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism. https://github.com/fla-org/flash-linear-attention
2024
-
[24]
Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. 2026. FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling. arXiv:2603.05451 [cs.CL]https://arxiv.org/abs/2603.05451
arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.