StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

Chengwei Li; Danning Ke; Guangda Liu; Jieru Zhao; Jing Lin; Wenchao Ding; Wenhao Chen; Yiquan Wang; Yiwu Yao

arxiv: 2606.20005 · v1 · pith:67MATKXVnew · submitted 2026-06-18 · 💻 cs.LG · cs.AI

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

Guangda Liu , Yiquan Wang , Chengwei Li , Wenhao Chen , Jing Lin , Yiwu Yao , Danning Ke , Wenchao Ding

show 1 more author

Jieru Zhao

This is my paper

Pith reviewed 2026-06-26 18:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords attention distillationKL divergenceGPU kernelmemory efficiencylong contextknowledge distillationattention mechanismsfused kernel

0 comments

The pith

StreamKL reduces the memory cost of attention KL divergence from quadratic in sequence length to constant by streaming a fused online computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard attention distillation materializes both source and target attention distributions before their KL divergence is computed, which costs O(N_Q N_K) high-bandwidth memory and becomes impossible for long contexts. StreamKL replaces that materialization with a new online formulation of the coupled two-distribution KL that lets a single forward kernel stream query-key tiles through on-chip SRAM. The backward pass recomputes the needed probabilities tile-by-tile instead of storing them. The result is an O(1) extra memory footprint together with reported speedups of 43x forward and 14x backward. If the formulation is numerically stable, long-context attention distillation becomes practical on a single GPU.

Core claim

StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. Experiments show this fused GPU primitive reduces the extra HBM footprint of attention distillation from O(N_Q N_K) to O(1) while delivering up to 43x forward and 14x backward speedups.

What carries the argument

The single one-pass tiled streaming kernel for the coupled KL reduction that fuses the two attention distributions into an online computation without materializing either full matrix.

If this is right

Long-context attention distillation becomes feasible on a single GPU without multi-GPU setups or excessive swapping.
The forward pass of attention distillation runs up to 43 times faster than baseline methods.
The backward pass runs up to 14 times faster than baseline methods.
The extra high-bandwidth memory required beyond model weights drops from quadratic in sequence length to constant.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tiling and recomputation pattern could be applied to other pairwise divergences between attention distributions, such as Jensen-Shannon or Wasserstein distances.
Sparse-attention training loops that already rely on distillation may now scale to contexts previously blocked by memory without changing the training objective.
The single-pass streaming approach suggests a template for removing quadratic intermediates in other attention-related losses that currently require full matrix storage.

Load-bearing premise

The online formulation for the coupled two-distribution KL reduction can be computed accurately in a single pass over tiles without numerical instability or precision loss imposed by on-chip SRAM limits.

What would settle it

Run StreamKL and a standard full-materialization KL routine on the same long sequence that fits in HBM only for the streaming version, then compare the scalar KL value and the resulting gradients to within floating-point tolerance.

Figures

Figures reproduced from arXiv: 2606.20005 by Chengwei Li, Danning Ke, Guangda Liu, Jieru Zhao, Jing Lin, Wenchao Ding, Wenhao Chen, Yiquan Wang, Yiwu Yao.

**Figure 1.** Figure 1: Overview of StreamKL. (a) Vanilla attention distillation materializes full 𝑃1, 𝑃2 in HBM, costing 𝑂(𝑁𝑄 𝑁𝐾) memory and IO. (b) StreamKL fuses the computation into a one-pass tiled kernel that computes KL online in SRAM without materializing 𝑃1 or 𝑃2, reducing extra HBM to 𝑂(1). through a weighted logit-difference term, and how to correctly rescale the accumulated logit-difference as the running maxima of … view at source ↗

**Figure 2.** Figure 2: Latency-memory trade-off of chunked attention KL divergence, with 𝑁𝑄 = 𝑁𝐾 = 128K and batch size 32. FlashAttention. Modern GPUs feature a memory hierarchy comprising a small but fast on-chip SRAM and a large but slow off-chip HBM. Standard attention computes the 𝑁𝑄 ×𝑁𝐾 attention matrix 𝑆 = 𝑄𝐾𝑇 , writes 𝑃 = softmax(𝑆) to HBM, and then reads it back to compute the output 𝑂 = 𝑃𝑉 . This requires 𝑂(𝑁𝑄 𝑁𝐾) HBM … view at source ↗

**Figure 3.** Figure 3: Normalized latency and HBM footprint of attention KL divergence, vanilla attention, and FlashAttention (H200, batch 32). Hatched bars denote OOM with extrapolation. of HBM, which is 3.6 times the capacity of a single H200 (141 GB). In sparse-attention LLM training, context lengths can reach 128K, 256K, or even 1M [7, 9], further amplifying the footprint. Even at shorter contexts where both distributions f… view at source ↗

**Figure 4.** Figure 4: Forward kernel design. (a) Default kernel on a (bsz,𝑇𝑄 ) grid. (b) Split-K variant: a third grid dimension 𝑊 partitions the 𝐾 dimension; each block writes its partial statistics to HBM and a lightweight reduce merges them. Causal masking. When causal masking is applied to the attention distributions, StreamKL exploits the triangular attention structure to avoid unnecessary computation. Costly element-wise … view at source ↗

**Figure 5.** Figure 5: Backward kernel design. (a) Separate strategy: two kernels for 𝑑𝑄 and 𝑑𝐾 on (bsz,𝑇𝑄 ) and (bsz,𝑇𝐾) grids respectively; each owns its output tile but reads inputs and recomputes logit tiles twice. (b) Fused strategy: a single (bsz,𝑇𝐾) grid where each thread block owns one 𝐾 tile, accumulates 𝑑𝐾 in registers, and writes 𝑑𝑄 via atomic_add. Numerically stable log-ratio. Computing 𝑟 𝑖 = log 𝑃 𝑖 1 − log 𝑃 𝑖 2 … view at source ↗

**Figure 6.** Figure 6: Peak HBM footprint of the non-causal forward pass across context lengths (𝑁𝑄 = 𝑁𝐾, batch 16, log-scale 𝑥-axis). Hatched bars denote OOM cases with extrapolation. a predefined search space; the auto-tuner profiles each candidate for a given problem shape (𝑁𝑄, 𝑁𝐾, 𝑑1, 𝑑2) and caches the fastest configuration for later use. 6 Evaluation 6.1 Setup Hardware. We evaluate StreamKL on NVIDIA GPUs spanning two arc… view at source ↗

**Figure 7.** Figure 7: Forward latency on H200 and A100. Each 𝑥-axis label is a (batch size, 𝑁𝑄 ) configuration with 𝑁𝑄 = 𝑁𝐾. Hatched bars denote OOM cases with extrapolation; numbers are in ms unless suffixed. 6.2 Forward Pass We compare StreamKL against the baselines on forward pass peak HBM footprint and latency, sweeping context lengths from 4K to 512K (𝑁𝑄 = 𝑁𝐾) and batch sizes in {16, 32} under both causal and non-causal ma… view at source ↗

**Figure 8.** Figure 8: Peak HBM footprint of the non-causal backward pass under Setting 1 across context lengths (𝑁𝑄 = 𝑁𝐾, batch 16). Hatched bars denote OOM cases with extrapolation. stems from StreamKL’s ability to skip masked computation at the kernel level, whereas the baselines still materialize the full 𝑁𝑄 × 𝑁𝐾 attention matrix and apply the mask post hoc, paying even more compute and IO cost than the non-causal case. The … view at source ↗

**Figure 9.** Figure 9: Backward (Setting 1) latency on H200 and A100. Each 𝑥-axis label is a (batch size, 𝑁𝑄 ) configuration with 𝑁𝑄 = 𝑁𝐾. Hatched bars denote OOM cases with extrapolation; numbers are in ms unless suffixed. PyTorch and roughly 500× less than torch.compile/FLA, with the gap doubling for every doubling of 𝑁𝑄 . Latency [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Peak HBM footprint of the non-causal backward pass under Setting 2 across context lengths (𝑁𝑄 = 𝑁𝐾, batch 16). Hatched bars denote OOM cases with extrapolation. only addition is the per-row KL scalar 𝐿 used by the kernel. However, torch.compile consumes roughly 2× more HBM than in Setting 1 because the autograd tape now has to save additional 𝑂(𝑁𝑄 𝑁𝐾) intermediates (𝑃1 and log 𝑃1), and the log-ratio 𝑟 to … view at source ↗

**Figure 11.** Figure 11: Backward (Setting 2) latency on H200 and A100. Each 𝑥-axis label is a (batch size, 𝑁𝑄 ) configuration with 𝑁𝑄 = 𝑁𝐾. Hatched bars denote OOM cases with extrapolation; numbers are in ms unless suffixed. (1, 64K) (1, 128K) (1, 256K) (1, 512K) (16, 64K) (16, 128K) (16, 256K) (16, 512K) (32, 64K) (32, 128K) (32, 256K) (32, 512K) (64, 64K) (64, 128K) (64, 256K) (64, 512K) 0 2 4 6 8 Normalized Latency Default Sp… view at source ↗

**Figure 13.** Figure 13: Separate vs. fused backward kernel (non-causal, Setting 1, batch size 16). Each 𝑥-axis label is a (𝑁𝑄 ,𝑁𝐾) configuration. Numbers are absolute latencies in ms. 6.4.2 Separate/Fused Kernels for Backward [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

read the original abstract

Attention distillation, which trains one attention distribution to match another by minimizing their Kullback-Leibler (KL) divergence, is widely used in knowledge distillation, model compression, continual learning, and sparse-attention LLM training. However, existing approaches materialize both attention distributions before computing the KL reduction, incurring $O(N_QN_K)$ memory and IO costs that become prohibitive at long context lengths. We present StreamKL, the first fused GPU primitive for attention KL divergence that eliminates this quadratic materialization. StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. We further design and implement efficient GPU kernels with dedicated optimizations. Experiments show StreamKL delivers up to $43\times$ and $14\times$ speedups over baseline methods in the forward and backward passes, respectively. Most importantly, StreamKL reduces the extra HBM footprint of attention distillation from $O(N_QN_K)$ to $O(1)$, enabling long-context distillation on a single GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StreamKL gives a practical fused kernel that drops the quadratic memory hit in attention distillation to constant extra space, but the online KL derivation needs a close look for numerical equivalence.

read the letter

The core claim is that they have a single-pass tiled kernel for KL between two attention maps that never writes out the full N_Q by N_K matrices. That removes the main memory blocker for long-context distillation on one GPU.

What is new is the online reduction for the coupled two-distribution KL. They keep running maxes, sums, and log-ratio accumulators that update from each query-key tile in SRAM, then handle the backward by recomputing the probabilities tile by tile. The reported 43x forward and 14x backward speedups come from avoiding the materialization and the associated HBM traffic. If the algebra holds, this is a direct systems win for anyone doing attention-based distillation or sparse training.

The soft spot is verification. The abstract states the speedups and the O(1) footprint but gives no error bars, no direct numerical comparison to the materializing baseline, and no discussion of floating-point drift in the single-pass accumulators. The stress-test concern about whether the coupled statistics stay exact under SRAM precision limits is reasonable; any gap there would mean the kernel computes something different from standard KL, which would undermine the memory-reduction guarantee. Without the full derivation and a small exact-vs-approx check in the paper, it is hard to judge how tight the equivalence is.

This is a systems paper aimed at people who already run attention distillation and hit the memory wall at longer contexts. A reader who needs the kernel or wants to build on the tiling trick will get value; someone looking for new theory on KL will not. The work is coherent on its own terms and shows clear engineering effort, so it deserves a serious referee even if the numerical validation needs strengthening.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce StreamKL, the first fused GPU kernel for attention KL divergence that uses a novel online formulation of the coupled two-distribution KL reduction to compute the divergence in a single one-pass streaming kernel over query-key tiles. This eliminates the need to materialize the full N_Q × N_K attention matrices, reducing extra HBM footprint from O(N_Q N_K) to O(1) while enabling recomputation-based backward pass; experiments are reported to show up to 43× forward and 14× backward speedups over baselines, enabling long-context attention distillation on a single GPU.

Significance. If the online formulation is algebraically exact and numerically stable, the result would be a practically significant systems contribution for memory-bound attention distillation workloads in knowledge distillation, model compression, and long-context LLM training, directly addressing a quadratic memory bottleneck that currently limits context length.

major comments (3)

[Abstract (online formulation derivation)] The central O(1) memory-reduction guarantee and all reported speedups rest on the correctness of the novel online formulation for the coupled two-distribution KL reduction (abstract). The manuscript must supply the explicit identities for the coupled running statistics (maxima, sums, and log-ratio accumulators) together with a proof that they are exactly equivalent to the standard materialization; without this, the memory claim cannot be verified and any algebraic gap would invalidate the equivalence.
[Abstract (numerical validation)] Numerical stability under single-pass tiled streaming is load-bearing for the claimed equivalence (abstract). The paper should include a direct numerical comparison (e.g., maximum absolute difference or relative error) between StreamKL results and a reference materializing implementation across representative sequence lengths and precisions; absence of such validation leaves open the possibility of precision loss from SRAM-limited accumulation.
[Abstract (experiments)] Performance claims lack error bars, repeated-run statistics, or explicit baseline implementations (abstract). The 43× forward and 14× backward speedups cannot be assessed for robustness without these details and without stating the exact baseline kernels and hardware configuration used for the comparison.

minor comments (2)

The abstract states that "dedicated optimizations" are designed for the GPU kernels but provides no description of the tiling strategy, register usage, or warp-level primitives; this should be expanded in the methods section for reproducibility.
Notation for the two attention distributions (P and Q) and the precise definition of the online accumulators should be introduced with consistent symbols before the derivation is presented.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the presentation of the online formulation, numerical validation, and experimental details.

read point-by-point responses

Referee: [Abstract (online formulation derivation)] The central O(1) memory-reduction guarantee and all reported speedups rest on the correctness of the novel online formulation for the coupled two-distribution KL reduction (abstract). The manuscript must supply the explicit identities for the coupled running statistics (maxima, sums, and log-ratio accumulators) together with a proof that they are exactly equivalent to the standard materialization; without this, the memory claim cannot be verified and any algebraic gap would invalidate the equivalence.

Authors: Section 3.2 of the manuscript already derives the explicit identities for the coupled running statistics (online maxima, sums, and log-ratio accumulators) and Appendix A contains the algebraic proof of exact equivalence to the two-pass materialization. To address the referee's request for visibility in the abstract, we will insert a concise statement of the key identities and a pointer to the proof in the revised abstract. revision: yes
Referee: [Abstract (numerical validation)] Numerical stability under single-pass tiled streaming is load-bearing for the claimed equivalence (abstract). The paper should include a direct numerical comparison (e.g., maximum absolute difference or relative error) between StreamKL results and a reference materializing implementation across representative sequence lengths and precisions; absence of such validation leaves open the possibility of precision loss from SRAM-limited accumulation.

Authors: We agree that a direct numerical comparison is necessary. We will add a table in the experiments section reporting maximum absolute and relative errors versus a reference materializing implementation for sequence lengths from 1k to 32k in FP32, BF16, and FP16, confirming errors remain below 1e-5. revision: yes
Referee: [Abstract (experiments)] Performance claims lack error bars, repeated-run statistics, or explicit baseline implementations (abstract). The 43× forward and 14× backward speedups cannot be assessed for robustness without these details and without stating the exact baseline kernels and hardware configuration used for the comparison.

Authors: We will expand the experiments section (and update the abstract) to report means and standard deviations over five independent runs, explicitly name the baseline kernels (PyTorch fused, Triton, and custom CUDA), and state the hardware (NVIDIA A100 80 GB). revision: yes

Circularity Check

0 steps flagged

No significant circularity; novel online KL derivation is algebraically independent

full rationale

The paper presents a first-principles derivation of an online formulation for coupled two-distribution KL divergence that enables single-pass tiled streaming. This algebraic reduction is self-contained, relies on standard GPU tiling assumptions rather than fitted parameters or self-referential equations, and contains no load-bearing self-citations or uniqueness theorems imported from prior author work. The central memory-reduction claim follows directly from the streaming identities without reducing to any input by construction. No circular steps are present.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced. The work relies on standard GPU memory-hierarchy assumptions and the correctness of the derived online KL formulation.

axioms (1)

domain assumption GPU SRAM can hold query-key tiles and support efficient streaming reduction for the coupled KL computation without overflow or precision loss.
Invoked in the description of the one-pass forward kernel and tile-by-tile backward recomputation.

pith-pipeline@v0.9.1-grok · 5766 in / 1139 out tokens · 27990 ms · 2026-06-26T18:00:12.277958+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 3 canonical work pages

[1]

Sanjay Agrawal, Deep Nayak, and Vivek Varadarajan Sembium. 2025. Multilingual Continual Learning using Attention Distillation. InPro- ceedings of the 31st International Conference on Computational Linguis- tics: Industry Track, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Dar- wish, and Apoor...

2025
[2]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalam- barkar, Laurent Kirsch, Michael...

work page doi:10.1145/3620665.3640366 2024
[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

Pith/arXiv arXiv 2020
[4]

Kanghyun Choi, Hyeyoon Lee, Dain Kwon, SunJong Park, Kyuyeun Kim, Noseong Park, Jonghyun Choi, and Jinho Lee. 2025. MimiQ: Low- Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity.Proceedings of the AAAI Conference on Artificial Intelligence39, 15 (April 2025), 16037–16045. doi:10.1609/ aaai.v39i15.33761

2025
[5]

Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning. arXiv:2307.08691 [cs.LG]https: //arxiv.org/abs/2307.08691

Pith/arXiv arXiv 2023
[6]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
[7]

InAdvances in Neural Information Processing Systems, Vol

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35
[8]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guant- ing Chen, Guowei Li, H. Zhang, Hanwei Xu, ...
[9]

arXiv:2512.02556 [cs.CL]https://arxiv.org/abs/2512.02556

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556 [cs.CL]https://arxiv.org/abs/2512.02556

Pith/arXiv arXiv
[10]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV] https://arxiv.org/abs/2010.11929

Pith/arXiv arXiv 2021
[11]

GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zho...

Pith/arXiv arXiv 2026
[12]

Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Jonghyun Choi, Jaesik Park, and Kyungsu Kim. 2026. TRACE: Your Diffusion Model is Secretly an Instance Edge Detector. arXiv:2503.07982 [cs.CV]https://arxiv.org/ abs/2503.07982

arXiv 2026
[13]

Habin Lim, Yeongseob Won, Juwon Seo, and Gyeong-Moon Park. 2025. ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement. arXiv:2510.04668 [cs.CV]https://arxiv.org/abs/2510.04668

arXiv 2025
[14]

Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax.https://arxiv.org/abs/1805.02867

Pith/arXiv arXiv 2018
[15]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala
[16]

arXiv:1912.01703 [cs.LG]https://arxiv.org/abs/1912.01703

PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703 [cs.LG]https://arxiv.org/abs/1912.01703

Pith/arXiv arXiv 1912
[17]

William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Transformers. arXiv:2212.09748 [cs.CV]https://arxiv.org/abs/ 2212.09748

Pith/arXiv arXiv 2023
[18]

Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, and Mar Gonzalez-Franco. 2024. Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion. arXiv:2308.12469 [cs.CV]https://arxiv.org/abs/2308.12469

arXiv 2024
[19]

Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an interme- diate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973

work page doi:10.1145/3315508.3329973 2019
[20]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL]https://arxiv.org/ abs/1706.03762

Pith/arXiv arXiv 2023
[21]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep Self-Attention Distilla- tion for Task-Agnostic Compression of Pre-Trained Transformers. arXiv:2002.10957 [cs.CL]https://arxiv.org/abs/2002.10957

arXiv 2020
[22]

Lang Xiong, Ning Liu, Ao Ren, Yuheng Bai, Haining Fang, Binyan Zhang, Zhe Jiang, Yujuan Tan, and Duo Liu. 2026. D2 Prune: Spar- sifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness.Proceedings of the AAAI Con- ference on Artificial Intelligence40, 32 (March 2026), 27171–27179. doi:10.1609/aaai.v40i32.39932

work page doi:10.1609/aaai.v40i32.39932 2026
[23]

2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism

Songlin Yang and Yu Zhang. 2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism. https://github.com/fla-org/flash-linear-attention

2024
[24]

Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. 2026. FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling. arXiv:2603.05451 [cs.CL]https://arxiv.org/abs/2603.05451

arXiv 2026

[1] [1]

Sanjay Agrawal, Deep Nayak, and Vivek Varadarajan Sembium. 2025. Multilingual Continual Learning using Attention Distillation. InPro- ceedings of the 31st International Conference on Computational Linguis- tics: Industry Track, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Dar- wish, and Apoor...

2025

[2] [2]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalam- barkar, Laurent Kirsch, Michael...

work page doi:10.1145/3620665.3640366 2024

[3] [3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

Pith/arXiv arXiv 2020

[4] [4]

Kanghyun Choi, Hyeyoon Lee, Dain Kwon, SunJong Park, Kyuyeun Kim, Noseong Park, Jonghyun Choi, and Jinho Lee. 2025. MimiQ: Low- Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity.Proceedings of the AAAI Conference on Artificial Intelligence39, 15 (April 2025), 16037–16045. doi:10.1609/ aaai.v39i15.33761

2025

[5] [5]

Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning. arXiv:2307.08691 [cs.LG]https: //arxiv.org/abs/2307.08691

Pith/arXiv arXiv 2023

[6] [6]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

[7] [7]

InAdvances in Neural Information Processing Systems, Vol

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35

[8] [8]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guant- ing Chen, Guowei Li, H. Zhang, Hanwei Xu, ...

[9] [9]

arXiv:2512.02556 [cs.CL]https://arxiv.org/abs/2512.02556

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556 [cs.CL]https://arxiv.org/abs/2512.02556

Pith/arXiv arXiv

[10] [10]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV] https://arxiv.org/abs/2010.11929

Pith/arXiv arXiv 2021

[11] [11]

GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zho...

Pith/arXiv arXiv 2026

[12] [12]

Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Jonghyun Choi, Jaesik Park, and Kyungsu Kim. 2026. TRACE: Your Diffusion Model is Secretly an Instance Edge Detector. arXiv:2503.07982 [cs.CV]https://arxiv.org/ abs/2503.07982

arXiv 2026

[13] [13]

Habin Lim, Yeongseob Won, Juwon Seo, and Gyeong-Moon Park. 2025. ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement. arXiv:2510.04668 [cs.CV]https://arxiv.org/abs/2510.04668

arXiv 2025

[14] [14]

Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax.https://arxiv.org/abs/1805.02867

Pith/arXiv arXiv 2018

[15] [15]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala

[16] [16]

arXiv:1912.01703 [cs.LG]https://arxiv.org/abs/1912.01703

PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703 [cs.LG]https://arxiv.org/abs/1912.01703

Pith/arXiv arXiv 1912

[17] [17]

William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Transformers. arXiv:2212.09748 [cs.CV]https://arxiv.org/abs/ 2212.09748

Pith/arXiv arXiv 2023

[18] [18]

Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, and Mar Gonzalez-Franco. 2024. Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion. arXiv:2308.12469 [cs.CV]https://arxiv.org/abs/2308.12469

arXiv 2024

[19] [19]

Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an interme- diate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973

work page doi:10.1145/3315508.3329973 2019

[20] [20]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL]https://arxiv.org/ abs/1706.03762

Pith/arXiv arXiv 2023

[21] [21]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep Self-Attention Distilla- tion for Task-Agnostic Compression of Pre-Trained Transformers. arXiv:2002.10957 [cs.CL]https://arxiv.org/abs/2002.10957

arXiv 2020

[22] [22]

Lang Xiong, Ning Liu, Ao Ren, Yuheng Bai, Haining Fang, Binyan Zhang, Zhe Jiang, Yujuan Tan, and Duo Liu. 2026. D2 Prune: Spar- sifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness.Proceedings of the AAAI Con- ference on Artificial Intelligence40, 32 (March 2026), 27171–27179. doi:10.1609/aaai.v40i32.39932

work page doi:10.1609/aaai.v40i32.39932 2026

[23] [23]

2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism

Songlin Yang and Yu Zhang. 2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism. https://github.com/fla-org/flash-linear-attention

2024

[24] [24]

Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. 2026. FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling. arXiv:2603.05451 [cs.CL]https://arxiv.org/abs/2603.05451

arXiv 2026