DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
Pith reviewed 2026-06-30 20:56 UTC · model grok-4.3
The pith
DualKV eliminates shared-prompt replication in FlashAttention for RL training by processing the prompt once across rollouts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DualKV is the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training via fused CUDA forward and backward kernels iterating over two disjoint KV regions—shared context and per-sequence response—in a single kernel launch, together with data-pipeline repacking that reduces total tokens from N(P+R) to P+NR, while remaining mathematically equivalent to standard attention and introducing no approximation.
What carries the argument
Fused CUDA kernels iterating over two disjoint KV regions (shared prompt and per-sequence responses) in one launch, paired with token repacking to extend the reduction beyond attention.
If this is right
- 1.63-2.09x policy-update speedup on Qwen3-8B with N=32 and 8K context, plus 2x larger micro-batches and MFU rising from 36% to 76%
- 2.47x speedup and 77% MFU for DAPO under the same conditions
- 3.82x policy-update and 3.38x end-to-end step speedup at 30B MoE scale on 16 H100 GPUs
- The token reduction factor rho = N(P+R)/(P+NR) applies to the entire model, not just attention
Where Pith is reading between the lines
- The same shared-prefix invariance could be exploited in non-RL settings that reuse long common prefixes across many generations.
- Extending the dual-region kernel design to other attention backends or to inference-time batching might yield further gains.
- Measuring wall-clock savings when N grows beyond 32 or context exceeds 8K would test whether the reported scaling holds at larger sizes.
Load-bearing premise
Prompt representations remain identical across all sequences at every layer due to causal masking in decoder-only models.
What would settle it
Run both DualKV and standard FlashAttention on the same inputs and check that all outputs, hidden states, and gradients match exactly.
Figures
read the original abstract
Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ($N{\geq}16$, $P{\geq}8\text{K}$), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose \textbf{DualKV}, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and backward kernels that iterate over two disjoint KV regions -- shared context and per-sequence response -- in a single kernel launch, and (2)~a data-pipeline redesign in veRL that repacks $N(P{+}R)$ tokens into $P{+}NR$ tokens per micro-batch, extending the token reduction from attention to the entire model by a factor $\rho = N(P{+}R)/(P{+}NR)$. DualKV is mathematically equivalent to standard attention and introduces no approximation. On Qwen3-8B GRPO training with 8$\times$H100 GPUs ($N{=}32$, 8K-context), DualKV achieves $1.63$--$2.09\times$ policy-update speedup, enables $2\times$ larger micro-batches, and raises MFU from $36\%$ to $76\%$. Similar gains hold for DAPO ($2.47\times$ speedup, $77\%$ MFU). At 30B MoE scale on 16$\times$H100, DualKV achieves $3.82\times$ policy-update and $3.38\times$ end-to-end step speedup over FlashAttention (which requires 4-way Ulysses sequence parallelism to avoid OOM).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DualKV, a FlashAttention kernel variant for RL post-training (e.g., GRPO, DAPO) that processes a shared prompt of P tokens only once across N response sequences of R tokens each. It exploits causal masking invariance in decoder-only models to use fused CUDA forward/backward kernels iterating over two disjoint KV regions (shared prompt + per-response) in a single launch, plus a veRL data repacking that reduces total tokens from N(P+R) to P+NR. The method claims exact mathematical equivalence to standard attention (no approximations) and reports 1.63-2.09x policy-update speedups on Qwen3-8B (N=32, 8K context), 2.47x for DAPO, MFU gains to 76-77%, and 3.82x at 30B MoE scale.
Significance. If the equivalence holds and the kernel implementation is verified, DualKV would address a practical redundancy in large-rollout RL training, enabling larger micro-batches and higher utilization without changing the training dynamics. The reported MFU improvements and scaling to MoE models indicate potential impact on efficient post-training pipelines.
major comments (2)
- [Abstract] Abstract: The central claim of mathematical equivalence to standard attention rests on the fused backward kernel correctly accumulating gradients for the shared prompt tokens from all N sequences. No description, pseudocode, or verification is provided for the accumulation mechanism (e.g., atomics, reductions, or separate passes), leaving open the possibility of omission, double-counting, or scaling errors relative to independent per-sequence gradient computation.
- [Abstract] Abstract: The enabling observation that prompt representations remain invariant across sequences (allowing single processing of norms, projections, MLP, and attention) is stated without a formal argument or reference to the causal mask structure that would guarantee identical hidden states and gradients at every layer.
Simulated Author's Rebuttal
We thank the referee for their careful review and constructive comments on the clarity of our equivalence claims. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of mathematical equivalence to standard attention rests on the fused backward kernel correctly accumulating gradients for the shared prompt tokens from all N sequences. No description, pseudocode, or verification is provided for the accumulation mechanism (e.g., atomics, reductions, or separate passes), leaving open the possibility of omission, double-counting, or scaling errors relative to independent per-sequence gradient computation.
Authors: We agree the abstract provides insufficient detail on the backward accumulation. The manuscript body (Section 3.2) specifies that the fused CUDA backward kernel uses atomicAdd operations to accumulate each of the N sequences' independent gradient contributions to the shared prompt tokens exactly once. This matches standard per-sequence computation with no double-counting or scaling. We will revise the abstract to briefly describe this mechanism and add pseudocode to an appendix. revision: yes
-
Referee: [Abstract] Abstract: The enabling observation that prompt representations remain invariant across sequences (allowing single processing of norms, projections, MLP, and attention) is stated without a formal argument or reference to the causal mask structure that would guarantee identical hidden states and gradients at every layer.
Authors: We acknowledge the request for a formal argument. The invariance follows from the causal mask: prompt tokens at positions < P attend exclusively to prior prompt tokens (identical across sequences), so hidden states and gradients are identical layer-wise. We will add a formal argument with explicit reference to the causal mask structure in Section 2 and a proof sketch in the appendix. revision: yes
Circularity Check
No circularity: implementation and benchmarks are independent of inputs
full rationale
The paper describes a fused CUDA kernel redesign for attention in RL rollouts, asserting equivalence to standard attention via disjoint KV region iteration and reporting measured wall-clock speedups on specific hardware. No equations, fitted parameters, or predictions are presented that reduce by construction to the paper's own inputs or prior self-citations. The causal-masking invariance is stated as an observed property of decoder-only models rather than a self-defined assumption, and the central speedup numbers are direct empirical comparisons, not derived quantities. This is a self-contained engineering contribution with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption In decoder-only models, causal masking makes prompt representations invariant across sequences at every layer.
Forward citations
Cited by 1 Pith paper
-
Schedule-Level Shared-Prefix Reuse for LLM RL Training
Schedule-level shared-prefix reuse decouples prefix and suffix passes in GRPO training to compute shared prefixes once, delivering up to 4.395x speedup and 59.1% HBM reduction while preserving numerical equivalence.
Reference graph
Works this paper leans on
-
[1]
Bifurcated attention: Accelerating massively parallel decoding with shared prefixes in LLMs
Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, and Bing Xiang. Bifurcated attention: Accelerating massively parallel decoding with shared prefixes in LLMs . arXiv preprint arXiv:2403.08845, 2024
-
[2]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Flash A ttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024
2024
-
[4]
Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022
2022
-
[5]
Abhimanyu Dubey et al. The L lama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, et al. OpenRLHF : An easy-to-use, scalable and high-performance RLHF framework. arXiv preprint arXiv:2405.11143, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. REINFORCE++ : Stabilizing critic-free policy optimization with global advantage normalization. arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. DeepSpeed U lysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE -bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations, 2024
2024
-
[10]
Reducing activation recomputation in large transformer models
Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems (MLSys), 5, 2023
2023
-
[11]
Efficient memory management for large language model serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pages 611--626, 2023
2023
-
[12]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Longreason: A synthetic long-context reasoning bench- mark via context expansion,
Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, and Jiecao Chen. Long R eason: A synthetic long-context reasoning benchmark via context expansion. arXiv preprint arXiv:2501.15089, 2025
-
[14]
Post-training gpt-oss for agentic reasoning with reinforcement learning
LinkedIn AI . Post-training gpt-oss for agentic reasoning with reinforcement learning. Hugging Face blog, https://huggingface.co/blog/LinkedIn/gpt-oss-agentic-rl, 2025
2025
-
[15]
Ring attention with blockwise transformers for near-infinite context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. In International Conference on Learning Representations, 2024
2024
-
[16]
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Tianyang Liu, Canwen Xu, and Julian McAuley. RepoBench : Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
arXiv preprint arXiv:2506.05433 , year=
Zikang Liu, Tongtian Yue, Yepeng Tang, Longteng Guo, Junxian Cai, Qingbin Liu, Xi Chen, and Jing Liu. Prefix grouper: Efficient GRPO training through shared-prefix forward. arXiv preprint arXiv:2506.05433, 2025
-
[18]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
Flash A ttention-3: Fast and accurate attention with asynchrony and low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flash A ttention-3: Fast and accurate attention with asynchrony and low-precision. In Advances in Neural Information Processing Systems, 2024
2024
-
[20]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deep S eek M ath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
veRL : An open-source unified reinforcement learning framework for large language models
Guangming Sheng, Chi Cao, Zilingfeng Lin, Song Bian, Da Wei, Wenbo Xu, Caicai Yang, Jian Liu, and Tao Zhang. veRL : An open-source unified reinforcement learning framework for large language models. arXiv preprint arXiv:2409.19951, 2024
-
[22]
TRL : Transformers reinforcement learning, 2020
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, et al. TRL : Transformers reinforcement learning, 2020. URL https://github.com/huggingface/trl
2020
-
[23]
An Yang, Baosong Yang, Beichen Zhang, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
SGLang: Efficient Execution of Structured Language Model Programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Shuo Cheng, Jeff Huang, Baris Kasikci, and Ion Stoica. SGLang : Efficient execution of structured language model programs. arXiv preprint arXiv:2312.07104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.