Recognition: unknown
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
Pith reviewed 2026-05-10 11:13 UTC · model grok-4.3
The pith
LongAct improves long-context RL performance by selectively updating only the weights tied to high-magnitude activations in query and key vectors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
High-magnitude activations appear in query and key vectors during long-context processing; because these activations are pivotal for effective optimization and long-context reasoning is sparse, selectively updating only the associated weights produces stronger long-context reasoning than uniform updates.
What carries the argument
Saliency-guided sparse updates that modify only the weights connected to high-magnitude activations in query and key vectors.
If this is right
- Yields an approximate 8% improvement on LongBench v2.
- Enhances generalization on the RULER benchmark.
- Consistently improves performance when plugged into GRPO or DAPO.
- Ablation results indicate that focusing on salient activations is essential for unlocking long-context gains.
- Replaces uniform weight updates with sparse, activation-magnitude-guided updates.
Where Pith is reading between the lines
- The same magnitude-based selection could be tested in supervised fine-tuning or continued pretraining to check whether the benefit is RL-specific.
- Dynamic identification of high-magnitude activations during inference might allow similar sparsity at test time without retraining.
- If the sparse structure holds across model scales, the method could reduce memory and compute costs for long-context training runs.
- The approach invites checking whether other internal statistics, such as gradient magnitudes, produce comparable sparse-update rules.
Load-bearing premise
High-magnitude activations in query and key vectors are the main drivers of successful optimization during long-context reinforcement learning.
What would settle it
An experiment that instead updates only low-magnitude activations or performs uniform updates and measures whether LongBench v2 and RULER scores still rise by approximately 8 percent.
Figures
read the original abstract
Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization -- which establishes the criticality of such high-magnitude activations -- and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper observes high-magnitude activations in query and key vectors during long-context processing in LLMs, hypothesizes that these are pivotal drivers for RL optimization due to inherent sparsity in long-context reasoning, and proposes LongAct to perform saliency-guided sparse weight updates instead of uniform updates. It claims this yields an approximate 8% improvement on LongBench v2, better generalization on RULER, and consistent gains across RL algorithms including GRPO and DAPO, with supporting ablation studies.
Significance. If the results hold after proper controls, the work could be significant for efficient LLM post-training by leveraging intrinsic activation patterns rather than external data or reward engineering. The universality across algorithms and focus on sparse structure in long-context RL could influence future optimization strategies, provided the saliency hypothesis is causally validated.
major comments (2)
- [Ablation studies / Experimental results] The ablation studies referenced in the abstract do not include a same-sparsity random-selection baseline for weight updates. Without this control, the reported ~8% gains on LongBench v2 cannot be attributed specifically to high-magnitude activation saliency rather than generic effects of sparsity-induced regularization or reduced update capacity, which directly undermines the central hypothesis that these activations are the pivotal drivers.
- [Introduction / Method motivation] The hypothesis that long-context reasoning inherently exhibits a sparse structure (and that high-magnitude Q/K activations are therefore critical) is stated as an insight but lacks quantitative characterization, such as activation histograms, sparsity metrics, or comparisons to short-context cases, making the motivation for saliency-guided updates insufficiently grounded.
minor comments (1)
- [Abstract] The abstract would benefit from explicit mention of the number of runs, statistical significance tests, and exact baselines used for the 8% LongBench v2 claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important ways to strengthen the empirical support for our claims. We agree that additional controls and quantitative analysis are needed to better substantiate the central hypothesis. We address each major comment below and will incorporate the suggested revisions in the next version of the manuscript.
read point-by-point responses
-
Referee: The ablation studies referenced in the abstract do not include a same-sparsity random-selection baseline for weight updates. Without this control, the reported ~8% gains on LongBench v2 cannot be attributed specifically to high-magnitude activation saliency rather than generic effects of sparsity-induced regularization or reduced update capacity, which directly undermines the central hypothesis that these activations are the pivotal drivers.
Authors: We agree that a same-sparsity random-selection baseline is necessary to isolate the contribution of saliency guidance from generic sparsity effects. In the revised manuscript, we will add this control experiment, applying random weight selection at the identical sparsity ratio used by LongAct and comparing the resulting performance on LongBench v2. This will provide direct evidence that the observed gains are attributable to targeting high-magnitude Q/K activations rather than sparsity-induced regularization alone. revision: yes
-
Referee: The hypothesis that long-context reasoning inherently exhibits a sparse structure (and that high-magnitude Q/K activations are therefore critical) is stated as an insight but lacks quantitative characterization, such as activation histograms, sparsity metrics, or comparisons to short-context cases, making the motivation for saliency-guided updates insufficiently grounded.
Authors: We acknowledge that the motivation would be strengthened by more rigorous quantitative support. We will expand the introduction and relevant method sections to include activation magnitude histograms, explicit sparsity metrics (e.g., fraction of activations exceeding magnitude thresholds), and side-by-side comparisons of long-context versus short-context activation patterns. These additions will better ground the claim that long-context processing exhibits an inherent sparse structure centered on high-magnitude Q/K activations. revision: yes
Circularity Check
No circularity; empirical hypothesis tested via selective updates
full rationale
The paper's chain begins with an empirical observation of high-magnitude activations in Q/K vectors for long contexts, draws inspiration from external quantization work and the general sparsity of long-context reasoning, then proposes saliency-guided sparse updates as a training strategy. Reported gains on LongBench v2 and RULER are measured outcomes, not quantities defined or fitted to equal the inputs by construction. No equations, self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text. The method is self-contained against external benchmarks and ablations; absence of a random-sparsity control is an experimental-design issue, not a circularity in the derivation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Long-context reasoning inherently exhibits a sparse structure.
- ad hoc to paper High-magnitude activations in query and key vectors are the pivotal drivers for effective model optimization.
Reference graph
Works this paper leans on
-
[1]
Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen Stephen Gou, Phil Blunsom, Ahmet \"U st \"u n, and Sara Hooker. 2023. Intriguing properties of quantization at scale. Advances in Neural Information Processing Systems, 36:34278--34294
2023
-
[2]
Anonymous. 2025. https://openreview.net/forum?id=omVhYvyTPJ Longrlvr: Overcoming the long-context bottleneck in reinforcement learning with verifiable rewards . OpenReview. Under review as a conference paper at ICLR 2026
2025
-
[3]
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, and 1 others. 2025. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639--3664
2025
- [4]
-
[5]
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, and 1 others. 2025. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617
work page internal anchor Pith review arXiv 2025
- [6]
-
[7]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 35:30318--30332
2022
- [8]
- [9]
-
[10]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [11]
-
[12]
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. 2024. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769
work page internal anchor Pith review arXiv 2024
-
[13]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. Ruler: What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654
work page internal anchor Pith review arXiv 2024
- [14]
- [15]
-
[16]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6:87--100
2024
-
[17]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750
work page internal anchor Pith review arXiv 2024
-
[18]
Samyak Mukherjee, Zhongyu Wu, and Mohit Bansal. 2025. Reinforcement learning finetunes small subnetworks in large language models. Advances in Neural Information Processing Systems, 38
2025
- [19]
-
[20]
Bowen Ping, Jiali Zeng, Fandong Meng, Shuo Wang, Jie Zhou, and Shanghang Zhang. 2025. Longdpo: Unlock better long-form generation abilities for llms via critique-augmented stepwise information. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7613--7632
2025
-
[21]
QwenTeam . 2025. Qwen3-next: Towards ultimate training & inference efficiency. https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd. Accessed: 2025-10
2025
- [22]
-
[23]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, and 1 others. 2025 a . Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534
work page internal anchor Pith review arXiv 2025
-
[25]
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, and 1 others. 2025 b . Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692
work page internal anchor Pith review arXiv 2025
- [26]
- [27]
- [28]
-
[29]
Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. 2024. Infllm: Training-free long-context extrapolation for llms with an efficient context memory. Advances in Neural Information Processing Systems, 37:119638--119661
2024
- [30]
- [31]
-
[32]
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2025 b . Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, and 1 others. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471
work page internal anchor Pith review arXiv 2025
-
[34]
Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, and 1 others. 2024. ∞ bench: Extending long context evaluation beyond 100k tokens. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15262--15277
2024
- [35]
-
[36]
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, and 1 others. 2025. Group sequence policy optimization. arXiv preprint arXiv:2507.18071
work page internal anchor Pith review arXiv 2025
-
[37]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[38]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.