arxiv: 2604.14922 · v1 · submitted 2026-04-16 · 💻 cs.LG · cs.CL

Recognition: unknown

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

Bowen Ping , Zijun Chen , Tingfeng Hui , Qize Yu , Chenxuan Li , Junchi Yan , Baobao Chang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:13 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords long-context reasoningreinforcement learninglarge language modelssparse updatesactivation patternsquery key vectorsLongBenchRULER

0 comments

The pith

LongAct improves long-context RL performance by selectively updating only the weights tied to high-magnitude activations in query and key vectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first notes that long-context inputs produce high-magnitude activations inside the query and key vectors of LLMs. It treats these activations as the critical drivers of optimization, drawing on quantization evidence and the idea that long-context reasoning is sparse. LongAct therefore replaces uniform weight updates with saliency-guided sparse updates that touch only the parameters linked to these large activations. When applied during reinforcement learning, the approach yields roughly 8 percent gains on LongBench v2, better generalization on RULER, and consistent lifts across GRPO and DAPO. A reader would care because the method extracts performance from an intrinsic model property rather than redesigning rewards or data.

Core claim

High-magnitude activations appear in query and key vectors during long-context processing; because these activations are pivotal for effective optimization and long-context reasoning is sparse, selectively updating only the associated weights produces stronger long-context reasoning than uniform updates.

What carries the argument

Saliency-guided sparse updates that modify only the weights connected to high-magnitude activations in query and key vectors.

If this is right

Yields an approximate 8% improvement on LongBench v2.
Enhances generalization on the RULER benchmark.
Consistently improves performance when plugged into GRPO or DAPO.
Ablation results indicate that focusing on salient activations is essential for unlocking long-context gains.
Replaces uniform weight updates with sparse, activation-magnitude-guided updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same magnitude-based selection could be tested in supervised fine-tuning or continued pretraining to check whether the benefit is RL-specific.
Dynamic identification of high-magnitude activations during inference might allow similar sparsity at test time without retraining.
If the sparse structure holds across model scales, the method could reduce memory and compute costs for long-context training runs.
The approach invites checking whether other internal statistics, such as gradient magnitudes, produce comparable sparse-update rules.

Load-bearing premise

High-magnitude activations in query and key vectors are the main drivers of successful optimization during long-context reinforcement learning.

What would settle it

An experiment that instead updates only low-magnitude activations or performs uniform updates and measures whether LongBench v2 and RULER scores still rise by approximately 8 percent.

Figures

Figures reproduced from arXiv: 2604.14922 by Baobao Chang, Bowen Ping, Chenxuan Li, Junchi Yan, Qize Yu, Tingfeng Hui, Zijun Chen.

**Figure 2.** Figure 2: Overview of the LongAct framework. The left panel illustrates the dynamic saliency-guided sparse [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the query representation magnitudes in Qwen3-8B on the RULER benchmark. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the key representation magnitudes in Qwen3-8B on the RULER benchmark. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the value representation magnitudes in Qwen3-8B on the RULER benchmark. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: This is an example taken from the LongBench v2 dataset with the ID [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization -- which establishes the criticality of such high-magnitude activations -- and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper observes high-magnitude activations in query and key vectors during long-context processing in LLMs, hypothesizes that these are pivotal drivers for RL optimization due to inherent sparsity in long-context reasoning, and proposes LongAct to perform saliency-guided sparse weight updates instead of uniform updates. It claims this yields an approximate 8% improvement on LongBench v2, better generalization on RULER, and consistent gains across RL algorithms including GRPO and DAPO, with supporting ablation studies.

Significance. If the results hold after proper controls, the work could be significant for efficient LLM post-training by leveraging intrinsic activation patterns rather than external data or reward engineering. The universality across algorithms and focus on sparse structure in long-context RL could influence future optimization strategies, provided the saliency hypothesis is causally validated.

major comments (2)

[Ablation studies / Experimental results] The ablation studies referenced in the abstract do not include a same-sparsity random-selection baseline for weight updates. Without this control, the reported ~8% gains on LongBench v2 cannot be attributed specifically to high-magnitude activation saliency rather than generic effects of sparsity-induced regularization or reduced update capacity, which directly undermines the central hypothesis that these activations are the pivotal drivers.
[Introduction / Method motivation] The hypothesis that long-context reasoning inherently exhibits a sparse structure (and that high-magnitude Q/K activations are therefore critical) is stated as an insight but lacks quantitative characterization, such as activation histograms, sparsity metrics, or comparisons to short-context cases, making the motivation for saliency-guided updates insufficiently grounded.

minor comments (1)

[Abstract] The abstract would benefit from explicit mention of the number of runs, statistical significance tests, and exact baselines used for the 8% LongBench v2 claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important ways to strengthen the empirical support for our claims. We agree that additional controls and quantitative analysis are needed to better substantiate the central hypothesis. We address each major comment below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses

Referee: The ablation studies referenced in the abstract do not include a same-sparsity random-selection baseline for weight updates. Without this control, the reported ~8% gains on LongBench v2 cannot be attributed specifically to high-magnitude activation saliency rather than generic effects of sparsity-induced regularization or reduced update capacity, which directly undermines the central hypothesis that these activations are the pivotal drivers.

Authors: We agree that a same-sparsity random-selection baseline is necessary to isolate the contribution of saliency guidance from generic sparsity effects. In the revised manuscript, we will add this control experiment, applying random weight selection at the identical sparsity ratio used by LongAct and comparing the resulting performance on LongBench v2. This will provide direct evidence that the observed gains are attributable to targeting high-magnitude Q/K activations rather than sparsity-induced regularization alone. revision: yes
Referee: The hypothesis that long-context reasoning inherently exhibits a sparse structure (and that high-magnitude Q/K activations are therefore critical) is stated as an insight but lacks quantitative characterization, such as activation histograms, sparsity metrics, or comparisons to short-context cases, making the motivation for saliency-guided updates insufficiently grounded.

Authors: We acknowledge that the motivation would be strengthened by more rigorous quantitative support. We will expand the introduction and relevant method sections to include activation magnitude histograms, explicit sparsity metrics (e.g., fraction of activations exceeding magnitude thresholds), and side-by-side comparisons of long-context versus short-context activation patterns. These additions will better ground the claim that long-context processing exhibits an inherent sparse structure centered on high-magnitude Q/K activations. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical hypothesis tested via selective updates

full rationale

The paper's chain begins with an empirical observation of high-magnitude activations in Q/K vectors for long contexts, draws inspiration from external quantization work and the general sparsity of long-context reasoning, then proposes saliency-guided sparse updates as a training strategy. Reported gains on LongBench v2 and RULER are measured outcomes, not quantities defined or fitted to equal the inputs by construction. No equations, self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text. The method is self-contained against external benchmarks and ablations; absence of a random-sparsity control is an experimental-design issue, not a circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on an unproven hypothesis linking high-magnitude activations to optimization drivers and an assumption of inherent sparsity in long-context reasoning.

axioms (2)

domain assumption Long-context reasoning inherently exhibits a sparse structure.
Invoked in the abstract to justify focusing on salient activations.
ad hoc to paper High-magnitude activations in query and key vectors are the pivotal drivers for effective model optimization.
Stated as the core hypothesis drawn from quantization insights.

pith-pipeline@v0.9.0 · 5511 in / 1177 out tokens · 27437 ms · 2026-05-10T11:13:48.796290+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 26 canonical work pages · 11 internal anchors

[1]

Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen Stephen Gou, Phil Blunsom, Ahmet \"U st \"u n, and Sara Hooker. 2023. Intriguing properties of quantization at scale. Advances in Neural Information Processing Systems, 36:34278--34294

2023
[2]

Anonymous. 2025. https://openreview.net/forum?id=omVhYvyTPJ Longrlvr: Overcoming the long-context bottleneck in reinforcement learning with verifiable rewards . OpenReview. Under review as a conference paper at ICLR 2026

2025
[3]

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, and 1 others. 2025. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639--3664

2025
[4]

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veli c kovi \'c . 2024. Round and round we go! what makes rotary positional encodings useful? arXiv preprint arXiv:2410.06205

work page arXiv 2024
[5]

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, and 1 others. 2025. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617

work page internal anchor Pith review arXiv 2025
[6]

Jingcheng Deng, Liang Pang, Zihao Wei, Shichen Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. 2025. https://arxiv.org/abs/2510.15522 Latent reasoning in llms as a vocabulary-space superposition . Preprint, arXiv:2510.15522

work page arXiv 2025
[7]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems, 35:30318--30332

2022
[8]

Sihan Du, Weihao Shi, Zhicheng Chen, Yifan Xu, Ruijia Qiao, Chun Chen, Jian Wu, and 1 others. 2024. Migu: Magnitude-based gradient updating for multitask learning. arXiv preprint arXiv:2406.17245

work page arXiv 2024
[9]

Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, and 1 others. 2025. Seerattention-r: Sparse attention adaptation for long reasoning. arXiv preprint arXiv:2506.08889

work page arXiv 2025
[10]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I Jordan, and Song Mei. 2024. Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms. arXiv preprint arXiv:2410.13835

work page arXiv 2024
[12]

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. 2024. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769

work page internal anchor Pith review arXiv 2024
[13]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. Ruler: What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654

work page internal anchor Pith review arXiv 2024
[14]

Tingfeng Hui, Pengyu Zhu, Bowen Ping, Ling Tang, Guanting Dong, Yaqi Zhang, and Sen Su. 2025. Decif: Improving instruction-following through meta-decomposition. arXiv preprint arXiv:2505.13990

work page arXiv 2025
[15]

Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, and Yongfeng Zhang. 2025. Massive values in self-attention modules are the key to contextual knowledge understanding. arXiv preprint arXiv:2502.01563

work page arXiv 2025
[16]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6:87--100

2024
[17]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750

work page internal anchor Pith review arXiv 2024
[18]

Samyak Mukherjee, Zhongyu Wu, and Mohit Bansal. 2025. Reinforcement learning finetunes small subnetworks in large language models. Advances in Neural Information Processing Systems, 38

2025
[19]

Bowen Ping, Zijun Chen, Yiyao Yu, Tingfeng Hui, Junchi Yan, and Baobao Chang. 2026. Longr: Unleashing long-context reasoning via reinforcement learning with dense utility rewards. arXiv preprint arXiv:2602.05758

work page arXiv 2026
[20]

Bowen Ping, Jiali Zeng, Fandong Meng, Shuo Wang, Jie Zhou, and Shanghang Zhang. 2025. Longdpo: Unlock better long-form generation abilities for llms via critique-augmented stepwise information. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7613--7632

2025
[21]

QwenTeam . 2025. Qwen3-next: Towards ultimate training & inference efficiency. https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd. Accessed: 2025-10

2025
[22]

John Schulman, Xi Chen, and Pieter Abbeel. 2018. https://arxiv.org/abs/1704.06440 Equivalence between policy gradients and soft q-learning . Preprint, arXiv:1704.06440

work page arXiv 2018
[23]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, and 1 others. 2025 a . Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534

work page internal anchor Pith review arXiv 2025
[25]

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, and 1 others. 2025 b . Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692

work page internal anchor Pith review arXiv 2025
[26]

MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, and 1 others. 2025 c . Minicpm4: Ultra-efficient llms on end devices. arXiv preprint arXiv:2506.07900

work page arXiv 2025
[27]

Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, and Ming Yan. 2025. Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning. arXiv preprint arXiv:2505.17667

work page arXiv 2025
[28]

Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, and Mao Yang. 2025. Loongrl: Reinforcement learning for advanced reasoning over long contexts. arXiv preprint arXiv:2510.19363

work page arXiv 2025
[29]

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. 2024. Infllm: Training-free long-context extrapolation for llms with an efficient context memory. Advances in Neural Information Processing Systems, 37:119638--119661

2024
[30]

Wujiang Xu, Qitian Wu, Zujie Liang, Jiaojiao Han, Xuying Ning, Yunxiao Shi, Wenfang Lin, and Yongfeng Zhang. 2024. Slmrec: Distilling large language models into small for sequential recommendation. arXiv preprint arXiv:2405.17890

work page arXiv 2024
[31]

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and 1 others. 2025 a . Memagent: Reshaping long-context llm with multi-conv rl-based memory agent. arXiv preprint arXiv:2507.02259

work page arXiv 2025
[32]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2025 b . Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, and 1 others. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471

work page internal anchor Pith review arXiv 2025
[34]

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, and 1 others. 2024. ∞ bench: Extending long context evaluation beyond 100k tokens. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15262--15277

2024
[35]

Weilin Zhao, Zihan Zhou, Zhou Su, Chaojun Xiao, Yuxuan Li, Yanghao Li, Yudi Zhang, Weilun Zhao, Zhen Li, Yuxiang Huang, and 1 others. 2025. Infllm-v2: Dense-sparse switchable attention for seamless short-to-long adaptation. arXiv preprint arXiv:2509.24663

work page arXiv 2025
[36]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, and 1 others. 2025. Group sequence policy optimization. arXiv preprint arXiv:2507.18071

work page internal anchor Pith review arXiv 2025
[37]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...