GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
Pith reviewed 2026-05-20 05:49 UTC · model grok-4.3
The pith
A taxonomy-guided dataset of 23K samples plus TMN-Reweight lets a 30B model match much larger ones on long-context tasks
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GoLongRL shows that an openly released dataset of 23K RLVR samples covering nine long-context task types outperforms the closed-source QwenLong-L1.5 dataset under identical vanilla GRPO training, while a Qwen3-30B-A3B model trained on it reaches long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507. The proposed TMN-Reweight method, which combines task-level mean normalization for reward-scale alignment with difficulty-adaptive weighting, further raises average performance over vanilla GRPO and keeps general capabilities intact or improved.
What carries the argument
TMN-Reweight, which applies task-level mean normalization to align cross-task reward scales and difficulty-adaptive weighting to stabilize advantage estimates when rewards are heterogeneous across tasks.
If this is right
- The open 23K-sample dataset alone produces stronger long-context results than the closed-source QwenLong-L1.5 dataset when both are used with the same vanilla GRPO setup.
- A 30B-scale model trained on the dataset reaches long-context performance levels previously associated only with models several times larger.
- TMN-Reweight delivers measurable gains on top of standard GRPO when rewards differ across tasks.
- General capabilities stay the same or improve while long-context performance advances.
Where Pith is reading between the lines
- The same taxonomy-driven approach to dataset design could be adapted to build RLVR collections for other targeted capabilities such as multi-step reasoning.
- Full public release of the construction pipeline and training code makes it straightforward for others to test the method on different base models or longer context lengths.
- If reward diversity is the main driver, systematically adding further task categories beyond the current nine could produce additional capability gains without changing the optimization method.
Load-bearing premise
That guiding data construction by a taxonomy of long-context capabilities and increasing the number of task types with natural metrics will substantially improve long-context capability gains through greater reward diversity.
What would settle it
Training the same base model on a dataset restricted to only three task types and finding no drop in long-context benchmark scores compared with the nine-task version would undermine the claimed benefit of broader coverage.
read the original abstract
We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GoLongRL, a fully open-source post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). It contributes (1) an openly released 23K-sample dataset spanning 9 task types guided by a taxonomy of long-context capabilities, with curated and synthetic samples, and (2) the TMN-Reweight method combining task-level mean normalization and difficulty-adaptive weighting for heterogeneous multitask GRPO optimization. The central empirical claims are that this dataset alone outperforms the closed-source QwenLong-L1.5 dataset under identical vanilla GRPO, and that a Qwen3-30B-A3B model trained on it achieves long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, with gains attributed to broader coverage and reward diversity.
Significance. If the results hold, the open release of the full dataset, construction pipeline, and training code is a clear strength that supports reproducibility and community progress in long-context RL. The TMN-Reweight technique provides a concrete, practical approach to handling reward heterogeneity in multitask settings, and the performance claims, if substantiated, indicate that capability-oriented data design can yield competitive long-context gains without model scaling.
major comments (1)
- [Abstract] Abstract: The claim that 'broader coverage and greater reward diversity substantially benefit long-context capability improvement' and that the dataset outperforms QwenLong-L1.5 'alone' under vanilla GRPO is load-bearing for the central contribution, yet no ablation is described that holds total sample count fixed while varying the number of task types (or reward formulations) to isolate diversity from curation quality or difficulty distribution.
minor comments (1)
- [Abstract] Abstract: The comparability claim to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507 would be strengthened by explicit mention of the exact long-context benchmarks, metrics, and whether error bars or statistical tests were used.
Simulated Author's Rebuttal
We are grateful to the referee for their thoughtful review and for recognizing the value of our open-source dataset and the TMN-Reweight method. We address the major comment in detail below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: The claim that 'broader coverage and greater reward diversity substantially benefit long-context capability improvement' and that the dataset outperforms QwenLong-L1.5 'alone' under vanilla GRPO is load-bearing for the central contribution, yet no ablation is described that holds total sample count fixed while varying the number of task types (or reward formulations) to isolate diversity from curation quality or difficulty distribution.
Authors: We agree with the referee that an ablation holding the total sample count fixed while varying the number of task types would provide stronger evidence for the benefits of broader coverage and reward diversity. Our current results demonstrate that the full GoLongRL dataset outperforms QwenLong-L1.5 under identical vanilla GRPO training, but we acknowledge that factors such as curation quality and difficulty distribution may contribute to the observed gains. In the revised manuscript, we will include a new ablation study. Specifically, we will construct subsets of our dataset with varying numbers of task types (e.g., 3, 6, and 9 tasks) while maintaining a fixed total sample count of approximately 23K by proportionally increasing samples from the included tasks. We will report the long-context performance under the same GRPO setup to isolate the impact of task diversity. This will be added to the experiments section, and the abstract will be updated to reflect the additional evidence. We believe this will substantiate our claims more robustly. revision: yes
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper's core contributions consist of an openly released 23K-sample dataset spanning 9 taxonomy-guided task types with natural metrics, plus the TMN-Reweight method that applies task-level mean normalization and difficulty-adaptive weighting. Performance assertions rest on direct empirical comparisons against external closed-source datasets (QwenLong-L1.5) and larger models (DeepSeek-R1, Qwen3-235B) under identical vanilla GRPO training, rather than any internal parameter fit, self-referential definition, or self-citation chain. No equations are presented that reduce a claimed prediction or uniqueness result to the inputs by construction, and the interpretation that diversity drives gains is offered as a post-hoc suggestion from the observed outperformance, not as a load-bearing derivation that collapses into its own assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The GRPO algorithm provides valid advantage estimation when applied to the heterogeneous reward setting
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric... TMN-Reweight... task-level mean normalization... difficulty-adaptive weighting
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Proceedings of the 2018 conference on empirical methods in natural language processing , year=
HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , year=
work page 2018
-
[3]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020
work page 2020
-
[4]
Transactions of the Association for Computational Linguistics , volume=
MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=
-
[5]
Transactions of the Association for Computational Linguistics , volume=
The narrativeqa reading comprehension challenge , author=. Transactions of the Association for Computational Linguistics , volume=
-
[6]
Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[7]
A dataset of information-seeking questions and answers anchored in research papers , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , year=
work page 2021
-
[8]
Su, Zhenpeng and Wu, Xing and Bai, Xue and Lin, Zijia and Chen, Hui and Ding, Guiguang and Zhou, Wei and Hu, Songlin. M i L e Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024
work page 2024
-
[9]
Focal Loss for Dense Object Detection , year=
Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Dollár, Piotr , journal=. Focal Loss for Dense Object Detection , year=
-
[10]
Lahiri, Shibamouli , title =. Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics , month =. 2014 , publisher =
work page 2014
-
[11]
2003 , howpublished =
work page 2003
-
[12]
Evaluating the Performance of Large Language Models on GAOKAO Benchmark , author=. 2024 , eprint=
work page 2024
-
[13]
Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities , author=. 2025 , eprint=
work page 2025
-
[14]
Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities , author=. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[15]
CL ong E val: A C hinese Benchmark for Evaluating Long-Context Large Language Models
Qiu, Zexuan and Li, Jingjing and Huang, Shijue and Jiao, Xiaoqi and Zhong, Wanjun and King, Irwin. CL ong E val: A C hinese Benchmark for Evaluating Long-Context Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024
work page 2024
- [16]
-
[17]
Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers , author=. 2025 , eprint=
work page 2025
-
[18]
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
Welbl, Johannes and Stenetorp, Pontus and Riedel, Sebastian. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics. 2018
work page 2018
-
[19]
CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction , author=. 2018 , eprint=
work page 2018
-
[20]
Plugging Schema Graph into Multi-Table QA : A Human-Guided Framework for Reducing LLM Reliance
Wang, Xixi and Costa, Miguel and Kovaceva, Jordanka and Wang, Shuai and Pereira, Francisco C. Plugging Schema Graph into Multi-Table QA : A Human-Guided Framework for Reducing LLM Reliance. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025
work page 2025
-
[21]
F in QA : A Dataset of Numerical Reasoning over Financial Data
Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang. F in QA : A Dataset of Numerical Reasoning over Financial Data. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021
work page 2021
-
[22]
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in
Mohammad Tavakoli and Alireza Salemi and Carrie Ye and Mohamed Abdalla and Hamed Zamani and J Ross Mitchell , booktitle=. Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in
-
[23]
Nature , author =. 2025 , pages =. doi:10.1038/s41586-025-09422-z , number =
-
[24]
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=
work page 2025
-
[25]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=
work page 2025
- [26]
-
[27]
F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare , author=. 2026 , eprint=
work page 2026
-
[28]
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management , author=. 2025 , eprint=
work page 2025
-
[29]
Siyuan Wang and Gaokai Zhang and Li Lyna Zhang and Ning Shang and Fan Yang and Dongyao Chen and Mao Yang , booktitle=. Loong
-
[30]
Guanzheng Chen and Michael Qizhe Shieh and Lidong Bing , booktitle=. Long
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=
work page 2024
-
[32]
Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=
work page 2025
-
[33]
Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and others , booktitle=
-
[34]
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization , author=. 2026 , eprint=
work page 2026
-
[35]
ASPO: Asymmetric Importance Sampling Policy Optimization , author=. 2025 , eprint=
work page 2025
-
[36]
LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark , author=. 2026 , eprint=
work page 2026
-
[37]
The Fourteenth International Conference on Learning Representations , year=
Revisiting Long-context Modeling from Context Denoising Perspective , author=. The Fourteenth International Conference on Learning Representations , year=
-
[38]
The Thirteenth International Conference on Learning Representations , year=
What is Wrong with Perplexity for Long-context Language Modeling? , author=. The Thirteenth International Conference on Learning Representations , year=
-
[39]
Token Weighting for Long-Range Language Modeling
Helm, Falko and Daheim, Nico and Gurevych, Iryna. Token Weighting for Long-Range Language Modeling. Findings of the Association for Computational Linguistics: NAACL 2025. 2025
work page 2025
- [40]
-
[41]
Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
work page 2025
-
[42]
Longbench: A bilingual, multitask benchmark for long context understanding , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=
-
[43]
arXiv preprint arXiv:2409.12640 , year=
Michelangelo: Long context evaluations beyond haystacks via latent structure queries , author=. arXiv preprint arXiv:2409.12640 , year=
-
[44]
DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[45]
CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning
CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning , author=. arXiv preprint arXiv:2601.14952 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
arXiv preprint arXiv:2510.18855 , year=
Every step evolves: Scaling reinforcement learning for trillion-scale thinking model , author=. arXiv preprint arXiv:2510.18855 , year=
-
[47]
Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , journal=
-
[48]
American Invitational Mathematics Examination (AIME) , year =
-
[49]
Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R , journal=
-
[50]
Forty-second International Conference on Machine Learning , year=
The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=
-
[51]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Longmemeval: Benchmarking chat assistants on long-term interactive memory , author=. arXiv preprint arXiv:2410.10813 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
-
[53]
Transactions of the association for computational linguistics , volume=
Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=
-
[54]
Kwai Summary Attention Technical Report
Kwai Summary Attention Technical Report , author=. arXiv preprint arXiv:2604.24432 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning , author=. 2026 , eprint=
work page 2026
-
[56]
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
Zhenpeng Su and Leiyu Pan and Minxuan Lv and Tiehua Mei and Zijia Lin and Yuntao Li and Wenping Hu and Ruiming Tang and Kun Gai and Guorui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.05591 , eprinttype =. 2512.05591 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.05591 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.