pith. sign in

arxiv: 2605.19577 · v1 · pith:OUZC76F3new · submitted 2026-05-19 · 💻 cs.CL

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Pith reviewed 2026-05-20 05:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords long-context reinforcement learningRLVRmultitask alignmentdataset constructionTMN-ReweightGRPOcapability taxonomypost-training
0
0 comments X p. Extension
pith:OUZC76F3 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{OUZC76F3}

Prints a linked pith:OUZC76F3 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A taxonomy-guided dataset of 23K samples plus TMN-Reweight lets a 30B model match much larger ones on long-context tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that long-context reinforcement learning improves when data construction follows a broad taxonomy of practical capabilities instead of narrow retrieval patterns, and when multitask optimization handles differing reward scales explicitly. It releases an open 23K-sample dataset spanning nine task types with their natural metrics, built from both curated and synthetic sources, and pairs this with a new reweighting technique. A sympathetic reader would care because the results show a modest open model reaching performance levels of much larger closed models under standard training, suggesting that thoughtful coverage and alignment can substitute for raw scale. If the central claim holds, long-context training becomes more accessible and efficient without proprietary data or ever-larger models.

Core claim

GoLongRL shows that an openly released dataset of 23K RLVR samples covering nine long-context task types outperforms the closed-source QwenLong-L1.5 dataset under identical vanilla GRPO training, while a Qwen3-30B-A3B model trained on it reaches long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507. The proposed TMN-Reweight method, which combines task-level mean normalization for reward-scale alignment with difficulty-adaptive weighting, further raises average performance over vanilla GRPO and keeps general capabilities intact or improved.

What carries the argument

TMN-Reweight, which applies task-level mean normalization to align cross-task reward scales and difficulty-adaptive weighting to stabilize advantage estimates when rewards are heterogeneous across tasks.

If this is right

  • The open 23K-sample dataset alone produces stronger long-context results than the closed-source QwenLong-L1.5 dataset when both are used with the same vanilla GRPO setup.
  • A 30B-scale model trained on the dataset reaches long-context performance levels previously associated only with models several times larger.
  • TMN-Reweight delivers measurable gains on top of standard GRPO when rewards differ across tasks.
  • General capabilities stay the same or improve while long-context performance advances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same taxonomy-driven approach to dataset design could be adapted to build RLVR collections for other targeted capabilities such as multi-step reasoning.
  • Full public release of the construction pipeline and training code makes it straightforward for others to test the method on different base models or longer context lengths.
  • If reward diversity is the main driver, systematically adding further task categories beyond the current nine could produce additional capability gains without changing the optimization method.

Load-bearing premise

That guiding data construction by a taxonomy of long-context capabilities and increasing the number of task types with natural metrics will substantially improve long-context capability gains through greater reward diversity.

What would settle it

Training the same base model on a dataset restricted to only three task types and finding no drop in long-context benchmark scores compared with the nine-task version would undermine the claimed benefit of broader coverage.

read the original abstract

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces GoLongRL, a fully open-source post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). It contributes (1) an openly released 23K-sample dataset spanning 9 task types guided by a taxonomy of long-context capabilities, with curated and synthetic samples, and (2) the TMN-Reweight method combining task-level mean normalization and difficulty-adaptive weighting for heterogeneous multitask GRPO optimization. The central empirical claims are that this dataset alone outperforms the closed-source QwenLong-L1.5 dataset under identical vanilla GRPO, and that a Qwen3-30B-A3B model trained on it achieves long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, with gains attributed to broader coverage and reward diversity.

Significance. If the results hold, the open release of the full dataset, construction pipeline, and training code is a clear strength that supports reproducibility and community progress in long-context RL. The TMN-Reweight technique provides a concrete, practical approach to handling reward heterogeneity in multitask settings, and the performance claims, if substantiated, indicate that capability-oriented data design can yield competitive long-context gains without model scaling.

major comments (1)
  1. [Abstract] Abstract: The claim that 'broader coverage and greater reward diversity substantially benefit long-context capability improvement' and that the dataset outperforms QwenLong-L1.5 'alone' under vanilla GRPO is load-bearing for the central contribution, yet no ablation is described that holds total sample count fixed while varying the number of task types (or reward formulations) to isolate diversity from curation quality or difficulty distribution.
minor comments (1)
  1. [Abstract] Abstract: The comparability claim to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507 would be strengthened by explicit mention of the exact long-context benchmarks, metrics, and whether error bars or statistical tests were used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for their thoughtful review and for recognizing the value of our open-source dataset and the TMN-Reweight method. We address the major comment in detail below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: The claim that 'broader coverage and greater reward diversity substantially benefit long-context capability improvement' and that the dataset outperforms QwenLong-L1.5 'alone' under vanilla GRPO is load-bearing for the central contribution, yet no ablation is described that holds total sample count fixed while varying the number of task types (or reward formulations) to isolate diversity from curation quality or difficulty distribution.

    Authors: We agree with the referee that an ablation holding the total sample count fixed while varying the number of task types would provide stronger evidence for the benefits of broader coverage and reward diversity. Our current results demonstrate that the full GoLongRL dataset outperforms QwenLong-L1.5 under identical vanilla GRPO training, but we acknowledge that factors such as curation quality and difficulty distribution may contribute to the observed gains. In the revised manuscript, we will include a new ablation study. Specifically, we will construct subsets of our dataset with varying numbers of task types (e.g., 3, 6, and 9 tasks) while maintaining a fixed total sample count of approximately 23K by proportionally increasing samples from the included tasks. We will report the long-context performance under the same GRPO setup to isolate the impact of task diversity. This will be added to the experiments section, and the abstract will be updated to reflect the additional evidence. We believe this will substantiate our claims more robustly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's core contributions consist of an openly released 23K-sample dataset spanning 9 taxonomy-guided task types with natural metrics, plus the TMN-Reweight method that applies task-level mean normalization and difficulty-adaptive weighting. Performance assertions rest on direct empirical comparisons against external closed-source datasets (QwenLong-L1.5) and larger models (DeepSeek-R1, Qwen3-235B) under identical vanilla GRPO training, rather than any internal parameter fit, self-referential definition, or self-citation chain. No equations are presented that reduce a claimed prediction or uniqueness result to the inputs by construction, and the interpretation that diversity drives gains is offered as a post-hoc suggestion from the observed outperformance, not as a load-bearing derivation that collapses into its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard RL assumptions and data synthesis practices; no new physical entities are introduced. TMN-Reweight is a procedural technique rather than a new postulated object.

axioms (1)
  • domain assumption The GRPO algorithm provides valid advantage estimation when applied to the heterogeneous reward setting
    Paper uses vanilla GRPO as the baseline setup for comparisons.

pith-pipeline@v0.9.0 · 5863 in / 1389 out tokens · 67794 ms · 2026-05-20T05:49:11.954541+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 4 internal anchors

  1. [1]

    2026 , eprint=

    Your Group-Relative Advantage Is Biased , author=. 2026 , eprint=

  2. [2]

    Proceedings of the 2018 conference on empirical methods in natural language processing , year=

    HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , year=

  3. [3]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020

  4. [4]

    Transactions of the Association for Computational Linguistics , volume=

    MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=

  5. [5]

    Transactions of the Association for Computational Linguistics , volume=

    The narrativeqa reading comprehension challenge , author=. Transactions of the Association for Computational Linguistics , volume=

  6. [6]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  7. [7]

    Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , year=

    A dataset of information-seeking questions and answers anchored in research papers , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , year=

  8. [8]

    M i L e Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models

    Su, Zhenpeng and Wu, Xing and Bai, Xue and Lin, Zijia and Chen, Hui and Ding, Guiguang and Zhou, Wei and Hu, Songlin. M i L e Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024

  9. [9]

    Focal Loss for Dense Object Detection , year=

    Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Dollár, Piotr , journal=. Focal Loss for Dense Object Detection , year=

  10. [10]

    Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics , month =

    Lahiri, Shibamouli , title =. Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics , month =. 2014 , publisher =

  11. [11]

    2003 , howpublished =

  12. [12]

    2024 , eprint=

    Evaluating the Performance of Large Language Models on GAOKAO Benchmark , author=. 2024 , eprint=

  13. [13]

    2025 , eprint=

    Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities , author=. 2025 , eprint=

  14. [14]

    Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities , author=. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  15. [15]

    CL ong E val: A C hinese Benchmark for Evaluating Long-Context Large Language Models

    Qiu, Zexuan and Li, Jingjing and Huang, Shijue and Jiao, Xiaoqi and Zhong, Wanjun and King, Irwin. CL ong E val: A C hinese Benchmark for Evaluating Long-Context Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024

  16. [16]

    2024 , howpublished =

    arXiv-CC0-v0.5 , author =. 2024 , howpublished =

  17. [17]

    2025 , eprint=

    Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers , author=. 2025 , eprint=

  18. [18]

    Constructing Datasets for Multi-hop Reading Comprehension Across Documents

    Welbl, Johannes and Stenetorp, Pontus and Riedel, Sebastian. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics. 2018

  19. [19]

    2018 , eprint=

    CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction , author=. 2018 , eprint=

  20. [20]

    Plugging Schema Graph into Multi-Table QA : A Human-Guided Framework for Reducing LLM Reliance

    Wang, Xixi and Costa, Miguel and Kovaceva, Jordanka and Wang, Shuai and Pereira, Francisco C. Plugging Schema Graph into Multi-Table QA : A Human-Guided Framework for Reducing LLM Reliance. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

  21. [21]

    F in QA : A Dataset of Numerical Reasoning over Financial Data

    Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang. F in QA : A Dataset of Numerical Reasoning over Financial Data. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

  22. [22]

    Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in

    Mohammad Tavakoli and Alireza Salemi and Carrie Ye and Mohamed Abdalla and Hamed Zamani and J Ross Mitchell , booktitle=. Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in

  23. [23]

    Nature , publisher=

    Nature , author =. 2025 , pages =. doi:10.1038/s41586-025-09422-z , number =

  24. [24]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  25. [25]

    2025 , eprint=

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=

  26. [26]

    2017 , eprint=

    Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

  27. [27]

    2026 , eprint=

    F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare , author=. 2026 , eprint=

  28. [28]

    2025 , eprint=

    QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management , author=. 2025 , eprint=

  29. [29]

    Siyuan Wang and Gaokai Zhang and Li Lyna Zhang and Ning Shang and Fan Yang and Dongyao Chen and Mao Yang , booktitle=. Loong

  30. [30]

    Guanzheng Chen and Michael Qizhe Shieh and Lidong Bing , booktitle=. Long

  31. [31]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  32. [32]

    2025 , eprint=

    Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=

  33. [33]

    Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and others , booktitle=

  34. [34]

    2026 , eprint=

    Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization , author=. 2026 , eprint=

  35. [35]

    2025 , eprint=

    ASPO: Asymmetric Importance Sampling Policy Optimization , author=. 2025 , eprint=

  36. [36]

    2026 , eprint=

    LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark , author=. 2026 , eprint=

  37. [37]

    The Fourteenth International Conference on Learning Representations , year=

    Revisiting Long-context Modeling from Context Denoising Perspective , author=. The Fourteenth International Conference on Learning Representations , year=

  38. [38]

    The Thirteenth International Conference on Learning Representations , year=

    What is Wrong with Perplexity for Long-context Language Modeling? , author=. The Thirteenth International Conference on Learning Representations , year=

  39. [39]

    Token Weighting for Long-Range Language Modeling

    Helm, Falko and Daheim, Nico and Gurevych, Iryna. Token Weighting for Long-Range Language Modeling. Findings of the Association for Computational Linguistics: NAACL 2025. 2025

  40. [40]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  41. [41]

    Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  42. [42]

    Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

    Longbench: A bilingual, multitask benchmark for long context understanding , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

  43. [43]

    arXiv preprint arXiv:2409.12640 , year=

    Michelangelo: Long context evaluations beyond haystacks via latent structure queries , author=. arXiv preprint arXiv:2409.12640 , year=

  44. [44]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  45. [45]

    CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

    CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning , author=. arXiv preprint arXiv:2601.14952 , year=

  46. [46]

    arXiv preprint arXiv:2510.18855 , year=

    Every step evolves: Scaling reinforcement learning for trillion-scale thinking model , author=. arXiv preprint arXiv:2510.18855 , year=

  47. [47]

    Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , journal=

  48. [48]

    American Invitational Mathematics Examination (AIME) , year =

  49. [49]

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R , journal=

  50. [50]

    Forty-second International Conference on Machine Learning , year=

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=

  51. [51]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Longmemeval: Benchmarking chat assistants on long-term interactive memory , author=. arXiv preprint arXiv:2410.10813 , year=

  52. [52]

    Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

  53. [53]

    Transactions of the association for computational linguistics , volume=

    Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=

  54. [54]

    Kwai Summary Attention Technical Report

    Kwai Summary Attention Technical Report , author=. arXiv preprint arXiv:2604.24432 , year=

  55. [55]

    2026 , eprint=

    CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning , author=. 2026 , eprint=

  56. [56]

    Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

    Zhenpeng Su and Leiyu Pan and Minxuan Lv and Tiehua Mei and Zijia Lin and Yuntao Li and Wenping Hu and Ruiming Tang and Kun Gai and Guorui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.05591 , eprinttype =. 2512.05591 , timestamp =