GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Han Li; Jian Liang; Junmin Chen; Minxuan Lv; Ruiming Tang; Ruotong Pan; Tanlong Du; Tiehua Mei; Zhennan Wu; Zhenpeng Su

arxiv: 2605.19577 · v1 · pith:OUZC76F3new · submitted 2026-05-19 · 💻 cs.CL

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Minxuan Lv , Tiehua Mei , Tanlong Du , Junmin Chen , Zhenpeng Su , Ziyang Chen , Ziqi Wang , Zhennan Wu

show 4 more authors

Ruotong Pan jian Liang Ruiming Tang Han Li

This is my paper

Pith reviewed 2026-05-20 05:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-context reinforcement learningRLVRmultitask alignmentdataset constructionTMN-ReweightGRPOcapability taxonomypost-training

0 comments

The pith

A taxonomy-guided dataset of 23K samples plus TMN-Reweight lets a 30B model match much larger ones on long-context tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that long-context reinforcement learning improves when data construction follows a broad taxonomy of practical capabilities instead of narrow retrieval patterns, and when multitask optimization handles differing reward scales explicitly. It releases an open 23K-sample dataset spanning nine task types with their natural metrics, built from both curated and synthetic sources, and pairs this with a new reweighting technique. A sympathetic reader would care because the results show a modest open model reaching performance levels of much larger closed models under standard training, suggesting that thoughtful coverage and alignment can substitute for raw scale. If the central claim holds, long-context training becomes more accessible and efficient without proprietary data or ever-larger models.

Core claim

GoLongRL shows that an openly released dataset of 23K RLVR samples covering nine long-context task types outperforms the closed-source QwenLong-L1.5 dataset under identical vanilla GRPO training, while a Qwen3-30B-A3B model trained on it reaches long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507. The proposed TMN-Reweight method, which combines task-level mean normalization for reward-scale alignment with difficulty-adaptive weighting, further raises average performance over vanilla GRPO and keeps general capabilities intact or improved.

What carries the argument

TMN-Reweight, which applies task-level mean normalization to align cross-task reward scales and difficulty-adaptive weighting to stabilize advantage estimates when rewards are heterogeneous across tasks.

If this is right

The open 23K-sample dataset alone produces stronger long-context results than the closed-source QwenLong-L1.5 dataset when both are used with the same vanilla GRPO setup.
A 30B-scale model trained on the dataset reaches long-context performance levels previously associated only with models several times larger.
TMN-Reweight delivers measurable gains on top of standard GRPO when rewards differ across tasks.
General capabilities stay the same or improve while long-context performance advances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same taxonomy-driven approach to dataset design could be adapted to build RLVR collections for other targeted capabilities such as multi-step reasoning.
Full public release of the construction pipeline and training code makes it straightforward for others to test the method on different base models or longer context lengths.
If reward diversity is the main driver, systematically adding further task categories beyond the current nine could produce additional capability gains without changing the optimization method.

Load-bearing premise

That guiding data construction by a taxonomy of long-context capabilities and increasing the number of task types with natural metrics will substantially improve long-context capability gains through greater reward diversity.

What would settle it

Training the same base model on a dataset restricted to only three task types and finding no drop in long-context benchmark scores compared with the nine-task version would undermine the claimed benefit of broader coverage.

read the original abstract

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Open 23K long-context RL dataset and TMN-Reweight deliver usable gains and full code release, but the diversity explanation lacks the needed ablations.

read the letter

The main points are that this paper ships a fully open 23K-sample dataset for long-context RLVR covering 9 task types and adds TMN-Reweight to manage heterogeneous rewards. Both pieces are released with the construction pipeline and training code, which is the practical part worth noting first. Under vanilla GRPO their data beats the closed QwenLong-L1.5 set, and the resulting 30B model reaches performance levels close to much larger models on long-context tasks while keeping general capabilities intact. The reweighting step uses task-level mean normalization plus difficulty-adaptive weighting, a straightforward fix for scale differences across tasks. That combination gives a concrete, reproducible starting point for others working on similar post-training. The open resources stand out as the clearest contribution here. The soft spot is the causal claim about task diversity. The paper ties the gains to broader coverage and reward variety, yet it does not include an ablation that holds total sample count fixed while changing only the number of task types. Without that control it remains possible that curation quality, difficulty distribution, or other unstated choices explain the edge instead. The abstract also omits error bars, statistical tests, and exact task definitions, so the strength of the reported improvements is hard to judge from the summary alone. This work is aimed at labs doing long-context post-training or RL with verifiable rewards who want ready-to-use open data rather than closed benchmarks. Readers focused on multitask alignment or capability-oriented data construction will extract the most value from the released assets. The concrete dataset and code make it worth sending to a serious referee, even though the diversity attribution would need tighter experiments in revision. I would recommend peer review.

Referee Report

1 major / 1 minor

Summary. The paper introduces GoLongRL, a fully open-source post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). It contributes (1) an openly released 23K-sample dataset spanning 9 task types guided by a taxonomy of long-context capabilities, with curated and synthetic samples, and (2) the TMN-Reweight method combining task-level mean normalization and difficulty-adaptive weighting for heterogeneous multitask GRPO optimization. The central empirical claims are that this dataset alone outperforms the closed-source QwenLong-L1.5 dataset under identical vanilla GRPO, and that a Qwen3-30B-A3B model trained on it achieves long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, with gains attributed to broader coverage and reward diversity.

Significance. If the results hold, the open release of the full dataset, construction pipeline, and training code is a clear strength that supports reproducibility and community progress in long-context RL. The TMN-Reweight technique provides a concrete, practical approach to handling reward heterogeneity in multitask settings, and the performance claims, if substantiated, indicate that capability-oriented data design can yield competitive long-context gains without model scaling.

major comments (1)

[Abstract] Abstract: The claim that 'broader coverage and greater reward diversity substantially benefit long-context capability improvement' and that the dataset outperforms QwenLong-L1.5 'alone' under vanilla GRPO is load-bearing for the central contribution, yet no ablation is described that holds total sample count fixed while varying the number of task types (or reward formulations) to isolate diversity from curation quality or difficulty distribution.

minor comments (1)

[Abstract] Abstract: The comparability claim to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507 would be strengthened by explicit mention of the exact long-context benchmarks, metrics, and whether error bars or statistical tests were used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for their thoughtful review and for recognizing the value of our open-source dataset and the TMN-Reweight method. We address the major comment in detail below and outline the revisions we plan to make.

read point-by-point responses

Referee: The claim that 'broader coverage and greater reward diversity substantially benefit long-context capability improvement' and that the dataset outperforms QwenLong-L1.5 'alone' under vanilla GRPO is load-bearing for the central contribution, yet no ablation is described that holds total sample count fixed while varying the number of task types (or reward formulations) to isolate diversity from curation quality or difficulty distribution.

Authors: We agree with the referee that an ablation holding the total sample count fixed while varying the number of task types would provide stronger evidence for the benefits of broader coverage and reward diversity. Our current results demonstrate that the full GoLongRL dataset outperforms QwenLong-L1.5 under identical vanilla GRPO training, but we acknowledge that factors such as curation quality and difficulty distribution may contribute to the observed gains. In the revised manuscript, we will include a new ablation study. Specifically, we will construct subsets of our dataset with varying numbers of task types (e.g., 3, 6, and 9 tasks) while maintaining a fixed total sample count of approximately 23K by proportionally increasing samples from the included tasks. We will report the long-context performance under the same GRPO setup to isolate the impact of task diversity. This will be added to the experiments section, and the abstract will be updated to reflect the additional evidence. We believe this will substantiate our claims more robustly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's core contributions consist of an openly released 23K-sample dataset spanning 9 taxonomy-guided task types with natural metrics, plus the TMN-Reweight method that applies task-level mean normalization and difficulty-adaptive weighting. Performance assertions rest on direct empirical comparisons against external closed-source datasets (QwenLong-L1.5) and larger models (DeepSeek-R1, Qwen3-235B) under identical vanilla GRPO training, rather than any internal parameter fit, self-referential definition, or self-citation chain. No equations are presented that reduce a claimed prediction or uniqueness result to the inputs by construction, and the interpretation that diversity drives gains is offered as a post-hoc suggestion from the observed outperformance, not as a load-bearing derivation that collapses into its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard RL assumptions and data synthesis practices; no new physical entities are introduced. TMN-Reweight is a procedural technique rather than a new postulated object.

axioms (1)

domain assumption The GRPO algorithm provides valid advantage estimation when applied to the heterogeneous reward setting
Paper uses vanilla GRPO as the baseline setup for comparisons.

pith-pipeline@v0.9.0 · 5863 in / 1389 out tokens · 67794 ms · 2026-05-20T05:49:11.954541+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric... TMN-Reweight... task-level mean normalization... difficulty-adaptive weighting
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 4 internal anchors

[1]

2026 , eprint=

Your Group-Relative Advantage Is Biased , author=. 2026 , eprint=

work page 2026
[2]

Proceedings of the 2018 conference on empirical methods in natural language processing , year=

HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , year=

work page 2018
[3]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020

work page 2020
[4]

Transactions of the Association for Computational Linguistics , volume=

MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[5]

Transactions of the Association for Computational Linguistics , volume=

The narrativeqa reading comprehension challenge , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[6]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[7]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , year=

A dataset of information-seeking questions and answers anchored in research papers , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , year=

work page 2021
[8]

M i L e Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models

Su, Zhenpeng and Wu, Xing and Bai, Xue and Lin, Zijia and Chen, Hui and Ding, Guiguang and Zhou, Wei and Hu, Songlin. M i L e Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024

work page 2024
[9]

Focal Loss for Dense Object Detection , year=

Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Dollár, Piotr , journal=. Focal Loss for Dense Object Detection , year=

work page
[10]

Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics , month =

Lahiri, Shibamouli , title =. Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics , month =. 2014 , publisher =

work page 2014
[11]

2003 , howpublished =

work page 2003
[12]

2024 , eprint=

Evaluating the Performance of Large Language Models on GAOKAO Benchmark , author=. 2024 , eprint=

work page 2024
[13]

2025 , eprint=

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities , author=. 2025 , eprint=

work page 2025
[14]

Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities , author=. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[15]

CL ong E val: A C hinese Benchmark for Evaluating Long-Context Large Language Models

Qiu, Zexuan and Li, Jingjing and Huang, Shijue and Jiao, Xiaoqi and Zhong, Wanjun and King, Irwin. CL ong E val: A C hinese Benchmark for Evaluating Long-Context Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024

work page 2024
[16]

2024 , howpublished =

arXiv-CC0-v0.5 , author =. 2024 , howpublished =

work page 2024
[17]

2025 , eprint=

Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers , author=. 2025 , eprint=

work page 2025
[18]

Constructing Datasets for Multi-hop Reading Comprehension Across Documents

Welbl, Johannes and Stenetorp, Pontus and Riedel, Sebastian. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics. 2018

work page 2018
[19]

2018 , eprint=

CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction , author=. 2018 , eprint=

work page 2018
[20]

Plugging Schema Graph into Multi-Table QA : A Human-Guided Framework for Reducing LLM Reliance

Wang, Xixi and Costa, Miguel and Kovaceva, Jordanka and Wang, Shuai and Pereira, Francisco C. Plugging Schema Graph into Multi-Table QA : A Human-Guided Framework for Reducing LLM Reliance. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

work page 2025
[21]

F in QA : A Dataset of Numerical Reasoning over Financial Data

Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang. F in QA : A Dataset of Numerical Reasoning over Financial Data. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

work page 2021
[22]

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in

Mohammad Tavakoli and Alireza Salemi and Carrie Ye and Mohamed Abdalla and Hamed Zamani and J Ross Mitchell , booktitle=. Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in

work page
[23]

Nature , publisher=

Nature , author =. 2025 , pages =. doi:10.1038/s41586-025-09422-z , number =

work page doi:10.1038/s41586-025-09422-z 2025
[24]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025
[25]

2025 , eprint=

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=

work page 2025
[26]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017
[27]

2026 , eprint=

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare , author=. 2026 , eprint=

work page 2026
[28]

2025 , eprint=

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management , author=. 2025 , eprint=

work page 2025
[29]

Siyuan Wang and Gaokai Zhang and Li Lyna Zhang and Ning Shang and Fan Yang and Dongyao Chen and Mao Yang , booktitle=. Loong

work page
[30]

Guanzheng Chen and Michael Qizhe Shieh and Lidong Bing , booktitle=. Long

work page
[31]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024
[32]

2025 , eprint=

Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=

work page 2025
[33]

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and others , booktitle=

work page
[34]

2026 , eprint=

Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization , author=. 2026 , eprint=

work page 2026
[35]

2025 , eprint=

ASPO: Asymmetric Importance Sampling Policy Optimization , author=. 2025 , eprint=

work page 2025
[36]

2026 , eprint=

LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark , author=. 2026 , eprint=

work page 2026
[37]

The Fourteenth International Conference on Learning Representations , year=

Revisiting Long-context Modeling from Context Denoising Perspective , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[38]

The Thirteenth International Conference on Learning Representations , year=

What is Wrong with Perplexity for Long-context Language Modeling? , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[39]

Token Weighting for Long-Range Language Modeling

Helm, Falko and Daheim, Nico and Gurevych, Iryna. Token Weighting for Long-Range Language Modeling. Findings of the Association for Computational Linguistics: NAACL 2025. 2025

work page 2025
[40]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[41]

Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025
[42]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Longbench: A bilingual, multitask benchmark for long context understanding , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

work page
[43]

arXiv preprint arXiv:2409.12640 , year=

Michelangelo: Long context evaluations beyond haystacks via latent structure queries , author=. arXiv preprint arXiv:2409.12640 , year=

work page arXiv
[44]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[45]

CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning , author=. arXiv preprint arXiv:2601.14952 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

arXiv preprint arXiv:2510.18855 , year=

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model , author=. arXiv preprint arXiv:2510.18855 , year=

work page arXiv
[47]

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , journal=

work page
[48]

American Invitational Mathematics Examination (AIME) , year =

work page
[49]

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R , journal=

work page
[50]

Forty-second International Conference on Machine Learning , year=

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=

work page
[51]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Longmemeval: Benchmarking chat assistants on long-term interactive memory , author=. arXiv preprint arXiv:2410.10813 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

work page
[53]

Transactions of the association for computational linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=

work page
[54]

Kwai Summary Attention Technical Report

Kwai Summary Attention Technical Report , author=. arXiv preprint arXiv:2604.24432 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

2026 , eprint=

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning , author=. 2026 , eprint=

work page 2026
[56]

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Zhenpeng Su and Leiyu Pan and Minxuan Lv and Tiehua Mei and Zijia Lin and Yuntao Li and Wenping Hu and Ruiming Tang and Kun Gai and Guorui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.05591 , eprinttype =. 2512.05591 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.05591 2025

[1] [1]

2026 , eprint=

Your Group-Relative Advantage Is Biased , author=. 2026 , eprint=

work page 2026

[2] [2]

Proceedings of the 2018 conference on empirical methods in natural language processing , year=

HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , year=

work page 2018

[3] [3]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020

work page 2020

[4] [4]

Transactions of the Association for Computational Linguistics , volume=

MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=

work page

[5] [5]

Transactions of the Association for Computational Linguistics , volume=

The narrativeqa reading comprehension challenge , author=. Transactions of the Association for Computational Linguistics , volume=

work page

[6] [6]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[7] [7]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , year=

A dataset of information-seeking questions and answers anchored in research papers , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , year=

work page 2021

[8] [8]

M i L e Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models

Su, Zhenpeng and Wu, Xing and Bai, Xue and Lin, Zijia and Chen, Hui and Ding, Guiguang and Zhou, Wei and Hu, Songlin. M i L e Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024

work page 2024

[9] [9]

Focal Loss for Dense Object Detection , year=

Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Dollár, Piotr , journal=. Focal Loss for Dense Object Detection , year=

work page

[10] [10]

Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics , month =

Lahiri, Shibamouli , title =. Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics , month =. 2014 , publisher =

work page 2014

[11] [11]

2003 , howpublished =

work page 2003

[12] [12]

2024 , eprint=

Evaluating the Performance of Large Language Models on GAOKAO Benchmark , author=. 2024 , eprint=

work page 2024

[13] [13]

2025 , eprint=

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities , author=. 2025 , eprint=

work page 2025

[14] [14]

Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities , author=. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page

[15] [15]

CL ong E val: A C hinese Benchmark for Evaluating Long-Context Large Language Models

Qiu, Zexuan and Li, Jingjing and Huang, Shijue and Jiao, Xiaoqi and Zhong, Wanjun and King, Irwin. CL ong E val: A C hinese Benchmark for Evaluating Long-Context Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024

work page 2024

[16] [16]

2024 , howpublished =

arXiv-CC0-v0.5 , author =. 2024 , howpublished =

work page 2024

[17] [17]

2025 , eprint=

Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers , author=. 2025 , eprint=

work page 2025

[18] [18]

Constructing Datasets for Multi-hop Reading Comprehension Across Documents

Welbl, Johannes and Stenetorp, Pontus and Riedel, Sebastian. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics. 2018

work page 2018

[19] [19]

2018 , eprint=

CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction , author=. 2018 , eprint=

work page 2018

[20] [20]

Plugging Schema Graph into Multi-Table QA : A Human-Guided Framework for Reducing LLM Reliance

Wang, Xixi and Costa, Miguel and Kovaceva, Jordanka and Wang, Shuai and Pereira, Francisco C. Plugging Schema Graph into Multi-Table QA : A Human-Guided Framework for Reducing LLM Reliance. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

work page 2025

[21] [21]

F in QA : A Dataset of Numerical Reasoning over Financial Data

Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang. F in QA : A Dataset of Numerical Reasoning over Financial Data. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

work page 2021

[22] [22]

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in

Mohammad Tavakoli and Alireza Salemi and Carrie Ye and Mohamed Abdalla and Hamed Zamani and J Ross Mitchell , booktitle=. Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in

work page

[23] [23]

Nature , publisher=

Nature , author =. 2025 , pages =. doi:10.1038/s41586-025-09422-z , number =

work page doi:10.1038/s41586-025-09422-z 2025

[24] [24]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025

[25] [25]

2025 , eprint=

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=

work page 2025

[26] [26]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017

[27] [27]

2026 , eprint=

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare , author=. 2026 , eprint=

work page 2026

[28] [28]

2025 , eprint=

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management , author=. 2025 , eprint=

work page 2025

[29] [29]

Siyuan Wang and Gaokai Zhang and Li Lyna Zhang and Ning Shang and Fan Yang and Dongyao Chen and Mao Yang , booktitle=. Loong

work page

[30] [30]

Guanzheng Chen and Michael Qizhe Shieh and Lidong Bing , booktitle=. Long

work page

[31] [31]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024

[32] [32]

2025 , eprint=

Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=

work page 2025

[33] [33]

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and others , booktitle=

work page

[34] [34]

2026 , eprint=

Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization , author=. 2026 , eprint=

work page 2026

[35] [35]

2025 , eprint=

ASPO: Asymmetric Importance Sampling Policy Optimization , author=. 2025 , eprint=

work page 2025

[36] [36]

2026 , eprint=

LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark , author=. 2026 , eprint=

work page 2026

[37] [37]

The Fourteenth International Conference on Learning Representations , year=

Revisiting Long-context Modeling from Context Denoising Perspective , author=. The Fourteenth International Conference on Learning Representations , year=

work page

[38] [38]

The Thirteenth International Conference on Learning Representations , year=

What is Wrong with Perplexity for Long-context Language Modeling? , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[39] [39]

Token Weighting for Long-Range Language Modeling

Helm, Falko and Daheim, Nico and Gurevych, Iryna. Token Weighting for Long-Range Language Modeling. Findings of the Association for Computational Linguistics: NAACL 2025. 2025

work page 2025

[40] [40]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[41] [41]

Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025

[42] [42]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Longbench: A bilingual, multitask benchmark for long context understanding , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

work page

[43] [43]

arXiv preprint arXiv:2409.12640 , year=

Michelangelo: Long context evaluations beyond haystacks via latent structure queries , author=. arXiv preprint arXiv:2409.12640 , year=

work page arXiv

[44] [44]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[45] [45]

CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning , author=. arXiv preprint arXiv:2601.14952 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

arXiv preprint arXiv:2510.18855 , year=

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model , author=. arXiv preprint arXiv:2510.18855 , year=

work page arXiv

[47] [47]

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , journal=

work page

[48] [48]

American Invitational Mathematics Examination (AIME) , year =

work page

[49] [49]

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R , journal=

work page

[50] [50]

Forty-second International Conference on Machine Learning , year=

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=

work page

[51] [51]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Longmemeval: Benchmarking chat assistants on long-term interactive memory , author=. arXiv preprint arXiv:2410.10813 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

work page

[53] [53]

Transactions of the association for computational linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=

work page

[54] [54]

Kwai Summary Attention Technical Report

Kwai Summary Attention Technical Report , author=. arXiv preprint arXiv:2604.24432 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

2026 , eprint=

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning , author=. 2026 , eprint=

work page 2026

[56] [56]

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Zhenpeng Su and Leiyu Pan and Minxuan Lv and Tiehua Mei and Zijia Lin and Yuntao Li and Wenping Hu and Ruiming Tang and Kun Gai and Guorui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.05591 , eprinttype =. 2512.05591 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.05591 2025