arxiv: 2605.06111 · v1 · submitted 2026-05-07 · 💻 cs.SE · cs.AI

Recognition: unknown

Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs

Yujia Chen , Yang Ye , Xiao Chu , Yuchi Ma , Cuiyun Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:05 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords multi-task reinforcement learningcode LLMsutility-driven schedulingpolicy optimizationpost-trainingcoding taskstask synergydata allocation

0 comments

The pith

By using task utility to schedule data and calibrate optimization, a single reinforcement learning model for code LLMs can outperform both task-specific specialists and prior multi-task methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that multi-task reinforcement learning for coding tasks becomes more effective when training is coordinated around a task utility signal that reflects both individual learning potential and benefits from other tasks. This matters because separate specialists for each coding task scale poorly in cost, while uniform multi-task approaches waste effort on low-value data and rigid constraints. The proposed method computes utility to drive two modules that allocate training resources hierarchically and adjust optimization per task dynamically. If the approach holds, one model could deliver higher performance across code generation, repair, and related tasks than either dedicated experts or existing joint-training baselines. Experiments on standard LLMs confirm the gains through direct comparisons on representative benchmarks.

Core claim

ASTOR demonstrates that centering multi-task RL on task utility enables a hierarchical utility-routed data scheduling module to allocate training budget and prioritize informative prompts, while an adaptive utility-calibrated policy optimization module dynamically scales per-task KL regularization to match each task's current state, allowing one shared model to advance simultaneously on all coding tasks and exceed both the best task-specific specialist and prior multi-task baselines.

What carries the argument

Task utility, a signal that quantifies each task's learning potential and cross-task synergy and is used to route data scheduling decisions and calibrate per-task optimization constraints.

If this is right

A single model trained this way can serve multiple coding tasks at higher performance than any one specialist.
Training budget is concentrated on prompts with highest current utility, raising data efficiency.
Per-task regularization adjusts automatically to avoid over- or under-constraining updates.
Cross-task synergies are actively used rather than treated as incidental side effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The utility-driven coordination pattern could transfer to multi-task RL post-training for non-coding domains such as mathematical problem solving.
If utility can be estimated with low overhead, the same modules might support adding new tasks incrementally without restarting training.
Lower reliance on task-specific fine-tuning would reduce the total compute needed to deploy capable code assistants across varied use cases.

Load-bearing premise

That a computable task utility signal can reliably capture per-task learning potential and cross-task synergy without introducing training instability or unintended policy biases.

What would settle it

Running the full ASTOR training procedure on the two LLMs and four coding tasks and finding that the resulting unified model does not exceed the average performance of the strongest task-specific specialist or baseline on the same benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.06111 by Cuiyun Gao, Xiao Chu, Yang Ye, Yuchi Ma, Yujia Chen.

**Figure 1.** Figure 1: Performance of task-specific RL models across four coding tasks. STL models perform view at source ↗

**Figure 2.** Figure 2: Overview of ASTOR. (I) hierarchically schedules training data at task and prompt levels view at source ↗

**Figure 3.** Figure 3: Training dynamics of ASTOR on Qwen2.5-Coder-7B. Impact of τ . As shown in view at source ↗

read the original abstract

Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified multi-task RL (MTRL) approach. However, existing MTRL methods treat all coding tasks uniformly, relying on fixed data curricula under a shared optimization strategy, ultimately limiting the effectiveness of multi-task training. To address these limitations, we propose ASTOR, a multi-tASk code reinforcement learning framework via uTility-driven coORdination. Centered on task utility, a signal capturing each task learning potential and cross-task synergy, ASTOR comprises two coupled modules: 1) Hierarchical Utility-Routed Data Scheduling module hierarchically allocates training budget and prioritizes informative prompts, steering training toward the most valuable data and 2) Adaptive Utility-Calibrated Policy Optimization module dynamically scales per-task KL regularization, matching update constraints to each tasks current training state. Experiments on two widely-used LLMs across four representative coding tasks demonstrate that ASTOR consistently improves a single model across all tasks, outperforming the best task-specific specialist by 9.0%-9.5% and surpassing the strongest MTRL baseline by 7.5%-12.8%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASTOR couples a task utility signal to hierarchical data scheduling and per-task KL calibration for multi-task code RL, but the signal itself lacks any visible definition or validation.

read the letter

The paper's main point is that ASTOR uses one computable task utility to run both a hierarchical scheduler that allocates training data and an adaptive module that scales KL penalties per task. This is meant to avoid the uniform treatment that limits existing multi-task RL for code LLMs and to beat separate task-specific models without extra cost at inference time. The two modules are presented as coupled through the same utility, which is described as capturing learning potential and cross-task synergy. That framing directly targets the scaling problem when you have multiple coding tasks like generation, repair, and summarization. The reported results on two LLMs and four tasks show the single model beating the best specialist by 9-9.5% and the strongest baseline by 7.5-12.8%. If those numbers are reproducible, the coordination pattern could be picked up by groups doing post-training on code models. The approach is new in how it ties scheduling and regularization to one utility rather than using fixed curricula or shared optimizers. The experiments are positioned as consistent across models and tasks, which is a practical strength for this subfield. The central weakness is the utility signal. No formula, features, or correlation check with reward curves or gradients appears in the abstract, so it is impossible to judge whether it actually drives the gains or whether the improvements trace to total data volume or tuning. The stress-test concern about misspecification leading to bad allocation or instability therefore stands on the current evidence. Without ablations, training curves, or statistical tests, the empirical claim remains thin. This work is aimed at researchers who train code LLMs and want one model to handle several tasks efficiently. A reader already working on multi-task RL or RL for code would get the most from the architecture description, even if they treat the numbers as preliminary. It deserves peer review because the problem is concrete and the proposed coordination is testable, though the paper will need to supply the missing utility details and experiment controls to be convincing.

Referee Report

2 major / 2 minor

Summary. The paper proposes ASTOR, a multi-task RL framework for post-training code LLMs. It centers on a task utility signal that purportedly encodes per-task learning potential and cross-task synergy, implemented via two modules: hierarchical utility-routed data scheduling (to allocate training budget and prioritize prompts) and adaptive utility-calibrated policy optimization (to dynamically scale per-task KL regularization). Experiments on two LLMs across four coding tasks are claimed to show that a single ASTOR-trained model outperforms the best task-specific specialist by 9.0%-9.5% and the strongest MTRL baseline by 7.5%-12.8%.

Significance. If the empirical results hold after proper validation, the work could meaningfully advance efficient multi-task post-training of code LLMs by replacing uniform curricula and fixed optimization with utility-driven coordination, thereby reducing the cost of maintaining separate task specialists. The approach directly targets a practical scaling issue in RL for coding agents.

major comments (2)

[Abstract] Abstract: The central empirical claim (outperformance by 9.0%-9.5% over specialists and 7.5%-12.8% over MTRL baselines) is stated without any reference to tables, figures, baselines, number of runs, statistical tests, training curves, or ablation studies. This leaves the primary result unsupported by visible evidence and prevents assessment of whether gains are attributable to the proposed modules rather than total data volume or hyperparameter choices.
[Abstract] Abstract (and implied methods): Both core modules depend on the task utility signal, yet no explicit formula, input features, computation procedure, or validation (e.g., correlation with per-task reward curves or gradient norms) is supplied. Without this, it is impossible to determine whether the signal reliably captures learning potential and synergy or whether the reported uniform improvements could arise from misspecification leading to misallocated budgets or mismatched KL constraints.

minor comments (2)

[Abstract] Abstract: The forced capitalization used to form the ASTOR acronym (multi-tASk ... uTility-driven coORdination) is nonstandard and reduces readability.
[Abstract] Abstract: The manuscript title 'Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs' does not align with the ASTOR acronym and framework name introduced in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim (outperformance by 9.0%-9.5% over specialists and 7.5%-12.8% over MTRL baselines) is stated without any reference to tables, figures, baselines, number of runs, statistical tests, training curves, or ablation studies. This leaves the primary result unsupported by visible evidence and prevents assessment of whether gains are attributable to the proposed modules rather than total data volume or hyperparameter choices.

Authors: We agree that the abstract would be strengthened by explicit pointers to the supporting evidence, even if space is limited. In the revised manuscript we will add concise references within the abstract to Table 1 (main results), Figure 2 (training curves), Section 4.1 (baselines), and Section 5 (ablations and statistical analysis). The experimental protocol already uses five independent runs with reported standard deviations and paired significance tests; all methods share the same total data volume and compute budget, so gains cannot be attributed to extra data. We will also ensure the main text explicitly states these controls. revision: yes
Referee: [Abstract] Abstract (and implied methods): Both core modules depend on the task utility signal, yet no explicit formula, input features, computation procedure, or validation (e.g., correlation with per-task reward curves or gradient norms) is supplied. Without this, it is impossible to determine whether the signal reliably captures learning potential and synergy or whether the reported uniform improvements could arise from misspecification leading to misallocated budgets or mismatched KL constraints.

Authors: We acknowledge that the current abstract and methods section do not provide a sufficiently explicit description of the task utility signal. In the revised manuscript we will add the full mathematical definition of the utility signal, the precise input features used, the step-by-step computation procedure, and additional validation experiments (including correlations with per-task reward curves and gradient norms) to Section 3. We will also insert a brief high-level description of the signal into the abstract so readers can immediately understand its role in the two modules. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks

full rationale

The manuscript contains no equations, derivations, or parameter-fitting procedures that could reduce to self-definition or fitted-input predictions. Task utility is introduced conceptually as a signal for scheduling and calibration, but the paper reports no formula, no self-referential computation of the signal from its own outputs, and no uniqueness theorem or ansatz imported via self-citation. All load-bearing claims are experimental comparisons on held-out coding benchmarks against task-specific specialists and MTRL baselines; these outcomes are falsifiable outside the paper and do not rely on any internal reduction to the inputs. The framework is therefore self-contained against external validation rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the ledger is therefore empty.

pith-pipeline@v0.9.0 · 5536 in / 1172 out tokens · 66512 ms · 2026-05-08T09:05:17.871630+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Metagpt: Meta programming for A multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for A multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, ICL...

2024
[2]

OpenReview.net, 2024

2024
[3]

The current challenges of software engineering in the era of large language models.ACM Trans

Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. The current challenges of software engineering in the era of large language models.ACM Trans. Softw. Eng. Methodol., 34(5):127:1–127:30, 2025

2025
[4]

Software development life cycle perspective: A survey of benchmarks for code large language models and agents

Kaixin Wang, Tianlin Li, Xiaoyu Zhang, Chong Wang, Weisong Sun, Yang Liu, and Bin Shi. Software development life cycle perspective: A survey of benchmarks for code large language models and agents. CoRR, abs/2505.05283, 2025

work page arXiv 2025
[5]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

2025
[6]

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, and Ge Li. Coderl+: Improving code generation via reinforcement with execution semantics alignment.CoRR, abs/2510.18471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, and ...

work page internal anchor Pith review arXiv 2025
[8]

RLEF: grounding code llms in execution feedback with reinforcement learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. RLEF: grounding code llms in execution feedback with reinforcement learning. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenReview.net, 2025

2025
[9]

Learning to generate unit test via adversarial reinforcement learning.CoRR, abs/2508.21107, 2025

Dongjun Lee, Changho Hwang, and Kimin Lee. Learning to generate unit test via adversarial reinforcement learning.CoRR, abs/2508.21107, 2025

work page arXiv 2025
[10]

Multitask learning.Mach

Rich Caruana. Multitask learning.Mach. Learn., 28(1):41–75, 1997

1997
[11]

Sharing experience in multitask reinforcement learning

Tung-Long Vuong, Do Van Nguyen, Tai-Long Nguyen, Cong-Minh Bui, Hai-Dang Kieu, Viet-Cuong Ta, Quoc-Long Tran, and Thanh Ha Le. Sharing experience in multitask reinforcement learning. In Sarit Kraus, editor,Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 3642–36...

2019
[12]

Multi-task reinforcement learning with soft modularization

Ruihan Yang, Huazhe Xu, Yi Wu, and Xiaolong Wang. Multi-task reinforcement learning with soft modularization. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December...

2020
[13]

Hard tasks first: Multi-task reinforce- ment learning through task scheduling

Myungsik Cho, Jongeui Park, Suyoung Lee, and Youngchul Sung. Hard tasks first: Multi-task reinforce- ment learning through task scheduling. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

2024
[14]

Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu

Yee Whye Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Sy...

2017
[15]

Imbalanced gradients in RL post-training of multi-task llms.CoRR, abs/2510.19178, 2025

Runzhe Wu, Ankur Samanta, Ayush Jain, Scott Fujimoto, Jeongyeol Kwon, Ben Kretzu, Youliang Yu, Kaveh Hassani, Boris Vidolov, and Yonathan Efroni. Imbalanced gradients in RL post-training of multi-task llms.CoRR, abs/2510.19178, 2025

work page arXiv 2025
[16]

Omni-thinker: Scaling cross-domain generalization in llms via multi-task rl with hybrid rewards

Derek Li, Jiaming Zhou, Amirreza Kazemi, Qianyi Sun, Abbas Ghaddar, Liheng Ma, Yu Luo, Dong Li, Jianye HAO, and Yingxue Zhang. Omni-thinker: Scaling cross-domain generalization in llms via multi-task rl with hybrid rewards. In2nd AI for Math Workshop@ ICML 2025, 2025. 10

2025
[17]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024

work page internal anchor Pith review arXiv 2024
[18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, and et al Junyang Lin. Qwen3 technical report.CoRR, abs/2505...

work page internal anchor Pith review arXiv 2025
[19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300, 2024

work page internal anchor Pith review arXiv 2024
[20]

Gradient surgery for multi-task learning

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, ...

2020
[21]

Just pick a sign: Optimizing deep multitask models with gradient sign dropout

Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Co...

2020
[22]

Conflict-averse gradient descent for multi-task learning

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurI...

2021
[23]

The exponentially weighted moving average.Journal of quality technology, 18(4):203–210, 1986

J Stuart Hunter. The exponentially weighted moving average.Journal of quality technology, 18(4):203–210, 1986

1986
[24]

Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.CoRR, abs/2504.14655, 2025

work page arXiv 2025
[25]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[26]

Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Mas- son d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, ...

work page arXiv 2022
[27]

Automated commit message generation with large language models: An empirical study and beyond.IEEE Trans

Pengyu Xue, Linhao Wu, Zhongxing Yu, Zhi Jin, Zhen Yang, Xinyi Li, Zhenyu Yang, and Yue Tan. Automated commit message generation with large language models: An empirical study and beyond.IEEE Trans. Software Eng., 50(12):3208–3224, 2024

2024
[28]

Cruxeval: A benchmark for code reasoning, understanding and execution

Alex Gu, Baptiste Rozière, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

2024
[29]

Aider llm leaderboard, 2025

Paul Gauthier. Aider llm leaderboard, 2025. URLhttps://aider.chat/docs/leaderboards/

2025
[30]

René Just, Darioush Jalali, and Michael D. Ernst. Defects4j: a database of existing faults to enable con- trolled testing studies for java programs. In Corina S. Pasareanu and Darko Marinov, editors,International Symposium on Software Testing and Analysis, ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014, pages 437–440. ACM, 2014

2014
[31]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL, 2002. 11

2002
[32]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004
[33]

METEOR: an automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. V oss, editors, Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA...

2005
[34]

Unlocking efficient long-to-short llm reasoning with model merging.arXiv preprint arXiv:2503.20641,

Han Wu, Yuxuan Yao, Shuqi Liu, Zehua Liu, Xiaojin Fu, Xiongwei Han, Xing Li, Hui-Ling Zhen, Tao Zhong, and Mingxuan Yuan. Unlocking efficient long-to-short LLM reasoning with model merging.CoRR, abs/2503.20641, 2025

work page arXiv 2025
[35]

hi") ==

Laingjun Feng, Chenyi Pan, Xinjie Guo, Fei Mei, Benzhe Ning, Jianxiang Zhang, Xinyang Liu, Beirong Zhou, Zeng Shu, Chang Liu, Guang Yang, Zhenyu Han, Jiangben Wang, and Bo Wang. Mindspeed RL: distributed dataflow for scalable and efficient RL training on ascend NPU cluster.CoRR, abs/2507.19017, 2025. 12 Appendix A Methodology Details A.1 Problem Formulati...

work page arXiv 2025
[36]

Hello" to

Switch the greeting text from "Hello" to "Hey". show_greeting.py “‘ import sys def greeting(name): print(f"Hey {name}") if __name__ == ’__main__’: greeting(sys.argv[1]) “‘ <|im_end|> <|im_start|> user I switched to a new code base. Please don’t consider the above files or try to edit them any longer. <|im_end|> <|im_start|> assistant Ok. <|im_end|> <|im_s...
[37]

30 Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...