pith. sign in

arxiv: 2606.25178 · v2 · pith:3O4EQJ5Jnew · submitted 2026-06-23 · 💻 cs.AI

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

Pith reviewed 2026-06-30 09:53 UTC · model grok-4.3

classification 💻 cs.AI
keywords reinforcement learningcurriculum learningmulti-domain reasoningtransferabilityRLVRGRPObandit curriculum
0
0 comments X

The pith

A curriculum that prioritizes domains by how much their gradients align with others improves macro accuracy in multi-domain RLVR over learnability-only baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Transfer-Aware Curriculum (TAC) for RLVR training across six domains that include mathematics, programming, and science. TAC is a bandit-style method that combines local learnability signals from per-domain advantages with a transferability estimate derived from projected gradients computed during each GRPO step. This transfer term measures gradient-geometry alignment to select domains whose updates benefit the full suite rather than only the current domain. On Qwen3-1.7B and Llama3.2-3B, TAC records the highest macro-averaged accuracy, beating proportional random sampling, hand-designed schedules, and a learnability-only bandit by as much as 2.8 points. Removing the transfer term sharply reduces performance, and TAC stays stable under imbalanced domain mixtures where pure learnability methods over-focus on dominant domains.

Core claim

Transfer-Aware Curriculum (TAC) is a bandit-style online curriculum that prioritizes domains whose updates broadly benefit the rest of the training suite. It repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from the GRPO step being computed, estimate cross-domain transferability via gradient-geometry alignment, at negligible cost. Across a six-domain reasoning suite, TAC achieves the best macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10%

What carries the argument

Transfer-Aware Curriculum (TAC), a bandit-style scheduler that augments local learnability advantages with cross-domain transferability estimated from projected gradient alignment during GRPO updates.

If this is right

  • TAC records the highest macro-averaged accuracy across the six-domain suite on two different model sizes.
  • Performance drops sharply when the transferability term is removed from the selection rule.
  • TAC maintains stable behavior on deliberately imbalanced domain mixtures where learnability-only methods over-commit to dominant domains.
  • The added computation for projected gradients incurs less than 1 percent wall-clock overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Gradient-alignment signals could be tested as a lightweight transfer proxy in other multi-task RL settings that lack explicit cross-domain evaluation.
  • If the current projection method underestimates certain forms of transfer, richer alignment statistics such as cosine similarity after layer-wise normalization could be substituted.
  • The same selection logic might be applied at the level of individual problems rather than whole domains once per-problem gradient projections become feasible.

Load-bearing premise

Projected gradients computed on one domain during a GRPO step reliably predict whether that update will improve performance on the other domains.

What would settle it

A controlled run in which domains chosen by high projected-gradient alignment produce no measurable accuracy gain on the remaining domains after the same number of updates.

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has been extended from single-domain training to multi-domain reasoning suites spanning mathematics, programming, and science. However, the training curriculum (how often each domain is sampled) is typically fixed or hand-tuned, even though reasoning skills transfer unevenly across domains. Existing learnability-based curricula adapt to where the policy is currently improving, but are blind to whether a gradient step on the selected domain benefits the remaining domains. In this paper, we propose Transfer-Aware Curriculum (TAC), a bandit-style online curriculum that prioritizes domains whose updates broadly benefit the rest of the training suite. TAC repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from the GRPO step being computed, estimate cross-domain transferability via gradient-geometry alignment, at negligible cost (<1% wall-clock overhead). Across a six-domain reasoning suite, TAC achieves the best macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10% relative). Ablations show performance degrades sharply when the transferability term is removed, and TAC remains robust on imbalanced training mixtures where learnability-only curricula over-commit to dominant domains. Our findings establish cross-domain transferability as a key signal for curriculum design in multi-domain RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Transfer-Aware Curriculum (TAC), a bandit-style online curriculum for multi-domain RLVR. TAC combines per-domain advantages (capturing local learnability) with projected gradients from GRPO steps (estimating cross-domain transferability via gradient-geometry alignment at <1% overhead). On a six-domain reasoning suite, TAC reports the highest macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit (by up to 2.8 points / 10% relative), while remaining robust on imbalanced mixtures.

Significance. If the gradient-geometry alignment reliably predicts positive cross-domain transfer, TAC offers an efficient, low-overhead method to incorporate transfer signals into multi-domain curricula, addressing a limitation of purely learnability-based approaches. The reported gains across two model scales and the ablation results provide empirical support for the combined signal's utility in reasoning domains.

major comments (1)
  1. [Abstract] Abstract: The central claim that projected gradients yield a reliable proxy for cross-domain transferability (via gradient-geometry alignment) lacks direct supporting evidence such as per-pair correlations, regressions, or transfer-delta statistics linking alignment scores to observed improvements on other domains; the ablation showing degradation when the transfer term is removed does not isolate whether the benefit arises from accurate transfer prediction versus ancillary effects such as increased sampling diversity or regularization.
minor comments (1)
  1. [Abstract] Abstract: No information is provided on statistical significance testing for the accuracy improvements, precise definitions of the six domains, or rules for data exclusion or mixture construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of TAC's empirical results. We address the concern about direct evidence for the gradient-alignment proxy below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that projected gradients yield a reliable proxy for cross-domain transferability (via gradient-geometry alignment) lacks direct supporting evidence such as per-pair correlations, regressions, or transfer-delta statistics linking alignment scores to observed improvements on other domains; the ablation showing degradation when the transfer term is removed does not isolate whether the benefit arises from accurate transfer prediction versus ancillary effects such as increased sampling diversity or regularization.

    Authors: We agree that the manuscript's support for the proxy is primarily indirect via end-to-end gains and the transfer-term ablation. The ablation isolates the contribution of the transfer signal but does not rule out ancillary effects. In revision we will add a dedicated analysis subsection reporting (i) per-domain-pair alignment scores, (ii) Pearson/Spearman correlations with measured transfer deltas (accuracy lift on target domain after source-domain updates), and (iii) simple linear regressions of transfer delta on alignment. If correlations are moderate we will discuss this limitation explicitly. This addition directly addresses the request for transfer-delta statistics without altering the method or claimed results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with online signals and independent test metrics

full rationale

The paper defines TAC as an online bandit curriculum that reuses per-step RL signals (advantages for learnability, projected gradients for transfer via alignment) already computed during GRPO. The headline results are macro-averaged test accuracies on held-out reasoning benchmarks, compared against baselines including a learnability-only ablation. No equation or claim reduces the reported gains to a post-hoc fit, self-citation chain, or input by construction; the transfer term is an explicit additive component whose removal is ablated. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the method rests on standard RL assumptions plus the unstated premise that GRPO gradient projections are a faithful proxy for cross-domain benefit. No explicit free parameters or invented entities are named.

axioms (1)
  • domain assumption Projected gradients from a single GRPO step on one domain estimate whether that update benefits other domains
    Central to TAC selection rule; location: abstract description of TAC

pith-pipeline@v0.9.1-grok · 5822 in / 1305 out tokens · 30280 ms · 2026-06-30T09:53:11.797601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 28 canonical work pages · 13 internal anchors

  1. [1]

    Database-friendly random projections

    Dimitris Achlioptas. Database-friendly random projections. InProceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2001

  2. [2]

    AIME Problems and Solutions, 2025

    Art of Problem Solving. AIME Problems and Solutions, 2025. URL https://artofproblemsolv ing.com/wiki/index.php/AIME_Problems_and_Solutions. Accessed: 2025-05-15

  3. [3]

    Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(Nov):397–422, 2002

    Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(Nov):397–422, 2002

  4. [4]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  5. [5]

    Multitask learning.Machine learning, 28(1):41–75, 1997

    Rich Caruana. Multitask learning.Machine learning, 28(1):41–75, 1997

  6. [6]

    Finding frequent items in data streams

    Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. InEATCS International Colloquium on Automata, Languages and Programming, pages 693–703. Springer, 2002

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  8. [8]

    Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025a

    Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

  9. [9]

    Finqa: A dataset of numerical reasoning over financial data

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

  10. [10]

    Hitab: A hierarchical table dataset for question answering and natural language generation

    Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. Hitab: A hierarchical table dataset for question answering and natural language generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

  11. [11]

    Revisiting reinforcement learning for llm reasoning from a cross-domain perspective

    Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Nilabjo Dey, Yonghao Zhuang, Yuheng Zha, et al. Revisiting reinforcement learning for llm reasoning from a cross-domain perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  12. [12]

    Arc prize 2024: Technical report

    Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: Technical report.arXiv preprint arXiv:2412.04604, 2024

  13. [13]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

  14. [14]

    Xeron Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi LI, Yunwen Li, dehua ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Zhenzhu Yang, Zekun ...

  15. [15]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  16. [16]

    CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065, 2024

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  18. [18]

    Skywork open reasoner series

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.not ion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680 , 2025. Notion Blog

  19. [19]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  20. [20]

    Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

    Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning.arXiv preprint arXiv:2507.00432, 2025

  21. [21]

    Livecodebench: Holistic and contamination free eval- uation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free eval- uation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

  22. [22]

    Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

    Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

  23. [23]

    Extensions of lipschitz mappings into a hilbert space

    William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26(189-206):1, 1984

  24. [24]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, 2017

  25. [25]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023

  26. [26]

    Omni-thinker: Scaling multi-task rl in llms with hybrid reward and task scheduling.arXiv preprint arXiv:2507.14783, 2025

    Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, et al. Omni-thinker: Scaling multi-task rl in llms with hybrid reward and task scheduling.arXiv preprint arXiv:2507.14783, 2025

  27. [27]

    Codeio: Condensing reasoning patterns via code input-output prediction

    Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. Codeio: Condensing reasoning patterns via code input-output prediction. InInternational Conference on Machine Learning, 2025. 14 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

  28. [28]

    Verified taco problems

    Kaixin Li. Verified taco problems. https://huggingface.co/datasets/likaixin/TACO-ver ified, 2024. URLhttps://huggingface.co/datasets/likaixin/TACO-verified

  29. [29]

    Combining induction and transduction for abstract reasoning.arXiv preprint arXiv:2411.02272, 2024

    Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M Dunn, Hao Tang, Michelangelo Naim, Dat Nguyen, et al. Combining induction and transduction for abstract reasoning.arXiv preprint arXiv:2411.02272, 2024

  30. [30]

    Can one domain help others? a data-centric study on multi-domain reasoning via reinforcement learning.arXiv preprint arXiv:2507.17512, 2025

    Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, and Lijun Wu. Can one domain help others? a data-centric study on multi-domain reasoning via reinforcement learning.arXiv preprint arXiv:2507.17512, 2025

  31. [31]

    Boosting multi-domain reasoning of LLMs via curvature-guided policy optimization

    Xize Liang, Lin Yang, Jie Wang, Rui Liu, Yang Lu, Jinliang Zeng, Hanzhu Chen, Dong Li, and Jianye HAO. Boosting multi-domain reasoning of LLMs via curvature-guided policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026

  32. [32]

    Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

    Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

  33. [33]

    Conflict-averse gradient descent for multi-task learning.Advances in Neural Information Processing Systems, 2021

    Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning.Advances in Neural Information Processing Systems, 2021

  34. [34]

    Deepcoder: A fully open-source 14b coder at o3-mini level

    Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Erran Li, Raluca Ada Popa, Ion Stoica, Ameen Patel, Alpay Ariyak, Qingyang Wu, Maurice Weber, and Ce Zhang. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/DeepC oder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb...

  35. [35]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassin g-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e ...

  36. [36]

    General-reasoner: Advancing LLM reasoning across all domains

    Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun MA, and Wenhu Chen. General-reasoner: Advancing LLM reasoning across all domains. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  37. [37]

    American mathematics competitions - amc.https://maa.org/, 2023

    MAA. American mathematics competitions - amc.https://maa.org/, 2023

  38. [38]

    Teacher–student curriculum learning

    Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher–student curriculum learning. IEEE transactions on neural networks and learning systems, 31(9):3732–3740, 2019

  39. [39]

    Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1

    Justus Mattern, Sami Jaghouar, Manveer Basra, Jannik Straube, Matthew Di Ferrante, Felix Gabriel, Jack Min Ong, Vincent Weisser, and Johannes Hagemann. Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1. https://www.primeintellect.ai/blog/synthet ic-1-release, 2025

  40. [40]

    Reasoning curriculum: Bootstrapping broad llm reasoning from math.arXiv preprint arXiv:2510.26143, 2025

    Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, and Yingbo Zhou. Reasoning curriculum: Bootstrapping broad llm reasoning from math.arXiv preprint arXiv:2510.26143, 2025

  41. [41]

    Kakade, and Surbhi Goel

    Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham M. Kakade, and Surbhi Goel. In good GRACEs: Principled teacher selection for knowledge distillation. InThe Fourteenth International Conference on Learning Representations, 2026

  42. [42]

    Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning

    Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. 15 Transferability for General Reasoning: An Aut...

  43. [43]

    Trak: attributing model behavior at scale

    Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander M ˛ adry. Trak: attributing model behavior at scale. InProceedings of the 40th International Conference on Machine Learning, 2023

  44. [44]

    Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

  45. [45]

    Multi-task grpo: Reliable llm rea- soning across tasks.arXiv preprint arXiv:2602.05547, 2026

    Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou Ammar, Aurelien Lucchi, and Ilija Bogunovic. Multi-task grpo: Reliable llm rea- soning across tasks.arXiv preprint arXiv:2602.05547, 2026

  46. [46]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

  47. [47]

    An Overview of Multi-Task Learning in Deep Neural Networks

    Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098, 2017

  48. [48]

    Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv preprint arXiv:2210.01240, 2022

    Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv preprint arXiv:2210.01240, 2022

  49. [49]

    Transformers struggle to learn to search.arXiv preprint arXiv:2412.04703, 2024

    Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, and He He. Transformers struggle to learn to search.arXiv preprint arXiv:2412.04703, 2024

  50. [50]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  51. [51]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025

  52. [52]

    Proximal curriculum for reinforcement learning agents.arXiv preprint arXiv:2304.12877, 2023

    Georgios Tzannetos, Bárbara Gomes Ribeiro, Parameswaran Kamalaruban, and Adish Singla. Proximal curriculum for reinforcement learning agents.arXiv preprint arXiv:2304.12877, 2023

  53. [53]

    Learning a multi-domain curriculum for neural machine translation

    Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, and Zarana Parekh. Learning a multi-domain curriculum for neural machine translation. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

  54. [54]

    A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021

    Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021

  55. [55]

    Dump: Automated distribution- level curriculum learning for rl-based llm post-training.arXiv preprint arXiv:2504.09710, 2025

    Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, and Wentian Zhao. Dump: Automated distribution- level curriculum learning for rl-based llm post-training.arXiv preprint arXiv:2504.09710, 2025

  56. [56]

    Less: Selecting influential data for targeted instruction tuning.arXiv preprint arXiv:2402.04333, 2024

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning.arXiv preprint arXiv:2402.04333, 2024

  57. [57]

    Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

    Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

  58. [58]

    Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 2023

    Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 2023

  59. [59]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 16 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

  60. [60]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

  61. [61]

    Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 2020

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 2020

  62. [62]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

  63. [63]

    Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data

    Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

  64. [64]

    header": [

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 2023. 17 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR Transferability for...