Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
Pith reviewed 2026-06-30 09:53 UTC · model grok-4.3
The pith
A curriculum that prioritizes domains by how much their gradients align with others improves macro accuracy in multi-domain RLVR over learnability-only baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Transfer-Aware Curriculum (TAC) is a bandit-style online curriculum that prioritizes domains whose updates broadly benefit the rest of the training suite. It repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from the GRPO step being computed, estimate cross-domain transferability via gradient-geometry alignment, at negligible cost. Across a six-domain reasoning suite, TAC achieves the best macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10%
What carries the argument
Transfer-Aware Curriculum (TAC), a bandit-style scheduler that augments local learnability advantages with cross-domain transferability estimated from projected gradient alignment during GRPO updates.
If this is right
- TAC records the highest macro-averaged accuracy across the six-domain suite on two different model sizes.
- Performance drops sharply when the transferability term is removed from the selection rule.
- TAC maintains stable behavior on deliberately imbalanced domain mixtures where learnability-only methods over-commit to dominant domains.
- The added computation for projected gradients incurs less than 1 percent wall-clock overhead.
Where Pith is reading between the lines
- Gradient-alignment signals could be tested as a lightweight transfer proxy in other multi-task RL settings that lack explicit cross-domain evaluation.
- If the current projection method underestimates certain forms of transfer, richer alignment statistics such as cosine similarity after layer-wise normalization could be substituted.
- The same selection logic might be applied at the level of individual problems rather than whole domains once per-problem gradient projections become feasible.
Load-bearing premise
Projected gradients computed on one domain during a GRPO step reliably predict whether that update will improve performance on the other domains.
What would settle it
A controlled run in which domains chosen by high projected-gradient alignment produce no measurable accuracy gain on the remaining domains after the same number of updates.
read the original abstract
Reinforcement learning with verifiable rewards (RLVR) has been extended from single-domain training to multi-domain reasoning suites spanning mathematics, programming, and science. However, the training curriculum (how often each domain is sampled) is typically fixed or hand-tuned, even though reasoning skills transfer unevenly across domains. Existing learnability-based curricula adapt to where the policy is currently improving, but are blind to whether a gradient step on the selected domain benefits the remaining domains. In this paper, we propose Transfer-Aware Curriculum (TAC), a bandit-style online curriculum that prioritizes domains whose updates broadly benefit the rest of the training suite. TAC repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from the GRPO step being computed, estimate cross-domain transferability via gradient-geometry alignment, at negligible cost (<1% wall-clock overhead). Across a six-domain reasoning suite, TAC achieves the best macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10% relative). Ablations show performance degrades sharply when the transferability term is removed, and TAC remains robust on imbalanced training mixtures where learnability-only curricula over-commit to dominant domains. Our findings establish cross-domain transferability as a key signal for curriculum design in multi-domain RLVR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Transfer-Aware Curriculum (TAC), a bandit-style online curriculum for multi-domain RLVR. TAC combines per-domain advantages (capturing local learnability) with projected gradients from GRPO steps (estimating cross-domain transferability via gradient-geometry alignment at <1% overhead). On a six-domain reasoning suite, TAC reports the highest macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit (by up to 2.8 points / 10% relative), while remaining robust on imbalanced mixtures.
Significance. If the gradient-geometry alignment reliably predicts positive cross-domain transfer, TAC offers an efficient, low-overhead method to incorporate transfer signals into multi-domain curricula, addressing a limitation of purely learnability-based approaches. The reported gains across two model scales and the ablation results provide empirical support for the combined signal's utility in reasoning domains.
major comments (1)
- [Abstract] Abstract: The central claim that projected gradients yield a reliable proxy for cross-domain transferability (via gradient-geometry alignment) lacks direct supporting evidence such as per-pair correlations, regressions, or transfer-delta statistics linking alignment scores to observed improvements on other domains; the ablation showing degradation when the transfer term is removed does not isolate whether the benefit arises from accurate transfer prediction versus ancillary effects such as increased sampling diversity or regularization.
minor comments (1)
- [Abstract] Abstract: No information is provided on statistical significance testing for the accuracy improvements, precise definitions of the six domains, or rules for data exclusion or mixture construction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of TAC's empirical results. We address the concern about direct evidence for the gradient-alignment proxy below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that projected gradients yield a reliable proxy for cross-domain transferability (via gradient-geometry alignment) lacks direct supporting evidence such as per-pair correlations, regressions, or transfer-delta statistics linking alignment scores to observed improvements on other domains; the ablation showing degradation when the transfer term is removed does not isolate whether the benefit arises from accurate transfer prediction versus ancillary effects such as increased sampling diversity or regularization.
Authors: We agree that the manuscript's support for the proxy is primarily indirect via end-to-end gains and the transfer-term ablation. The ablation isolates the contribution of the transfer signal but does not rule out ancillary effects. In revision we will add a dedicated analysis subsection reporting (i) per-domain-pair alignment scores, (ii) Pearson/Spearman correlations with measured transfer deltas (accuracy lift on target domain after source-domain updates), and (iii) simple linear regressions of transfer delta on alignment. If correlations are moderate we will discuss this limitation explicitly. This addition directly addresses the request for transfer-delta statistics without altering the method or claimed results. revision: yes
Circularity Check
No circularity: empirical method with online signals and independent test metrics
full rationale
The paper defines TAC as an online bandit curriculum that reuses per-step RL signals (advantages for learnability, projected gradients for transfer via alignment) already computed during GRPO. The headline results are macro-averaged test accuracies on held-out reasoning benchmarks, compared against baselines including a learnability-only ablation. No equation or claim reduces the reported gains to a post-hoc fit, self-citation chain, or input by construction; the transfer term is an explicit additive component whose removal is ablated. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Projected gradients from a single GRPO step on one domain estimate whether that update benefits other domains
Reference graph
Works this paper leans on
-
[1]
Database-friendly random projections
Dimitris Achlioptas. Database-friendly random projections. InProceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2001
2001
-
[2]
AIME Problems and Solutions, 2025
Art of Problem Solving. AIME Problems and Solutions, 2025. URL https://artofproblemsolv ing.com/wiki/index.php/AIME_Problems_and_Solutions. Accessed: 2025-05-15
2025
-
[3]
Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(Nov):397–422, 2002
Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(Nov):397–422, 2002
2002
-
[4]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Multitask learning.Machine learning, 28(1):41–75, 1997
Rich Caruana. Multitask learning.Machine learning, 28(1):41–75, 1997
1997
-
[6]
Finding frequent items in data streams
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. InEATCS International Colloquium on Automata, Languages and Programming, pages 693–703. Springer, 2002
2002
-
[7]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025a
Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025
-
[9]
Finqa: A dataset of numerical reasoning over financial data
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021
2021
-
[10]
Hitab: A hierarchical table dataset for question answering and natural language generation
Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. Hitab: A hierarchical table dataset for question answering and natural language generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022
2022
-
[11]
Revisiting reinforcement learning for llm reasoning from a cross-domain perspective
Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Nilabjo Dey, Yonghao Zhuang, Yuheng Zha, et al. Revisiting reinforcement learning for llm reasoning from a cross-domain perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025
2025
-
[12]
Arc prize 2024: Technical report.arXiv preprint arXiv:2412.04604, 2024
Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: Technical report.arXiv preprint arXiv:2412.04604, 2024
-
[13]
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Xeron Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi LI, Yunwen Li, dehua ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Zhenzhu Yang, Zekun ...
2025
-
[15]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Skywork open reasoner series
Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.not ion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680 , 2025. Notion Blog
2025
-
[19]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning.arXiv preprint arXiv:2507.00432, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Livecodebench: Holistic and contamination free eval- uation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free eval- uation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[22]
Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025
-
[23]
Extensions of lipschitz mappings into a hilbert space
William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26(189-206):1, 1984
1984
-
[24]
Understanding black-box predictions via influence functions
Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, 2017
2017
-
[25]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023
2023
-
[26]
Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, et al. Omni-thinker: Scaling multi-task rl in llms with hybrid reward and task scheduling.arXiv preprint arXiv:2507.14783, 2025
-
[27]
Codeio: Condensing reasoning patterns via code input-output prediction
Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. Codeio: Condensing reasoning patterns via code input-output prediction. InInternational Conference on Machine Learning, 2025. 14 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
2025
-
[28]
Verified taco problems
Kaixin Li. Verified taco problems. https://huggingface.co/datasets/likaixin/TACO-ver ified, 2024. URLhttps://huggingface.co/datasets/likaixin/TACO-verified
2024
-
[29]
Combining induction and transduction for abstract reasoning.arXiv preprint arXiv:2411.02272, 2024
Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M Dunn, Hao Tang, Michelangelo Naim, Dat Nguyen, et al. Combining induction and transduction for abstract reasoning.arXiv preprint arXiv:2411.02272, 2024
-
[30]
Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, and Lijun Wu. Can one domain help others? a data-centric study on multi-domain reasoning via reinforcement learning.arXiv preprint arXiv:2507.17512, 2025
-
[31]
Boosting multi-domain reasoning of LLMs via curvature-guided policy optimization
Xize Liang, Lin Yang, Jie Wang, Rui Liu, Yang Lu, Jinliang Zeng, Hanzhu Chen, Dong Li, and Jianye HAO. Boosting multi-domain reasoning of LLMs via curvature-guided policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[32]
Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025
-
[33]
Conflict-averse gradient descent for multi-task learning.Advances in Neural Information Processing Systems, 2021
Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning.Advances in Neural Information Processing Systems, 2021
2021
-
[34]
Deepcoder: A fully open-source 14b coder at o3-mini level
Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Erran Li, Raluca Ada Popa, Ion Stoica, Ameen Patel, Alpay Ariyak, Qingyang Wu, Maurice Weber, and Ce Zhang. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/DeepC oder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb...
2025
-
[35]
Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassin g-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e ...
2025
-
[36]
General-reasoner: Advancing LLM reasoning across all domains
Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun MA, and Wenhu Chen. General-reasoner: Advancing LLM reasoning across all domains. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[37]
American mathematics competitions - amc.https://maa.org/, 2023
MAA. American mathematics competitions - amc.https://maa.org/, 2023
2023
-
[38]
Teacher–student curriculum learning
Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher–student curriculum learning. IEEE transactions on neural networks and learning systems, 31(9):3732–3740, 2019
2019
-
[39]
Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1
Justus Mattern, Sami Jaghouar, Manveer Basra, Jannik Straube, Matthew Di Ferrante, Felix Gabriel, Jack Min Ong, Vincent Weisser, and Johannes Hagemann. Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1. https://www.primeintellect.ai/blog/synthet ic-1-release, 2025
2025
-
[40]
Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, and Yingbo Zhou. Reasoning curriculum: Bootstrapping broad llm reasoning from math.arXiv preprint arXiv:2510.26143, 2025
-
[41]
Kakade, and Surbhi Goel
Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham M. Kakade, and Surbhi Goel. In good GRACEs: Principled teacher selection for knowledge distillation. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[42]
Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning
Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. 15 Transferability for General Reasoning: An Aut...
2026
-
[43]
Trak: attributing model behavior at scale
Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander M ˛ adry. Trak: attributing model behavior at scale. InProceedings of the 40th International Conference on Machine Learning, 2023
2023
-
[44]
Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020
Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020
2020
-
[45]
Multi-task grpo: Reliable llm rea- soning across tasks.arXiv preprint arXiv:2602.05547, 2026
Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou Ammar, Aurelien Lucchi, and Ilija Bogunovic. Multi-task grpo: Reliable llm rea- soning across tasks.arXiv preprint arXiv:2602.05547, 2026
-
[46]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024
2024
-
[47]
An Overview of Multi-Task Learning in Deep Neural Networks
Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv preprint arXiv:2210.01240, 2022
-
[49]
Transformers struggle to learn to search.arXiv preprint arXiv:2412.04703, 2024
Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, and He He. Transformers struggle to learn to search.arXiv preprint arXiv:2412.04703, 2024
-
[50]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025
2025
-
[52]
Proximal curriculum for reinforcement learning agents.arXiv preprint arXiv:2304.12877, 2023
Georgios Tzannetos, Bárbara Gomes Ribeiro, Parameswaran Kamalaruban, and Adish Singla. Proximal curriculum for reinforcement learning agents.arXiv preprint arXiv:2304.12877, 2023
-
[53]
Learning a multi-domain curriculum for neural machine translation
Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, and Zarana Parekh. Learning a multi-domain curriculum for neural machine translation. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
2020
-
[54]
A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021
Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021
2021
-
[55]
Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, and Wentian Zhao. Dump: Automated distribution- level curriculum learning for rl-based llm post-training.arXiv preprint arXiv:2504.09710, 2025
-
[56]
Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning.arXiv preprint arXiv:2402.04333, 2024
-
[57]
Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025
-
[58]
Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 2023
Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 2023
2023
-
[59]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 16 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 2020
Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 2020
2020
-
[62]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data
Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022
2022
-
[64]
header": [
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 2023. 17 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR Transferability for...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.