Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

Bernhard Sch\"olkopf; Jiarui Liu; Lechen Zhang; Yinghui He; Yongjin Yang; Zhijing Jin

arxiv: 2606.25178 · v2 · pith:3O4EQJ5Jnew · submitted 2026-06-23 · 💻 cs.AI

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

Yongjin Yang , Jiarui Liu , Yinghui He , Lechen Zhang , Bernhard Sch\"olkopf , Zhijing Jin This is my paper

Pith reviewed 2026-06-30 09:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learningcurriculum learningmulti-domain reasoningtransferabilityRLVRGRPObandit curriculum

0 comments

The pith

A curriculum that prioritizes domains by how much their gradients align with others improves macro accuracy in multi-domain RLVR over learnability-only baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Transfer-Aware Curriculum (TAC) for RLVR training across six domains that include mathematics, programming, and science. TAC is a bandit-style method that combines local learnability signals from per-domain advantages with a transferability estimate derived from projected gradients computed during each GRPO step. This transfer term measures gradient-geometry alignment to select domains whose updates benefit the full suite rather than only the current domain. On Qwen3-1.7B and Llama3.2-3B, TAC records the highest macro-averaged accuracy, beating proportional random sampling, hand-designed schedules, and a learnability-only bandit by as much as 2.8 points. Removing the transfer term sharply reduces performance, and TAC stays stable under imbalanced domain mixtures where pure learnability methods over-focus on dominant domains.

Core claim

Transfer-Aware Curriculum (TAC) is a bandit-style online curriculum that prioritizes domains whose updates broadly benefit the rest of the training suite. It repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from the GRPO step being computed, estimate cross-domain transferability via gradient-geometry alignment, at negligible cost. Across a six-domain reasoning suite, TAC achieves the best macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10%

What carries the argument

Transfer-Aware Curriculum (TAC), a bandit-style scheduler that augments local learnability advantages with cross-domain transferability estimated from projected gradient alignment during GRPO updates.

If this is right

TAC records the highest macro-averaged accuracy across the six-domain suite on two different model sizes.
Performance drops sharply when the transferability term is removed from the selection rule.
TAC maintains stable behavior on deliberately imbalanced domain mixtures where learnability-only methods over-commit to dominant domains.
The added computation for projected gradients incurs less than 1 percent wall-clock overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Gradient-alignment signals could be tested as a lightweight transfer proxy in other multi-task RL settings that lack explicit cross-domain evaluation.
If the current projection method underestimates certain forms of transfer, richer alignment statistics such as cosine similarity after layer-wise normalization could be substituted.
The same selection logic might be applied at the level of individual problems rather than whole domains once per-problem gradient projections become feasible.

Load-bearing premise

Projected gradients computed on one domain during a GRPO step reliably predict whether that update will improve performance on the other domains.

What would settle it

A controlled run in which domains chosen by high projected-gradient alignment produce no measurable accuracy gain on the remaining domains after the same number of updates.

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has been extended from single-domain training to multi-domain reasoning suites spanning mathematics, programming, and science. However, the training curriculum (how often each domain is sampled) is typically fixed or hand-tuned, even though reasoning skills transfer unevenly across domains. Existing learnability-based curricula adapt to where the policy is currently improving, but are blind to whether a gradient step on the selected domain benefits the remaining domains. In this paper, we propose Transfer-Aware Curriculum (TAC), a bandit-style online curriculum that prioritizes domains whose updates broadly benefit the rest of the training suite. TAC repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from the GRPO step being computed, estimate cross-domain transferability via gradient-geometry alignment, at negligible cost (<1% wall-clock overhead). Across a six-domain reasoning suite, TAC achieves the best macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10% relative). Ablations show performance degrades sharply when the transferability term is removed, and TAC remains robust on imbalanced training mixtures where learnability-only curricula over-commit to dominant domains. Our findings establish cross-domain transferability as a key signal for curriculum design in multi-domain RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAC gets a real but modest edge from the gradient-alignment term in the bandit, yet the paper never directly checks whether that alignment actually predicts cross-domain gains.

read the letter

The main thing to know is that TAC improves macro-averaged accuracy by up to 2.8 points over a learnability-only bandit on two small reasoning models, and the ablations show the transfer term matters. The method stays cheap by pulling projected gradients straight out of the GRPO step already being run.

What the paper does cleanly is reuse existing training signals instead of adding new forward passes. Per-domain advantages handle local learnability while the alignment term tries to capture whether an update on one domain will help the others. The robustness result on imbalanced mixtures is the most practical part: the learnability-only baseline over-commits to the dominant domain while TAC spreads effort better. That comparison is worth having.

The soft spot is exactly the one the stress-test note flags. The abstract presents gradient-geometry alignment as a low-cost proxy for cross-domain transfer, but it never shows a correlation, regression, or per-pair delta that links higher alignment scores to actual positive transfer on the other domains. Without that diagnostic, the performance lift could come from increased sampling diversity or a mild regularizing effect rather than genuine transfer awareness. The claim that the term estimates whether an update benefits the rest of the suite therefore rests on an untested assumption.

This is for people already running multi-domain RLVR on reasoning tasks and looking for curriculum tweaks that do not require extra compute. A reader who cares about bandit curricula or learnability signals will find the head-to-head numbers and the imbalance experiment useful. The empirical pattern is solid enough to justify referee time even though the mechanistic story needs more support.

I would send it for peer review.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Transfer-Aware Curriculum (TAC), a bandit-style online curriculum for multi-domain RLVR. TAC combines per-domain advantages (capturing local learnability) with projected gradients from GRPO steps (estimating cross-domain transferability via gradient-geometry alignment at <1% overhead). On a six-domain reasoning suite, TAC reports the highest macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit (by up to 2.8 points / 10% relative), while remaining robust on imbalanced mixtures.

Significance. If the gradient-geometry alignment reliably predicts positive cross-domain transfer, TAC offers an efficient, low-overhead method to incorporate transfer signals into multi-domain curricula, addressing a limitation of purely learnability-based approaches. The reported gains across two model scales and the ablation results provide empirical support for the combined signal's utility in reasoning domains.

major comments (1)

[Abstract] Abstract: The central claim that projected gradients yield a reliable proxy for cross-domain transferability (via gradient-geometry alignment) lacks direct supporting evidence such as per-pair correlations, regressions, or transfer-delta statistics linking alignment scores to observed improvements on other domains; the ablation showing degradation when the transfer term is removed does not isolate whether the benefit arises from accurate transfer prediction versus ancillary effects such as increased sampling diversity or regularization.

minor comments (1)

[Abstract] Abstract: No information is provided on statistical significance testing for the accuracy improvements, precise definitions of the six domains, or rules for data exclusion or mixture construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of TAC's empirical results. We address the concern about direct evidence for the gradient-alignment proxy below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that projected gradients yield a reliable proxy for cross-domain transferability (via gradient-geometry alignment) lacks direct supporting evidence such as per-pair correlations, regressions, or transfer-delta statistics linking alignment scores to observed improvements on other domains; the ablation showing degradation when the transfer term is removed does not isolate whether the benefit arises from accurate transfer prediction versus ancillary effects such as increased sampling diversity or regularization.

Authors: We agree that the manuscript's support for the proxy is primarily indirect via end-to-end gains and the transfer-term ablation. The ablation isolates the contribution of the transfer signal but does not rule out ancillary effects. In revision we will add a dedicated analysis subsection reporting (i) per-domain-pair alignment scores, (ii) Pearson/Spearman correlations with measured transfer deltas (accuracy lift on target domain after source-domain updates), and (iii) simple linear regressions of transfer delta on alignment. If correlations are moderate we will discuss this limitation explicitly. This addition directly addresses the request for transfer-delta statistics without altering the method or claimed results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with online signals and independent test metrics

full rationale

The paper defines TAC as an online bandit curriculum that reuses per-step RL signals (advantages for learnability, projected gradients for transfer via alignment) already computed during GRPO. The headline results are macro-averaged test accuracies on held-out reasoning benchmarks, compared against baselines including a learnability-only ablation. No equation or claim reduces the reported gains to a post-hoc fit, self-citation chain, or input by construction; the transfer term is an explicit additive component whose removal is ablated. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the method rests on standard RL assumptions plus the unstated premise that GRPO gradient projections are a faithful proxy for cross-domain benefit. No explicit free parameters or invented entities are named.

axioms (1)

domain assumption Projected gradients from a single GRPO step on one domain estimate whether that update benefits other domains
Central to TAC selection rule; location: abstract description of TAC

pith-pipeline@v0.9.1-grok · 5822 in / 1305 out tokens · 30280 ms · 2026-06-30T09:53:11.797601+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 28 canonical work pages · 13 internal anchors

[1]

Database-friendly random projections

Dimitris Achlioptas. Database-friendly random projections. InProceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2001

2001
[2]

AIME Problems and Solutions, 2025

Art of Problem Solving. AIME Problems and Solutions, 2025. URL https://artofproblemsolv ing.com/wiki/index.php/AIME_Problems_and_Solutions. Accessed: 2025-05-15

2025
[3]

Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(Nov):397–422, 2002

Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(Nov):397–422, 2002

2002
[4]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Multitask learning.Machine learning, 28(1):41–75, 1997

Rich Caruana. Multitask learning.Machine learning, 28(1):41–75, 1997

1997
[6]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. InEATCS International Colloquium on Automata, Languages and Programming, pages 693–703. Springer, 2002

2002
[7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025a

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

work page arXiv 2025
[9]

Finqa: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

2021
[10]

Hitab: A hierarchical table dataset for question answering and natural language generation

Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. Hitab: A hierarchical table dataset for question answering and natural language generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

2022
[11]

Revisiting reinforcement learning for llm reasoning from a cross-domain perspective

Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Nilabjo Dey, Yonghao Zhuang, Yuheng Zha, et al. Revisiting reinforcement learning for llm reasoning from a cross-domain perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[12]

Arc prize 2024: Technical report

Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: Technical report.arXiv preprint arXiv:2412.04604, 2024

work page arXiv 2024
[13]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Xeron Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi LI, Yunwen Li, dehua ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Zhenzhu Yang, Zekun ...

2025
[15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Skywork open reasoner series

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.not ion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680 , 2025. Notion Blog

2025
[19]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning.arXiv preprint arXiv:2507.00432, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Livecodebench: Holistic and contamination free eval- uation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free eval- uation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[22]

Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

work page arXiv 2025
[23]

Extensions of lipschitz mappings into a hilbert space

William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26(189-206):1, 1984

1984
[24]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, 2017

2017
[25]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023

2023
[26]

Omni-thinker: Scaling multi-task rl in llms with hybrid reward and task scheduling.arXiv preprint arXiv:2507.14783, 2025

Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, et al. Omni-thinker: Scaling multi-task rl in llms with hybrid reward and task scheduling.arXiv preprint arXiv:2507.14783, 2025

work page arXiv 2025
[27]

Codeio: Condensing reasoning patterns via code input-output prediction

Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. Codeio: Condensing reasoning patterns via code input-output prediction. InInternational Conference on Machine Learning, 2025. 14 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

2025
[28]

Verified taco problems

Kaixin Li. Verified taco problems. https://huggingface.co/datasets/likaixin/TACO-ver ified, 2024. URLhttps://huggingface.co/datasets/likaixin/TACO-verified

2024
[29]

Combining induction and transduction for abstract reasoning.arXiv preprint arXiv:2411.02272, 2024

Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M Dunn, Hao Tang, Michelangelo Naim, Dat Nguyen, et al. Combining induction and transduction for abstract reasoning.arXiv preprint arXiv:2411.02272, 2024

work page arXiv 2024
[30]

Can one domain help others? a data-centric study on multi-domain reasoning via reinforcement learning.arXiv preprint arXiv:2507.17512, 2025

Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, and Lijun Wu. Can one domain help others? a data-centric study on multi-domain reasoning via reinforcement learning.arXiv preprint arXiv:2507.17512, 2025

work page arXiv 2025
[31]

Boosting multi-domain reasoning of LLMs via curvature-guided policy optimization

Xize Liang, Lin Yang, Jie Wang, Rui Liu, Yang Lu, Jinliang Zeng, Hanzhu Chen, Dong Li, and Jianye HAO. Boosting multi-domain reasoning of LLMs via curvature-guided policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[32]

Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

work page arXiv 2025
[33]

Conflict-averse gradient descent for multi-task learning.Advances in Neural Information Processing Systems, 2021

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning.Advances in Neural Information Processing Systems, 2021

2021
[34]

Deepcoder: A fully open-source 14b coder at o3-mini level

Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Erran Li, Raluca Ada Popa, Ion Stoica, Ameen Patel, Alpay Ariyak, Qingyang Wu, Maurice Weber, and Ce Zhang. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/DeepC oder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb...

2025
[35]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassin g-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e ...

2025
[36]

General-reasoner: Advancing LLM reasoning across all domains

Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun MA, and Wenhu Chen. General-reasoner: Advancing LLM reasoning across all domains. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[37]

American mathematics competitions - amc.https://maa.org/, 2023

MAA. American mathematics competitions - amc.https://maa.org/, 2023

2023
[38]

Teacher–student curriculum learning

Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher–student curriculum learning. IEEE transactions on neural networks and learning systems, 31(9):3732–3740, 2019

2019
[39]

Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1

Justus Mattern, Sami Jaghouar, Manveer Basra, Jannik Straube, Matthew Di Ferrante, Felix Gabriel, Jack Min Ong, Vincent Weisser, and Johannes Hagemann. Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1. https://www.primeintellect.ai/blog/synthet ic-1-release, 2025

2025
[40]

Reasoning curriculum: Bootstrapping broad llm reasoning from math.arXiv preprint arXiv:2510.26143, 2025

Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, and Yingbo Zhou. Reasoning curriculum: Bootstrapping broad llm reasoning from math.arXiv preprint arXiv:2510.26143, 2025

work page arXiv 2025
[41]

Kakade, and Surbhi Goel

Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham M. Kakade, and Surbhi Goel. In good GRACEs: Principled teacher selection for knowledge distillation. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[42]

Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. 15 Transferability for General Reasoning: An Aut...

2026
[43]

Trak: attributing model behavior at scale

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander M ˛ adry. Trak: attributing model behavior at scale. InProceedings of the 40th International Conference on Machine Learning, 2023

2023
[44]

Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

2020
[45]

Multi-task grpo: Reliable llm rea- soning across tasks.arXiv preprint arXiv:2602.05547, 2026

Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou Ammar, Aurelien Lucchi, and Ilija Bogunovic. Multi-task grpo: Reliable llm rea- soning across tasks.arXiv preprint arXiv:2602.05547, 2026

work page arXiv 2026
[46]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

2024
[47]

An Overview of Multi-Task Learning in Deep Neural Networks

Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv preprint arXiv:2210.01240, 2022

Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv preprint arXiv:2210.01240, 2022

work page arXiv 2022
[49]

Transformers struggle to learn to search.arXiv preprint arXiv:2412.04703, 2024

Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, and He He. Transformers struggle to learn to search.arXiv preprint arXiv:2412.04703, 2024

work page arXiv 2024
[50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025

2025
[52]

Proximal curriculum for reinforcement learning agents.arXiv preprint arXiv:2304.12877, 2023

Georgios Tzannetos, Bárbara Gomes Ribeiro, Parameswaran Kamalaruban, and Adish Singla. Proximal curriculum for reinforcement learning agents.arXiv preprint arXiv:2304.12877, 2023

work page arXiv 2023
[53]

Learning a multi-domain curriculum for neural machine translation

Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, and Zarana Parekh. Learning a multi-domain curriculum for neural machine translation. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

2020
[54]

A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021

Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021

2021
[55]

Dump: Automated distribution- level curriculum learning for rl-based llm post-training.arXiv preprint arXiv:2504.09710, 2025

Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, and Wentian Zhao. Dump: Automated distribution- level curriculum learning for rl-based llm post-training.arXiv preprint arXiv:2504.09710, 2025

work page arXiv 2025
[56]

Less: Selecting influential data for targeted instruction tuning.arXiv preprint arXiv:2402.04333, 2024

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning.arXiv preprint arXiv:2402.04333, 2024

work page arXiv 2024
[57]

Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

work page arXiv 2025
[58]

Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 2023

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 2023

2023
[59]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 16 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 2020

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 2020

2020
[62]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data

Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

2022
[64]

header": [

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 2023. 17 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR Transferability for...

2023

[1] [1]

Database-friendly random projections

Dimitris Achlioptas. Database-friendly random projections. InProceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2001

2001

[2] [2]

AIME Problems and Solutions, 2025

Art of Problem Solving. AIME Problems and Solutions, 2025. URL https://artofproblemsolv ing.com/wiki/index.php/AIME_Problems_and_Solutions. Accessed: 2025-05-15

2025

[3] [3]

Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(Nov):397–422, 2002

Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(Nov):397–422, 2002

2002

[4] [4]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Multitask learning.Machine learning, 28(1):41–75, 1997

Rich Caruana. Multitask learning.Machine learning, 28(1):41–75, 1997

1997

[6] [6]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. InEATCS International Colloquium on Automata, Languages and Programming, pages 693–703. Springer, 2002

2002

[7] [7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025a

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

work page arXiv 2025

[9] [9]

Finqa: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

2021

[10] [10]

Hitab: A hierarchical table dataset for question answering and natural language generation

Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. Hitab: A hierarchical table dataset for question answering and natural language generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

2022

[11] [11]

Revisiting reinforcement learning for llm reasoning from a cross-domain perspective

Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Nilabjo Dey, Yonghao Zhuang, Yuheng Zha, et al. Revisiting reinforcement learning for llm reasoning from a cross-domain perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025

[12] [12]

Arc prize 2024: Technical report

Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: Technical report.arXiv preprint arXiv:2412.04604, 2024

work page arXiv 2024

[13] [13]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Xeron Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi LI, Yunwen Li, dehua ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Zhenzhu Yang, Zekun ...

2025

[15] [15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Skywork open reasoner series

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.not ion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680 , 2025. Notion Blog

2025

[19] [19]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[20] [20]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning.arXiv preprint arXiv:2507.00432, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Livecodebench: Holistic and contamination free eval- uation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free eval- uation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[22] [22]

Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

work page arXiv 2025

[23] [23]

Extensions of lipschitz mappings into a hilbert space

William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26(189-206):1, 1984

1984

[24] [24]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, 2017

2017

[25] [25]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023

2023

[26] [26]

Omni-thinker: Scaling multi-task rl in llms with hybrid reward and task scheduling.arXiv preprint arXiv:2507.14783, 2025

Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, et al. Omni-thinker: Scaling multi-task rl in llms with hybrid reward and task scheduling.arXiv preprint arXiv:2507.14783, 2025

work page arXiv 2025

[27] [27]

Codeio: Condensing reasoning patterns via code input-output prediction

Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. Codeio: Condensing reasoning patterns via code input-output prediction. InInternational Conference on Machine Learning, 2025. 14 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

2025

[28] [28]

Verified taco problems

Kaixin Li. Verified taco problems. https://huggingface.co/datasets/likaixin/TACO-ver ified, 2024. URLhttps://huggingface.co/datasets/likaixin/TACO-verified

2024

[29] [29]

Combining induction and transduction for abstract reasoning.arXiv preprint arXiv:2411.02272, 2024

Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M Dunn, Hao Tang, Michelangelo Naim, Dat Nguyen, et al. Combining induction and transduction for abstract reasoning.arXiv preprint arXiv:2411.02272, 2024

work page arXiv 2024

[30] [30]

Can one domain help others? a data-centric study on multi-domain reasoning via reinforcement learning.arXiv preprint arXiv:2507.17512, 2025

Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, and Lijun Wu. Can one domain help others? a data-centric study on multi-domain reasoning via reinforcement learning.arXiv preprint arXiv:2507.17512, 2025

work page arXiv 2025

[31] [31]

Boosting multi-domain reasoning of LLMs via curvature-guided policy optimization

Xize Liang, Lin Yang, Jie Wang, Rui Liu, Yang Lu, Jinliang Zeng, Hanzhu Chen, Dong Li, and Jianye HAO. Boosting multi-domain reasoning of LLMs via curvature-guided policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[32] [32]

Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

work page arXiv 2025

[33] [33]

Conflict-averse gradient descent for multi-task learning.Advances in Neural Information Processing Systems, 2021

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning.Advances in Neural Information Processing Systems, 2021

2021

[34] [34]

Deepcoder: A fully open-source 14b coder at o3-mini level

Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Erran Li, Raluca Ada Popa, Ion Stoica, Ameen Patel, Alpay Ariyak, Qingyang Wu, Maurice Weber, and Ce Zhang. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/DeepC oder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb...

2025

[35] [35]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassin g-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e ...

2025

[36] [36]

General-reasoner: Advancing LLM reasoning across all domains

Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun MA, and Wenhu Chen. General-reasoner: Advancing LLM reasoning across all domains. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[37] [37]

American mathematics competitions - amc.https://maa.org/, 2023

MAA. American mathematics competitions - amc.https://maa.org/, 2023

2023

[38] [38]

Teacher–student curriculum learning

Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher–student curriculum learning. IEEE transactions on neural networks and learning systems, 31(9):3732–3740, 2019

2019

[39] [39]

Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1

Justus Mattern, Sami Jaghouar, Manveer Basra, Jannik Straube, Matthew Di Ferrante, Felix Gabriel, Jack Min Ong, Vincent Weisser, and Johannes Hagemann. Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1. https://www.primeintellect.ai/blog/synthet ic-1-release, 2025

2025

[40] [40]

Reasoning curriculum: Bootstrapping broad llm reasoning from math.arXiv preprint arXiv:2510.26143, 2025

Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, and Yingbo Zhou. Reasoning curriculum: Bootstrapping broad llm reasoning from math.arXiv preprint arXiv:2510.26143, 2025

work page arXiv 2025

[41] [41]

Kakade, and Surbhi Goel

Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham M. Kakade, and Surbhi Goel. In good GRACEs: Principled teacher selection for knowledge distillation. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[42] [42]

Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. 15 Transferability for General Reasoning: An Aut...

2026

[43] [43]

Trak: attributing model behavior at scale

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander M ˛ adry. Trak: attributing model behavior at scale. InProceedings of the 40th International Conference on Machine Learning, 2023

2023

[44] [44]

Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

2020

[45] [45]

Multi-task grpo: Reliable llm rea- soning across tasks.arXiv preprint arXiv:2602.05547, 2026

Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou Ammar, Aurelien Lucchi, and Ilija Bogunovic. Multi-task grpo: Reliable llm rea- soning across tasks.arXiv preprint arXiv:2602.05547, 2026

work page arXiv 2026

[46] [46]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

2024

[47] [47]

An Overview of Multi-Task Learning in Deep Neural Networks

Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[48] [48]

Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv preprint arXiv:2210.01240, 2022

Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv preprint arXiv:2210.01240, 2022

work page arXiv 2022

[49] [49]

Transformers struggle to learn to search.arXiv preprint arXiv:2412.04703, 2024

Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, and He He. Transformers struggle to learn to search.arXiv preprint arXiv:2412.04703, 2024

work page arXiv 2024

[50] [50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, 2025

2025

[52] [52]

Proximal curriculum for reinforcement learning agents.arXiv preprint arXiv:2304.12877, 2023

Georgios Tzannetos, Bárbara Gomes Ribeiro, Parameswaran Kamalaruban, and Adish Singla. Proximal curriculum for reinforcement learning agents.arXiv preprint arXiv:2304.12877, 2023

work page arXiv 2023

[53] [53]

Learning a multi-domain curriculum for neural machine translation

Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, and Zarana Parekh. Learning a multi-domain curriculum for neural machine translation. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

2020

[54] [54]

A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021

Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021

2021

[55] [55]

Dump: Automated distribution- level curriculum learning for rl-based llm post-training.arXiv preprint arXiv:2504.09710, 2025

Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, and Wentian Zhao. Dump: Automated distribution- level curriculum learning for rl-based llm post-training.arXiv preprint arXiv:2504.09710, 2025

work page arXiv 2025

[56] [56]

Less: Selecting influential data for targeted instruction tuning.arXiv preprint arXiv:2402.04333, 2024

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning.arXiv preprint arXiv:2402.04333, 2024

work page arXiv 2024

[57] [57]

Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

work page arXiv 2025

[58] [58]

Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 2023

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 2023

2023

[59] [59]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 16 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 2020

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems, 2020

2020

[62] [62]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data

Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

2022

[64] [64]

header": [

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 2023. 17 Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR Transferability for...

2023