The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms
Pith reviewed 2026-06-26 05:27 UTC · model grok-4.3
The pith
Learning from one example transfers differently across algorithms, with RL extending reach more efficiently than supervised fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing for each training example a controlled suite of test variants arranged by increasing transfer distance—from exact recall to implementation transfer across languages, context transfer under narrative re-framing, category-matched in-domain problems, and an unpaired baseline—the Generalization Spectrum profiles how far an algorithm's learning extends. This reveals that RL converts memorization into near-transfer more efficiently than SFT-family baselines, ICL exhibits strong but correspondence-dependent transfer, and within-family variants show local gains need not expand the generalization radius, with RFT preserving a stronger far-transfer tail than reference SFT.
What carries the argument
The Generalization Spectrum, which tracks performance across test variants ordered by transfer distance for each training example.
If this is right
- RL is more efficient than SFT at turning exact memorization into near-transfer performance.
- ICL transfer is strong but depends on specific correspondences between examples.
- Abstractions and hints primarily improve local transfer rather than expanding the overall generalization radius.
- RFT maintains a stronger far-transfer tail compared to standard SFT.
- Self-distillation or hint-assisted RL can reduce far transfer even when local transfer improves.
Where Pith is reading between the lines
- Applying the spectrum to other domains like natural language inference or mathematical reasoning could reveal similar differences in generalization behavior across methods.
- The framework suggests that algorithm design should target expanding the transfer radius rather than just optimizing local performance.
- Future evaluations might routinely include such distance-based suites to avoid overestimating generalization from aggregate scores alone.
Load-bearing premise
The constructed test variants form a valid monotonic ordering of transfer distance without uncontrolled factors such as varying difficulty or data contamination.
What would settle it
Re-running the experiments with variants reordered by a different distance metric or on a new set of problems where the performance does not decrease monotonically with claimed transfer distance would challenge the framework's validity.
read the original abstract
Traditional evaluations measure a learning algorithm's final performance on an i.i.d. test set, reducing learning to a single aggregate score. This approach obscures a fundamental question: to what extent does learning from a specific example generalize to others? Such per-sample generalization, akin to learning by analogy in human cognition, captures how far the knowledge extracted from one example can transfer, yet remains invisible to standard benchmarks. We introduce the Generalization Spectrum, an evaluation framework designed to expose this hidden dimension. For each training example, we construct a controlled suite of test variants arranged by increasing transfer distance, from exact recall to implementation transfer across languages, context transfer under complete narrative re-framing, category-matched in-domain problems, and an unpaired baseline. By tracking performance across these distances, we reveal not just whether an algorithm learns, but how far that learning extends. We instantiate this framework on competitive programming, using a selection-and-synthesis pipeline seeded with recent problems to mitigate contamination. We first compare three canonical learning paradigms under matched memorization. RL converts memorization into near-transfer more efficiently than SFT-family baselines, while ICL exhibits strong but correspondence-dependent transfer. We then use the Spectrum to diagnose within-family variants. The resulting profiles show that local gains need not expand the generalization radius: abstractions and hints mainly lift local transfer, RFT preserves a stronger far-transfer tail than reference SFT, and self-distillation or hint-assisted RL can reduce far transfer even when local transfer or optimization improves.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Generalization Spectrum framework for evaluating per-sample generalization of learning algorithms by constructing, for each training example, a controlled suite of test variants ordered by increasing transfer distance (exact recall, implementation transfer, narrative re-framing, category-matched in-domain, unpaired baseline). Applied to competitive programming via a selection-and-synthesis pipeline on recent problems, it compares RL, SFT-family methods, and ICL under matched memorization, claiming RL converts memorization into near-transfer more efficiently than SFT baselines while ICL shows strong but correspondence-dependent transfer; it further diagnoses within-family variants, finding that local gains need not expand generalization radius.
Significance. If the variant suites are shown to form a controlled monotonic axis, the framework offers a finer-grained diagnostic than i.i.d. benchmarks, exposing differences in generalization radius across paradigms and variants that could guide algorithm design toward better transfer.
major comments (2)
- [Variant Construction Pipeline] § on variant construction (selection-and-synthesis pipeline): the five-level spectrum is presented as a monotonic transfer-distance axis, yet no quantitative check is supplied that difficulty is held constant within suites, that the ordering is monotonic under model-independent measures, or that synthesis artifacts do not introduce surface cues correlated with claimed distance; this assumption is load-bearing for the headline RL-efficiency claim.
- [Empirical Comparisons] Results on RL vs. SFT comparisons: the reported spectrum profiles are described qualitatively without accompanying quantitative metrics, error bars, or statistical tests on the efficiency differences, so the central claim that RL converts memorization into near-transfer more efficiently cannot be verified from the supplied evidence.
minor comments (2)
- [Spectrum Definition] Clarify whether the unpaired baseline is strictly out-of-distribution or merely unpaired within the seeded problem set.
- [Variant Diagnosis] The abstract states findings on RFT preserving a stronger far-transfer tail; supply the corresponding spectrum plot or table row to allow direct inspection.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the Generalization Spectrum framework. The comments highlight important aspects of validation and evidence presentation that we will address through revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Variant Construction Pipeline] § on variant construction (selection-and-synthesis pipeline): the five-level spectrum is presented as a monotonic transfer-distance axis, yet no quantitative check is supplied that difficulty is held constant within suites, that the ordering is monotonic under model-independent measures, or that synthesis artifacts do not introduce surface cues correlated with claimed distance; this assumption is load-bearing for the headline RL-efficiency claim.
Authors: We agree that the current presentation relies on the construction of the levels (exact recall, implementation transfer, narrative re-framing, category-matched, unpaired) to establish increasing transfer distance while holding the underlying problem fixed, without supplying explicit quantitative validation of monotonicity or difficulty constancy. The selection-and-synthesis pipeline is designed to control for these factors by seeding from recent problems and applying targeted transformations, but we did not include model-independent checks such as lexical or structural complexity metrics across variants. In revision we will add such an analysis in the variant construction section, including verification that ordering holds under independent difficulty proxies and discussion of how the pipeline reduces surface-cue artifacts. revision: yes
-
Referee: [Empirical Comparisons] Results on RL vs. SFT comparisons: the reported spectrum profiles are described qualitatively without accompanying quantitative metrics, error bars, or statistical tests on the efficiency differences, so the central claim that RL converts memorization into near-transfer more efficiently cannot be verified from the supplied evidence.
Authors: The manuscript presents the spectrum profiles as the primary evidence for the RL-efficiency claim and within-family diagnoses, relying on visual comparison across the transfer distances. We acknowledge that this is qualitative and lacks aggregate quantitative metrics, error bars, or statistical tests. In revision we will introduce summary statistics (e.g., mean transfer efficiency per paradigm) computed over the problem suites, include error bars reflecting variation across examples, and add appropriate statistical comparisons to support the efficiency differences. revision: yes
Circularity Check
No significant circularity; framework and comparisons are independently constructed
full rationale
The paper defines the Generalization Spectrum as a new evaluation framework that constructs controlled test variant suites ordered by transfer distance and then measures empirical performance of RL, SFT, and ICL on those suites. No equations, fitted parameters, or self-citations are shown that would make any reported 'prediction' or comparison reduce to the inputs by construction. The selection-and-synthesis pipeline is described as an external process seeded on recent problems, and the resulting profiles are presented as direct observations rather than derived from prior self-referential results. The central claims therefore rest on the empirical measurements themselves rather than on any self-definitional or load-bearing self-citation step.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
and Chu, Eric and Behbahani, Feryal and Faust, Aleksandra and Larochelle, Hugo , booktitle =
Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, and Hugo Larochelle. Many-shot in-context learning. InAdvances in Neural Information Processing Systems, volume 37, pages 76930–76966. Curran Associates, Inc.,...
-
[2]
Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning-II: A simple test time scaling approach via self-critique.arXiv preprint arXiv:2507.09075, 2025
arXiv 2025
-
[3]
Opencodereasoning: Advancing data distillation for competitive coding
Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025
Pith/arXiv arXiv 2025
-
[4]
Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien
Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. InInternational Conference on Machine Learning, pages 233–242, 2017
2017
-
[5]
Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
Pith/arXiv arXiv 2021
-
[6]
Barnett and Stephen J
Susan M. Barnett and Stephen J. Ceci. When and where do we apply what we learn? A taxonomy for far transfer. Psychological Bulletin, 128(4):612–637, 2002
2002
-
[7]
Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, TomHenighan, RewonChild, AdityaRamesh, DanielM.Ziegler, JeffreyWu, ClemensWinter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Sc...
1901
-
[8]
MultiPL-E: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 49(7):3675–3691, 2023
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. MultiPL-E: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 49(7):3675–3691, 2023
2023
-
[9]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, C...
Pith/arXiv arXiv 2021
-
[10]
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025
Pith/arXiv arXiv 2025
-
[11]
A survey on in-context learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, 2024
2024
-
[12]
Jia Fu, Xinyu Yang, Hongzhi Zhang, Yahui Liu, Jingyuan Zhang, Qi Wang, Fuzheng Zhang, and Guorui Zhou. Klear-codetest: Scalable test case generation for code reinforcement learning.arXiv preprint arXiv:2508.05710, 2025
arXiv 2025
-
[13]
Testing mined specifications
Mark Gabel and Zhendong Su. Testing mined specifications. InProceedings of the ACM SIGSOFT International Symposium on the Foundationsof Software Engineering (FSE), pages 4:1–4:11, 2012. 14
2012
-
[14]
Smith, Vudtiwat Ngampruetikorn, and David J
Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, and David J. Schwab. When can in-context learning generalize out of task distribution? InProceedings of the 42nd International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=YKyza9lrv4
2025
-
[15]
Hardtests: Synthesizing high-quality test cases for LLM coding.arXiv preprint arXiv:2505.24098, 2025
Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, and Lei Li. Hardtests: Synthesizing high-quality test cases for LLM coding.arXiv preprint arXiv:2505.24098, 2025
arXiv 2025
-
[16]
Measuring coding challenge competence with APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In NeurIPS Datasets and Benchmarks, 2021. URLhttps://datasets-benchmarks-proceedings.neurips.cc/ paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abs...
2021
-
[17]
Does math reasoning improve general LLM capabilities? understanding transferability of LLM reasoning
Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general LLM capabilities? understanding transferability of LLM reasoning. arXiv preprint arXiv:2507.00432, 2025
Pith/arXiv arXiv 2025
-
[18]
Boosting mllm reasoning with text-debiased hint-grpo
Qihan Huang, Weilong Dai, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jingyuan Chen, Chang Yao, and Jie Song. Boosting mllm reasoning with text-debiased hint-grpo. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4848–4857, 2025
2025
-
[19]
Thinkbench: Dynamic out-of-distribution evaluation for robust llm reasoning
Shulin Huang, Linyi Yang, Yan Song, Shuang Chen, Leyang Cui, Ziyu Wan, Qingcheng Zeng, Ying Wen, Kun Shao, Weinan Zhang, Jun Wang, and Yue Zhang. Thinkbench: Dynamic out-of-distribution evaluation for robust llm reasoning. arXiv preprint arXiv:2502.16268, 2025
arXiv 2025
-
[20]
Reinforcement learning via self-distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026
Pith/arXiv arXiv 2026
-
[21]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
Pith/arXiv arXiv 2024
-
[22]
Xiaoqiang Kang, Shengen Wu, Zimu Wang, Yilin Liu, Xiaobo Jin, Kaizhu Huang, Wei Wang, Yutao Yue, Xiaowei Huang, and Qiufeng Wang. Can GRPO boost complex multimodal table understanding? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12631–12644, Suzhou, China, 2025. Association for Computational Linguistics....
-
[23]
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020
Pith/arXiv arXiv 2001
-
[24]
Understanding the effects of RLHF on LLM generalisation and diversity
Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of RLHF on LLM generalisation and diversity. InInternational Conference on Learning Representations, 2024
2024
-
[25]
Questa: Expanding reasoning capacity in LLMs via question augmentation
Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, and Jingzhao Zhang. Questa: Expanding reasoning capacity in LLMs via question augmentation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3MifB0f7qR
2026
-
[26]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...
-
[27]
Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning
Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. InAdvancesin Neural Information Processing Systems, volume 35, pages 1950–1965, 2022
1950
-
[28]
Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, and Mao Yang. rstar- coder: Scaling competitive code reasoning with a large-scale verified dataset.arXiv preprint arXiv:2505.21297, 2025. 15
arXiv 2025
-
[29]
Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, and Dandan Tu. Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning.arXiv preprint arXiv:2507.10628, 2025
arXiv 2025
-
[30]
Reft: Reasoning with reinforced fine-tuning
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 7601–7614, 2024
2024
-
[31]
Dynamic scaling of unit tests for code reward modeling
Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, and Jie Tang. Dynamic scaling of unit tests for code reward modeling. arXiv preprint arXiv:2501.01054, 2025
arXiv 2025
-
[32]
Few-shot fine-tuning vs
Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, and Yanai Elazar. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 12284–12314, 2023
2023
-
[33]
Deep double descent: Where bigger models and more data hurt
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. InInternational Conference on Learning Representations, 2020
2020
-
[34]
Christiano, Jan Leike, and Ryan Lowe
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...
2022
-
[35]
HumanEval-XL: A multilingual code generation benchmark for cross-lingual natural language generalization
Qiwei Peng, Yekun Chai, and Xuhong Li. HumanEval-XL: A multilingual code generation benchmark for cross-lingual natural language generalization. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8383–8394, 2024
2024
-
[36]
Alethea Power, Yuri Burda, Harrison Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022
Pith/arXiv arXiv 2022
-
[37]
Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. Codeelo: Benchmarking competition-level code generation of llms with human-comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025
arXiv 2025
-
[38]
Sentence-bert: Sentenceembeddingsusingsiamesebert-networks
NilsReimersandIrynaGurevych. Sentence-bert: Sentenceembeddingsusingsiamesebert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019
2019
-
[39]
Proximal policy optimization algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
Pith/arXiv arXiv 2017
-
[40]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300
Pith/arXiv arXiv 2024
-
[41]
Self-distillation enables continual learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026
Pith/arXiv arXiv 2026
-
[43]
URLhttps://arxiv.org/abs/2509.21016
-
[44]
Grapharena: Evaluating and exploring large language models on graph computation
Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, and Jia Li. Grapharena: Evaluating and exploring large language models on graph computation. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[45]
Qixun Wang, Yifei Wang, Yisen Wang, and Xianghua Ying. Can in-context learning really generalize to out-of-distribution tasks? arXiv preprint arXiv:2410.09695, 2024
arXiv 2024
-
[46]
HINT: Helping ineffective rollouts navigate towards effectiveness, 2026
Xinyi Wang, Jinyi Han, Zishang Jiang, Tingyun Li, Jiaqing Liang, Sihang Jiang, Zhaoqian Dai, Shuguang Ma, Fei Yu, and Yanghua Xiao. HINT: Helping ineffective rollouts navigate towards effectiveness, 2026. URL https://openreview.net/forum?id=Fw6PBELcFs. 16
2026
-
[47]
Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025
arXiv 2025
-
[48]
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022
2022
-
[49]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvancesin Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022. URLhttps://proceedings. neurips.cc/paper_files/paper/2022/f...
2022
-
[50]
Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding
Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
2025
-
[51]
Ood-bench: Quantifying and understanding two dimensions of out-of-distribution generalization
Nanyang Ye, Kaican Li, Haoyue Bai, Runpeng Yu, Lanqing Hong, Fengwei Zhou, Zhenguo Li, and Jun Zhu. Ood-bench: Quantifying and understanding two dimensions of out-of-distribution generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7947–7958, 2022
2022
-
[52]
Deeper insights without updates: The power of in-context learning over fine-tuning
Qingyu Yin, Xuzheng He, Chak Tou Leong, Fan Wang, Yanzhao Yan, Xiaoyu Shen, and Qiang Zhang. Deeper insights without updates: The power of in-context learning over fine-tuning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4138–4151, 2024. doi: 10.18653/v1/2024.findings-emnlp.239. URLhttps://aclanthology.org/2024.findings-...
-
[53]
A survey on evaluation of out-of-distribution generalization
Han Yu, Jiashuo Liu, Xingxuan Zhang, Jiayun Wu, and Peng Cui. A survey on evaluation of out-of-distribution generalization. arXiv preprint arXiv:2403.01874, 2024
arXiv 2024
-
[54]
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
Pith/arXiv arXiv 2025
-
[55]
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023
Pith/arXiv arXiv 2023
-
[56]
STaR: Bootstrapping reasoning with reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STaR: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems, volume 35, pages 15476–15488. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 639a9a172c044fbb64175b5fad42e9a5-Paper-Conference.pdf
2022
-
[57]
ACECODER: Acing coder RL via automated test-case synthesis
Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. ACECODER: Acing coder RL via automated test-case synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
2025
-
[58]
Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. Stephint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025
arXiv 2025
-
[59]
BREAD: Branched rollouts from expert anchors bridge SFT & RL for reasoning
Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak. BREAD: Branched rollouts from expert anchors bridge SFT & RL for reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=NUDaln2vCe
2026
-
[60]
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026
Pith/arXiv arXiv 2026
-
[61]
Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, Peiyao Sheng, Zixuan Wang, Wenhao Chai, Aleksandra Korolova, Peter Henderson, Sanjeev Arora, Pramod Viswanath, Jingbo Shang, and Saining Xie. Livecodebench pro: How do olympiad medalists judge LLMs in competitive programming? InAd...
2025
-
[62]
Chenyu Zhou, Jingyuan Yang, Linwei Xin, Yitian Chen, Ziyan He, and Dongdong Ge. Auto-formulating dynamic programming problems with large language models.arXiv preprint arXiv:2507.11737, 2025. 18 Appendix A Dataset Construction Details This section provides the full details of our dataset construction pipeline, including selection criteria, synthesis proce...
arXiv 2025
-
[63]
Only problems it solves proceed—this ensures the LLM understands the problem well enough to generate valid variants
Solution verification.GPT-5.2 attempts the original problem (pass@3). Only problems it solves proceed—this ensures the LLM understands the problem well enough to generate valid variants. 19
-
[64]
The prompt requires: (a) entirely different story and narrative, (b) identical I/O format, (c) the original solution logic must still apply
Problem generation.Given the original statement, I/O format, and examples, we prompt GPT-5.2 to generate a new problem description. The prompt requires: (a) entirely different story and narrative, (b) identical I/O format, (c) the original solution logic must still apply
-
[65]
Failed cases are filtered out or regenerated
Consistency review.Gemini-3-Pro independently checks whether the new problem truly shares the same solution and test cases. Failed cases are filtered out or regenerated. First-pass acceptance is approximately 85%; after regeneration, retention is 100%
-
[66]
DP – Classic DP
Solution re-derivation.As final validation, Gemini-3-Pro solves the new problem from scratch (no access to the original) and we verify that its solution passes the original test cases. This provides further evidence of semantic equivalence. For quality control, we automatically verify that the original solution still passes all test cases on the new probl...
2000
-
[67]
Constraint omission: the algorithmic direction is close, but a required constraint, boundary case, or output condition is missing
-
[68]
Structure confusion: the broad family is similar, but the state space, transition, counted object, graph relation, or interval semantics is mis-specified
-
[69]
‘ “‘output {example_output} 24 “‘ ## Note {note} For ICL, we construct a multi-turn conversation with the demonstration as a completed interaction: ICL Format [ {
Other error: the high-level route is plausible, but the code fails through implementation, indexing, I/O, variable, or complexity errors. When the annotator judged an evaluator-failed trace to be apparently correct or could not assign a category, that trace is excluded from the failure-category counts. Table 6 reports counts for the alignedD0/D2 diagnosti...
1900
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.