pith. sign in

arxiv: 2606.25450 · v2 · pith:KH5B6OFKnew · submitted 2026-06-24 · 💻 cs.LG · cs.CL

The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

Pith reviewed 2026-06-26 05:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords generalization spectrumtransfer distancereinforcement learningsupervised fine-tuningin-context learningcompetitive programmingper-sample generalizationevaluation framework
0
0 comments X

The pith

Learning from one example transfers differently across algorithms, with RL extending reach more efficiently than supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Generalization Spectrum to measure how far knowledge from a single training example generalizes by testing on variants ordered by increasing transfer distance. This goes beyond standard i.i.d. benchmarks to reveal the radius of generalization for different learning methods. In experiments on competitive programming problems, reinforcement learning converts memorization into near-transfer more efficiently than supervised fine-tuning baselines, while in-context learning shows strong but correspondence-dependent transfer. The spectrum also diagnoses that local performance gains do not necessarily broaden the generalization range, as seen in comparisons of reference fine-tuning, reinforced fine-tuning, and other variants.

Core claim

By constructing for each training example a controlled suite of test variants arranged by increasing transfer distance—from exact recall to implementation transfer across languages, context transfer under narrative re-framing, category-matched in-domain problems, and an unpaired baseline—the Generalization Spectrum profiles how far an algorithm's learning extends. This reveals that RL converts memorization into near-transfer more efficiently than SFT-family baselines, ICL exhibits strong but correspondence-dependent transfer, and within-family variants show local gains need not expand the generalization radius, with RFT preserving a stronger far-transfer tail than reference SFT.

What carries the argument

The Generalization Spectrum, which tracks performance across test variants ordered by transfer distance for each training example.

If this is right

  • RL is more efficient than SFT at turning exact memorization into near-transfer performance.
  • ICL transfer is strong but depends on specific correspondences between examples.
  • Abstractions and hints primarily improve local transfer rather than expanding the overall generalization radius.
  • RFT maintains a stronger far-transfer tail compared to standard SFT.
  • Self-distillation or hint-assisted RL can reduce far transfer even when local transfer improves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying the spectrum to other domains like natural language inference or mathematical reasoning could reveal similar differences in generalization behavior across methods.
  • The framework suggests that algorithm design should target expanding the transfer radius rather than just optimizing local performance.
  • Future evaluations might routinely include such distance-based suites to avoid overestimating generalization from aggregate scores alone.

Load-bearing premise

The constructed test variants form a valid monotonic ordering of transfer distance without uncontrolled factors such as varying difficulty or data contamination.

What would settle it

Re-running the experiments with variants reordered by a different distance metric or on a new set of problems where the performance does not decrease monotonically with claimed transfer distance would challenge the framework's validity.

read the original abstract

Traditional evaluations measure a learning algorithm's final performance on an i.i.d. test set, reducing learning to a single aggregate score. This approach obscures a fundamental question: to what extent does learning from a specific example generalize to others? Such per-sample generalization, akin to learning by analogy in human cognition, captures how far the knowledge extracted from one example can transfer, yet remains invisible to standard benchmarks. We introduce the Generalization Spectrum, an evaluation framework designed to expose this hidden dimension. For each training example, we construct a controlled suite of test variants arranged by increasing transfer distance, from exact recall to implementation transfer across languages, context transfer under complete narrative re-framing, category-matched in-domain problems, and an unpaired baseline. By tracking performance across these distances, we reveal not just whether an algorithm learns, but how far that learning extends. We instantiate this framework on competitive programming, using a selection-and-synthesis pipeline seeded with recent problems to mitigate contamination. We first compare three canonical learning paradigms under matched memorization. RL converts memorization into near-transfer more efficiently than SFT-family baselines, while ICL exhibits strong but correspondence-dependent transfer. We then use the Spectrum to diagnose within-family variants. The resulting profiles show that local gains need not expand the generalization radius: abstractions and hints mainly lift local transfer, RFT preserves a stronger far-transfer tail than reference SFT, and self-distillation or hint-assisted RL can reduce far transfer even when local transfer or optimization improves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Generalization Spectrum framework for evaluating per-sample generalization of learning algorithms by constructing, for each training example, a controlled suite of test variants ordered by increasing transfer distance (exact recall, implementation transfer, narrative re-framing, category-matched in-domain, unpaired baseline). Applied to competitive programming via a selection-and-synthesis pipeline on recent problems, it compares RL, SFT-family methods, and ICL under matched memorization, claiming RL converts memorization into near-transfer more efficiently than SFT baselines while ICL shows strong but correspondence-dependent transfer; it further diagnoses within-family variants, finding that local gains need not expand generalization radius.

Significance. If the variant suites are shown to form a controlled monotonic axis, the framework offers a finer-grained diagnostic than i.i.d. benchmarks, exposing differences in generalization radius across paradigms and variants that could guide algorithm design toward better transfer.

major comments (2)
  1. [Variant Construction Pipeline] § on variant construction (selection-and-synthesis pipeline): the five-level spectrum is presented as a monotonic transfer-distance axis, yet no quantitative check is supplied that difficulty is held constant within suites, that the ordering is monotonic under model-independent measures, or that synthesis artifacts do not introduce surface cues correlated with claimed distance; this assumption is load-bearing for the headline RL-efficiency claim.
  2. [Empirical Comparisons] Results on RL vs. SFT comparisons: the reported spectrum profiles are described qualitatively without accompanying quantitative metrics, error bars, or statistical tests on the efficiency differences, so the central claim that RL converts memorization into near-transfer more efficiently cannot be verified from the supplied evidence.
minor comments (2)
  1. [Spectrum Definition] Clarify whether the unpaired baseline is strictly out-of-distribution or merely unpaired within the seeded problem set.
  2. [Variant Diagnosis] The abstract states findings on RFT preserving a stronger far-transfer tail; supply the corresponding spectrum plot or table row to allow direct inspection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the Generalization Spectrum framework. The comments highlight important aspects of validation and evidence presentation that we will address through revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Variant Construction Pipeline] § on variant construction (selection-and-synthesis pipeline): the five-level spectrum is presented as a monotonic transfer-distance axis, yet no quantitative check is supplied that difficulty is held constant within suites, that the ordering is monotonic under model-independent measures, or that synthesis artifacts do not introduce surface cues correlated with claimed distance; this assumption is load-bearing for the headline RL-efficiency claim.

    Authors: We agree that the current presentation relies on the construction of the levels (exact recall, implementation transfer, narrative re-framing, category-matched, unpaired) to establish increasing transfer distance while holding the underlying problem fixed, without supplying explicit quantitative validation of monotonicity or difficulty constancy. The selection-and-synthesis pipeline is designed to control for these factors by seeding from recent problems and applying targeted transformations, but we did not include model-independent checks such as lexical or structural complexity metrics across variants. In revision we will add such an analysis in the variant construction section, including verification that ordering holds under independent difficulty proxies and discussion of how the pipeline reduces surface-cue artifacts. revision: yes

  2. Referee: [Empirical Comparisons] Results on RL vs. SFT comparisons: the reported spectrum profiles are described qualitatively without accompanying quantitative metrics, error bars, or statistical tests on the efficiency differences, so the central claim that RL converts memorization into near-transfer more efficiently cannot be verified from the supplied evidence.

    Authors: The manuscript presents the spectrum profiles as the primary evidence for the RL-efficiency claim and within-family diagnoses, relying on visual comparison across the transfer distances. We acknowledge that this is qualitative and lacks aggregate quantitative metrics, error bars, or statistical tests. In revision we will introduce summary statistics (e.g., mean transfer efficiency per paradigm) computed over the problem suites, include error bars reflecting variation across examples, and add appropriate statistical comparisons to support the efficiency differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework and comparisons are independently constructed

full rationale

The paper defines the Generalization Spectrum as a new evaluation framework that constructs controlled test variant suites ordered by transfer distance and then measures empirical performance of RL, SFT, and ICL on those suites. No equations, fitted parameters, or self-citations are shown that would make any reported 'prediction' or comparison reduce to the inputs by construction. The selection-and-synthesis pipeline is described as an external process seeded on recent problems, and the resulting profiles are presented as direct observations rather than derived from prior self-referential results. The central claims therefore rest on the empirical measurements themselves rather than on any self-definitional or load-bearing self-citation step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5818 in / 907 out tokens · 18843 ms · 2026-06-26T05:27:44.343280+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 4 canonical work pages

  1. [1]

    and Chu, Eric and Behbahani, Feryal and Faust, Aleksandra and Larochelle, Hugo , booktitle =

    Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, and Hugo Larochelle. Many-shot in-context learning. InAdvances in Neural Information Processing Systems, volume 37, pages 76930–76966. Curran Associates, Inc.,...

  2. [2]

    Opencodereasoning-II: A simple test time scaling approach via self-critique.arXiv preprint arXiv:2507.09075, 2025

    Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning-II: A simple test time scaling approach via self-critique.arXiv preprint arXiv:2507.09075, 2025

  3. [3]

    Opencodereasoning: Advancing data distillation for competitive coding

    Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025

  4. [4]

    Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien

    Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. InInternational Conference on Machine Learning, pages 233–242, 2017

  5. [5]

    Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  6. [6]

    Barnett and Stephen J

    Susan M. Barnett and Stephen J. Ceci. When and where do we apply what we learn? A taxonomy for far transfer. Psychological Bulletin, 128(4):612–637, 2002

  7. [7]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, TomHenighan, RewonChild, AdityaRamesh, DanielM.Ziegler, JeffreyWu, ClemensWinter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Sc...

  8. [8]

    MultiPL-E: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 49(7):3675–3691, 2023

    Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. MultiPL-E: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 49(7):3675–3691, 2023

  9. [9]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, C...

  10. [10]

    Le, Sergey Levine, and Yi Ma

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

  11. [11]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, 2024

  12. [12]

    Klear-codetest: Scalable test case generation for code reinforcement learning.arXiv preprint arXiv:2508.05710, 2025

    Jia Fu, Xinyu Yang, Hongzhi Zhang, Yahui Liu, Jingyuan Zhang, Qi Wang, Fuzheng Zhang, and Guorui Zhou. Klear-codetest: Scalable test case generation for code reinforcement learning.arXiv preprint arXiv:2508.05710, 2025

  13. [13]

    Testing mined specifications

    Mark Gabel and Zhendong Su. Testing mined specifications. InProceedings of the ACM SIGSOFT International Symposium on the Foundationsof Software Engineering (FSE), pages 4:1–4:11, 2012. 14

  14. [14]

    Smith, Vudtiwat Ngampruetikorn, and David J

    Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, and David J. Schwab. When can in-context learning generalize out of task distribution? InProceedings of the 42nd International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=YKyza9lrv4

  15. [15]

    Hardtests: Synthesizing high-quality test cases for LLM coding.arXiv preprint arXiv:2505.24098, 2025

    Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, and Lei Li. Hardtests: Synthesizing high-quality test cases for LLM coding.arXiv preprint arXiv:2505.24098, 2025

  16. [16]

    Measuring coding challenge competence with APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In NeurIPS Datasets and Benchmarks, 2021. URLhttps://datasets-benchmarks-proceedings.neurips.cc/ paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abs...

  17. [17]

    Does math reasoning improve general LLM capabilities? understanding transferability of LLM reasoning

    Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general LLM capabilities? understanding transferability of LLM reasoning. arXiv preprint arXiv:2507.00432, 2025

  18. [18]

    Boosting mllm reasoning with text-debiased hint-grpo

    Qihan Huang, Weilong Dai, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jingyuan Chen, Chang Yao, and Jie Song. Boosting mllm reasoning with text-debiased hint-grpo. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4848–4857, 2025

  19. [19]

    Thinkbench: Dynamic out-of-distribution evaluation for robust llm reasoning

    Shulin Huang, Linyi Yang, Yan Song, Shuang Chen, Leyang Cui, Ziyu Wan, Qingcheng Zeng, Ying Wen, Kun Shao, Weinan Zhang, Jun Wang, and Yue Zhang. Thinkbench: Dynamic out-of-distribution evaluation for robust llm reasoning. arXiv preprint arXiv:2502.16268, 2025

  20. [20]

    Reinforcement learning via self-distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026

  21. [21]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  22. [22]

    Xiaoqiang Kang, Shengen Wu, Zimu Wang, Yilin Liu, Xiaobo Jin, Kaizhu Huang, Wei Wang, Yutao Yue, Xiaowei Huang, and Qiufeng Wang. Can GRPO boost complex multimodal table understanding? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12631–12644, Suzhou, China, 2025. Association for Computational Linguistics....

  23. [23]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020

  24. [24]

    Understanding the effects of RLHF on LLM generalisation and diversity

    Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of RLHF on LLM generalisation and diversity. InInternational Conference on Learning Representations, 2024

  25. [25]

    Questa: Expanding reasoning capacity in LLMs via question augmentation

    Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, and Jingzhao Zhang. Questa: Expanding reasoning capacity in LLMs via question augmentation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3MifB0f7qR

  26. [26]

    doi: 10.1126/science.abq1158

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  27. [27]

    Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. InAdvancesin Neural Information Processing Systems, volume 35, pages 1950–1965, 2022

  28. [28]

    rstar- coder: Scaling competitive code reasoning with a large-scale verified dataset.arXiv preprint arXiv:2505.21297, 2025

    Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, and Mao Yang. rstar- coder: Scaling competitive code reasoning with a large-scale verified dataset.arXiv preprint arXiv:2505.21297, 2025. 15

  29. [29]

    Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning.arXiv preprint arXiv:2507.10628, 2025

    Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, and Dandan Tu. Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning.arXiv preprint arXiv:2507.10628, 2025

  30. [30]

    Reft: Reasoning with reinforced fine-tuning

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 7601–7614, 2024

  31. [31]

    Dynamic scaling of unit tests for code reward modeling

    Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, and Jie Tang. Dynamic scaling of unit tests for code reward modeling. arXiv preprint arXiv:2501.01054, 2025

  32. [32]

    Few-shot fine-tuning vs

    Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, and Yanai Elazar. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 12284–12314, 2023

  33. [33]

    Deep double descent: Where bigger models and more data hurt

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. InInternational Conference on Learning Representations, 2020

  34. [34]

    Christiano, Jan Leike, and Ryan Lowe

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

  35. [35]

    HumanEval-XL: A multilingual code generation benchmark for cross-lingual natural language generalization

    Qiwei Peng, Yekun Chai, and Xuhong Li. HumanEval-XL: A multilingual code generation benchmark for cross-lingual natural language generalization. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8383–8394, 2024

  36. [36]

    Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

    Alethea Power, Yuri Burda, Harrison Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

  37. [37]

    Codeelo: Benchmarking competition-level code generation of llms with human-comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025

    Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. Codeelo: Benchmarking competition-level code generation of llms with human-comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025

  38. [38]

    Sentence-bert: Sentenceembeddingsusingsiamesebert-networks

    NilsReimersandIrynaGurevych. Sentence-bert: Sentenceembeddingsusingsiamesebert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019

  39. [39]

    Proximal policy optimization algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  40. [40]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

  41. [41]

    Self-distillation enables continual learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026

  42. [43]

    URLhttps://arxiv.org/abs/2509.21016

  43. [44]

    Grapharena: Evaluating and exploring large language models on graph computation

    Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, and Jia Li. Grapharena: Evaluating and exploring large language models on graph computation. InInternational Conference on Learning Representations (ICLR), 2025

  44. [45]

    Can in-context learning really generalize to out-of-distribution tasks? arXiv preprint arXiv:2410.09695, 2024

    Qixun Wang, Yifei Wang, Yisen Wang, and Xianghua Ying. Can in-context learning really generalize to out-of-distribution tasks? arXiv preprint arXiv:2410.09695, 2024

  45. [46]

    HINT: Helping ineffective rollouts navigate towards effectiveness, 2026

    Xinyi Wang, Jinyi Han, Zishang Jiang, Tingyun Li, Jiaqing Liang, Sihang Jiang, Zhaoqian Dai, Shuguang Ma, Fei Yu, and Yanghua Xiao. HINT: Helping ineffective rollouts navigate towards effectiveness, 2026. URL https://openreview.net/forum?id=Fw6PBELcFs. 16

  46. [47]

    Codecontests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

    Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

  47. [48]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

  48. [49]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvancesin Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022. URLhttps://proceedings. neurips.cc/paper_files/paper/2022/f...

  49. [50]

    Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding

    Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  50. [51]

    Ood-bench: Quantifying and understanding two dimensions of out-of-distribution generalization

    Nanyang Ye, Kaican Li, Haoyue Bai, Runpeng Yu, Lanqing Hong, Fengwei Zhou, Zhenguo Li, and Jun Zhu. Ood-bench: Quantifying and understanding two dimensions of out-of-distribution generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7947–7958, 2022

  51. [52]

    Deeper insights without updates: The power of in-context learning over fine-tuning

    Qingyu Yin, Xuzheng He, Chak Tou Leong, Fan Wang, Yanzhao Yan, Xiaoyu Shen, and Qiang Zhang. Deeper insights without updates: The power of in-context learning over fine-tuning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4138–4151, 2024. doi: 10.18653/v1/2024.findings-emnlp.239. URLhttps://aclanthology.org/2024.findings-...

  52. [53]

    A survey on evaluation of out-of-distribution generalization

    Han Yu, Jiashuo Liu, Xingxuan Zhang, Jiayun Wu, and Peng Cui. A survey on evaluation of out-of-distribution generalization. arXiv preprint arXiv:2403.01874, 2024

  53. [54]

    DAPO: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  54. [55]

    Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023

  55. [56]

    STaR: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STaR: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems, volume 35, pages 15476–15488. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 639a9a172c044fbb64175b5fad42e9a5-Paper-Conference.pdf

  56. [57]

    ACECODER: Acing coder RL via automated test-case synthesis

    Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. ACECODER: Acing coder RL via automated test-case synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  57. [58]

    Stephint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025

    Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. Stephint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025

  58. [59]

    BREAD: Branched rollouts from expert anchors bridge SFT & RL for reasoning

    Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak. BREAD: Branched rollouts from expert anchors bridge SFT & RL for reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=NUDaln2vCe

  59. [60]

    Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  60. [61]

    Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, Peiyao Sheng, Zixuan Wang, Wenhao Chai, Aleksandra Korolova, Peter Henderson, Sanjeev Arora, Pramod Viswanath, Jingbo Shang, and Saining Xie. Livecodebench pro: How do olympiad medalists judge LLMs in competitive programming? InAd...

  61. [62]

    Auto-formulating dynamic programming problems with large language models.arXiv preprint arXiv:2507.11737, 2025

    Chenyu Zhou, Jingyuan Yang, Linwei Xin, Yitian Chen, Ziyan He, and Dongdong Ge. Auto-formulating dynamic programming problems with large language models.arXiv preprint arXiv:2507.11737, 2025. 18 Appendix A Dataset Construction Details This section provides the full details of our dataset construction pipeline, including selection criteria, synthesis proce...

  62. [63]

    Only problems it solves proceed—this ensures the LLM understands the problem well enough to generate valid variants

    Solution verification.GPT-5.2 attempts the original problem (pass@3). Only problems it solves proceed—this ensures the LLM understands the problem well enough to generate valid variants. 19

  63. [64]

    The prompt requires: (a) entirely different story and narrative, (b) identical I/O format, (c) the original solution logic must still apply

    Problem generation.Given the original statement, I/O format, and examples, we prompt GPT-5.2 to generate a new problem description. The prompt requires: (a) entirely different story and narrative, (b) identical I/O format, (c) the original solution logic must still apply

  64. [65]

    Failed cases are filtered out or regenerated

    Consistency review.Gemini-3-Pro independently checks whether the new problem truly shares the same solution and test cases. Failed cases are filtered out or regenerated. First-pass acceptance is approximately 85%; after regeneration, retention is 100%

  65. [66]

    DP – Classic DP

    Solution re-derivation.As final validation, Gemini-3-Pro solves the new problem from scratch (no access to the original) and we verify that its solution passes the original test cases. This provides further evidence of semantic equivalence. For quality control, we automatically verify that the original solution still passes all test cases on the new probl...

  66. [67]

    Constraint omission: the algorithmic direction is close, but a required constraint, boundary case, or output condition is missing

  67. [68]

    Structure confusion: the broad family is similar, but the state space, transition, counted object, graph relation, or interval semantics is mis-specified

  68. [69]

    ‘ “‘output {example_output} 24 “‘ ## Note {note} For ICL, we construct a multi-turn conversation with the demonstration as a completed interaction: ICL Format [ {

    Other error: the high-level route is plausible, but the code fails through implementation, indexing, I/O, variable, or complexity errors. When the annotator judged an evaluator-failed trace to be apparently correct or could not assign a category, that trace is excluded from the failure-category counts. Table 6 reports counts for the alignedD0/D2 diagnosti...