pith. machine review for the scientific record. sign in

arxiv: 2605.14445 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: no theorem link

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:58 UTC · model grok-4.3

classification 💻 cs.LG
keywords open-ended codingproblem synthesisLLM fine-tuningcompetitive programmingdata generationidea divergenceagent benchmarks
0
0 comments X

The pith

An automated system evolves closed-ended competitive programming tasks into open-ended coding problems and uses the resulting data to train stronger LLM coders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

FrontierSmith takes existing competitive programming problems and modifies their goals, restricts outputs, and generalizes inputs to create open-ended variants that admit many valid solutions. It filters candidates with a quantitative idea divergence metric that favors problems prompting genuinely different solution strategies across solvers, then auto-generates test cases and verifiers for the survivors. When models are trained on these problems, Qwen3.5-9B gains 8.82 points on FrontierCS and 306 Elo on ALE-bench while the 27B variant gains 12.12 and 309 Elo respectively. The synthesized problems also prompt agents to use more turns and tokens, matching the behavior seen with human-curated open-ended tasks. The method therefore turns abundant closed-ended seeds into scalable training data for the open-ended coding regime where current models remain weak.

Core claim

FrontierSmith iteratively evolves closed-ended competitive programming problems into open-ended coding challenges by altering goals, restricting outputs, and generalizing inputs, then applies a quantitative idea divergence metric to retain only those problems that elicit genuinely diverse solution approaches from different solvers, after which agents produce test cases and verifiers; training on the resulting dataset produces the stated gains on FrontierCS and ALE-bench and increases the number of turns and tokens used by agents in a manner comparable to human-curated open-ended problems.

What carries the argument

The quantitative idea divergence metric that selects problems eliciting genuinely diverse solution approaches from different solvers.

If this is right

  • Models trained on the synthesized problems show measurable gains on two separate open-ended coding benchmarks.
  • The generated problems cause agents to consume more turns and tokens during solution, matching patterns observed with human-curated open-ended tasks.
  • Closed-ended competitive programming problems can serve as practical seeds for generating long-horizon coding training data at scale.
  • The pipeline produces verifiable test cases and verifiers without manual curation for the retained problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evolution-plus-divergence approach could be tested on non-coding domains such as open-ended mathematical reasoning or scientific hypothesis generation.
  • If the divergence metric correlates with downstream gains, future systems might optimize the metric itself rather than hand-designing problem transformations.
  • The method implies that explicit diversity enforcement in data generation can substitute for the scarcity of naturally occurring open-ended problems.

Load-bearing premise

The quantitative idea divergence metric reliably selects problems that elicit genuinely diverse solution approaches from different solvers, and the automatically generated test cases and verifiers are sufficiently robust to support training.

What would settle it

Retraining the same base models on the synthesized data and measuring zero or negative change relative to the reported +8.82 and +306 Elo gains on FrontierCS and ALE-bench would falsify the central claim.

read the original abstract

Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-ended training problems are scarce and expensive to construct. Our goal is to synthesize open-ended coding problems at scale to train stronger LLM coders. We introduce FrontierSmith, an automated system for iteratively evolving open-ended problems from existing closed-ended coding tasks. Starting from competitive programming problems, FrontierSmith generates candidate open-ended variants by changing the problems'goals, restricting outputs, and generalizing inputs. It then uses a quantitative idea divergence metric to select problems that elicit genuinely diverse approaches from different solvers. Agents then generate test cases and verifiers for the surviving candidates. On two open-ended coding benchmarks, training on our synthesized data yields substantial gains over the base models: Qwen3.5-9B improves by +8.82 score on FrontierCS and +306.36 (Elo-rating-based performance) on ALE-bench; Qwen3.5-27B improves by +12.12 and +309.12, respectively. The synthesized problems also make agents take more turns and use more tokens, similar to human-curated ones, suggesting that closed-ended seeds can be a practical starting point for long-horizon coding data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FrontierSmith, a system that iteratively evolves open-ended coding problems from closed-ended competitive programming seeds by altering goals, restricting outputs, and generalizing inputs. It applies a quantitative idea divergence metric to filter for problems that elicit diverse solution approaches across solvers, then uses agents to generate test cases and verifiers. Training on the resulting data is reported to yield gains of +8.82/+12.12 on FrontierCS and +306.36/+309.12 on ALE-bench for Qwen3.5-9B and 27B models, respectively, along with increased interaction turns and token usage akin to human-curated open-ended problems.

Significance. If the divergence-based selection and verifier generation produce reliably clean, diverse open-ended training signals, the approach offers a scalable route to address the scarcity of long-horizon coding data, potentially improving LLM performance on real-world open-ended tasks. The evaluation on external benchmarks with raw score deltas (rather than in-sample fitting) is a methodological strength that supports the claim of genuine generalization.

major comments (3)
  1. [§3.2] §3.2 (Divergence Metric): The quantitative idea divergence metric is described as selecting problems that 'elicit genuinely diverse approaches from different solvers,' yet no formula, implementation details, or validation (e.g., human agreement rates or measured inter-solver variance) is supplied. This omission is load-bearing for the central claim that gains arise from open-ended diversity rather than confounding factors such as problem length or data volume.
  2. [§4.3] §4.3 (Verifier Generation): No error rates, robustness checks, or failure-mode analysis are reported for the agent-generated test cases and verifiers. Without these, it is impossible to rule out that training gains partly reflect noisy or incorrect supervision, weakening attribution of the +8.82/+12.12 and Elo improvements to the synthesized open-ended problems.
  3. [§5.1] §5.1 (Ablation Experiments): The reported results do not isolate the divergence filter's contribution from the effects of simply increasing training data volume or altering input distributions; an ablation removing the filter (or comparing against unfiltered evolved problems) is needed to confirm that the selection step is responsible for the observed behavioral changes in turns and tokens.
minor comments (2)
  1. [Abstract and §5.2] The abstract and §5.2 state that agents 'produce more turns and use more tokens' but do not report the precise measurement protocol or statistical significance of these differences relative to the closed-ended baselines.
  2. [§3.1] Notation for the evolution operators (goal change, output restriction, input generalization) is introduced informally; adding a compact pseudocode listing or equation set in §3.1 would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity and rigor, and we address each major point below with planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Divergence Metric): The quantitative idea divergence metric is described as selecting problems that 'elicit genuinely diverse approaches from different solvers,' yet no formula, implementation details, or validation (e.g., human agreement rates or measured inter-solver variance) is supplied. This omission is load-bearing for the central claim that gains arise from open-ended diversity rather than confounding factors such as problem length or data volume.

    Authors: We will revise §3.2 to include the exact mathematical formula for the idea divergence metric, pseudocode for its computation across multiple solvers, and validation results. Specifically, we will report measured inter-solver variance on the selected problems and human agreement rates from two annotators evaluating a random sample of 50 problems for whether they elicit genuinely diverse solution approaches. These additions will directly support the claim that the metric isolates open-ended diversity. revision: yes

  2. Referee: [§4.3] §4.3 (Verifier Generation): No error rates, robustness checks, or failure-mode analysis are reported for the agent-generated test cases and verifiers. Without these, it is impossible to rule out that training gains partly reflect noisy or incorrect supervision, weakening attribution of the +8.82/+12.12 and Elo improvements to the synthesized open-ended problems.

    Authors: We acknowledge the need for explicit quality metrics. In the revised version we will add a new analysis subsection under §4.3 reporting error rates from manual inspection of 100 randomly sampled verifiers and test cases, plus a discussion of observed failure modes such as overly permissive tests or missed edge cases. While these checks cannot fully eliminate the possibility of residual noise, the consistent improvements on external benchmarks (FrontierCS and ALE-bench) provide evidence that the supervision remains sufficiently reliable for the reported gains. revision: partial

  3. Referee: [§5.1] §5.1 (Ablation Experiments): The reported results do not isolate the divergence filter's contribution from the effects of simply increasing training data volume or altering input distributions; an ablation removing the filter (or comparing against unfiltered evolved problems) is needed to confirm that the selection step is responsible for the observed behavioral changes in turns and tokens.

    Authors: We agree that isolating the filter's contribution is essential. We will add an ablation experiment to §5.1 that trains identical models on (a) the full FrontierSmith dataset and (b) the unfiltered set of all evolved problems (same volume and input distribution). We will report the resulting differences in interaction turns, token usage, and benchmark scores to demonstrate that the divergence-based selection, rather than data volume alone, drives the more human-like behavioral patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims consist of empirical performance deltas on two external benchmarks (FrontierCS and ALE-bench) after training on synthesized data. These benchmarks are independent of the synthesis pipeline, and no equations or selection metrics are shown to reduce the reported gains to quantities defined inside the paper itself. The quantitative idea divergence metric and verifier generation are described as procedural steps whose outputs are evaluated downstream rather than fitted or self-defined to produce the headline numbers. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the provided abstract or reader's summary. The derivation therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that closed-ended problems can be mutated into open-ended ones while preserving the ability to automatically verify solutions. No explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption Closed-ended competitive programming problems contain sufficient structure to be mutated into open-ended variants that still admit automatic verification.
    Invoked when the system generates candidate variants and then produces test cases and verifiers for them.

pith-pipeline@v0.9.0 · 5611 in / 1282 out tokens · 36666 ms · 2026-05-15T01:58:18.730852+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 13 internal anchors

  1. [1]

    Bengt Aspvall, Michael F Plass, and Robert Endre Tarjan

    Accessed: 2026-05-03. Bengt Aspvall, Michael F Plass, and Robert Endre Tarjan. A linear-time algorithm for testing the truth of certain quantified boolean formulas.Information processing letters, 8(3):121–123,

  2. [2]

    Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

    Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei An- driushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. SWE-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411,

  3. [3]

    Scaling Self-Play with Self-Guidance

    Luke Bailey, Kaiyue Wen, Kefan Dong, Tatsunori Hashimoto, and Tengyu Ma. Scaling self-play with self-guidance. arXiv preprint arXiv:2604.20209,

  4. [4]

    K-search: Llm kernel generation via co-evolving intrinsic world model.arXiv preprint arXiv:2602.19128,

    Shiyi Cao, Ziming Mao, Joseph E Gonzalez, and Ion Stoica. K-search: Llm kernel generation via co-evolving intrinsic world model.arXiv preprint arXiv:2602.19128,

  5. [5]

    2602.20133 , archivePrefix =

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, et al. Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133,

  6. [6]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan et al. MLE-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095,

  7. [7]

    HeuriGym: An agentic benchmark for LLM-crafted heuristics in combinatorial optimization

    Hongzheng Chen et al. HeuriGym: An agentic benchmark for LLM-crafted heuristics in combinatorial optimization. arXiv preprint arXiv:2506.07972,

  8. [8]

    Goodman, and Dimitris Papailiopoulos

    Kanishk Gandhi, Shivam Garg, Noah D. Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling RL environments for terminal agents.arXiv preprint arXiv:2601.16443,

  9. [9]

    Dan Gusfield and Leonard Pitt

    Accessed: 2026-05-05. Dan Gusfield and Leonard Pitt. A bounded approximation for the minimum cost 2-sat problem.Algorithmica, 8(1): 103–117,

  10. [10]

    HardTests: Synthesizing high-quality test cases for LLM coding.arXiv preprint arXiv:2505.24098,

    Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, and Lei Li. HardTests: Synthesizing high-quality test cases for LLM coding.arXiv preprint arXiv:2505.24098,

  11. [11]

    V-STaR: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457,

    Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-STaR: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457,

  12. [12]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. AgentCoder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010,

  13. [13]

    ALE-Bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050,

    Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE-Bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050,

  14. [14]

    R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents,

    11 Preprint FrontierCS T eam Naman Jain, Jaskirat Singh, Manish Shetty, Tianjun Zheng, Koushik Sen, and Ion Stoica. R2E-Gym: Procedural environments and hybrid verifiers for scaling open-weights SWE agents.arXiv preprint arXiv:2504.07164,

  15. [15]

    GASP: Guided asymmetric self-play for coding LLMs.arXiv preprint arXiv:2603.15957,

    Swadesh Jana, Cansu Sancaktar, Tomáš Daniš, Georg Martius, Antonio Orvieto, and Pavel Kolev. GASP: Guided asymmetric self-play for coding LLMs.arXiv preprint arXiv:2603.15957,

  16. [16]

    Reasoning path divergence: A new metric and curation strategy to unlock LLM diverse thinking.arXiv preprint arXiv:2510.26122,

    Feng Ju et al. Reasoning path divergence: A new metric and curation strategy to unlock LLM diverse thinking.arXiv preprint arXiv:2510.26122,

  17. [17]

    Reducibility among combinatorial problems

    Richard M Karp. Reducibility among combinatorial problems. In50 Years of Integer Programming 1958-2008: from the Early Years to the State-of-the-Art, pages 219–241. Springer,

  18. [18]

    How diversely can language models solve problems? Exploring the algorithmic diversity of model-generated code.arXiv preprint arXiv:2503.00691,

    Seonghyeon Lee et al. How diversely can language models solve problems? Exploring the algorithmic diversity of model-generated code.arXiv preprint arXiv:2503.00691,

  19. [19]

    Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O. Stanley. Evolution through large models.arXiv preprint arXiv:2206.08896,

  20. [20]

    NP-Engine: Empowering optimization reasoning in LLMs with verifiable synthetic NP problems

    Xiaozhe Li et al. NP-Engine: Empowering optimization reasoning in LLMs with verifiable synthetic NP problems. arXiv preprint arXiv:2510.16476,

  21. [21]

    Skynomad: On using multi-region spot instances to minimize ai batch job cost.arXiv preprint arXiv:2601.06520,

    Zhifei Li, Tian Xia, Ziming Mao, Zihan Zhou, Ethan J Jackson, Jamison Kerney, Zhanghao Wu, Pratik Mishra, Yi Xu, Yifan Qiao, et al. Skynomad: On using multi-region spot instances to minimize ai batch job cost.arXiv preprint arXiv:2601.06520,

  22. [22]

    and Du, Alexander and Keutzer, Kurt and Cheung, Alvin and Dimakis, Alexandros G

    Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026a. Tengxiao Liu, Yuqing Yang, Xi Ye, and Danqi Chen. Can coding agents optimize algorithms autonomously? https://tengxiaoliu....

  23. [23]

    Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution

    Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang, Coleman Hooper, Yuezhou Hu, Rishabh Tiwari, Jue Wang, Harman Singh, et al. Squeeze evolve: Unified multi-model orchestration for verifier-free evolution.arXiv preprint arXiv:2604.07725,

  24. [24]

    Gonzalez, Jingbo 12 Preprint FrontierCS T eam Shang, and Alvin Cheung

    Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...

  25. [25]

    Illuminating search spaces by mapping elites

    Accessed: 2026-05-05. Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909,

  26. [26]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

  27. [27]

    Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a

    OpenAI. Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a. Accessed: 2026-04-26. OpenAI. Introducing gpt-5.5.https://openai.com/index/introducing-gpt-5-5/, 2026b. Accessed: 2026-05-05. Anne Ouyang et al. KernelBench: Can LLMs write efficient GPU kernels?arXiv preprint arXiv:2502.10517,

  28. [28]

    ACES: Generating a diversity of challenging programming puzzles with autotelic generative models.arXiv preprint arXiv:2310.10692,

    Julien Pourcel, Cédric Colas, Gaia Molinaro, Pierre-Yves Oudeyer, and Laetitia Teodorescu. ACES: Generating a diversity of challenging programming puzzles with autotelic generative models.arXiv preprint arXiv:2310.10692,

  29. [29]

    2604.01658 , archivePrefix =

    Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658,

  30. [30]

    Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947,

    Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, and Luke Zettlemoyer. Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947,

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  32. [32]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  33. [33]

    BugPilot: Complex bug generation for efficient learning of SWE skills.arXiv preprint arXiv:2510.19898,

    Atharv Sonwane, Isadora White, Hyunji Lee, Matheus Pereira, Lucas Caccia, Minseon Kim, Zhengyan Shi, Chinmay Singh, Alessandro Sordoni, Marc-Alexandre Côté, and Xingdi Yuan. BugPilot: Complex bug generation for efficient learning of SWE skills.arXiv preprint arXiv:2510.19898,

  34. [34]

    Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions

    URL https: //qwen.ai/blog?id=qwen3.5. Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. Paired open-ended trailblazer (POET): Endlessly gener- ating increasingly complex and diverse learning environments and their solutions.arXiv preprint arXiv:1901.01753,

  35. [35]

    CURE: Co-evolving LLM coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136, 2025a

    Yinjie Wang et al. CURE: Co-evolving LLM coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136, 2025a. Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025b. 13 Preprint...

  36. [36]

    arXiv:2411.15114 , year=

    Hjalmar Wijk et al. RE-Bench: Evaluating frontier AI R&D capabilities of language model agents.arXiv preprint arXiv:2411.15114,

  37. [37]

    Top leaderboard ranking = top coding proficiency, always? EvoEval: Evolving coding benchmarks via LLM.arXiv preprint arXiv:2403.19114,

    Chunqiu Steven Xia, Yinlin Deng, and Lingming Zhang. Top leaderboard ranking = top coding proficiency, always? EvoEval: Evolving coding benchmarks via LLM.arXiv preprint arXiv:2403.19114,

  38. [38]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. WizardLM: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244,

  39. [39]

    SWE-smith: Scaling Data for Software Engineering Agents

    John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798,

  40. [40]

    Evolving alignment via asymmetric self-play.arXiv preprint arXiv:2411.00062,

    Ziyu Ye et al. Evolving alignment via asymmetric self-play.arXiv preprint arXiv:2411.00062,

  41. [41]

    Self-Rewarding Language Models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self- rewarding language models.arXiv preprint arXiv:2401.10020,

  42. [42]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,

  43. [43]

    AutoCode: LLMs as problem setters for competitive programming.arXiv preprint arXiv:2510.12803,

    Shang Zhou, Zihan Zheng, Kaiyuan Liu, Zeyu Shen, Zerui Cheng, Zexing Chen, Hansen He, Jianzhu Yao, Huanzhi Mao, Qiuyang Mang, Tianfu Fu, Beichen Li, Dongruixuan Li, Wenhao Chai, Zhuang Liu, Aleksandra Korolova, Peter Henderson, Natasha Jaques, Pramod Viswanath, Saining Xie, and Jingbo Shang. AutoCode: LLMs as problem setters for competitive programming.ar...