Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis
Pith reviewed 2026-06-30 20:55 UTC · model grok-4.3
The pith
Self-improving reasoning RL succeeds when models synthesize their own environments that maintain stable solve-verify asymmetry.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that stable self-improvement arises from an environment-construction loop in which each artifact is a reusable executable object that samples instances, computes references, and scores responses, provided the environments exhibit stable solve-verify asymmetry. This asymmetry takes two forms: tasks algorithmically hard to reason through but trivial as code, or tasks intrinsically hard to solve but easy to verify. Both keep the reward informative because the policy cannot close the gap by gaming the verifier. EvoEnv implements the loop by synthesizing Python environments from ten seeds and admitting them only after staged validation, semantic self-review, solver-relative d
What carries the argument
Stable solve-verify asymmetry: the durable gap in which the model can write an oracle once that it cannot reliably execute in natural language on fresh instances, keeping the reward signal informative as the solver improves.
If this is right
- On already strong models, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce average performance while environment synthesis raises it.
- Self-improvement requires environments whose difficulty stays structurally beyond the model's reach rather than more synthetic data.
- A single policy can serve as both generator and solver when environments are admitted only after validation and calibration.
- Two complementary environment types sustain the asymmetry: algorithmically hard but trivial as code, and intrinsically hard to solve but easy to verify.
- The reward signal stays useful only while the generator continues to produce environments that the current solver cannot reliably handle in natural language.
Where Pith is reading between the lines
- The same asymmetry principle could be tested in non-reasoning domains such as code generation if analogous oracle-solver gaps can be engineered.
- Running the loop for more iterations would show whether the generator itself improves at creating harder environments or whether asymmetry eventually saturates.
- Applying the method to larger base models might produce larger absolute gains if the generator scales in its ability to maintain the gap.
- The approach suggests that future self-improvement systems should prioritize verifiable environment construction over pure data synthesis.
Load-bearing premise
The environments the generator produces will keep showing stable solve-verify asymmetry on new instances even after the solver policy strengthens, so the reward remains informative instead of being gamed or saturated.
What would settle it
After multiple training rounds with EvoEnv, measure whether the solver policy now solves the generated environments reliably in natural language without needing the code oracle; if it does, the asymmetry has collapsed and further improvement should cease.
read the original abstract
We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EvoEnv, a single-policy generator-solver framework in which an LLM synthesizes reusable Python environments from ten seeds; environments are admitted only after staged validation (semantic self-review, solver-relative difficulty calibration, novelty checks). The central claim is that these environments maintain stable solve-verify asymmetry (algorithmically hard to solve but trivial to verify, or intrinsically hard to solve but easy to verify), enabling durable self-improving reasoning RL. On Qwen3-4B-Thinking this yields an average performance increase from 72.4 to 74.8 (3.3% relative), while fixed public-data RLVR and fixed hand-crafted environment RLVR both decrease performance.
Significance. If the durability of the solve-verify gap is demonstrated, the work offers a principled alternative to data-imitation loops for self-improvement, shifting focus to construction of executable environments whose difficulty remains structurally beyond the solver's reach. The distinction between algorithmic and intrinsic asymmetry forms is a clear conceptual contribution.
major comments (2)
- [Abstract] Abstract: the reported gain (72.4 o 74.8) is presented without any mention of number of runs, standard deviation, statistical significance, or controls for selection bias introduced by the solver-relative difficulty calibration step; this makes it impossible to judge whether the 3.3% improvement is robust or partly an artifact of the admission filter.
- [Abstract] Abstract (and implied results): no post-training measurement is described that checks whether the final solver still exhibits a stable solve-verify gap on newly sampled environments; without this, the claim that the asymmetry 'stays structurally beyond their own reach' remains unverified and the self-evolution loop could saturate after the first iteration.
minor comments (1)
- [Abstract] The abstract refers to 'ten seeds' and 'staged validation' but does not list the concrete benchmarks or task families used to compute the reported average; adding this would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater statistical transparency in the abstract and for a direct verification of persistent solve-verify asymmetry. Both comments identify genuine gaps in the current presentation. We address each below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported gain (72.4 o 74.8) is presented without any mention of number of runs, standard deviation, statistical significance, or controls for selection bias introduced by the solver-relative difficulty calibration step; this makes it impossible to judge whether the 3.3% improvement is robust or partly an artifact of the admission filter.
Authors: We agree that the abstract lacks the requested statistical details and that this omission limits assessment of robustness. The solver-relative calibration is an integral part of the admission process, and while the fixed-baseline comparisons provide some control, explicit reporting is needed. In the revised manuscript we will update the abstract to state the number of independent runs, report standard deviation, note statistical significance where applicable, and briefly describe how selection bias is mitigated by the overall experimental design. revision: yes
-
Referee: [Abstract] Abstract (and implied results): no post-training measurement is described that checks whether the final solver still exhibits a stable solve-verify gap on newly sampled environments; without this, the claim that the asymmetry 'stays structurally beyond their own reach' remains unverified and the self-evolution loop could saturate after the first iteration.
Authors: The referee correctly notes that the manuscript does not include an explicit post-training evaluation of the solve-verify gap on newly generated environments. The observed performance lift provides supporting evidence, yet a direct measurement would more rigorously substantiate durability. We will add this analysis in the revision by evaluating the trained solver on a fresh set of environments sampled after the final iteration and reporting the resulting solve-verify gap. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central claim is an empirical result: EvoEnv training yields a 72.4 to 74.8 gain on Qwen3-4B-Thinking while fixed baselines degrade performance. The derivation chain consists of an environment-generation procedure (ten seeds, staged validation, semantic self-review, solver-relative difficulty calibration, novelty checks) followed by RL training and external evaluation. No equations, fitted parameters renamed as predictions, or self-citations appear in the supplied text. The calibration step selects training environments relative to the current solver, but the reported metric is performance on (presumably held-out) benchmarks, not a quantity forced by the admission filter itself. The stability of solve-verify asymmetry is presented as a necessary assumption rather than a result derived from the inputs by construction. This is therefore a standard empirical claim whose validity can be checked against external benchmarks without circular reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions. 2026. URLhttps://arxiv.org/abs/2505.23281
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Safe and scalable web agent learning via recreated websites,
Hyungjoo Chae, Jungsoo Park, and Alan Ritter. Safe and scalable web agent learning via recreated websites,
- [3]
-
[4]
Spc: Evolving self-play critic via adversarial games for llm reasoning, 2025
Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning, 2025
2025
-
[5]
Self-questioning language models, 2025
Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Self-questioning language models, 2025
2025
-
[6]
Multi-agent evolve: Llm self-improve through co-evolution, 2025
Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution, 2025
2025
-
[7]
Scaling agent learning via experience synthesis, 2025
Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, and Dat Huynh. Scaling agent learning via experience synthesis, 2025. URLhttps://arxiv.org/abs/2511.03773
-
[8]
Self-play fine-tuning converts weak language models to strong language models, 2024
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models, 2024
2024
-
[9]
Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025
Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025. URLhttps://arxiv.org/abs/2504. 21024
2025
-
[10]
Serl: Self-play reinforcement learning for large language models with limited data, 2025
Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data, 2025
2025
-
[11]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
2025
-
[12]
How far can unsupervised rlvr scale llm training?, 2026
Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. How far can unsupervised rlvr scale llm training?, 2026
2026
-
[13]
V-star: Training verifiers for self-taught reasoners, 2024
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners, 2024
2024
-
[14]
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-Zero: Self-evolving reasoning LLM from zero data, 2025. URLhttps://arxiv.org/abs/ 2508.05004
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. 2024. URLhttps://arxiv.org/abs/2403.07974
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Language self-play for data-free training, 2025
Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan, and Jason Chen. Language self-play for data-free training, 2025. URLhttps://arxiv.org/abs/2509.07414
-
[17]
Opensir: Open-ended self-improving reasoner, 2025
Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z Pan, Marco Valentino, and Pasquale Minervini. Opensir: Open-ended self-improving reasoner, 2025
2025
-
[18]
Embomatrix: A scalable training-ground for embodied decision-making,
Zixing Lei, Sheng Yin, Yichen Xiong, Yuanzhuo Ding, Wenhao Huang, Yuxi Wei, Qingyao Xu, Yiming Li, Weixin Li, Yunhong Wang, and Siheng Chen. Embomatrix: A scalable training-ground for embodied decision-making,
- [19]
-
[20]
Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning, 2025
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, et al. Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning, 2025
2025
-
[21]
Spice: Self-play in corpus environments improves reasoning, 2025
Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning, 2025. 11
2025
-
[22]
Chasing moving targets with online self-play reinforcement learning for safer language models, 2025
Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, and Natasha Jaques. Chasing moving targets with online self-play reinforcement learning for safer language models, 2025
2025
-
[23]
Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025
Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025
2025
-
[24]
Search self-play: Pushing the frontier of agent capability without supervision, 2025
Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, and Guanjun Jiang. Search self-play: Pushing the frontier of agent capability without supervision, 2025
2025
-
[25]
Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. 2025. URLhttps://arxiv.org/abs/2504.16891
-
[26]
Self-consistency preference optimization, 2024
Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, and Jane Yu. Self-consistency preference optimization, 2024
2024
-
[27]
Scaling synthetic task generation for agents via exploration, 2025
Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, and Alexander Toshev. Scaling synthetic task generation for agents via exploration, 2025. URL https://arxiv.org/abs/2509.25047
-
[28]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. 2019. URL https://arxiv.org/abs/1908.10084
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[29]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. 2023. URLhttps: //arxiv.org/abs/2311.12022
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [30]
-
[31]
Spurious rewards: Rethinking training signals in rlvr, 2025
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr, 2025
2025
-
[32]
Beyond human data: Scaling self-training for problem-solving with language models, 2023
Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models, 2023
2023
-
[33]
Envscaler: Scaling tool-interactive environments for llm agent via programmatic synthesis, 2026
Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Envscaler: Scaling tool-interactive environments for llm agent via programmatic synthesis, 2026. URLhttps://arxiv.org/abs/2601. 05808
2026
-
[34]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models. 2026. URLhttps://arxiv.org/abs/2512.13607
-
[36]
Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions, 2019. URL https: //arxiv.org/abs/1901.01753
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[37]
Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution, 2025
Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze, Boyu Yang, Wei Wang, Hu Wei, and Linfeng Zhang. Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution, 2025
2025
-
[38]
Llms as scalable, general-purpose simulators for evolving digital agent training,
Yiming Wang, Da Yin, Yuedong Cui, Ruichen Zheng, Zhiqian Li, Zongyu Lin, Di Wu, Xueqing Wu, Chenchen Ye, Yu Zhou, and Kai-Wei Chang. Llms as scalable, general-purpose simulators for evolving digital agent training,
- [39]
-
[40]
Toward Training Superintelligent Software Agents through Self-Play SWE-RL
Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, and Sida Wang. Toward training superintelligent software agents through self-play swe-rl, 2025. URL https://arxiv.org/abs/2512.18552. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Mirage or method? how model-task alignment induces divergent rl conclusions, 2025
Haoze Wu, Cheng Wang, Wenshuo Zhao, and Junxian He. Mirage or method? how model-task alignment induces divergent rl conclusions, 2025. URLhttps://arxiv.org/abs/2508.21188
-
[42]
Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026
Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/2602.14296
-
[43]
Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning, 2025
Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning, 2025
2025
-
[44]
Qwen3 technical report, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025
2025
-
[45]
Dapo: An open-source llm reinforcement learning system at scale, 2025
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025
2025
-
[46]
Self-rewarding language models, 2024
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024
2024
-
[47]
Star: Bootstrapping reasoning with reasoning, 2022
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning, 2022
2022
-
[48]
GLM-5: from Vibe Coding to Agentic Engineering
Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, and Hannaneh Hajishirzi. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments, 2025. URL...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Darwin gödel machine: Open-ended evolution of self-improving agents, 2026
Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents, 2026. URLhttps://openreview.net/forum?id=pUpzQZTvGY
2026
-
[51]
Right question is already half the answer: Fully unsupervised llm reasoning incentivization, 2025
Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization, 2025
2025
-
[52]
Better llm reasoning via dual-play, 2025
Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, and Claire Cardie. Better llm reasoning via dual-play, 2025
2025
-
[53]
Infiniteweb: Scalable web environment synthesis for gui agent training, 2026
Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training, 2026. URLhttps://arxiv.org/abs/2601.04126
work page internal anchor Pith review arXiv 2026
-
[54]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data, 2025. URL https://arxiv.org/abs/2505.03335
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Learning to reason without external rewards, 2025
Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards, 2025
2025
-
[56]
Self-challenging language model agents, 2026
Yifei Zhou, Sergey Levine, Jason E Weston, Xian Li, and Sainbayar Sukhbaatar. Self-challenging language model agents, 2026. URLhttps://openreview.net/forum?id=9yusqX9DpR
2026
-
[57]
Evolving language models without labels: Majority drives selection, novelty promotes variation, 2025
Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, and Dong Yu. Evolving language models without labels: Majority drives selection, novelty promotes variation, 2025
2025
-
[58]
Training versatile coding agents in synthetic environments, 2025
Yiqi Zhu, Apurva Gandhi, and Graham Neubig. Training versatile coding agents in synthetic environments, 2025. URLhttps://arxiv.org/abs/2512.12216
-
[59]
Given the multiset {S}, find a nonempty
Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning, 2025. 13 A Detailed positioning against nearby self-improvement methods Table 3 expands the sketch in Section 2 into a family-by-family comparison. Two axes distinguishEvoEnv from each adjacent...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.