From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
Pith reviewed 2026-06-27 00:57 UTC · model grok-4.3
The pith
A trained RL policy outperforms larger models by proposing its own training environment changes from failure summaries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The LLM-as-Environment-Engineer framework conditions the current policy on structured summaries of failure trajectories, policy behavior, and environment statistics to generate the next-stage training environment configuration. With Qwen3-4B as backbone this produces the strongest aggregate performance across benchmarks, exceeding both fixed-environment training and larger models such as GPT and Gemini. The RL checkpoint itself serves as a better environment engineer than the original base model, and successful updates depend on incorporating failure evidence while preserving configurations that already work.
What carries the argument
LLM-as-Environment-Engineer framework, which takes structured summaries of failures and stats as input to output environment configuration modifications for the next RL stage.
If this is right
- RL pipelines for LLMs can proceed through automated environment redesign without repeated human intervention between stages.
- Iterative self-proposed environment updates yield higher final policy performance than static configurations.
- The trained checkpoint's improved engineering ability suggests policy learning and environment diagnosis reinforce each other.
- Failure trajectory summaries are more effective context than other inputs for generating useful configuration changes.
- Preserving working configurations while altering others based on failures leads to stable improvement across stages.
Where Pith is reading between the lines
- The method could reduce reliance on domain experts for tuning multi-stage RL training loops if the summarization process generalizes.
- Repeated cycles of policy improvement and environment redesign might produce compounding gains beyond single-stage application.
- Structured failure evidence might transfer to other controllable testbeds if similar multi-dimensional generators are available.
Load-bearing premise
Conditioning the current policy on summaries of its failures and environment statistics will let it propose changes that reliably raise subsequent performance.
What would settle it
Apply the framework on MAPF-FrozenLake and observe that the generated environment configurations produce no gain or a drop in policy performance relative to fixed baselines.
read the original abstract
Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the LLM-as-Environment-Engineer framework, in which the current policy LLM analyzes structured summaries of failure trajectories, policy behavior, and environment statistics to propose modifications to the next-stage RL training environment. A new controllable testbed called MAPF-FrozenLake is introduced whose generator exposes multi-dimensional configuration parameters. Experiments condition Qwen3-4B on these summaries to generate environment updates; the resulting framework reports the strongest aggregate performance, outperforming larger proprietary LLMs and fixed-environment baselines. Additional analyses examine which context elements are most effective and observe that the trained RL checkpoint outperforms the base model as an environment engineer.
Significance. If the empirical results hold, the work offers a practical automation of a currently manual step in multi-stage RL for LLMs and supplies a controllable testbed that enables systematic study of environment redesign. The finding that policy learning improves the model's ability to diagnose its own weaknesses is a concrete observation that could support iterative self-improvement pipelines. The explicit analysis of context forms (failure evidence vs. preserved working configurations) is a useful empirical contribution.
minor comments (3)
- [§3] §3 (Testbed description): the precise parameterization of the MAPF-FrozenLake generator (dimensions, ranges, and how each affects the underlying multi-agent path-finding task) is not stated explicitly enough to allow independent reproduction or extension.
- [§4] §4 (Experiments): while aggregate performance is claimed, the manuscript should include per-benchmark tables with means, standard deviations, and the exact number of runs to support the statistical comparison against GPT/Gemini and fixed baselines.
- [Abstract / §1] Notation: MAPF is used without expansion on first appearance; a parenthetical definition would improve readability for readers outside the multi-agent planning community.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The recognition of the LLM-as-Environment-Engineer framework, the MAPF-FrozenLake testbed, and the empirical findings on context effectiveness and policy-improved environment engineering is appreciated. No major comments were provided in the report.
Circularity Check
No significant circularity: empirical framework proposal without self-referential derivation
full rationale
The paper proposes an LLM-as-Environment-Engineer framework and evaluates it empirically on the MAPF-FrozenLake testbed. No equations, fitted parameters, or derivation chain are present. The central claim is an empirical observation that the RL checkpoint outperforms the base model as an environment engineer, presented as a finding rather than a premise. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The work is self-contained against external benchmarks with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The current policy model can analyze failure trajectories and environment statistics to propose effective configuration changes.
invented entities (1)
-
MAPF-FrozenLake testbed
no independent evidence
Reference graph
Works this paper leans on
-
[1]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2501.12948 , eprinttype =. 2501.12948 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
-
[2]
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
Wasi Uddin Ahmad and Sean Narenthiran and Somshubra Majumdar and Aleksander Ficek and Siddhartha Jain and Jocelyn Huang and Vahid Noroozi and Boris Ginsburg , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.01943 , eprinttype =. 2504.01943 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01943 2025
-
[3]
Open R1: A fully open reproduction of DeepSeek-R1 , url =
-
[4]
Phi-4-reasoning Technical Report
Marah I Abdin and Sahaj Agarwal and Ahmed Awadallah and Vidhisha Balachandran and Harkirat S. Behl and Lingjiao Chen and Gustavo de Rosa and Suriya Gunasekar and Mojan Javaheripi and Neel Joshi and Piero Kauffmann and Yash Lara and Caio C. Phi-4-reasoning Technical Report , journal =. 2025 , url =. doi:10.48550/ARXIV.2504.21318 , eprinttype =. 2504.21318 ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21318 2025
-
[5]
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , author=
-
[6]
Hanxu Hu and Xingxing Zhang and Jannis Vamvas and Rico Sennrich and Furu Wei , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.17715 , eprinttype =. 2510.17715 , timestamp =
-
[7]
OpenThoughts: Data Recipes for Reasoning Models
Etash Kumar Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.04178 2025
-
[8]
Xianzhen Luo and Jinyang Huang and Wenzhen Zheng and Qingfu Zhu and Mingzheng Xu and Yiheng Xu and YuanTao Fan and Libo Qin and Wanxiang Che , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.08720 , eprinttype =. 2510.08720 , timestamp =
-
[9]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
Huaye Zeng and Dongfu Jiang and Haozhe Wang and Ping Nie and Xiaotong Chen and Wenhu Chen , editor =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =
2025
-
[10]
Measuring Coding Challenge Competence With
Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt , editor =. Measuring Coding Challenge Competence With. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Ben...
2021
-
[11]
NeurIPS , year=
Measuring Coding Challenge Competence With APPS , author=. NeurIPS , year=
-
[12]
Evaluating Large Language Models Trained on Code , journal =
Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Pond. Evaluating Large Language Models Trained on Code , journal =. 2021 , url =. 2107.03374 , timestamp =
Pith/arXiv arXiv 2021
-
[13]
Reddy , title =
Parshin Shojaee and Aneesh Jain and Sindhu Tipirneni and Chandan K. Reddy , title =. Trans. Mach. Learn. Res. , volume =. 2023 , url =
2023
-
[14]
Forty-second International Conference on Machine Learning,
Jonas Gehring and Kunhao Zheng and Jade Copet and Vegard Mella and Taco Cohen and Gabriel Synnaeve , title =. Forty-second International Conference on Machine Learning,. 2025 , url =
2025
-
[15]
Shihan Dou and Yan Liu and Haoxiang Jia and Limao Xiong and Enyu Zhou and Wei Shen and Junjie Shan and Caishuang Huang and Xiao Wang and Xiaoran Fan and Zhiheng Xi and Yuhao Zhou and Tao Ji and Rui Zheng and Qi Zhang and Xuanjing Huang and Tao Gui , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.01391 , eprinttype =. 2402.01391 , timestamp =
-
[16]
Huimu Yu and Xing Wu and Weidong Yin and Debing Zhang and Songlin Hu , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.02229 , eprinttype =. 2410.02229 , timestamp =
-
[17]
Joar Skalse and Nikolaus H. R. Howe and Dmitrii Krasheninnikov and David Krueger , editor =. Defining and Characterizing Reward Gaming , booktitle =. 2022 , url =
2022
-
[18]
Jiayi Fu and Xuandong Zhao and Chengyuan Yao and Heng Wang and Qi Han and Yanghua Xiao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.18770 , eprinttype =. 2502.18770 , timestamp =
-
[19]
Christiano and John Schulman and Dan Man
Dario Amodei and Chris Olah and Jacob Steinhardt and Paul F. Christiano and John Schulman and Dan Man. Concrete Problems in. CoRR , volume =. 2016 , url =. 1606.06565 , timestamp =
Pith/arXiv arXiv 2016
-
[20]
Reinforcement Learning with a Corrupted Reward Channel , booktitle =
Tom Everitt and Victoria Krakovna and Laurent Orseau and Shane Legg , editor =. Reinforcement Learning with a Corrupted Reward Channel , booktitle =. 2017 , url =. doi:10.24963/IJCAI.2017/656 , timestamp =
-
[21]
the method of paired comparisons , author=
Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=
1952
-
[22]
Mingzhe Du and Luu Tuan Tuan and Yue Liu and Yuhao Qing and Dong Huang and Xinyi He and Qian Liu and Zejun Ma and See. Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2505.23387 , eprinttype =. 2505.23387 , timestamp =
-
[23]
Hugging Face repository , howpublished =
CodeForces , author=. Hugging Face repository , howpublished =. 2025 , publisher =
2025
-
[24]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
-
[25]
Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu , title =. Proceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025 , pages =. 2025 , url =. doi:10.1145/3689031.3696075 , timestamp =
-
[26]
LiveBench:
Colin White and Samuel Dooley and Manley Roberts and Arka Pal and Benjamin Feuer and Siddhartha Jain and Ravid Shwartz. LiveBench:. The Thirteenth International Conference on Learning Representations,. 2025 , url =
2025
-
[27]
Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J
Jacob Austin and Augustus Odena and Maxwell I. Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J. Cai and Michael Terry and Quoc V. Le and Charles Sutton , title =. CoRR , volume =. 2021 , url =. 2108.07732 , timestamp =
Pith/arXiv arXiv 2021
-
[28]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , booktitle =
Naman Jain and King Han and Alex Gu and Wen. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , booktitle =. 2025 , url =
2025
-
[29]
Science , volume=
Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=
2022
-
[30]
Yinjie Wang and Ling Yang and Ye Tian and Ke Shen and Mengdi Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.03136 , eprinttype =. 2506.03136 , timestamp =
-
[31]
2022 , eprint=
Emergent Abilities of Large Language Models , author=. 2022 , eprint=
2022
-
[32]
2023 , eprint=
AceCoder: Utilizing Existing Code to Enhance Code Generation , author=. 2023 , eprint=
2023
-
[33]
Evaluating In-Context Learning of Libraries for Code Generation , booktitle =
Arkil Patel and Siva Reddy and Dzmitry Bahdanau and Pradeep Dasigi , editor =. Evaluating In-Context Learning of Libraries for Code Generation , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.161 , timestamp =
-
[34]
2026 , eprint=
X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests , author=. 2026 , eprint=
2026
-
[35]
Codefuse and Wenting Cai and Yuchen Cao and Chaoyu Chen and Chen Chen and Siba Chen and Qing Cui and Peng Di and Junpeng Fang and Zi Gong and Ting Guo and Zhengyu He and Yang Huang and Cong Li and Jianguo Li and Zheng Li and Shijie Lian and Bingchang Liu and Songshan Luo and Shuo Mao and Min Shen and Jian Wu and Jiaolong Yang and Wenjie Yang and Tong Ye a...
-
[36]
Yifei Liu and Li Lyna Zhang and Yi Zhu and Bingcheng Dong and Xudong Zhou and Ning Shang and Fan Yang and Mao Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.21297 , eprinttype =. 2505.21297 , timestamp =
-
[37]
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
Xue Jiang and Yihong Dong and Mengyang Liu and Hongyi Deng and Tian Wang and Yongding Tao and Rongyu Cao and Binhua Li and Zhi Jin and Wenpin Jiao and Fei Huang and Yongbin Li and Ge Li , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.18471 , eprinttype =. 2510.18471 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.18471 2025
-
[38]
Rongao Li and Jie Fu and Bo. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.14852 , eprinttype =. 2312.14852 , timestamp =
-
[39]
2025 , url=
SYNTHETIC-1: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1 , author=. 2025 , url=
2025
-
[40]
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Chris Yuhao Liu and Liang Zeng and Yuzhen Xiao and Jujie He and Jiacai Liu and Chaojie Wang and Rui Yan and Wei Shen and Fuxiang Zhang and Jiacheng Xu and Yang Liu and Yahui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.01352 , eprinttype =. 2507.01352 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.01352 2025
-
[41]
Marah I Abdin and Jyoti Aneja and Harkirat S. Behl and S. Phi-4 Technical Report , journal =. 2024 , url =. doi:10.48550/ARXIV.2412.08905 , eprinttype =. 2412.08905 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.08905 2024
-
[42]
Gemma 2: Improving Open Language Models at a Practical Size
Morgane Rivi. Gemma 2: Improving Open Language Models at a Practical Size , journal =. 2024 , url =. doi:10.48550/ARXIV.2408.00118 , eprinttype =. 2408.00118 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024
-
[43]
Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran , editor =. KodCode:. Findings of the Association for Computational Linguistics,. 2025 , url =. doi:10.18653/V1/2025.FINDINGS-ACL.365 , timestamp =
-
[44]
LIMO: Less is More for Reasoning
Yixin Ye and Zhen Huang and Yang Xiao and Ethan Chern and Shijie Xia and Pengfei Liu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.03387 , eprinttype =. 2502.03387 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.03387 2025
-
[45]
Forty-first International Conference on Machine Learning,
Zhengyang Tang and Xingxing Zhang and Benyou Wang and Furu Wei , title =. Forty-first International Conference on Machine Learning,. 2024 , url =
2024
-
[46]
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Dong Huang and Qingwen Bu and Jie M. Zhang and Michael Luck and Heming Cui , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.13010 , eprinttype =. 2312.13010 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.13010 2023
-
[47]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[48]
5-Coder Technical Report , author=
Qwen2. 5-Coder Technical Report , author=. arXiv preprint arXiv:2409.12186 , year=
-
[49]
Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =
John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , booktitle =. 2024 , url =
2024
-
[50]
CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation , booktitle =
Qingyao Li and Xinyi Dai and Xiangyang Li and Weinan Zhang and Yasheng Wang and Ruiming Tang and Yong Yu , editor =. CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation , booktitle =. 2025 , url =
2025
-
[51]
Yue Wang and Hung Le and Akhilesh Deepak Gotmare and Nghi D. Q. Bui and Junnan Li and Steven C. H. Hoi , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2305.07922 , eprinttype =. 2305.07922 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.07922 2023
-
[52]
doi:10.18653/V1/2021.EMNLP-MAIN.685
Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi , editor =. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , booktitle =. 2021 , url =. doi:10.18653/V1/2021.EMNLP-MAIN.685 , timestamp =
-
[53]
2024 , eprint=
StarCoder 2 and The Stack v2: The Next Generation , author=. 2024 , eprint=
2024
-
[54]
2024 , eprint=
Code Llama: Open Foundation Models for Code , author=. 2024 , eprint=
2024
-
[55]
2024 , eprint=
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence , author=. 2024 , eprint=
2024
-
[56]
2023 , eprint=
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. 2023 , eprint=
2023
-
[57]
2020 , eprint=
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , author=. 2020 , eprint=
2020
-
[58]
2025 , eprint=
A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs , author=. 2025 , eprint=
2025
-
[59]
2025 , eprint=
OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs , author=. 2025 , eprint=
2025
-
[60]
2025 , eprint=
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance , author=. 2025 , eprint=
2025
-
[61]
2022 , eprint=
Training language models to follow instructions with human feedback , author=. 2022 , eprint=
2022
-
[62]
2017 , eprint=
Proximal Policy Optimization Algorithms , author=. 2017 , eprint=
2017
-
[63]
2025 , eprint=
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback , author=. 2025 , eprint=
2025
-
[64]
2025 , eprint=
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution , author=. 2025 , eprint=
2025
-
[65]
arXiv preprint arXiv:2502.14382 , year=
S*: Test time scaling for code generation , author=. arXiv preprint arXiv:2502.14382 , year=
-
[66]
CodeT: Code Generation with Generated Tests , author=
-
[67]
Jackson Petty and Sjoerd van Steenkiste and Tal Linzen , title =. Trans. Mach. Learn. Res. , volume =. 2025 , url =
2025
-
[68]
arXiv preprint arXiv:2309.16298 , year=
At which training stage does code data help llms reasoning? , author=. arXiv preprint arXiv:2309.16298 , year=
-
[69]
arXiv preprint arXiv:2507.17512 , year=
Can one domain help others? a data-centric study on multi-domain reasoning via reinforcement learning , author=. arXiv preprint arXiv:2507.17512 , year=
-
[70]
The Thirteenth International Conference on Learning Representations,
Yantao Liu and Zijun Yao and Rui Min and Yixin Cao and Lei Hou and Juanzi Li , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =
2025
-
[71]
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Maggie Huan and Yuetai Li and Tuney Zheng and Xiaoyu Xu and Seungone Kim and Minxin Du and Radha Poovendran and Graham Neubig and Xiang Yue , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.00432 , eprinttype =. 2507.00432 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.00432 2025
-
[73]
2025 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMS , author=. 2025 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
2025
-
[74]
ArXiv , year=
WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. ArXiv , year=
-
[75]
International Conference on Machine Learning , year=
CLUTR: Curriculum Learning via Unsupervised Task Representation Learning , author=. International Conference on Machine Learning , year=
-
[76]
ArXiv , year=
SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data , author=. ArXiv , year=
-
[77]
CoRR , volume =
Kimi Team , title =. CoRR , volume =. 2026 , url =
2026
-
[78]
ArXiv , year=
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. ArXiv , year=
-
[79]
CoRR , year =
Qwen Team , title =. CoRR , year =
-
[81]
Proceedings of the 26th annual international conference on machine learning , pages=
Curriculum learning , author=. Proceedings of the 26th annual international conference on machine learning , pages=
-
[82]
international conference on machine learning , pages=
Automated curriculum learning for neural networks , author=. international conference on machine learning , pages=. 2017 , organization=
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.