From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Chao Chen; Chengzu Li; Yinhong Liu; Zhijiang Guo; Zhiwei Li

arxiv: 2606.17682 · v1 · pith:7LEAY4NGnew · submitted 2026-06-16 · 💻 cs.CL

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Chao Chen , Chengzu Li , Zhiwei Li , Yinhong Liu , Zhijiang Guo This is my paper

Pith reviewed 2026-06-27 00:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLMReinforcement LearningEnvironment DesignSelf-ImprovementMulti-Agent ReasoningTraining AutomationFrozenLake

0 comments

The pith

A trained RL policy outperforms larger models by proposing its own training environment changes from failure summaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an LLM-as-Environment-Engineer framework that lets the current policy model review structured summaries of its failures, behavior, and environment statistics, then output modifications for the next training stage. This replaces manual heuristic redesign of environments between RL stages. On the introduced MAPF-FrozenLake testbed, which exposes multi-dimensional configurations, the approach with Qwen3-4B yields the best aggregate results, beating fixed-environment baselines and larger proprietary models. The trained checkpoint proves more effective at this engineering task than the base model, indicating that policy learning sharpens diagnosis of remaining weaknesses. Analysis shows that useful updates draw on failure evidence while keeping successful configurations intact.

Core claim

The LLM-as-Environment-Engineer framework conditions the current policy on structured summaries of failure trajectories, policy behavior, and environment statistics to generate the next-stage training environment configuration. With Qwen3-4B as backbone this produces the strongest aggregate performance across benchmarks, exceeding both fixed-environment training and larger models such as GPT and Gemini. The RL checkpoint itself serves as a better environment engineer than the original base model, and successful updates depend on incorporating failure evidence while preserving configurations that already work.

What carries the argument

LLM-as-Environment-Engineer framework, which takes structured summaries of failures and stats as input to output environment configuration modifications for the next RL stage.

If this is right

RL pipelines for LLMs can proceed through automated environment redesign without repeated human intervention between stages.
Iterative self-proposed environment updates yield higher final policy performance than static configurations.
The trained checkpoint's improved engineering ability suggests policy learning and environment diagnosis reinforce each other.
Failure trajectory summaries are more effective context than other inputs for generating useful configuration changes.
Preserving working configurations while altering others based on failures leads to stable improvement across stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could reduce reliance on domain experts for tuning multi-stage RL training loops if the summarization process generalizes.
Repeated cycles of policy improvement and environment redesign might produce compounding gains beyond single-stage application.
Structured failure evidence might transfer to other controllable testbeds if similar multi-dimensional generators are available.

Load-bearing premise

Conditioning the current policy on summaries of its failures and environment statistics will let it propose changes that reliably raise subsequent performance.

What would settle it

Apply the framework on MAPF-FrozenLake and observe that the generated environment configurations produce no gain or a drop in policy performance relative to fixed baselines.

read the original abstract

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts forward an LLM-as-Environment-Engineer loop that lets the policy model propose its own next training environment from failure summaries, plus a controllable MAPF-FrozenLake testbed, with the main empirical note that the RL checkpoint outperforms the base model at this engineering task.

read the letter

The main takeaway is a framework where the current policy model reads structured failure trajectories and environment stats, then outputs a revised configuration for the next training stage. They also release MAPF-FrozenLake, a generator that exposes multiple tunable dimensions for multi-agent pathfinding on FrozenLake, built specifically to test this kind of automated redesign.

What the work does cleanly is close the loop on a practical pain point in LLM RL pipelines: instead of manual heuristic tweaks between stages, the model itself suggests changes. The observation that the trained checkpoint is a stronger engineer than the original base model is the most interesting result, and it aligns with the idea that policy learning sharpens the model's ability to spot its own weaknesses. Conditioning on failure evidence rather than just success cases also appears to matter.

The soft spots are mostly around evidence. The abstract states that Qwen3-4B under this setup beats larger proprietary models and fixed-environment baselines on aggregate, but supplies no numbers, variance, or ablation details, so the size and reliability of the gains are impossible to judge from the given text. The testbed is new, which is fine, but that also means there are no external benchmarks yet to show it captures the right kinds of difficulty. The central assumption—that summaries of failures plus stats are enough for the model to propose reliably useful changes—holds up in their setup but could be brittle outside the controllable FrozenLake domain.

This is for groups already running iterative RL on LLMs and looking for ways to reduce manual environment work. The framework and testbed are concrete enough that a serious referee could evaluate the experimental protocol and the strength of the self-improvement claim once the full results are in. I would send it to review.

Referee Report

0 major / 3 minor

Summary. The paper proposes the LLM-as-Environment-Engineer framework, in which the current policy LLM analyzes structured summaries of failure trajectories, policy behavior, and environment statistics to propose modifications to the next-stage RL training environment. A new controllable testbed called MAPF-FrozenLake is introduced whose generator exposes multi-dimensional configuration parameters. Experiments condition Qwen3-4B on these summaries to generate environment updates; the resulting framework reports the strongest aggregate performance, outperforming larger proprietary LLMs and fixed-environment baselines. Additional analyses examine which context elements are most effective and observe that the trained RL checkpoint outperforms the base model as an environment engineer.

Significance. If the empirical results hold, the work offers a practical automation of a currently manual step in multi-stage RL for LLMs and supplies a controllable testbed that enables systematic study of environment redesign. The finding that policy learning improves the model's ability to diagnose its own weaknesses is a concrete observation that could support iterative self-improvement pipelines. The explicit analysis of context forms (failure evidence vs. preserved working configurations) is a useful empirical contribution.

minor comments (3)

[§3] §3 (Testbed description): the precise parameterization of the MAPF-FrozenLake generator (dimensions, ranges, and how each affects the underlying multi-agent path-finding task) is not stated explicitly enough to allow independent reproduction or extension.
[§4] §4 (Experiments): while aggregate performance is claimed, the manuscript should include per-benchmark tables with means, standard deviations, and the exact number of runs to support the statistical comparison against GPT/Gemini and fixed baselines.
[Abstract / §1] Notation: MAPF is used without expansion on first appearance; a parenthetical definition would improve readability for readers outside the multi-agent planning community.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The recognition of the LLM-as-Environment-Engineer framework, the MAPF-FrozenLake testbed, and the empirical findings on context effectiveness and policy-improved environment engineering is appreciated. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity: empirical framework proposal without self-referential derivation

full rationale

The paper proposes an LLM-as-Environment-Engineer framework and evaluates it empirically on the MAPF-FrozenLake testbed. No equations, fitted parameters, or derivation chain are present. The central claim is an empirical observation that the RL checkpoint outperforms the base model as an environment engineer, presented as a finding rather than a premise. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The work is self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, mathematical axioms, or invented physical entities are stated. The central proposal rests on an implicit domain assumption about LLM diagnostic capability.

axioms (1)

domain assumption The current policy model can analyze failure trajectories and environment statistics to propose effective configuration changes.
This premise is required for the LLM-as-Environment-Engineer loop to function as described.

invented entities (1)

MAPF-FrozenLake testbed no independent evidence
purpose: Controllable multi-dimensional environment generator for benchmarking environment redesign
Newly introduced testbed whose generator exposes configurations; no independent evidence of prior existence provided.

pith-pipeline@v0.9.1-grok · 5760 in / 1177 out tokens · 54019 ms · 2026-06-27T00:57:28.585845+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

121 extracted references · 29 canonical work pages · 14 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2501.12948 , eprinttype =. 2501.12948 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[2]

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

Wasi Uddin Ahmad and Sean Narenthiran and Somshubra Majumdar and Aleksander Ficek and Siddhartha Jain and Jocelyn Huang and Vahid Noroozi and Boris Ginsburg , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.01943 , eprinttype =. 2504.01943 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01943 2025
[3]

Open R1: A fully open reproduction of DeepSeek-R1 , url =
[4]

Phi-4-reasoning Technical Report

Marah I Abdin and Sahaj Agarwal and Ahmed Awadallah and Vidhisha Balachandran and Harkirat S. Behl and Lingjiao Chen and Gustavo de Rosa and Suriya Gunasekar and Mojan Javaheripi and Neel Joshi and Piero Kauffmann and Yash Lara and Caio C. Phi-4-reasoning Technical Report , journal =. 2025 , url =. doi:10.48550/ARXIV.2504.21318 , eprinttype =. 2504.21318 ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21318 2025
[5]

DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , author=
[6]

CoRR , volume =

Hanxu Hu and Xingxing Zhang and Jannis Vamvas and Rico Sennrich and Furu Wei , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.17715 , eprinttype =. 2510.17715 , timestamp =

work page doi:10.48550/arxiv.2510.17715 2025
[7]

OpenThoughts: Data Recipes for Reasoning Models

Etash Kumar Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.04178 2025
[8]

CoRR , volume =

Xianzhen Luo and Jinyang Huang and Wenzhen Zheng and Qingfu Zhu and Mingzheng Xu and Yiheng Xu and YuanTao Fan and Libo Qin and Wanxiang Che , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.08720 , eprinttype =. 2510.08720 , timestamp =

work page doi:10.48550/arxiv.2510.08720 2025
[9]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

Huaye Zeng and Dongfu Jiang and Haozhe Wang and Ping Nie and Xiaotong Chen and Wenhu Chen , editor =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =

2025
[10]

Measuring Coding Challenge Competence With

Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt , editor =. Measuring Coding Challenge Competence With. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Ben...

2021
[11]

NeurIPS , year=

Measuring Coding Challenge Competence With APPS , author=. NeurIPS , year=
[12]

Evaluating Large Language Models Trained on Code , journal =

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Pond. Evaluating Large Language Models Trained on Code , journal =. 2021 , url =. 2107.03374 , timestamp =

Pith/arXiv arXiv 2021
[13]

Reddy , title =

Parshin Shojaee and Aneesh Jain and Sindhu Tipirneni and Chandan K. Reddy , title =. Trans. Mach. Learn. Res. , volume =. 2023 , url =

2023
[14]

Forty-second International Conference on Machine Learning,

Jonas Gehring and Kunhao Zheng and Jade Copet and Vegard Mella and Taco Cohen and Gabriel Synnaeve , title =. Forty-second International Conference on Machine Learning,. 2025 , url =

2025
[15]

CoRR , volume =

Shihan Dou and Yan Liu and Haoxiang Jia and Limao Xiong and Enyu Zhou and Wei Shen and Junjie Shan and Caishuang Huang and Xiao Wang and Xiaoran Fan and Zhiheng Xi and Yuhao Zhou and Tao Ji and Rui Zheng and Qi Zhang and Xuanjing Huang and Tao Gui , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.01391 , eprinttype =. 2402.01391 , timestamp =

work page doi:10.48550/arxiv.2402.01391 2024
[16]

CoRR , volume =

Huimu Yu and Xing Wu and Weidong Yin and Debing Zhang and Songlin Hu , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.02229 , eprinttype =. 2410.02229 , timestamp =

work page doi:10.48550/arxiv.2410.02229 2024
[17]

Joar Skalse and Nikolaus H. R. Howe and Dmitrii Krasheninnikov and David Krueger , editor =. Defining and Characterizing Reward Gaming , booktitle =. 2022 , url =

2022
[18]

CoRR , volume =

Jiayi Fu and Xuandong Zhao and Chengyuan Yao and Heng Wang and Qi Han and Yanghua Xiao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.18770 , eprinttype =. 2502.18770 , timestamp =

work page doi:10.48550/arxiv.2502.18770 2025
[19]

Christiano and John Schulman and Dan Man

Dario Amodei and Chris Olah and Jacob Steinhardt and Paul F. Christiano and John Schulman and Dan Man. Concrete Problems in. CoRR , volume =. 2016 , url =. 1606.06565 , timestamp =

Pith/arXiv arXiv 2016
[20]

Reinforcement Learning with a Corrupted Reward Channel , booktitle =

Tom Everitt and Victoria Krakovna and Laurent Orseau and Shane Legg , editor =. Reinforcement Learning with a Corrupted Reward Channel , booktitle =. 2017 , url =. doi:10.24963/IJCAI.2017/656 , timestamp =

work page doi:10.24963/ijcai.2017/656 2017
[21]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

1952
[22]

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization , journal =

Mingzhe Du and Luu Tuan Tuan and Yue Liu and Yuhao Qing and Dong Huang and Xinyi He and Qian Liu and Zejun Ma and See. Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2505.23387 , eprinttype =. 2505.23387 , timestamp =

work page doi:10.48550/arxiv.2505.23387 2025
[23]

Hugging Face repository , howpublished =

CodeForces , author=. Hugging Face repository , howpublished =. 2025 , publisher =

2025
[24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
[25]

2025 , isbn =

Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu , title =. Proceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025 , pages =. 2025 , url =. doi:10.1145/3689031.3696075 , timestamp =

work page doi:10.1145/3689031.3696075 2025
[26]

LiveBench:

Colin White and Samuel Dooley and Manley Roberts and Arka Pal and Benjamin Feuer and Siddhartha Jain and Ravid Shwartz. LiveBench:. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[27]

Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J

Jacob Austin and Augustus Odena and Maxwell I. Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J. Cai and Michael Terry and Quoc V. Le and Charles Sutton , title =. CoRR , volume =. 2021 , url =. 2108.07732 , timestamp =

Pith/arXiv arXiv 2021
[28]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , booktitle =

Naman Jain and King Han and Alex Gu and Wen. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , booktitle =. 2025 , url =

2025
[29]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

2022
[30]

CoRR , volume =

Yinjie Wang and Ling Yang and Ye Tian and Ke Shen and Mengdi Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.03136 , eprinttype =. 2506.03136 , timestamp =

work page doi:10.48550/arxiv.2506.03136 2025
[31]

2022 , eprint=

Emergent Abilities of Large Language Models , author=. 2022 , eprint=

2022
[32]

2023 , eprint=

AceCoder: Utilizing Existing Code to Enhance Code Generation , author=. 2023 , eprint=

2023
[33]

Evaluating In-Context Learning of Libraries for Code Generation , booktitle =

Arkil Patel and Siva Reddy and Dzmitry Bahdanau and Pradeep Dasigi , editor =. Evaluating In-Context Learning of Libraries for Code Generation , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.161 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.161 2024
[34]

2026 , eprint=

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests , author=. 2026 , eprint=

2026
[35]

CoRR , volume =

Codefuse and Wenting Cai and Yuchen Cao and Chaoyu Chen and Chen Chen and Siba Chen and Qing Cui and Peng Di and Junpeng Fang and Zi Gong and Ting Guo and Zhengyu He and Yang Huang and Cong Li and Jianguo Li and Zheng Li and Shijie Lian and Bingchang Liu and Songshan Luo and Shuo Mao and Min Shen and Jian Wu and Jiaolong Yang and Wenjie Yang and Tong Ye a...

work page doi:10.48550/arxiv.2503.17793 2025
[36]

CoRR , volume =

Yifei Liu and Li Lyna Zhang and Yi Zhu and Bingcheng Dong and Xudong Zhou and Ning Shang and Fan Yang and Mao Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.21297 , eprinttype =. 2505.21297 , timestamp =

work page doi:10.48550/arxiv.2505.21297 2025
[37]

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

Xue Jiang and Yihong Dong and Mengyang Liu and Hongyi Deng and Tian Wang and Yongding Tao and Rongyu Cao and Binhua Li and Zhi Jin and Wenpin Jiao and Fei Huang and Yongbin Li and Ge Li , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.18471 , eprinttype =. 2510.18471 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.18471 2025
[38]

CoRR , volume =

Rongao Li and Jie Fu and Bo. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.14852 , eprinttype =. 2312.14852 , timestamp =

work page doi:10.48550/arxiv.2312.14852 2023
[39]

2025 , url=

SYNTHETIC-1: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1 , author=. 2025 , url=

2025
[40]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu and Liang Zeng and Yuzhen Xiao and Jujie He and Jiacai Liu and Chaojie Wang and Rui Yan and Wei Shen and Fuxiang Zhang and Jiacheng Xu and Yang Liu and Yahui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.01352 , eprinttype =. 2507.01352 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.01352 2025
[41]

Phi-4 Technical Report

Marah I Abdin and Jyoti Aneja and Harkirat S. Behl and S. Phi-4 Technical Report , journal =. 2024 , url =. doi:10.48550/ARXIV.2412.08905 , eprinttype =. 2412.08905 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.08905 2024
[42]

Gemma 2: Improving Open Language Models at a Practical Size

Morgane Rivi. Gemma 2: Improving Open Language Models at a Practical Size , journal =. 2024 , url =. doi:10.48550/ARXIV.2408.00118 , eprinttype =. 2408.00118 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024
[43]

KodCode:

Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran , editor =. KodCode:. Findings of the Association for Computational Linguistics,. 2025 , url =. doi:10.18653/V1/2025.FINDINGS-ACL.365 , timestamp =

work page doi:10.18653/v1/2025.findings-acl.365 2025
[44]

LIMO: Less is More for Reasoning

Yixin Ye and Zhen Huang and Yang Xiao and Ethan Chern and Shijie Xia and Pengfei Liu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.03387 , eprinttype =. 2502.03387 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.03387 2025
[45]

Forty-first International Conference on Machine Learning,

Zhengyang Tang and Xingxing Zhang and Benyou Wang and Furu Wei , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

2024
[46]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang and Qingwen Bu and Jie M. Zhang and Michael Luck and Heming Cui , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.13010 , eprinttype =. 2312.13010 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.13010 2023
[47]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[48]

5-Coder Technical Report , author=

Qwen2. 5-Coder Technical Report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv
[49]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , booktitle =. 2024 , url =

2024
[50]

CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation , booktitle =

Qingyao Li and Xinyi Dai and Xiangyang Li and Weinan Zhang and Yasheng Wang and Ruiming Tang and Yong Yu , editor =. CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation , booktitle =. 2025 , url =

2025
[51]

Yue Wang and Hung Le and Akhilesh Deepak Gotmare and Nghi D. Q. Bui and Junnan Li and Steven C. H. Hoi , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2305.07922 , eprinttype =. 2305.07922 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.07922 2023
[52]

doi:10.18653/V1/2021.EMNLP-MAIN.685

Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi , editor =. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , booktitle =. 2021 , url =. doi:10.18653/V1/2021.EMNLP-MAIN.685 , timestamp =

work page doi:10.18653/v1/2021.emnlp-main.685 2021
[53]

2024 , eprint=

StarCoder 2 and The Stack v2: The Next Generation , author=. 2024 , eprint=

2024
[54]

2024 , eprint=

Code Llama: Open Foundation Models for Code , author=. 2024 , eprint=

2024
[55]

2024 , eprint=

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence , author=. 2024 , eprint=

2024
[56]

2023 , eprint=

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. 2023 , eprint=

2023
[57]

2020 , eprint=

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , author=. 2020 , eprint=

2020
[58]

2025 , eprint=

A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs , author=. 2025 , eprint=

2025
[59]

2025 , eprint=

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs , author=. 2025 , eprint=

2025
[60]

2025 , eprint=

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance , author=. 2025 , eprint=

2025
[61]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

2022
[62]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

2017
[63]

2025 , eprint=

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback , author=. 2025 , eprint=

2025
[64]

2025 , eprint=

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution , author=. 2025 , eprint=

2025
[65]

arXiv preprint arXiv:2502.14382 , year=

S*: Test time scaling for code generation , author=. arXiv preprint arXiv:2502.14382 , year=

arXiv
[66]

CodeT: Code Generation with Generated Tests , author=
[67]

Jackson Petty and Sjoerd van Steenkiste and Tal Linzen , title =. Trans. Mach. Learn. Res. , volume =. 2025 , url =

2025
[68]

arXiv preprint arXiv:2309.16298 , year=

At which training stage does code data help llms reasoning? , author=. arXiv preprint arXiv:2309.16298 , year=

arXiv
[69]

arXiv preprint arXiv:2507.17512 , year=

Can one domain help others? a data-centric study on multi-domain reasoning via reinforcement learning , author=. arXiv preprint arXiv:2507.17512 , year=

arXiv
[70]

The Thirteenth International Conference on Learning Representations,

Yantao Liu and Zijun Yao and Rui Min and Yixin Cao and Lei Hou and Juanzi Li , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[71]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan and Yuetai Li and Tuney Zheng and Xiaoyu Xu and Seungone Kim and Minxin Du and Radha Poovendran and Graham Neubig and Xiang Yue , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.00432 , eprinttype =. 2507.00432 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.00432 2025
[73]

2025 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMS , author=. 2025 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

2025
[74]

ArXiv , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. ArXiv , year=
[75]

International Conference on Machine Learning , year=

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning , author=. International Conference on Machine Learning , year=
[76]

ArXiv , year=

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data , author=. ArXiv , year=
[77]

CoRR , volume =

Kimi Team , title =. CoRR , volume =. 2026 , url =

2026
[78]

ArXiv , year=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. ArXiv , year=
[79]

CoRR , year =

Qwen Team , title =. CoRR , year =
[81]

Proceedings of the 26th annual international conference on machine learning , pages=

Curriculum learning , author=. Proceedings of the 26th annual international conference on machine learning , pages=
[82]

international conference on machine learning , pages=

Automated curriculum learning for neural networks , author=. international conference on machine learning , pages=. 2017 , organization=

2017

Showing first 80 references.

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2501.12948 , eprinttype =. 2501.12948 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025

[2] [2]

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

Wasi Uddin Ahmad and Sean Narenthiran and Somshubra Majumdar and Aleksander Ficek and Siddhartha Jain and Jocelyn Huang and Vahid Noroozi and Boris Ginsburg , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.01943 , eprinttype =. 2504.01943 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01943 2025

[3] [3]

Open R1: A fully open reproduction of DeepSeek-R1 , url =

[4] [4]

Phi-4-reasoning Technical Report

Marah I Abdin and Sahaj Agarwal and Ahmed Awadallah and Vidhisha Balachandran and Harkirat S. Behl and Lingjiao Chen and Gustavo de Rosa and Suriya Gunasekar and Mojan Javaheripi and Neel Joshi and Piero Kauffmann and Yash Lara and Caio C. Phi-4-reasoning Technical Report , journal =. 2025 , url =. doi:10.48550/ARXIV.2504.21318 , eprinttype =. 2504.21318 ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21318 2025

[5] [5]

DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , author=

[6] [6]

CoRR , volume =

Hanxu Hu and Xingxing Zhang and Jannis Vamvas and Rico Sennrich and Furu Wei , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.17715 , eprinttype =. 2510.17715 , timestamp =

work page doi:10.48550/arxiv.2510.17715 2025

[7] [7]

OpenThoughts: Data Recipes for Reasoning Models

Etash Kumar Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.04178 2025

[8] [8]

CoRR , volume =

Xianzhen Luo and Jinyang Huang and Wenzhen Zheng and Qingfu Zhu and Mingzheng Xu and Yiheng Xu and YuanTao Fan and Libo Qin and Wanxiang Che , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.08720 , eprinttype =. 2510.08720 , timestamp =

work page doi:10.48550/arxiv.2510.08720 2025

[9] [9]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

Huaye Zeng and Dongfu Jiang and Haozhe Wang and Ping Nie and Xiaotong Chen and Wenhu Chen , editor =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =

2025

[10] [10]

Measuring Coding Challenge Competence With

Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt , editor =. Measuring Coding Challenge Competence With. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Ben...

2021

[11] [11]

NeurIPS , year=

Measuring Coding Challenge Competence With APPS , author=. NeurIPS , year=

[12] [12]

Evaluating Large Language Models Trained on Code , journal =

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Pond. Evaluating Large Language Models Trained on Code , journal =. 2021 , url =. 2107.03374 , timestamp =

Pith/arXiv arXiv 2021

[13] [13]

Reddy , title =

Parshin Shojaee and Aneesh Jain and Sindhu Tipirneni and Chandan K. Reddy , title =. Trans. Mach. Learn. Res. , volume =. 2023 , url =

2023

[14] [14]

Forty-second International Conference on Machine Learning,

Jonas Gehring and Kunhao Zheng and Jade Copet and Vegard Mella and Taco Cohen and Gabriel Synnaeve , title =. Forty-second International Conference on Machine Learning,. 2025 , url =

2025

[15] [15]

CoRR , volume =

Shihan Dou and Yan Liu and Haoxiang Jia and Limao Xiong and Enyu Zhou and Wei Shen and Junjie Shan and Caishuang Huang and Xiao Wang and Xiaoran Fan and Zhiheng Xi and Yuhao Zhou and Tao Ji and Rui Zheng and Qi Zhang and Xuanjing Huang and Tao Gui , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.01391 , eprinttype =. 2402.01391 , timestamp =

work page doi:10.48550/arxiv.2402.01391 2024

[16] [16]

CoRR , volume =

Huimu Yu and Xing Wu and Weidong Yin and Debing Zhang and Songlin Hu , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.02229 , eprinttype =. 2410.02229 , timestamp =

work page doi:10.48550/arxiv.2410.02229 2024

[17] [17]

Joar Skalse and Nikolaus H. R. Howe and Dmitrii Krasheninnikov and David Krueger , editor =. Defining and Characterizing Reward Gaming , booktitle =. 2022 , url =

2022

[18] [18]

CoRR , volume =

Jiayi Fu and Xuandong Zhao and Chengyuan Yao and Heng Wang and Qi Han and Yanghua Xiao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.18770 , eprinttype =. 2502.18770 , timestamp =

work page doi:10.48550/arxiv.2502.18770 2025

[19] [19]

Christiano and John Schulman and Dan Man

Dario Amodei and Chris Olah and Jacob Steinhardt and Paul F. Christiano and John Schulman and Dan Man. Concrete Problems in. CoRR , volume =. 2016 , url =. 1606.06565 , timestamp =

Pith/arXiv arXiv 2016

[20] [20]

Reinforcement Learning with a Corrupted Reward Channel , booktitle =

Tom Everitt and Victoria Krakovna and Laurent Orseau and Shane Legg , editor =. Reinforcement Learning with a Corrupted Reward Channel , booktitle =. 2017 , url =. doi:10.24963/IJCAI.2017/656 , timestamp =

work page doi:10.24963/ijcai.2017/656 2017

[21] [21]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

1952

[22] [22]

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization , journal =

Mingzhe Du and Luu Tuan Tuan and Yue Liu and Yuhao Qing and Dong Huang and Xinyi He and Qian Liu and Zejun Ma and See. Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2505.23387 , eprinttype =. 2505.23387 , timestamp =

work page doi:10.48550/arxiv.2505.23387 2025

[23] [23]

Hugging Face repository , howpublished =

CodeForces , author=. Hugging Face repository , howpublished =. 2025 , publisher =

2025

[24] [24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025

[25] [25]

2025 , isbn =

Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu , title =. Proceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025 , pages =. 2025 , url =. doi:10.1145/3689031.3696075 , timestamp =

work page doi:10.1145/3689031.3696075 2025

[26] [26]

LiveBench:

Colin White and Samuel Dooley and Manley Roberts and Arka Pal and Benjamin Feuer and Siddhartha Jain and Ravid Shwartz. LiveBench:. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[27] [27]

Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J

Jacob Austin and Augustus Odena and Maxwell I. Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J. Cai and Michael Terry and Quoc V. Le and Charles Sutton , title =. CoRR , volume =. 2021 , url =. 2108.07732 , timestamp =

Pith/arXiv arXiv 2021

[28] [28]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , booktitle =

Naman Jain and King Han and Alex Gu and Wen. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , booktitle =. 2025 , url =

2025

[29] [29]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

2022

[30] [30]

CoRR , volume =

Yinjie Wang and Ling Yang and Ye Tian and Ke Shen and Mengdi Wang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.03136 , eprinttype =. 2506.03136 , timestamp =

work page doi:10.48550/arxiv.2506.03136 2025

[31] [31]

2022 , eprint=

Emergent Abilities of Large Language Models , author=. 2022 , eprint=

2022

[32] [32]

2023 , eprint=

AceCoder: Utilizing Existing Code to Enhance Code Generation , author=. 2023 , eprint=

2023

[33] [33]

Evaluating In-Context Learning of Libraries for Code Generation , booktitle =

Arkil Patel and Siva Reddy and Dzmitry Bahdanau and Pradeep Dasigi , editor =. Evaluating In-Context Learning of Libraries for Code Generation , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.161 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.161 2024

[34] [34]

2026 , eprint=

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests , author=. 2026 , eprint=

2026

[35] [35]

CoRR , volume =

Codefuse and Wenting Cai and Yuchen Cao and Chaoyu Chen and Chen Chen and Siba Chen and Qing Cui and Peng Di and Junpeng Fang and Zi Gong and Ting Guo and Zhengyu He and Yang Huang and Cong Li and Jianguo Li and Zheng Li and Shijie Lian and Bingchang Liu and Songshan Luo and Shuo Mao and Min Shen and Jian Wu and Jiaolong Yang and Wenjie Yang and Tong Ye a...

work page doi:10.48550/arxiv.2503.17793 2025

[36] [36]

CoRR , volume =

Yifei Liu and Li Lyna Zhang and Yi Zhu and Bingcheng Dong and Xudong Zhou and Ning Shang and Fan Yang and Mao Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.21297 , eprinttype =. 2505.21297 , timestamp =

work page doi:10.48550/arxiv.2505.21297 2025

[37] [37]

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

Xue Jiang and Yihong Dong and Mengyang Liu and Hongyi Deng and Tian Wang and Yongding Tao and Rongyu Cao and Binhua Li and Zhi Jin and Wenpin Jiao and Fei Huang and Yongbin Li and Ge Li , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.18471 , eprinttype =. 2510.18471 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.18471 2025

[38] [38]

CoRR , volume =

Rongao Li and Jie Fu and Bo. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.14852 , eprinttype =. 2312.14852 , timestamp =

work page doi:10.48550/arxiv.2312.14852 2023

[39] [39]

2025 , url=

SYNTHETIC-1: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1 , author=. 2025 , url=

2025

[40] [40]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu and Liang Zeng and Yuzhen Xiao and Jujie He and Jiacai Liu and Chaojie Wang and Rui Yan and Wei Shen and Fuxiang Zhang and Jiacheng Xu and Yang Liu and Yahui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.01352 , eprinttype =. 2507.01352 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.01352 2025

[41] [41]

Phi-4 Technical Report

Marah I Abdin and Jyoti Aneja and Harkirat S. Behl and S. Phi-4 Technical Report , journal =. 2024 , url =. doi:10.48550/ARXIV.2412.08905 , eprinttype =. 2412.08905 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.08905 2024

[42] [42]

Gemma 2: Improving Open Language Models at a Practical Size

Morgane Rivi. Gemma 2: Improving Open Language Models at a Practical Size , journal =. 2024 , url =. doi:10.48550/ARXIV.2408.00118 , eprinttype =. 2408.00118 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024

[43] [43]

KodCode:

Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran , editor =. KodCode:. Findings of the Association for Computational Linguistics,. 2025 , url =. doi:10.18653/V1/2025.FINDINGS-ACL.365 , timestamp =

work page doi:10.18653/v1/2025.findings-acl.365 2025

[44] [44]

LIMO: Less is More for Reasoning

Yixin Ye and Zhen Huang and Yang Xiao and Ethan Chern and Shijie Xia and Pengfei Liu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.03387 , eprinttype =. 2502.03387 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.03387 2025

[45] [45]

Forty-first International Conference on Machine Learning,

Zhengyang Tang and Xingxing Zhang and Benyou Wang and Furu Wei , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

2024

[46] [46]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang and Qingwen Bu and Jie M. Zhang and Michael Luck and Heming Cui , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.13010 , eprinttype =. 2312.13010 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.13010 2023

[47] [47]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[48] [48]

5-Coder Technical Report , author=

Qwen2. 5-Coder Technical Report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv

[49] [49]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , editor =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , booktitle =. 2024 , url =

2024

[50] [50]

CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation , booktitle =

Qingyao Li and Xinyi Dai and Xiangyang Li and Weinan Zhang and Yasheng Wang and Ruiming Tang and Yong Yu , editor =. CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation , booktitle =. 2025 , url =

2025

[51] [51]

Yue Wang and Hung Le and Akhilesh Deepak Gotmare and Nghi D. Q. Bui and Junnan Li and Steven C. H. Hoi , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2305.07922 , eprinttype =. 2305.07922 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.07922 2023

[52] [52]

doi:10.18653/V1/2021.EMNLP-MAIN.685

Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi , editor =. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , booktitle =. 2021 , url =. doi:10.18653/V1/2021.EMNLP-MAIN.685 , timestamp =

work page doi:10.18653/v1/2021.emnlp-main.685 2021

[53] [53]

2024 , eprint=

StarCoder 2 and The Stack v2: The Next Generation , author=. 2024 , eprint=

2024

[54] [54]

2024 , eprint=

Code Llama: Open Foundation Models for Code , author=. 2024 , eprint=

2024

[55] [55]

2024 , eprint=

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence , author=. 2024 , eprint=

2024

[56] [56]

2023 , eprint=

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. 2023 , eprint=

2023

[57] [57]

2020 , eprint=

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , author=. 2020 , eprint=

2020

[58] [58]

2025 , eprint=

A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs , author=. 2025 , eprint=

2025

[59] [59]

2025 , eprint=

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs , author=. 2025 , eprint=

2025

[60] [60]

2025 , eprint=

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance , author=. 2025 , eprint=

2025

[61] [61]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

2022

[62] [62]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

2017

[63] [63]

2025 , eprint=

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback , author=. 2025 , eprint=

2025

[64] [64]

2025 , eprint=

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution , author=. 2025 , eprint=

2025

[65] [65]

arXiv preprint arXiv:2502.14382 , year=

S*: Test time scaling for code generation , author=. arXiv preprint arXiv:2502.14382 , year=

arXiv

[66] [66]

CodeT: Code Generation with Generated Tests , author=

[67] [67]

Jackson Petty and Sjoerd van Steenkiste and Tal Linzen , title =. Trans. Mach. Learn. Res. , volume =. 2025 , url =

2025

[68] [68]

arXiv preprint arXiv:2309.16298 , year=

At which training stage does code data help llms reasoning? , author=. arXiv preprint arXiv:2309.16298 , year=

arXiv

[69] [69]

arXiv preprint arXiv:2507.17512 , year=

Can one domain help others? a data-centric study on multi-domain reasoning via reinforcement learning , author=. arXiv preprint arXiv:2507.17512 , year=

arXiv

[70] [70]

The Thirteenth International Conference on Learning Representations,

Yantao Liu and Zijun Yao and Rui Min and Yixin Cao and Lei Hou and Juanzi Li , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[71] [71]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan and Yuetai Li and Tuney Zheng and Xiaoyu Xu and Seungone Kim and Minxin Du and Radha Poovendran and Graham Neubig and Xiang Yue , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.00432 , eprinttype =. 2507.00432 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.00432 2025

[72] [73]

2025 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMS , author=. 2025 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

2025

[73] [74]

ArXiv , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. ArXiv , year=

[74] [75]

International Conference on Machine Learning , year=

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning , author=. International Conference on Machine Learning , year=

[75] [76]

ArXiv , year=

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data , author=. ArXiv , year=

[76] [77]

CoRR , volume =

Kimi Team , title =. CoRR , volume =. 2026 , url =

2026

[77] [78]

ArXiv , year=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. ArXiv , year=

[78] [79]

CoRR , year =

Qwen Team , title =. CoRR , year =

[79] [81]

Proceedings of the 26th annual international conference on machine learning , pages=

Curriculum learning , author=. Proceedings of the 26th annual international conference on machine learning , pages=

[80] [82]

international conference on machine learning , pages=

Automated curriculum learning for neural networks , author=. international conference on machine learning , pages=. 2017 , organization=

2017