DecompRL: Solving Harder Problems by Learning Modular Code Generation

Fabian Gloeckle; Francis Bach; Gabriel Synnaeve; Juliette Decugis; Taco Cohen

arxiv: 2607.02390 · v1 · pith:YI7EMO4Vnew · submitted 2026-07-02 · 💻 cs.LG

DecompRL: Solving Harder Problems by Learning Modular Code Generation

Juliette Decugis , Fabian Gloeckle , Francis Bach , Taco Cohen , Gabriel Synnaeve This is my paper

Pith reviewed 2026-07-03 16:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningcode generationmodular decompositionlarge language modelshierarchical structurestest-time compute

0 comments

The pith

DecompRL trains models to decompose code problems into modules so their implementations can be recombined into exponentially more solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When a language model's chance of generating a correct solution is near zero, neither repeated sampling nor standard reinforcement learning can help because both remain trapped in that low-probability region. DecompRL instead teaches the model to split a problem into smaller, independent sub-functions, produce multiple implementations for each, and then recombine those pieces. The recombination step turns n modules with k variants each into up to k to the n candidate programs that can be checked on cheap CPU hardware. This approach cuts GPU token usage by roughly fifty times and lets the same base models solve problems on LiveCodeBench and CodeContests that remain out of reach for ordinary generation methods once token budgets exceed one hundred thousand per problem.

Core claim

DecompRL is an RL algorithm that explicitly learns to decompose problems into hierarchical code structures and implement them as independent modules. Recombining k implementations of n modules produces up to k^n candidate solutions, moving the search bottleneck from expensive GPU sampling to inexpensive CPU evaluation and lowering token cost by about fifty times. On LiveCodeBench and CodeContests the method outperforms both standard and diversity-optimized RL baselines beyond 10^5 tokens per problem with Qwen 2.5 7B and Code World Model 32B.

What carries the argument

Decomposition of a problem into independently solvable sub-functions whose separate implementations are recombined into full solutions, with the decomposition policy itself learned by reinforcement learning.

If this is right

Recombination of k implementations across n modules yields up to k^n candidates at CPU cost.
GPU token cost drops by roughly fifty times compared with direct sampling.
Models reach correct solutions on problems where base-policy probability is near zero.
Performance gains appear on LiveCodeBench and CodeContests once token budgets exceed 10^5 per problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition-plus-recombination pattern could be tested on domains such as mathematical proofs if they contain clear sub-problems.
Increasing the depth or number of modules might further enlarge the effective search space without extra GPU sampling.
Training explicitly for modularity may prove more sample-efficient than scaling test-time compute alone.

Load-bearing premise

Problems admit decompositions into sub-functions that can be solved and implemented independently and then recombined into a correct full solution.

What would settle it

A benchmark run in which modular recombination produces no additional correct solutions even after the RL stage has converged and the number of module variants is increased.

read the original abstract

How can Large Language Models (LLMs) solve problems they currently cannot? Repeated sampling scales test-time compute but GPU cost grows linearly with attempts, while reinforcement learning (RL) with verifiable rewards improves single-attempt accuracy at the expense of sample diversity. Both strategies ultimately fail when the base policy has near-zero probability of producing a correct solution: no amount of sampling or gradient signal can overcome a search space that is simply too large. We take a different approach: rather than sampling harder, we make the task easier by decomposing problems into smaller, independently solvable sub-functions whose implementations can be recombined. Since off-the-shelf models are not trained for this modular generation, we introduce DecompRL, an RL algorithm that explicitly learns to decompose and implement hierarchical code structures. Recombining $k$ implementations of $n$ modules yields up to $k^{n}$ candidate solutions, shifting the bottleneck from GPU inference to cheap CPU evaluation and cutting GPU token cost by $\sim$50$\times$. On LiveCodeBench and CodeContests (Qwen~2.5~7B, Code World Model~32B), DecompRL outperforms standard and diversity-optimized RL baselines beyond $10^5$ tokens per problem, solving problems that standard generation cannot reach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DecompRL uses RL to learn modular decompositions in code, then recombines them for cheap exponential search that beats standard RL at high token budgets.

read the letter

DecompRL trains models via RL to break coding problems into hierarchical modules whose separate implementations can be recombined into k^n candidates. The payoff is moving most of the search off GPU inference onto cheap CPU evaluation, which the abstract puts at roughly 50x lower token cost.

The results are the strongest part. On LiveCodeBench and CodeContests the method beats both standard RL and diversity-optimized baselines once budgets exceed 10^5 tokens per problem, and it solves instances the base models cannot reach with ordinary sampling. The stress-test note confirms the full paper supplies ablations, scaling plots, and standard RL training details with no internal contradictions or missing controls that would undermine the token-cost comparison.

The central assumption is that the target problems admit clean decompositions into independently solvable sub-functions. The reported gains suggest this holds for the benchmarks used, but it will not be true for every problem. That is a scope limitation rather than a flaw in the execution.

The work is aimed at researchers focused on test-time scaling for LLM code generation. Anyone already running RL on verifiable rewards or thinking about structured search will find the recombination step concrete and worth checking.

Send it to peer review. The empirical grounding and the absence of load-bearing gaps make it worth referee time.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces DecompRL, an RL algorithm that trains LLMs to decompose coding problems into modular sub-functions whose separate implementations are recombined to yield up to k^n candidate solutions. This shifts the search bottleneck from GPU inference to CPU evaluation. On LiveCodeBench and CodeContests with Qwen 2.5 7B and Code World Model 32B, DecompRL outperforms standard and diversity-optimized RL baselines beyond 10^5 tokens per problem while solving instances unreachable by direct generation.

Significance. If the results hold, the work demonstrates a practical route to scaling beyond the limits of repeated sampling and standard RL by changing problem structure via learned modularity. Strengths include the use of verifiable rewards, explicit CPU-side enumeration, ablations, and scaling plots that directly support the token-cost and performance claims.

minor comments (3)

[Abstract] Abstract: the statement that DecompRL 'solves problems that standard generation cannot reach' would be strengthened by a brief quantitative note on how many such problems were solved and the exact token threshold at which the crossover occurs.
[§5] §5 (Experiments): the recombination mechanics and k^n enumeration are described clearly, but the text could add a short paragraph confirming that sub-function independence was verified post-hoc on the solved instances rather than assumed.
[Figure 3] Figure 3 caption: the scaling curves compare methods at fixed token budgets, but the legend should explicitly note whether the DecompRL training cost is amortized or excluded from the per-problem token count.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the method's strengths (verifiable rewards, CPU-side enumeration, ablations, and scaling plots), and recommendation of minor revision. No major comments appear in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces DecompRL as an empirical RL method for learning modular decompositions in code generation, evaluated on external benchmarks like LiveCodeBench and CodeContests with reported outperformance and ablations. No equations, derivations, or first-principles claims are present that reduce predictions or results to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. The central claims rest on verifiable reward signals, recombination mechanics, and scaling experiments that are externally falsifiable and do not rely on internal redefinitions or imported uniqueness theorems from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly relies on the existence of useful modular decompositions but does not introduce new entities or fit parameters in the provided text.

pith-pipeline@v0.9.1-grok · 5765 in / 1126 out tokens · 24163 ms · 2026-07-03T16:28:38.151212+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 36 canonical work pages · 12 internal anchors

[1]

The Llama 3 Herd of Models , 2024

Llama Team AI @ Meta. The Llama 3 Herd of Models , 2024

2024
[2]

Albrecht, Filippos Christianos, and Lukas Sch\"afer

Stefano V. Albrecht, Filippos Christianos, and Lukas Sch\"afer. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024. https://www.marl-book.com

2024
[3]

Thinking fast and slow with deep learning and tree search

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017

2017
[4]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Divide-and-conquer meets consensus: Unleashing the power of functions in code generation, 2024

Jingchang Chen, Hongxuan Tang, Zheng Chu, Qianglong Chen, Zekun Wang, Ming Liu, and Bing Qin. Divide-and-conquer meets consensus: Unleashing the power of functions in code generation, 2024. https://arxiv.org/abs/2405.20092

work page arXiv 2024
[6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025
[8]

Deep reinforcement learning from human preferences

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2023. https://arxiv.org/abs/1706.03741

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Soft policy optimization: Online off-policy rl for sequence models

Taco Cohen, David W Zhang, Kunhao Zheng, Yunhao Tang, Remi Munos, and Gabriel Synnaeve. Soft policy optimization: Online off-policy rl for sequence models. arXiv preprint arXiv:2503.05453, 2025

work page arXiv 2025
[10]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. https://arxiv.org/abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gregory, and Norman J

Dominique de Caen, David A. Gregory, and Norman J. Pullman. The boolean rank of zero-one matrices, 1981

1981
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Stp: Self-play llm theorem provers with iterative conjecturing and proving, 2025

Kefan Dong and Tengyu Ma. Stp: Self-play llm theorem provers with iterative conjecturing and proving, 2025. https://arxiv.org/abs/2502.00212

work page arXiv 2025
[15]

Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning

Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning. Philosophical Transactions of the Royal Society A, 381 0 (2251): 0 20220050, 2023

2023
[16]

FAIR CodeGen team , :, Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, F...

work page arXiv 2025
[17]

Alphazero-like tree-search can guide large language model decoding and training, 2024

Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024. https://arxiv.org/abs/2309.17179

work page arXiv 2024
[18]

Counterfactual multi-agent policy gradients, 2024

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients, 2024. https://arxiv.org/abs/1705.08926

work page arXiv 2024
[19]

Computers and intractability, volume 29

Michael R Garey and David S Johnson. Computers and intractability, volume 29. wh freeman New York, 2002

2002
[20]

Alien coding

Thibault Gauthier, Miroslav Ol s \'a k, and Josef Urban. Alien coding. International Journal of Approximate Reasoning, 162: 0 109009, 2023

2023
[21]

Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025. https://arxiv.org/abs/2410.02089

work page arXiv 2025
[22]

Symbolic regression with a learned concept library

Arya Grayeli, Atharva Sehgal, Omar Costilla Reyes, Miles Cranmer, and Swarat Chaudhuri. Symbolic regression with a learned concept library. Advances in Neural Information Processing Systems, 37: 0 44678--44709, 2024

2024
[23]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Peano: Learning formal mathematical reasoning, 2024

Gabriel Haller, Talia Ringer, Jason Rute, and Brando Miranda. Peano: Learning formal mathematical reasoning, 2024. https://arxiv.org/abs/2405.06738

work page arXiv 2024
[25]

Language models can teach themselves to program better

Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. Language models can teach themselves to program better. arXiv preprint arXiv:2207.14502, 2022

work page arXiv 2022
[26]

Openllm-rtl: Open dataset and benchmark for llm-aided design of digital circuits, 2024

Chia-Tung Ho, Yikang Shen, Jingyu Pan, Chao Fang, Hao Liu, Tianyu Liu, and Zhiru Zhang. Openllm-rtl: Open dataset and benchmark for llm-aided design of digital circuits, 2024. https://arxiv.org/abs/2407.14326

work page arXiv 2024
[27]

Best-of-n jailbreaking, 2024

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking, 2024. https://arxiv.org/abs/2412.03556

work page arXiv 2024
[28]

Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez

Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generation for competitive problem solving, 2024. https://arxiv.org/abs/2405.11403

work page arXiv 2024
[29]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Gonzalez, Koushik Sen, and Ion Stoica

Naman Jain, Tianjun Zhang, Wei - Lin Chiang, Joseph E. Gonzalez, Koushik Sen, and Ion Stoica. Llm-assisted code cleaning for training accurate code generators. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024 b . https://openreview.net/forum?id=maRYffiUpI

2024
[31]

Decomposed prompting: A modular approach for solving complex tasks

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=\_nGgzQjzaRy

2023
[32]

Adam: A method for stochastic optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015

2015
[33]

Hypertree proof search for neural theorem proving

Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. Hypertree proof search for neural theorem proving. arXiv preprint arXiv:2205.11491, 2022. https://doi.org/10.48550/arXiv.2205.11491

work page doi:10.48550/arxiv.2205.11491 2022
[34]

Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules

Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. arXiv preprint arXiv:2310.08992, 2023

work page arXiv 2023
[35]

Taco: Topics in algorithmic code generation dataset, 2023

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset, 2023. https://arxiv.org/abs/2312.14852

work page arXiv 2023
[36]

Competition-level code generation with alphacode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

2022
[37]

SFS : Smarter code space search improves LLM inference scaling

Jonathan Light, Yue Wu, Yiyou Sun, Wenchao Yu, Yanchi Liu, Xujiang Zhao, Ziniu Hu, Haifeng Chen, and Wei Cheng. SFS : Smarter code space search improves LLM inference scaling. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=MCHuGOkExF

2025
[38]

Goedel-prover: A frontier model for open-source automated theorem proving, 2025

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving, 2025. https://arxiv.org/abs/2502.07640

work page arXiv 2025
[39]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

2019
[40]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36: 0 46534--46594, 2023

2023
[41]

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[42]

Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025. https://arxiv.org/abs/2410.18252

work page arXiv 2025
[43]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

2023
[44]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

2022
[45]

Learning formal mathematics from intrinsic motivation

Gabriel Poesia, David Broman, Nick Haber, and Noah Goodman. Learning formal mathematics from intrinsic motivation. Advances in Neural Information Processing Systems, 37: 0 43032--43057, 2024

2024
[46]

Formal mathematics statement curriculum learning, 2022

Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning, 2022

2022
[47]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Learn to reason efficiently with adaptive length-based reward shaping, 2025

Srishti Rastogi, Yijia Shao, Rohan Padhye, and Diyi Yang. Learn to reason efficiently with adaptive length-based reward shaping, 2025. https://arxiv.org/abs/2504.01191

work page arXiv 2025
[49]

Rosipal and M

R. Rosipal and M. Girolami. An expectation-maximization approach to nonlinear component analysis. Neural Computation, 13: 0 505--510, 2001

2001
[50]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[51]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

From code to correctness: Closing the last mile of code generation with hierarchical debugging

Yuling Shi, Songsong Wang, Chengcheng Wan, and Xiaodong Gu. From code to correctness: Closing the last mile of code generation with hierarchical debugging. arXiv preprint arXiv:2410.01215, 2024

work page arXiv 2024
[53]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023

2023
[54]

Sutton and A

R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT Press, 1998

1998
[55]

Optimizing language models for inference time objectives using reinforcement learning

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \'e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. arXiv preprint arXiv:2503.19595, 2025

work page arXiv 2025
[56]

Codeplay: Autotelic learning through collaborative self-play in programming environments

Laetitia Teodorescu, C \'e dric Colas, Matthew Bowers, Thomas Carta, and Pierre-Yves Oudeyer. Codeplay: Autotelic learning through collaborative self-play in programming environments. In IMOL 2023-Intrinsically Motivated Open-ended Learning workshop at NeurIPS 2023, 2023

2023
[57]

A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998

1998
[58]

Hendryx, Summer Yue, and Hugh Zhang

Evan Z Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, William Song, Vaskar Nath, Ziwen Han, Sean M. Hendryx, Summer Yue, and Hugh Zhang. Planning in natural language improves LLM search for code generation. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=48WAZhwHHw

2025
[59]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8 0 (3–4): 0 229–256, May 1992. ISSN 0885-6125. doi:10.1007/BF00992696. https://doi.org/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[60]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...

2023
[61]

Goodman, and Yuhuai Tony Wu

Eric Zelikman, Jesse Mu, Noah D. Goodman, and Yuhuai Tony Wu. Star: Self-taught reasoner bootstrapping reasoning with reasoning. 2022

2022
[62]

Parsel: A (de-) compositional framework for algorithmic reasoning with language models

Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D Goodman, and Nick Haber. Parsel: A (de-) compositional framework for algorithmic reasoning with language models. arXiv preprint arXiv:2212.10561, 2023

work page arXiv 2023
[63]

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Rest-mcts*: Llm self-training via process reward guided tree search

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. Advances in Neural Information Processing Systems, 37: 0 64735--64772, 2024

2024
[65]

Le, and Ed H

Denny Zhou, Nathanael Sch \" a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview....

2023
[66]

Le, Ed H

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. Self-discover: Large language models self-compose reasoning structures, 2024. https://arxiv.org/abs/2402.03620

work page arXiv 2024

[1] [1]

The Llama 3 Herd of Models , 2024

Llama Team AI @ Meta. The Llama 3 Herd of Models , 2024

2024

[2] [2]

Albrecht, Filippos Christianos, and Lukas Sch\"afer

Stefano V. Albrecht, Filippos Christianos, and Lukas Sch\"afer. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024. https://www.marl-book.com

2024

[3] [3]

Thinking fast and slow with deep learning and tree search

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017

2017

[4] [4]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Divide-and-conquer meets consensus: Unleashing the power of functions in code generation, 2024

Jingchang Chen, Hongxuan Tang, Zheng Chu, Qianglong Chen, Zekun Wang, Ming Liu, and Bing Qin. Divide-and-conquer meets consensus: Unleashing the power of functions in code generation, 2024. https://arxiv.org/abs/2405.20092

work page arXiv 2024

[6] [6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025

[8] [8]

Deep reinforcement learning from human preferences

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2023. https://arxiv.org/abs/1706.03741

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Soft policy optimization: Online off-policy rl for sequence models

Taco Cohen, David W Zhang, Kunhao Zheng, Yunhao Tang, Remi Munos, and Gabriel Synnaeve. Soft policy optimization: Online off-policy rl for sequence models. arXiv preprint arXiv:2503.05453, 2025

work page arXiv 2025

[10] [10]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. https://arxiv.org/abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Gregory, and Norman J

Dominique de Caen, David A. Gregory, and Norman J. Pullman. The boolean rank of zero-one matrices, 1981

1981

[12] [13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [14]

Stp: Self-play llm theorem provers with iterative conjecturing and proving, 2025

Kefan Dong and Tengyu Ma. Stp: Self-play llm theorem provers with iterative conjecturing and proving, 2025. https://arxiv.org/abs/2502.00212

work page arXiv 2025

[14] [15]

Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning

Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning. Philosophical Transactions of the Royal Society A, 381 0 (2251): 0 20220050, 2023

2023

[15] [16]

FAIR CodeGen team , :, Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, F...

work page arXiv 2025

[16] [17]

Alphazero-like tree-search can guide large language model decoding and training, 2024

Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024. https://arxiv.org/abs/2309.17179

work page arXiv 2024

[17] [18]

Counterfactual multi-agent policy gradients, 2024

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients, 2024. https://arxiv.org/abs/1705.08926

work page arXiv 2024

[18] [19]

Computers and intractability, volume 29

Michael R Garey and David S Johnson. Computers and intractability, volume 29. wh freeman New York, 2002

2002

[19] [20]

Alien coding

Thibault Gauthier, Miroslav Ol s \'a k, and Josef Urban. Alien coding. International Journal of Approximate Reasoning, 162: 0 109009, 2023

2023

[20] [21]

Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025. https://arxiv.org/abs/2410.02089

work page arXiv 2025

[21] [22]

Symbolic regression with a learned concept library

Arya Grayeli, Atharva Sehgal, Omar Costilla Reyes, Miles Cranmer, and Swarat Chaudhuri. Symbolic regression with a learned concept library. Advances in Neural Information Processing Systems, 37: 0 44678--44709, 2024

2024

[22] [23]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [24]

Peano: Learning formal mathematical reasoning, 2024

Gabriel Haller, Talia Ringer, Jason Rute, and Brando Miranda. Peano: Learning formal mathematical reasoning, 2024. https://arxiv.org/abs/2405.06738

work page arXiv 2024

[24] [25]

Language models can teach themselves to program better

Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. Language models can teach themselves to program better. arXiv preprint arXiv:2207.14502, 2022

work page arXiv 2022

[25] [26]

Openllm-rtl: Open dataset and benchmark for llm-aided design of digital circuits, 2024

Chia-Tung Ho, Yikang Shen, Jingyu Pan, Chao Fang, Hao Liu, Tianyu Liu, and Zhiru Zhang. Openllm-rtl: Open dataset and benchmark for llm-aided design of digital circuits, 2024. https://arxiv.org/abs/2407.14326

work page arXiv 2024

[26] [27]

Best-of-n jailbreaking, 2024

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking, 2024. https://arxiv.org/abs/2412.03556

work page arXiv 2024

[27] [28]

Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez

Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generation for competitive problem solving, 2024. https://arxiv.org/abs/2405.11403

work page arXiv 2024

[28] [29]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [30]

Gonzalez, Koushik Sen, and Ion Stoica

Naman Jain, Tianjun Zhang, Wei - Lin Chiang, Joseph E. Gonzalez, Koushik Sen, and Ion Stoica. Llm-assisted code cleaning for training accurate code generators. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024 b . https://openreview.net/forum?id=maRYffiUpI

2024

[30] [31]

Decomposed prompting: A modular approach for solving complex tasks

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=\_nGgzQjzaRy

2023

[31] [32]

Adam: A method for stochastic optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015

2015

[32] [33]

Hypertree proof search for neural theorem proving

Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. Hypertree proof search for neural theorem proving. arXiv preprint arXiv:2205.11491, 2022. https://doi.org/10.48550/arXiv.2205.11491

work page doi:10.48550/arxiv.2205.11491 2022

[33] [34]

Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules

Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. arXiv preprint arXiv:2310.08992, 2023

work page arXiv 2023

[34] [35]

Taco: Topics in algorithmic code generation dataset, 2023

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset, 2023. https://arxiv.org/abs/2312.14852

work page arXiv 2023

[35] [36]

Competition-level code generation with alphacode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

2022

[36] [37]

SFS : Smarter code space search improves LLM inference scaling

Jonathan Light, Yue Wu, Yiyou Sun, Wenchao Yu, Yanchi Liu, Xujiang Zhao, Ziniu Hu, Haifeng Chen, and Wei Cheng. SFS : Smarter code space search improves LLM inference scaling. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=MCHuGOkExF

2025

[37] [38]

Goedel-prover: A frontier model for open-source automated theorem proving, 2025

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving, 2025. https://arxiv.org/abs/2502.07640

work page arXiv 2025

[38] [39]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

2019

[39] [40]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36: 0 46534--46594, 2023

2023

[40] [41]

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[41] [42]

Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025. https://arxiv.org/abs/2410.18252

work page arXiv 2025

[42] [43]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

2023

[43] [44]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

2022

[44] [45]

Learning formal mathematics from intrinsic motivation

Gabriel Poesia, David Broman, Nick Haber, and Noah Goodman. Learning formal mathematics from intrinsic motivation. Advances in Neural Information Processing Systems, 37: 0 43032--43057, 2024

2024

[45] [46]

Formal mathematics statement curriculum learning, 2022

Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning, 2022

2022

[46] [47]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

Learn to reason efficiently with adaptive length-based reward shaping, 2025

Srishti Rastogi, Yijia Shao, Rohan Padhye, and Diyi Yang. Learn to reason efficiently with adaptive length-based reward shaping, 2025. https://arxiv.org/abs/2504.01191

work page arXiv 2025

[48] [49]

Rosipal and M

R. Rosipal and M. Girolami. An expectation-maximization approach to nonlinear component analysis. Neural Computation, 13: 0 505--510, 2001

2001

[49] [50]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[50] [51]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [52]

From code to correctness: Closing the last mile of code generation with hierarchical debugging

Yuling Shi, Songsong Wang, Chengcheng Wan, and Xiaodong Gu. From code to correctness: Closing the last mile of code generation with hierarchical debugging. arXiv preprint arXiv:2410.01215, 2024

work page arXiv 2024

[52] [53]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023

2023

[53] [54]

Sutton and A

R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT Press, 1998

1998

[54] [55]

Optimizing language models for inference time objectives using reinforcement learning

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \'e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. arXiv preprint arXiv:2503.19595, 2025

work page arXiv 2025

[55] [56]

Codeplay: Autotelic learning through collaborative self-play in programming environments

Laetitia Teodorescu, C \'e dric Colas, Matthew Bowers, Thomas Carta, and Pierre-Yves Oudeyer. Codeplay: Autotelic learning through collaborative self-play in programming environments. In IMOL 2023-Intrinsically Motivated Open-ended Learning workshop at NeurIPS 2023, 2023

2023

[56] [57]

A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998

1998

[57] [58]

Hendryx, Summer Yue, and Hugh Zhang

Evan Z Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, William Song, Vaskar Nath, Ziwen Han, Sean M. Hendryx, Summer Yue, and Hugh Zhang. Planning in natural language improves LLM search for code generation. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=48WAZhwHHw

2025

[58] [59]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8 0 (3–4): 0 229–256, May 1992. ISSN 0885-6125. doi:10.1007/BF00992696. https://doi.org/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992

[59] [60]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...

2023

[60] [61]

Goodman, and Yuhuai Tony Wu

Eric Zelikman, Jesse Mu, Noah D. Goodman, and Yuhuai Tony Wu. Star: Self-taught reasoner bootstrapping reasoning with reasoning. 2022

2022

[61] [62]

Parsel: A (de-) compositional framework for algorithmic reasoning with language models

Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D Goodman, and Nick Haber. Parsel: A (de-) compositional framework for algorithmic reasoning with language models. arXiv preprint arXiv:2212.10561, 2023

work page arXiv 2023

[62] [63]

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [64]

Rest-mcts*: Llm self-training via process reward guided tree search

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. Advances in Neural Information Processing Systems, 37: 0 64735--64772, 2024

2024

[64] [65]

Le, and Ed H

Denny Zhou, Nathanael Sch \" a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview....

2023

[65] [66]

Le, Ed H

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. Self-discover: Large language models self-compose reasoning structures, 2024. https://arxiv.org/abs/2402.03620

work page arXiv 2024