DecompRL: Solving Harder Problems by Learning Modular Code Generation
Pith reviewed 2026-07-03 16:28 UTC · model grok-4.3
The pith
DecompRL trains models to decompose code problems into modules so their implementations can be recombined into exponentially more solutions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DecompRL is an RL algorithm that explicitly learns to decompose problems into hierarchical code structures and implement them as independent modules. Recombining k implementations of n modules produces up to k^n candidate solutions, moving the search bottleneck from expensive GPU sampling to inexpensive CPU evaluation and lowering token cost by about fifty times. On LiveCodeBench and CodeContests the method outperforms both standard and diversity-optimized RL baselines beyond 10^5 tokens per problem with Qwen 2.5 7B and Code World Model 32B.
What carries the argument
Decomposition of a problem into independently solvable sub-functions whose separate implementations are recombined into full solutions, with the decomposition policy itself learned by reinforcement learning.
If this is right
- Recombination of k implementations across n modules yields up to k^n candidates at CPU cost.
- GPU token cost drops by roughly fifty times compared with direct sampling.
- Models reach correct solutions on problems where base-policy probability is near zero.
- Performance gains appear on LiveCodeBench and CodeContests once token budgets exceed 10^5 per problem.
Where Pith is reading between the lines
- The same decomposition-plus-recombination pattern could be tested on domains such as mathematical proofs if they contain clear sub-problems.
- Increasing the depth or number of modules might further enlarge the effective search space without extra GPU sampling.
- Training explicitly for modularity may prove more sample-efficient than scaling test-time compute alone.
Load-bearing premise
Problems admit decompositions into sub-functions that can be solved and implemented independently and then recombined into a correct full solution.
What would settle it
A benchmark run in which modular recombination produces no additional correct solutions even after the RL stage has converged and the number of module variants is increased.
read the original abstract
How can Large Language Models (LLMs) solve problems they currently cannot? Repeated sampling scales test-time compute but GPU cost grows linearly with attempts, while reinforcement learning (RL) with verifiable rewards improves single-attempt accuracy at the expense of sample diversity. Both strategies ultimately fail when the base policy has near-zero probability of producing a correct solution: no amount of sampling or gradient signal can overcome a search space that is simply too large. We take a different approach: rather than sampling harder, we make the task easier by decomposing problems into smaller, independently solvable sub-functions whose implementations can be recombined. Since off-the-shelf models are not trained for this modular generation, we introduce DecompRL, an RL algorithm that explicitly learns to decompose and implement hierarchical code structures. Recombining $k$ implementations of $n$ modules yields up to $k^{n}$ candidate solutions, shifting the bottleneck from GPU inference to cheap CPU evaluation and cutting GPU token cost by $\sim$50$\times$. On LiveCodeBench and CodeContests (Qwen~2.5~7B, Code World Model~32B), DecompRL outperforms standard and diversity-optimized RL baselines beyond $10^5$ tokens per problem, solving problems that standard generation cannot reach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DecompRL, an RL algorithm that trains LLMs to decompose coding problems into modular sub-functions whose separate implementations are recombined to yield up to k^n candidate solutions. This shifts the search bottleneck from GPU inference to CPU evaluation. On LiveCodeBench and CodeContests with Qwen 2.5 7B and Code World Model 32B, DecompRL outperforms standard and diversity-optimized RL baselines beyond 10^5 tokens per problem while solving instances unreachable by direct generation.
Significance. If the results hold, the work demonstrates a practical route to scaling beyond the limits of repeated sampling and standard RL by changing problem structure via learned modularity. Strengths include the use of verifiable rewards, explicit CPU-side enumeration, ablations, and scaling plots that directly support the token-cost and performance claims.
minor comments (3)
- [Abstract] Abstract: the statement that DecompRL 'solves problems that standard generation cannot reach' would be strengthened by a brief quantitative note on how many such problems were solved and the exact token threshold at which the crossover occurs.
- [§5] §5 (Experiments): the recombination mechanics and k^n enumeration are described clearly, but the text could add a short paragraph confirming that sub-function independence was verified post-hoc on the solved instances rather than assumed.
- [Figure 3] Figure 3 caption: the scaling curves compare methods at fixed token budgets, but the legend should explicitly note whether the DecompRL training cost is amortized or excluded from the per-problem token count.
Simulated Author's Rebuttal
We thank the referee for the positive summary, recognition of the method's strengths (verifiable rewards, CPU-side enumeration, ablations, and scaling plots), and recommendation of minor revision. No major comments appear in the report.
Circularity Check
No significant circularity
full rationale
The paper introduces DecompRL as an empirical RL method for learning modular decompositions in code generation, evaluated on external benchmarks like LiveCodeBench and CodeContests with reported outperformance and ablations. No equations, derivations, or first-principles claims are present that reduce predictions or results to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. The central claims rest on verifiable reward signals, recombination mechanics, and scaling experiments that are externally falsifiable and do not rely on internal redefinitions or imported uniqueness theorems from the authors' prior work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The Llama 3 Herd of Models , 2024
Llama Team AI @ Meta. The Llama 3 Herd of Models , 2024
2024
-
[2]
Albrecht, Filippos Christianos, and Lukas Sch\"afer
Stefano V. Albrecht, Filippos Christianos, and Lukas Sch\"afer. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024. https://www.marl-book.com
2024
-
[3]
Thinking fast and slow with deep learning and tree search
Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017
2017
-
[4]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Divide-and-conquer meets consensus: Unleashing the power of functions in code generation, 2024
Jingchang Chen, Hongxuan Tang, Zheng Chu, Qianglong Chen, Zekun Wang, Ming Liu, and Bing Qin. Divide-and-conquer meets consensus: Unleashing the power of functions in code generation, 2024. https://arxiv.org/abs/2405.20092
-
[6]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models
Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025
-
[8]
Deep reinforcement learning from human preferences
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2023. https://arxiv.org/abs/1706.03741
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Soft policy optimization: Online off-policy rl for sequence models
Taco Cohen, David W Zhang, Kunhao Zheng, Yunhao Tang, Remi Munos, and Gabriel Synnaeve. Soft policy optimization: Online off-policy rl for sequence models. arXiv preprint arXiv:2503.05453, 2025
-
[10]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. https://arxiv.org/abs/2505.22617
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Gregory, and Norman J
Dominique de Caen, David A. Gregory, and Norman J. Pullman. The boolean rank of zero-one matrices, 1981
1981
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Stp: Self-play llm theorem provers with iterative conjecturing and proving, 2025
Kefan Dong and Tengyu Ma. Stp: Self-play llm theorem provers with iterative conjecturing and proving, 2025. https://arxiv.org/abs/2502.00212
-
[15]
Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning
Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning. Philosophical Transactions of the Royal Society A, 381 0 (2251): 0 20220050, 2023
2023
-
[16]
FAIR CodeGen team , :, Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, F...
-
[17]
Alphazero-like tree-search can guide large language model decoding and training, 2024
Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024. https://arxiv.org/abs/2309.17179
-
[18]
Counterfactual multi-agent policy gradients, 2024
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients, 2024. https://arxiv.org/abs/1705.08926
-
[19]
Computers and intractability, volume 29
Michael R Garey and David S Johnson. Computers and intractability, volume 29. wh freeman New York, 2002
2002
-
[20]
Alien coding
Thibault Gauthier, Miroslav Ol s \'a k, and Josef Urban. Alien coding. International Journal of Approximate Reasoning, 162: 0 109009, 2023
2023
-
[21]
Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025. https://arxiv.org/abs/2410.02089
-
[22]
Symbolic regression with a learned concept library
Arya Grayeli, Atharva Sehgal, Omar Costilla Reyes, Miles Cranmer, and Swarat Chaudhuri. Symbolic regression with a learned concept library. Advances in Neural Information Processing Systems, 37: 0 44678--44709, 2024
2024
-
[23]
Reinforced Self-Training (ReST) for Language Modeling
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Peano: Learning formal mathematical reasoning, 2024
Gabriel Haller, Talia Ringer, Jason Rute, and Brando Miranda. Peano: Learning formal mathematical reasoning, 2024. https://arxiv.org/abs/2405.06738
-
[25]
Language models can teach themselves to program better
Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. Language models can teach themselves to program better. arXiv preprint arXiv:2207.14502, 2022
-
[26]
Openllm-rtl: Open dataset and benchmark for llm-aided design of digital circuits, 2024
Chia-Tung Ho, Yikang Shen, Jingyu Pan, Chao Fang, Hao Liu, Tianyu Liu, and Zhiru Zhang. Openllm-rtl: Open dataset and benchmark for llm-aided design of digital circuits, 2024. https://arxiv.org/abs/2407.14326
-
[27]
John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking, 2024. https://arxiv.org/abs/2412.03556
-
[28]
Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez
Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generation for competitive problem solving, 2024. https://arxiv.org/abs/2405.11403
-
[29]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Gonzalez, Koushik Sen, and Ion Stoica
Naman Jain, Tianjun Zhang, Wei - Lin Chiang, Joseph E. Gonzalez, Koushik Sen, and Ion Stoica. Llm-assisted code cleaning for training accurate code generators. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024 b . https://openreview.net/forum?id=maRYffiUpI
2024
-
[31]
Decomposed prompting: A modular approach for solving complex tasks
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=\_nGgzQjzaRy
2023
-
[32]
Adam: A method for stochastic optimization
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015
2015
-
[33]
Hypertree proof search for neural theorem proving
Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. Hypertree proof search for neural theorem proving. arXiv preprint arXiv:2205.11491, 2022. https://doi.org/10.48550/arXiv.2205.11491
-
[34]
Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. arXiv preprint arXiv:2310.08992, 2023
-
[35]
Taco: Topics in algorithmic code generation dataset, 2023
Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset, 2023. https://arxiv.org/abs/2312.14852
-
[36]
Competition-level code generation with alphacode
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022
2022
-
[37]
SFS : Smarter code space search improves LLM inference scaling
Jonathan Light, Yue Wu, Yiyou Sun, Wenchao Yu, Yanchi Liu, Xujiang Zhao, Ziniu Hu, Haifeng Chen, and Wei Cheng. SFS : Smarter code space search improves LLM inference scaling. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=MCHuGOkExF
2025
-
[38]
Goedel-prover: A frontier model for open-source automated theorem proving, 2025
Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving, 2025. https://arxiv.org/abs/2502.07640
-
[39]
Decoupled weight decay regularization, 2019
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019
2019
-
[40]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36: 0 46534--46594, 2023
2023
-
[41]
Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[42]
Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025. https://arxiv.org/abs/2410.18252
-
[43]
Gpt-4 technical report, 2023
OpenAI. Gpt-4 technical report, 2023
2023
-
[44]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
2022
-
[45]
Learning formal mathematics from intrinsic motivation
Gabriel Poesia, David Broman, Nick Haber, and Noah Goodman. Learning formal mathematics from intrinsic motivation. Advances in Neural Information Processing Systems, 37: 0 43032--43057, 2024
2024
-
[46]
Formal mathematics statement curriculum learning, 2022
Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning, 2022
2022
-
[47]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Learn to reason efficiently with adaptive length-based reward shaping, 2025
Srishti Rastogi, Yijia Shao, Rohan Padhye, and Diyi Yang. Learn to reason efficiently with adaptive length-based reward shaping, 2025. https://arxiv.org/abs/2504.01191
-
[49]
Rosipal and M
R. Rosipal and M. Girolami. An expectation-maximization approach to nonlinear component analysis. Neural Computation, 13: 0 505--510, 2001
2001
-
[50]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. https://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[51]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
From code to correctness: Closing the last mile of code generation with hierarchical debugging
Yuling Shi, Songsong Wang, Chengcheng Wan, and Xiaodong Gu. From code to correctness: Closing the last mile of code generation with hierarchical debugging. arXiv preprint arXiv:2410.01215, 2024
-
[53]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023
2023
-
[54]
Sutton and A
R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT Press, 1998
1998
-
[55]
Optimizing language models for inference time objectives using reinforcement learning
Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \'e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. arXiv preprint arXiv:2503.19595, 2025
-
[56]
Codeplay: Autotelic learning through collaborative self-play in programming environments
Laetitia Teodorescu, C \'e dric Colas, Matthew Bowers, Thomas Carta, and Pierre-Yves Oudeyer. Codeplay: Autotelic learning through collaborative self-play in programming environments. In IMOL 2023-Intrinsically Motivated Open-ended Learning workshop at NeurIPS 2023, 2023
2023
-
[57]
A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998
1998
-
[58]
Hendryx, Summer Yue, and Hugh Zhang
Evan Z Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, William Song, Vaskar Nath, Ziwen Han, Sean M. Hendryx, Summer Yue, and Hugh Zhang. Planning in natural language improves LLM search for code generation. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=48WAZhwHHw
2025
-
[59]
Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8 0 (3–4): 0 229–256, May 1992. ISSN 0885-6125. doi:10.1007/BF00992696. https://doi.org/10.1007/BF00992696
-
[60]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...
2023
-
[61]
Goodman, and Yuhuai Tony Wu
Eric Zelikman, Jesse Mu, Noah D. Goodman, and Yuhuai Tony Wu. Star: Self-taught reasoner bootstrapping reasoning with reasoning. 2022
2022
-
[62]
Parsel: A (de-) compositional framework for algorithmic reasoning with language models
Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D Goodman, and Nick Haber. Parsel: A (de-) compositional framework for algorithmic reasoning with language models. arXiv preprint arXiv:2212.10561, 2023
-
[63]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Rest-mcts*: Llm self-training via process reward guided tree search
Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. Advances in Neural Information Processing Systems, 37: 0 64735--64772, 2024
2024
-
[65]
Le, and Ed H
Denny Zhou, Nathanael Sch \" a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview....
2023
- [66]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.