pith. machine review for the scientific record. sign in

arxiv: 2604.10449 · v1 · submitted 2026-04-12 · 💻 cs.SE

Recognition: unknown

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3

classification 💻 cs.SE
keywords code generationadversarial searchMonte Carlo Tree Searchpseudo-correctnesstest case generationlarge language modelsminimax gamerobust code synthesis
0
0 comments X

The pith

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that search-based code generation with LLMs overfits to fixed public test cases, producing solutions that pass visible checks but fail on hidden ones. It introduces an adversarial setup in which a solver agent proposes code while an attacker agent searches for corner-case tests that reveal logical weaknesses in the current pool of candidates. The discovered tests act as a growing filter that penalizes brittle solutions and pushes the process toward more general code. A reader would care because this directly targets the gap between benchmark success and real-world reliability in automated programming.

Core claim

AdverMCTS formulates generation as a minimax-style game between a Solver agent, which synthesizes code candidates, and an Attacker agent, which evolves to generate targeted corner test cases that exploit logical divergences in the current code pool; these discovered tests form a dynamic, progressively hostile filter that penalizes fragile reasoning and reduces pseudo-correctness.

What carries the argument

Adversarial Monte Carlo Tree Search that couples a Solver agent producing code with an Attacker agent generating exploitative tests in a minimax loop.

If this is right

  • Solutions exhibit lower false-positive rates when verified against hidden test suites.
  • Generated code must handle logical scenarios absent from the initial public constraints.
  • The search process yields candidates that generalize beyond the visible test distribution.
  • Performance gains hold across multiple code-generation benchmarks compared with non-adversarial search methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same solver-attacker loop could be applied to other generation tasks such as test-case creation or proof synthesis where edge-case discovery matters.
  • Repeated adversarial filtering might surface systematic reasoning weaknesses that could then be used to improve base model training.
  • Efficiency hinges on the attacker converging quickly; slower convergence would limit practical deployment on large code pools.

Load-bearing premise

The attacker agent can consistently discover non-trivial logical divergences in the code pool without excessive compute or converging to ineffective tests.

What would settle it

An evaluation on standard code benchmarks in which the attacker produces no additional failing tests beyond the public set and overall pass rates on hidden tests remain unchanged from static baselines.

Figures

Figures reproduced from arXiv: 2604.10449 by Bo An, Qingyao Li, Weinan Zhang, Weiwen Liu, Yong Yu.

Figure 1
Figure 1. Figure 1: Conceptual Comparison. (A) Standard Search relies on sparse public tests, creating a “leaky” filter prone to pseudo￾correctness. (B) ADVERMCTS employs an active Attacker to co-evolve a progressively stricter environment, exposing hidden bugs and enforcing robust correctness. dard programming tasks with remarkable proficiency (Chen, 2021; Li et al., 2022; Wang et al., 2025). However, as the focus shifts tow… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ADVERMCTS. A minimax interaction where the Solver (blue) generates code and the Attacker (red) synthesizes adversarial tests. The Global Hub turns valid attacks into constraints. Feedback is dual: divergence rewards (+R) for the Attacker and penalties (-V) for the Solver enforce robust generalization. which blindly targets the problem description, our Attacker employs a targeted adversarial str… view at source ↗
Figure 3
Figure 3. Figure 3: Scalability with Model Capabilities. We compare meth￾ods across backbones sorted by intrinsic capability. Despite pa￾rameter discrepancies, ADVERMCTS consistently amplifies per￾formance, maintaining a significant lead across all model scales. lection; and (iii) GF-Hub, where we specifically disable pre-admission global screening before a candidate enters the code pool, while keeping the final test-based re… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study of ADVERMCTS on APPS and TACO. We report the final Pass@1 (%) under the same inference budget while removing one component at a time. computational cost (average token consumption). As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cost-Performance Pareto Frontiers (upper left is better). Comparison of Pass@1 accuracy against average token consump￾tion per problem. ADVERMCTS consistently achieves a superior Pareto frontier. The labels denote the rollout budget. Co-evolution of Test and Code. To uncover the temporal mechanism behind ADVERMCTS, we visualize the itera￾tive interaction between the Attacker and the Solver. We selected 100… view at source ↗
Figure 7
Figure 7. Figure 7: Attacker Discriminative Analysis. We evaluate filtering efficacy on Pseudo-Correct (pass public, fail hidden) versus True Correct codes. The green hatched area marks effective bug iden￾tification (True Positives), while the red area indicates wrongful penalization (False Positives). Solver-Aware is Crucial for Effective Attacker. We in￾vestigate two critical design choices to see the impact of solver aware… view at source ↗
Figure 8
Figure 8. Figure 8: Empirical Validation of Pseudo-Correctness. Comparison of MCTS-Thought performance under varying verification environments (Original, Half-Hidden, Oracle). The significant performance gap between the standard setting (5 Tests) and the Oracle setting confirms the prevalence of pseudo-correctness: robust solutions are successfully generated but are filtered out due to the sparsity of public tests. C.2. Hard … view at source ↗
Figure 9
Figure 9. Figure 9: Attacker rollout scaling. Varying the Attacker’s number of rollouts shows an inverted-U trend on both APPS and TACO: moderate budgets improve Pass@1 across difficulty splits, while larger budgets can saturate or degrade performance. trivial cases and discover discriminative corner tests. However, further increasing the budget brings diminishing returns and can even hurt performance. We attribute this to a … view at source ↗
read the original abstract

Recent advancements in Large Language Models (LLMs) have successfully employed search-based strategies to enhance code generation. However, existing methods typically rely on static, sparse public test cases for verification, leading to pseudo-correctness -- where solutions overfit the visible public tests but fail to generalize to hidden test cases. We argue that optimizing against a fixed, weak environment inherently limits robustness. To address this, we propose AdverMCTS, a novel adversarial Monte Carlo Tree Search framework that combats pseudo-correctness by coupling code search with active vulnerability discovery. AdverMCTS formulates generation as a minimax-style game between a Solver agent, which synthesizes code candidates, and an Attacker agent, which evolves to generate targeted corner test cases that exploit logical divergences in the current code pool. These discovered tests form a dynamic, progressively hostile filter that penalizes fragile reasoning. Extensive experiments demonstrate that AdverMCTS significantly outperforms state-of-the-art baselines, effectively reducing false positive rates and forcing the model to generalize beyond the initial constraints. The resources of this work are available at https://anonymous.4open.science/r/AdverMCTS_open-A255.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AdverMCTS, a novel adversarial Monte Carlo Tree Search framework for LLM-based code generation. It formulates the task as a minimax-style game between a Solver agent that synthesizes code candidates and an Attacker agent that evolves targeted corner-case tests to exploit logical divergences in the current code pool. These dynamically discovered tests serve as a progressively hostile filter to penalize fragile solutions that overfit to static public tests (pseudo-correctness) while failing to generalize to hidden cases. The paper claims that extensive experiments demonstrate significant outperformance over state-of-the-art baselines, with reduced false-positive rates.

Significance. If the central results hold, the work could advance robustness in code generation by moving beyond static verification to active adversarial testing, addressing a recognized limitation in current search-based LLM methods. The open release of resources supports reproducibility and follow-on work.

major comments (2)
  1. [§3] §3 (Method), Attacker agent description: No details are provided on the attacker's state representation, mutation operators, reward signal, or stopping criteria. This is load-bearing for the headline claim, because reduced false positives and improved generalization require the attacker to reliably surface non-trivial logical divergences rather than noise or duplicates; without these components it is impossible to determine whether the adversarial loop adds value beyond extra search budget.
  2. [§4] §4 (Experiments): The abstract asserts outperformance and reduced false positives, yet the provided text contains no quantitative results, baseline implementations, statistical tests, ablation studies, or convergence analysis for the attacker. The central claim that AdverMCTS forces generalization therefore rests on unreviewed experimental evidence.
minor comments (1)
  1. The anonymous resource link should be replaced with a permanent repository identifier before publication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important areas where the presentation of AdverMCTS can be strengthened. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§3] §3 (Method), Attacker agent description: No details are provided on the attacker's state representation, mutation operators, reward signal, or stopping criteria. This is load-bearing for the headline claim, because reduced false positives and improved generalization require the attacker to reliably surface non-trivial logical divergences rather than noise or duplicates; without these components it is impossible to determine whether the adversarial loop adds value beyond extra search budget.

    Authors: We agree that §3 currently provides only a high-level description of the Attacker and that the requested implementation details are necessary to substantiate the minimax formulation. In the revised manuscript we will expand §3 with: (1) state representation as the tuple (current code pool, accumulated test cases, divergence history); (2) mutation operators consisting of boundary-value injection, logical negation of conditions, and input-distribution perturbation; (3) reward signal defined as the count of code candidates that pass public tests yet fail the newly generated test; and (4) stopping criteria based on either a fixed iteration budget or plateau in new logical divergences. We will also add pseudocode for the attacker loop and a brief complexity analysis. These additions will clarify that the adversarial component contributes targeted test cases beyond uniform extra search budget. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts outperformance and reduced false positives, yet the provided text contains no quantitative results, baseline implementations, statistical tests, ablation studies, or convergence analysis for the attacker. The central claim that AdverMCTS forces generalization therefore rests on unreviewed experimental evidence.

    Authors: We acknowledge that the experimental section in the submitted version was insufficiently detailed and that the quantitative evidence must be fully visible to support the claims. The full manuscript contains §4 with results on HumanEval, MBPP, and APPS, but to address the concern we will revise §4 to include: explicit tables reporting pass@1, false-positive rates, and generalization gaps versus baselines (CodeT, AlphaCode, etc.); re-implementation details for all baselines; statistical significance tests with p-values; ablation studies isolating the attacker’s contribution; and convergence curves for both solver and attacker MCTS. The linked repository already contains the complete experimental code and data; we will add a pointer to the specific result files in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces AdverMCTS as a novel adversarial Monte Carlo Tree Search framework formulated as a minimax game between Solver and Attacker agents, with performance gains demonstrated through empirical experiments on code generation tasks. No equations, derivations, or first-principles results are presented that reduce the claimed outperformance or reduced false-positive rates to fitted parameters, self-definitions, or self-citation chains. The method is self-contained as an empirical construction evaluated against external baselines, with no load-bearing steps that collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that adversarial test evolution improves generalization and on standard MCTS exploration mechanics; no new physical entities or free parameters are explicitly introduced in the abstract.

axioms (1)
  • domain assumption An attacker that generates targeted corner cases will expose logical divergences that static tests miss.
    Core premise of the minimax game in the abstract.

pith-pipeline@v0.9.0 · 5516 in / 1102 out tokens · 48291 ms · 2026-05-10T16:31:50.283624+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 33 canonical work pages · 16 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Program Synthesis with Large Language Models

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., R \'e , C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

  4. [4]

    B., Powley, E., Whitehouse, D., Lucas, S

    Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4 0 (1): 0 1--43, 2012

  5. [5]

    Ppl-mcts: Constrained textual generation through discriminator-guided mcts decoding

    Chaffin, A., Claveau, V., and Kijak, E. Ppl-mcts: Constrained textual generation through discriminator-guided mcts decoding. In Proceedings of the 2022 C onference of the North American chapter of the A ssociation for C omputational L inguistics: H uman L anguage T echnologies , pp.\ 2953--2967, 2022

  6. [6]

    CodeT : Code generation with generated tests

    Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J.-G., and Chen, W. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022

  7. [7]

    Code search is all you need? I mproving code suggestions with code search

    Chen, J., Hu, X., Li, Z., Gao, C., Xia, X., and Lo, D. Code search is all you need? I mproving code suggestions with code search. In Proceedings of the IEEE/ACM 46th I nternational C onference on S oftware E ngineering , pp.\ 1--13, 2024 a

  8. [8]

    Evaluating Large Language Models Trained on Code

    Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  9. [9]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024 b

  10. [10]

    When is tree search useful for LLM planning? it depends on the discriminator

    Chen, Z., White, M., Mooney, R., Payani, A., Su, Y., and Sun, H. When is tree search useful for LLM planning? it depends on the discriminator. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 13659--13678, 2024 c

  11. [11]

    Codescore: Evaluating code generation by learning code execution

    Dong, Y., Ding, J., Jiang, X., Li, G., Li, Z., and Jin, Z. Codescore: Evaluating code generation by learning code execution. ACM Transactions on Software Engineering and Methodology, 34 0 (3): 0 1--22, 2025 a

  12. [12]

    CoRR , volume =

    Dong, Y., Jiang, X., Qian, J., Wang, T., Zhang, K., Jin, Z., and Li, G. A survey on code generation with LLM -based agents. arXiv preprint arXiv:2508.00083, 2025 b

  13. [13]

    Search-based LLMs for code optimization

    Gao, S., Gao, C., Gu, W., and Lyu, M. Search-based LLMs for code optimization. arXiv preprint arXiv:2408.12159, 2024

  14. [14]

    Rrgcode: Deep hierarchical search-based code generation

    Gou, Q., Dong, Y., Wu, Y., and Ke, Q. Rrgcode: Deep hierarchical search-based code generation. Journal of Systems and Software, 211: 0 111982, 2024

  15. [15]

    A value based parallel update mcts method for multi-agent cooperative decision making of connected and automated vehicles

    Han, Y., Zhang, L., Meng, D., Zhang, Z., Hu, X., and Weng, S. A value based parallel update mcts method for multi-agent cooperative decision making of connected and automated vehicles. arXiv preprint arXiv:2409.13783, 2024

  16. [16]

    Reasoning with language model is planning with world model

    Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D., and Hu, Z. Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 8154--8173, 2023

  17. [17]

    Measuring Coding Challenge Competence With APPS

    Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021

  18. [18]

    Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

  19. [19]

    A Survey on Large Language Models for Code Generation

    Jiang, J., Wang, F., Shen, J., Kim, S., and Kim, S. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024 a

  20. [20]

    Self-planning code generation with large language models

    Jiang, X., Dong, Y., Wang, L., Fang, Z., Shang, Q., Li, G., Jin, Z., and Jiao, W. Self-planning code generation with large language models. ACM Transactions on Software Engineering and Methodology, 33 0 (7): 0 1--30, 2024 b

  21. [21]

    On the bias of BFS (breadth first search)

    Kurant, M., Markopoulou, A., and Thiran, P. On the bias of BFS (breadth first search). In 2010 22nd International Teletraffic Congress (LTC 22), pp.\ 1--8. IEEE, 2010

  22. [22]

    vLLM: An Efficient Inference Engine for Large Language Models

    Kwon, W. vLLM: An Efficient Inference Engine for Large Language Models. PhD thesis, University of California, Berkeley, 2025

  23. [23]

    HumanEval on Latest GPT Models–2024

    Li, D. and Murr, L. Humaneval on latest GPT models--2024. arXiv preprint arXiv:2402.14852, 2024

  24. [24]

    Gonzalez, and Ion Stoica

    Li, D., Cao, S., Cao, C., Li, X., Tan, S., Keutzer, K., Xing, J., Gonzalez, J. E., and Stoica, I. S*: Test time scaling for code generation. arXiv preprint arXiv:2502.14382, 2025 a

  25. [25]

    Codetree: Agent-guided tree search for code generation with large language models

    Li, J., Le, H., Zhou, Y., Xiong, C., Savarese, S., and Sahoo, D. Codetree: Agent-guided tree search for code generation with large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 3711--3726, 2025 b

  26. [26]

    Codeprm: Execution feedback-enhanced process reward model for code generation

    Li, Q., Dai, X., Li, X., Zhang, W., Wang, Y., Tang, R., and Yu, Y. Codeprm: Execution feedback-enhanced process reward model for code generation. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 8169--8182, 2025 c

  27. [27]

    Atgen: Adversarial reinforcement learning for test case generation

    Li, Q., Dai, X., Liu, W., Li, X., Wang, Y., Tang, R., Yu, Y., and Zhang, W. Atgen: Adversarial reinforcement learning for test case generation. arXiv preprint arXiv:2510.14635, 2025 d

  28. [28]

    Rethinkmcts: Refining erroneous thoughts in monte carlo tree search for code generation

    Li, Q., Xia, W., Dai, X., Du, K., Liu, W., Wang, Y., Tang, R., Yu, Y., and Zhang, W. Rethinkmcts: Refining erroneous thoughts in monte carlo tree search for code generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 8103--8121, 2025 e

  29. [29]

    StarCoder: may the source be with you!

    Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023 a

  30. [30]

    Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

    Li, R., Fu, J., Zhang, B.-W., Huang, T., Sun, Z., Lyu, C., Liu, G., Jin, Z., and Li, G. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852, 2023 b

  31. [31]

    Competition-level code generation with alphacode

    Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

  32. [32]

    From System 1 to System 2: A Survey of Reasoning Large Language Models

    Li, Z.-Z., Zhang, D., Zhang, M.-L., Zhang, J., Liu, Z., Yao, Y., Xu, H., Zheng, J., Wang, P.-J., Chen, X., et al. From system 1 to system 2: A survey of reasoning large language models. arXiv preprint arXiv:2502.17419, 2025 f

  33. [33]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025

  34. [34]

    Lotov, A. V. and Miettinen, K. Visualizing the pareto frontier. In Multiobjective O ptimization: I nteractive and E volutionary A pproaches , pp.\ 213--243. Springer, 2008

  35. [35]

    StarCoder 2 and The Stack v2: The Next Generation

    Lozhkov, A., Li, R., Allal, L. B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024

  36. [36]

    Let's revise step-by-step: A unified local search framework for code generation with LLMs

    Lyu, Z., Huang, J., Deng, Y., Hoi, S., and An, B. Let's revise step-by-step: A unified local search framework for code generation with LLMs . arXiv preprint arXiv:2508.07434, 2025

  37. [37]

    Codeforces as an educational platform for learning programming in digitalization

    Mirzayanov, M., Pavlova, O., MAVRIN, P., Melnikov, R., Plotnikov, A., Parfenov, V., and Stankevich, A. Codeforces as an educational platform for learning programming in digitalization. Olympiads in Informatics, 14 0 (133-142): 0 14, 2020

  38. [38]

    L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T

    Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T. B. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 20286--20332, 2025

  39. [39]

    Ni, A., Iyer, S., Radev, D., Stoyanov, V., Yih, W.-t., Wang, S., and Lin, X. V. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pp.\ 26106--26128. PMLR, 2023

  40. [40]

    Odeh, A., Odeh, N., and Mohammed, A. S. A comparative review of ai techniques for automated code generation in software development: advancements, challenges, and future directions. TEM Journal, 13 0 (1): 0 726, 2024

  41. [41]

    G., Zhu, H., and Bayley, I

    Paul, D. G., Zhu, H., and Bayley, I. Benchmarks and metrics for evaluations of code generation: A critical review. In 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), pp.\ 87--94. IEEE, 2024

  42. [42]

    TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation

    Princis, H., Sharma, A., and David, C. Treecoder: Systematic exploration and optimisation of decoding and constraints for LLM code generation. arXiv preprint arXiv:2511.22277, 2025

  43. [43]

    A decision-making framework using mcts as a hierarchical task network and deep learning connector

    Shao, T., Zhang, K., Cheng, K., and Zhang, H. A decision-making framework using mcts as a hierarchical task network and deep learning connector. Science Progress, 108 0 (4): 0 00368504251386308, 2025

  44. [44]

    Reflexion: Language agents with verbal reinforcement learning

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning. Advances in N eural I nformation P rocessing S ystems , 36: 0 8634--8652, 2023

  45. [45]

    J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al

    Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. N ature , 529 0 (7587): 0 484--489, 2016

  46. [46]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017

  47. [47]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

  48. [48]

    Teaching code LLMs to use autocompletion tools in repository-level code generation

    Wang, C., Zhang, J., Feng, Y., Li, T., Sun, W., Liu, Y., and Peng, X. Teaching code LLMs to use autocompletion tools in repository-level code generation. ACM Transactions on Software Engineering and Methodology, 34 0 (7): 0 1--27, 2025

  49. [49]

    Planning in natural language improves llm search for code generation

    Wang, E., Cassano, F., Wu, C., Bai, Y., Song, W., Nath, V., Han, Z., Hendryx, S., Yue, S., and Zhang, H. Planning in natural language improves LLM search for code generation. arXiv preprint arXiv:2409.03733, 2024

  50. [50]

    Reward-centered rest-mcts: A robust decision-making framework for robotic manipulation in high uncertainty environments

    Wang, X. Reward-centered rest-mcts: A robust decision-making framework for robotic manipulation in high uncertainty environments. arXiv preprint arXiv:2503.05226, 2025

  51. [51]

    V., Zhou, D., et al

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in N eural I nformation P rocessing S ystems , 35: 0 24824--24837, 2022

  52. [52]

    Tailoring Diagnostic Modeling to Individual Learners: Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction

    Wu, T., Chen, J., Lin, W., Zhan, J., Li, M., Kuang, K., and Wu, F. Personalized distractor generation via mcts-guided reasoning reconstruction. arXiv preprint arXiv:2508.11184, 2025

  53. [53]

    Towards system 2 reasoning in llms: Learning how to think with meta chain-of-though,

    Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Blagden, C., Phung, D., Rafailov, R., Lile, N., Mahan, D., et al. Towards system 2 reasoning in LLMs : Learning how to think with meta chain-of-thought. arXiv preprint arXiv:2501.04682, 2025

  54. [54]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

  55. [55]

    An empirical study of retrieval-augmented code generation: Challenges and opportunities

    Yang, Z., Chen, S., Gao, C., Li, Z., Hu, X., Liu, K., and Xia, X. An empirical study of retrieval-augmented code generation: Challenges and opportunities. ACM Transactions on Software Engineering and Methodology, 2025 b

  56. [56]

    Tree of thoughts: Deliberate problem solving with large language models

    Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Advances in N eural I nformation P rocessing S ystems , 36: 0 11809--11822, 2023

  57. [57]

    Revisit- ing the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?arXiv preprint arXiv:2502.12215,

    Zeng, Z., Cheng, Q., Yin, Z., Zhou, Y., and Qiu, X. Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? arXiv preprint arXiv:2502.12215, 2025

  58. [58]

    A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

    Zhang, Q., Lyu, F., Sun, Z., Wang, L., Zhang, W., Hua, W., Wu, H., Guo, Z., Wang, Y., Muennighoff, N., et al. A survey on test-time scaling in large language models: What, how, where, and how well? arXiv preprint arXiv:2503.24235, 2025

  59. [59]

    arXiv preprint arXiv:2303.05510 , year=

    Zhang, S., Chen, Z., Shen, Y., Ding, M., Tenenbaum, J. B., and Gan, C. Planning with large language models for code generation. arXiv preprint arXiv:2303.05510, 2023

  60. [60]

    arXiv preprint arXiv:2402.16906 , year=

    Zhong, L., Wang, Z., and Shang, J. Ldb: A large language model debugger via verifying runtime execution step-by-step. arXiv preprint arXiv:2402.16906, 2024

  61. [61]

    Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

    Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y.-X. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406, 2023