arxiv: 2604.10449 · v1 · submitted 2026-04-12 · 💻 cs.SE

Recognition: unknown

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

Qingyao Li , Weiwen Liu , Weinan Zhang , Yong Yu , Bo An

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3

classification 💻 cs.SE

keywords code generationadversarial searchMonte Carlo Tree Searchpseudo-correctnesstest case generationlarge language modelsminimax gamerobust code synthesis

0 comments

The pith

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that search-based code generation with LLMs overfits to fixed public test cases, producing solutions that pass visible checks but fail on hidden ones. It introduces an adversarial setup in which a solver agent proposes code while an attacker agent searches for corner-case tests that reveal logical weaknesses in the current pool of candidates. The discovered tests act as a growing filter that penalizes brittle solutions and pushes the process toward more general code. A reader would care because this directly targets the gap between benchmark success and real-world reliability in automated programming.

Core claim

AdverMCTS formulates generation as a minimax-style game between a Solver agent, which synthesizes code candidates, and an Attacker agent, which evolves to generate targeted corner test cases that exploit logical divergences in the current code pool; these discovered tests form a dynamic, progressively hostile filter that penalizes fragile reasoning and reduces pseudo-correctness.

What carries the argument

Adversarial Monte Carlo Tree Search that couples a Solver agent producing code with an Attacker agent generating exploitative tests in a minimax loop.

If this is right

Solutions exhibit lower false-positive rates when verified against hidden test suites.
Generated code must handle logical scenarios absent from the initial public constraints.
The search process yields candidates that generalize beyond the visible test distribution.
Performance gains hold across multiple code-generation benchmarks compared with non-adversarial search methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same solver-attacker loop could be applied to other generation tasks such as test-case creation or proof synthesis where edge-case discovery matters.
Repeated adversarial filtering might surface systematic reasoning weaknesses that could then be used to improve base model training.
Efficiency hinges on the attacker converging quickly; slower convergence would limit practical deployment on large code pools.

Load-bearing premise

The attacker agent can consistently discover non-trivial logical divergences in the code pool without excessive compute or converging to ineffective tests.

What would settle it

An evaluation on standard code benchmarks in which the attacker produces no additional failing tests beyond the public set and overall pass rates on hidden tests remain unchanged from static baselines.

Figures

Figures reproduced from arXiv: 2604.10449 by Bo An, Qingyao Li, Weinan Zhang, Weiwen Liu, Yong Yu.

**Figure 1.** Figure 1: Conceptual Comparison. (A) Standard Search relies on sparse public tests, creating a “leaky” filter prone to pseudocorrectness. (B) ADVERMCTS employs an active Attacker to co-evolve a progressively stricter environment, exposing hidden bugs and enforcing robust correctness. dard programming tasks with remarkable proficiency (Chen, 2021; Li et al., 2022; Wang et al., 2025). However, as the focus shifts tow… view at source ↗

**Figure 2.** Figure 2: Overview of ADVERMCTS. A minimax interaction where the Solver (blue) generates code and the Attacker (red) synthesizes adversarial tests. The Global Hub turns valid attacks into constraints. Feedback is dual: divergence rewards (+R) for the Attacker and penalties (-V) for the Solver enforce robust generalization. which blindly targets the problem description, our Attacker employs a targeted adversarial str… view at source ↗

**Figure 3.** Figure 3: Scalability with Model Capabilities. We compare methods across backbones sorted by intrinsic capability. Despite parameter discrepancies, ADVERMCTS consistently amplifies performance, maintaining a significant lead across all model scales. lection; and (iii) GF-Hub, where we specifically disable pre-admission global screening before a candidate enters the code pool, while keeping the final test-based re… view at source ↗

**Figure 4.** Figure 4: Ablation study of ADVERMCTS on APPS and TACO. We report the final Pass@1 (%) under the same inference budget while removing one component at a time. computational cost (average token consumption). As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Cost-Performance Pareto Frontiers (upper left is better). Comparison of Pass@1 accuracy against average token consumption per problem. ADVERMCTS consistently achieves a superior Pareto frontier. The labels denote the rollout budget. Co-evolution of Test and Code. To uncover the temporal mechanism behind ADVERMCTS, we visualize the iterative interaction between the Attacker and the Solver. We selected 100… view at source ↗

**Figure 7.** Figure 7: Attacker Discriminative Analysis. We evaluate filtering efficacy on Pseudo-Correct (pass public, fail hidden) versus True Correct codes. The green hatched area marks effective bug identification (True Positives), while the red area indicates wrongful penalization (False Positives). Solver-Aware is Crucial for Effective Attacker. We investigate two critical design choices to see the impact of solver aware… view at source ↗

**Figure 8.** Figure 8: Empirical Validation of Pseudo-Correctness. Comparison of MCTS-Thought performance under varying verification environments (Original, Half-Hidden, Oracle). The significant performance gap between the standard setting (5 Tests) and the Oracle setting confirms the prevalence of pseudo-correctness: robust solutions are successfully generated but are filtered out due to the sparsity of public tests. C.2. Hard … view at source ↗

**Figure 9.** Figure 9: Attacker rollout scaling. Varying the Attacker’s number of rollouts shows an inverted-U trend on both APPS and TACO: moderate budgets improve Pass@1 across difficulty splits, while larger budgets can saturate or degrade performance. trivial cases and discover discriminative corner tests. However, further increasing the budget brings diminishing returns and can even hurt performance. We attribute this to a … view at source ↗

read the original abstract

Recent advancements in Large Language Models (LLMs) have successfully employed search-based strategies to enhance code generation. However, existing methods typically rely on static, sparse public test cases for verification, leading to pseudo-correctness -- where solutions overfit the visible public tests but fail to generalize to hidden test cases. We argue that optimizing against a fixed, weak environment inherently limits robustness. To address this, we propose AdverMCTS, a novel adversarial Monte Carlo Tree Search framework that combats pseudo-correctness by coupling code search with active vulnerability discovery. AdverMCTS formulates generation as a minimax-style game between a Solver agent, which synthesizes code candidates, and an Attacker agent, which evolves to generate targeted corner test cases that exploit logical divergences in the current code pool. These discovered tests form a dynamic, progressively hostile filter that penalizes fragile reasoning. Extensive experiments demonstrate that AdverMCTS significantly outperforms state-of-the-art baselines, effectively reducing false positive rates and forcing the model to generalize beyond the initial constraints. The resources of this work are available at https://anonymous.4open.science/r/AdverMCTS_open-A255.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdverMCTS has a sensible adversarial framing for fixing pseudo-correctness in code search, but the abstract gives no results or attacker details to show it works.

read the letter

The paper's main contribution is a minimax-style MCTS where a solver generates code candidates while an attacker evolves tests to expose logical gaps in the current pool. This setup aims to replace static public tests with a dynamic, hostile filter that penalizes overfitting. The motivation is clear: existing search methods still produce solutions that pass visible tests but fail hidden ones, and the authors call this pseudo-correctness correctly.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AdverMCTS, a novel adversarial Monte Carlo Tree Search framework for LLM-based code generation. It formulates the task as a minimax-style game between a Solver agent that synthesizes code candidates and an Attacker agent that evolves targeted corner-case tests to exploit logical divergences in the current code pool. These dynamically discovered tests serve as a progressively hostile filter to penalize fragile solutions that overfit to static public tests (pseudo-correctness) while failing to generalize to hidden cases. The paper claims that extensive experiments demonstrate significant outperformance over state-of-the-art baselines, with reduced false-positive rates.

Significance. If the central results hold, the work could advance robustness in code generation by moving beyond static verification to active adversarial testing, addressing a recognized limitation in current search-based LLM methods. The open release of resources supports reproducibility and follow-on work.

major comments (2)

[§3] §3 (Method), Attacker agent description: No details are provided on the attacker's state representation, mutation operators, reward signal, or stopping criteria. This is load-bearing for the headline claim, because reduced false positives and improved generalization require the attacker to reliably surface non-trivial logical divergences rather than noise or duplicates; without these components it is impossible to determine whether the adversarial loop adds value beyond extra search budget.
[§4] §4 (Experiments): The abstract asserts outperformance and reduced false positives, yet the provided text contains no quantitative results, baseline implementations, statistical tests, ablation studies, or convergence analysis for the attacker. The central claim that AdverMCTS forces generalization therefore rests on unreviewed experimental evidence.

minor comments (1)

The anonymous resource link should be replaced with a permanent repository identifier before publication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important areas where the presentation of AdverMCTS can be strengthened. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [§3] §3 (Method), Attacker agent description: No details are provided on the attacker's state representation, mutation operators, reward signal, or stopping criteria. This is load-bearing for the headline claim, because reduced false positives and improved generalization require the attacker to reliably surface non-trivial logical divergences rather than noise or duplicates; without these components it is impossible to determine whether the adversarial loop adds value beyond extra search budget.

Authors: We agree that §3 currently provides only a high-level description of the Attacker and that the requested implementation details are necessary to substantiate the minimax formulation. In the revised manuscript we will expand §3 with: (1) state representation as the tuple (current code pool, accumulated test cases, divergence history); (2) mutation operators consisting of boundary-value injection, logical negation of conditions, and input-distribution perturbation; (3) reward signal defined as the count of code candidates that pass public tests yet fail the newly generated test; and (4) stopping criteria based on either a fixed iteration budget or plateau in new logical divergences. We will also add pseudocode for the attacker loop and a brief complexity analysis. These additions will clarify that the adversarial component contributes targeted test cases beyond uniform extra search budget. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts outperformance and reduced false positives, yet the provided text contains no quantitative results, baseline implementations, statistical tests, ablation studies, or convergence analysis for the attacker. The central claim that AdverMCTS forces generalization therefore rests on unreviewed experimental evidence.

Authors: We acknowledge that the experimental section in the submitted version was insufficiently detailed and that the quantitative evidence must be fully visible to support the claims. The full manuscript contains §4 with results on HumanEval, MBPP, and APPS, but to address the concern we will revise §4 to include: explicit tables reporting pass@1, false-positive rates, and generalization gaps versus baselines (CodeT, AlphaCode, etc.); re-implementation details for all baselines; statistical significance tests with p-values; ablation studies isolating the attacker’s contribution; and convergence curves for both solver and attacker MCTS. The linked repository already contains the complete experimental code and data; we will add a pointer to the specific result files in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces AdverMCTS as a novel adversarial Monte Carlo Tree Search framework formulated as a minimax game between Solver and Attacker agents, with performance gains demonstrated through empirical experiments on code generation tasks. No equations, derivations, or first-principles results are presented that reduce the claimed outperformance or reduced false-positive rates to fitted parameters, self-definitions, or self-citation chains. The method is self-contained as an empirical construction evaluated against external baselines, with no load-bearing steps that collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that adversarial test evolution improves generalization and on standard MCTS exploration mechanics; no new physical entities or free parameters are explicitly introduced in the abstract.

axioms (1)

domain assumption An attacker that generates targeted corner cases will expose logical divergences that static tests miss.
Core premise of the minimax game in the abstract.

pith-pipeline@v0.9.0 · 5516 in / 1102 out tokens · 48291 ms · 2026-05-10T16:31:50.283624+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 33 canonical work pages · 16 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., R \'e , C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review arXiv 2024
[4]

B., Powley, E., Whitehouse, D., Lucas, S

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4 0 (1): 0 1--43, 2012

2012
[5]

Ppl-mcts: Constrained textual generation through discriminator-guided mcts decoding

Chaffin, A., Claveau, V., and Kijak, E. Ppl-mcts: Constrained textual generation through discriminator-guided mcts decoding. In Proceedings of the 2022 C onference of the North American chapter of the A ssociation for C omputational L inguistics: H uman L anguage T echnologies , pp.\ 2953--2967, 2022

2022
[6]

CodeT : Code generation with generated tests

Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J.-G., and Chen, W. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022

work page arXiv 2022
[7]

Code search is all you need? I mproving code suggestions with code search

Chen, J., Hu, X., Li, Z., Gao, C., Xia, X., and Lo, D. Code search is all you need? I mproving code suggestions with code search. In Proceedings of the IEEE/ACM 46th I nternational C onference on S oftware E ngineering , pp.\ 1--13, 2024 a

2024
[8]

Evaluating Large Language Models Trained on Code

Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

When is tree search useful for LLM planning? it depends on the discriminator

Chen, Z., White, M., Mooney, R., Payani, A., Su, Y., and Sun, H. When is tree search useful for LLM planning? it depends on the discriminator. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 13659--13678, 2024 c

2024
[11]

Codescore: Evaluating code generation by learning code execution

Dong, Y., Ding, J., Jiang, X., Li, G., Li, Z., and Jin, Z. Codescore: Evaluating code generation by learning code execution. ACM Transactions on Software Engineering and Methodology, 34 0 (3): 0 1--22, 2025 a

2025
[12]

CoRR , volume =

Dong, Y., Jiang, X., Qian, J., Wang, T., Zhang, K., Jin, Z., and Li, G. A survey on code generation with LLM -based agents. arXiv preprint arXiv:2508.00083, 2025 b

work page arXiv 2025
[13]

Search-based LLMs for code optimization

Gao, S., Gao, C., Gu, W., and Lyu, M. Search-based LLMs for code optimization. arXiv preprint arXiv:2408.12159, 2024

work page arXiv 2024
[14]

Rrgcode: Deep hierarchical search-based code generation

Gou, Q., Dong, Y., Wu, Y., and Ke, Q. Rrgcode: Deep hierarchical search-based code generation. Journal of Systems and Software, 211: 0 111982, 2024

2024
[15]

A value based parallel update mcts method for multi-agent cooperative decision making of connected and automated vehicles

Han, Y., Zhang, L., Meng, D., Zhang, Z., Hu, X., and Weng, S. A value based parallel update mcts method for multi-agent cooperative decision making of connected and automated vehicles. arXiv preprint arXiv:2409.13783, 2024

work page arXiv 2024
[16]

Reasoning with language model is planning with world model

Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D., and Hu, Z. Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 8154--8173, 2023

2023
[17]

Measuring Coding Challenge Competence With APPS

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review arXiv 2021
[18]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

A Survey on Large Language Models for Code Generation

Jiang, J., Wang, F., Shen, J., Kim, S., and Kim, S. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024 a

work page internal anchor Pith review arXiv 2024
[20]

Self-planning code generation with large language models

Jiang, X., Dong, Y., Wang, L., Fang, Z., Shang, Q., Li, G., Jin, Z., and Jiao, W. Self-planning code generation with large language models. ACM Transactions on Software Engineering and Methodology, 33 0 (7): 0 1--30, 2024 b

2024
[21]

On the bias of BFS (breadth first search)

Kurant, M., Markopoulou, A., and Thiran, P. On the bias of BFS (breadth first search). In 2010 22nd International Teletraffic Congress (LTC 22), pp.\ 1--8. IEEE, 2010

2010
[22]

vLLM: An Efficient Inference Engine for Large Language Models

Kwon, W. vLLM: An Efficient Inference Engine for Large Language Models. PhD thesis, University of California, Berkeley, 2025

2025
[23]

HumanEval on Latest GPT Models–2024

Li, D. and Murr, L. Humaneval on latest GPT models--2024. arXiv preprint arXiv:2402.14852, 2024

work page arXiv 2024
[24]

Gonzalez, and Ion Stoica

Li, D., Cao, S., Cao, C., Li, X., Tan, S., Keutzer, K., Xing, J., Gonzalez, J. E., and Stoica, I. S*: Test time scaling for code generation. arXiv preprint arXiv:2502.14382, 2025 a

work page arXiv 2025
[25]

Codetree: Agent-guided tree search for code generation with large language models

Li, J., Le, H., Zhou, Y., Xiong, C., Savarese, S., and Sahoo, D. Codetree: Agent-guided tree search for code generation with large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 3711--3726, 2025 b

2025
[26]

Codeprm: Execution feedback-enhanced process reward model for code generation

Li, Q., Dai, X., Li, X., Zhang, W., Wang, Y., Tang, R., and Yu, Y. Codeprm: Execution feedback-enhanced process reward model for code generation. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 8169--8182, 2025 c

2025
[27]

Atgen: Adversarial reinforcement learning for test case generation

Li, Q., Dai, X., Liu, W., Li, X., Wang, Y., Tang, R., Yu, Y., and Zhang, W. Atgen: Adversarial reinforcement learning for test case generation. arXiv preprint arXiv:2510.14635, 2025 d

work page arXiv 2025
[28]

Rethinkmcts: Refining erroneous thoughts in monte carlo tree search for code generation

Li, Q., Xia, W., Dai, X., Du, K., Liu, W., Wang, Y., Tang, R., Yu, Y., and Zhang, W. Rethinkmcts: Refining erroneous thoughts in monte carlo tree search for code generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 8103--8121, 2025 e

2025
[29]

StarCoder: may the source be with you!

Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023 a

work page internal anchor Pith review arXiv 2023
[30]

Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

Li, R., Fu, J., Zhang, B.-W., Huang, T., Sun, Z., Lyu, C., Liu, G., Jin, Z., and Li, G. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852, 2023 b

work page arXiv 2023
[31]

Competition-level code generation with alphacode

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

2022
[32]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Li, Z.-Z., Zhang, D., Zhang, M.-L., Zhang, J., Liu, Z., Yao, Y., Xu, H., Zheng, J., Wang, P.-J., Chen, X., et al. From system 1 to system 2: A survey of reasoning large language models. arXiv preprint arXiv:2502.17419, 2025 f

work page internal anchor Pith review arXiv 2025
[33]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Lotov, A. V. and Miettinen, K. Visualizing the pareto frontier. In Multiobjective O ptimization: I nteractive and E volutionary A pproaches , pp.\ 213--243. Springer, 2008

2008
[35]

StarCoder 2 and The Stack v2: The Next Generation

Lozhkov, A., Li, R., Allal, L. B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024

work page internal anchor Pith review arXiv 2024
[36]

Let's revise step-by-step: A unified local search framework for code generation with LLMs

Lyu, Z., Huang, J., Deng, Y., Hoi, S., and An, B. Let's revise step-by-step: A unified local search framework for code generation with LLMs . arXiv preprint arXiv:2508.07434, 2025

work page arXiv 2025
[37]

Codeforces as an educational platform for learning programming in digitalization

Mirzayanov, M., Pavlova, O., MAVRIN, P., Melnikov, R., Plotnikov, A., Parfenov, V., and Stankevich, A. Codeforces as an educational platform for learning programming in digitalization. Olympiads in Informatics, 14 0 (133-142): 0 14, 2020

2020
[38]

L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T. B. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 20286--20332, 2025

2025
[39]

Ni, A., Iyer, S., Radev, D., Stoyanov, V., Yih, W.-t., Wang, S., and Lin, X. V. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pp.\ 26106--26128. PMLR, 2023

2023
[40]

Odeh, A., Odeh, N., and Mohammed, A. S. A comparative review of ai techniques for automated code generation in software development: advancements, challenges, and future directions. TEM Journal, 13 0 (1): 0 726, 2024

2024
[41]

G., Zhu, H., and Bayley, I

Paul, D. G., Zhu, H., and Bayley, I. Benchmarks and metrics for evaluations of code generation: A critical review. In 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), pp.\ 87--94. IEEE, 2024

2024
[42]

TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation

Princis, H., Sharma, A., and David, C. Treecoder: Systematic exploration and optimisation of decoding and constraints for LLM code generation. arXiv preprint arXiv:2511.22277, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

A decision-making framework using mcts as a hierarchical task network and deep learning connector

Shao, T., Zhang, K., Cheng, K., and Zhang, H. A decision-making framework using mcts as a hierarchical task network and deep learning connector. Science Progress, 108 0 (4): 0 00368504251386308, 2025

2025
[44]

Reflexion: Language agents with verbal reinforcement learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning. Advances in N eural I nformation P rocessing S ystems , 36: 0 8634--8652, 2023

2023
[45]

J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. N ature , 529 0 (7587): 0 484--489, 2016

2016
[46]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017

work page Pith review arXiv 2017
[47]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Teaching code LLMs to use autocompletion tools in repository-level code generation

Wang, C., Zhang, J., Feng, Y., Li, T., Sun, W., Liu, Y., and Peng, X. Teaching code LLMs to use autocompletion tools in repository-level code generation. ACM Transactions on Software Engineering and Methodology, 34 0 (7): 0 1--27, 2025

2025
[49]

Planning in natural language improves llm search for code generation

Wang, E., Cassano, F., Wu, C., Bai, Y., Song, W., Nath, V., Han, Z., Hendryx, S., Yue, S., and Zhang, H. Planning in natural language improves LLM search for code generation. arXiv preprint arXiv:2409.03733, 2024

work page arXiv 2024
[50]

Reward-centered rest-mcts: A robust decision-making framework for robotic manipulation in high uncertainty environments

Wang, X. Reward-centered rest-mcts: A robust decision-making framework for robotic manipulation in high uncertainty environments. arXiv preprint arXiv:2503.05226, 2025

work page arXiv 2025
[51]

V., Zhou, D., et al

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in N eural I nformation P rocessing S ystems , 35: 0 24824--24837, 2022

2022
[52]

Tailoring Diagnostic Modeling to Individual Learners: Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction

Wu, T., Chen, J., Lin, W., Zhan, J., Li, M., Kuang, K., and Wu, F. Personalized distractor generation via mcts-guided reasoning reconstruction. arXiv preprint arXiv:2508.11184, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Towards system 2 reasoning in llms: Learning how to think with meta chain-of-though,

Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Blagden, C., Phung, D., Rafailov, R., Lile, N., Mahan, D., et al. Towards system 2 reasoning in LLMs : Learning how to think with meta chain-of-thought. arXiv preprint arXiv:2501.04682, 2025

work page arXiv 2025
[54]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

An empirical study of retrieval-augmented code generation: Challenges and opportunities

Yang, Z., Chen, S., Gao, C., Li, Z., Hu, X., Liu, K., and Xia, X. An empirical study of retrieval-augmented code generation: Challenges and opportunities. ACM Transactions on Software Engineering and Methodology, 2025 b

2025
[56]

Tree of thoughts: Deliberate problem solving with large language models

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Advances in N eural I nformation P rocessing S ystems , 36: 0 11809--11822, 2023

2023
[57]

Revisit- ing the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities?arXiv preprint arXiv:2502.12215,

Zeng, Z., Cheng, Q., Yin, Z., Zhou, Y., and Qiu, X. Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? arXiv preprint arXiv:2502.12215, 2025

work page arXiv 2025
[58]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Zhang, Q., Lyu, F., Sun, Z., Wang, L., Zhang, W., Hua, W., Wu, H., Guo, Z., Wang, Y., Muennighoff, N., et al. A survey on test-time scaling in large language models: What, how, where, and how well? arXiv preprint arXiv:2503.24235, 2025

work page internal anchor Pith review arXiv 2025
[59]

arXiv preprint arXiv:2303.05510 , year=

Zhang, S., Chen, Z., Shen, Y., Ding, M., Tenenbaum, J. B., and Gan, C. Planning with large language models for code generation. arXiv preprint arXiv:2303.05510, 2023

work page arXiv 2023
[60]

arXiv preprint arXiv:2402.16906 , year=

Zhong, L., Wang, Z., and Shang, J. Ldb: A large language model debugger via verifying runtime execution step-by-step. arXiv preprint arXiv:2402.16906, 2024

work page arXiv 2024
[61]

Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y.-X. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406, 2023

work page arXiv 2023