AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation

Chong Wang; Kaifeng He; Mingwei Liu; Xin Peng; Yanlin Wang; Zibin Zheng; Zike Li

arxiv: 2506.08980 · v5 · submitted 2025-06-10 · 💻 cs.SE

AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation

Kaifeng He , Mingwei Liu , Chong Wang , Zike Li , Yanlin Wang , Xin Peng , Zibin Zheng This is my paper

Pith reviewed 2026-05-19 10:14 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM code generationadaptive decodinguncertainty-guided decodinglookahead decodingPass@1 accuracyHumanEvalMBPPtoken uncertainty

0 comments

The pith

AdaDec triggers short lookaheads only at high-uncertainty code tokens to rerank candidates and raises Pass@1 accuracy by up to 20.9 points over greedy decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that many LLM code generation failures occur at specific steps where the model assigns high uncertainty yet still includes the correct token lower in its ranking. AdaDec learns a model-specific threshold to detect these steps, pauses generation, runs a brief lookahead to compare candidate continuations, and selects the better-ranked path. This selective intervention improves accuracy on HumanEval+, MBPP+, and DevEval while using far less compute than full beam search or other adaptive methods. A sympathetic reader would care because it turns an observed pattern in token uncertainty into a practical fix that makes LLM coding assistants more reliable without sacrificing speed.

Core claim

Token ranking mistakes at high-uncertainty decision points cause many generation errors in code, because the correct token is often present in the distribution but not chosen first; AdaDec counters this by learning an uncertainty threshold that triggers a pause-then-rerank step using short lookahead, selecting the continuation that better matches the intended program logic and thereby lifting Pass@1 scores substantially above both greedy decoding and prior adaptive baselines.

What carries the argument

token-level pause-then-rerank mechanism driven by learned model-specific uncertainty thresholds

If this is right

Selective pausing preserves most of the speed of greedy decoding while correcting logic errors that uniform strategies miss.
The same threshold-learning approach can be applied to other code-generation models without retraining the underlying LLM.
Outperformance over beam search indicates that targeted reranking at uncertain points is more efficient than exhaustive search.
Consistent gains across HumanEval+, MBPP+, and DevEval suggest the method generalizes to both simple and realistic programming tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend naturally to other structured generation tasks such as math proofs or API call sequences where uncertainty also clusters at critical choice points.
If uncertainty thresholds prove stable across model sizes, the method could become a lightweight post-training adapter for any LLM used in code.
Developers might combine AdaDec with test-time verification to further reduce the chance that lookahead selects a locally plausible but globally incorrect path.

Load-bearing premise

Model uncertainty reliably marks the exact steps where the correct token sits in the distribution but not at the top and a short lookahead can correct it without creating new errors downstream.

What would settle it

On a benchmark where tokens at high-uncertainty steps are frequently absent from the top-k candidates during lookahead, AdaDec would show no accuracy gain or would degrade relative to greedy decoding.

Figures

Figures reproduced from arXiv: 2506.08980 by Chong Wang, Kaifeng He, Mingwei Liu, Xin Peng, Yanlin Wang, Zibin Zheng, Zike Li.

**Figure 3.** Figure 3: Change in the average rank of ground-truth tokens above and below [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Approach Overview of ADADEC Finding 2: The observed correlation between entropy and the rank of the ground-truth token suggests that entropy can be used as an indicator to adaptively pause the decoding process and rerank uncertain tokens. However, our entropy percentile analysis shows that it is difficult to define a universal, fixed entropy threshold across all models that effectively balances pause frequ… view at source ↗

**Figure 5.** Figure 5: A case study from HumanEval At a certain point, the model DS-1.3B has produced the initial structure shown in the “Current Seq” portion. At this stage, it must decide how to proceed. A standard greedy decoding checks islower() and isupper() directly on all keys, implicitly assuming that the keys are strings. This assumption is unsafe: if any key is a non-string (e.g., an integer), a runtime error will occu… view at source ↗

read the original abstract

Code generation with large language models (LLMs) is highly sensitive to token selection during decoding, particularly at uncertain decision points that influence program logic. While standard strategies such as greedy decoding treat all tokens uniformly, they overlook code-specific uncertainty patterns, leading to suboptimal performance. This paper presents an empirical study revealing that many generation errors stem from token ranking mistakes at high-uncertainty steps, where the correct token is present but not top-ranked. Motivated by these findings, we propose AdaDec, a lookahead-based uncertainty-guided adaptive decoding framework that integrates a token-level pause-then-rerank mechanism driven by token uncertainty. AdaDec learns model-specific uncertainty thresholds and applies a lookahead-based reranking strategy when uncertainty is high. Experiments on HumanEval+, MBPP+, and DevEval benchmarks show that AdaDec improves Pass@1 accuracy by up to 20.9% in absolute terms over greedy decoding. More importantly, it consistently outperforms both competitive baselines like Beam Search and state-of-the-art adaptive decoding methods such as AdapT, while maintaining high efficiency through selective, uncertainty-triggered pausing. Our results highlight the promise of uncertainty-aware adaptive decoding for improving both the reliability and efficiency of LLM-based code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaDec gets up to 20.9% absolute Pass@1 gains on code benchmarks by triggering short lookahead reranks only at high-uncertainty tokens, but the method details and stability checks are thin.

read the letter

The main takeaway is that selectively pausing for lookahead reranking at uncertain tokens lifts Pass@1 by as much as 20.9% over greedy decoding on HumanEval+, MBPP+, and DevEval while beating both beam search and AdapT at lower average cost. The paper first shows through error analysis that many code generation failures occur when the correct token sits in the distribution but is not ranked first at high-uncertainty steps. They then learn a model-specific threshold and apply a short lookahead rerank only when that threshold is crossed. This selective trigger is the practical hook, since it avoids the full cost of beam search on every token. The empirical results line up with the mechanism they describe, and the efficiency claim holds because the intervention stays infrequent. What is new is the focused application to code generation with per-model thresholds rather than a generic adaptive rule. They cite the relevant baselines cleanly and the gains appear consistent across the three benchmarks. The work is straightforward and the selective design makes sense for real code assistants. The soft spots are mostly in the missing specifics. The abstract does not name the exact uncertainty measure or show how the threshold is learned without leaking test information. There are also no error bars or run-to-run variance reported, which makes it harder to judge whether the 20.9% figure is stable. The core assumption that uncertainty reliably flags fixable ranking errors seems to work in their experiments, but it would be useful to see whether the rerank ever introduces new downstream errors on longer programs. This paper is for engineers and researchers who build or tune LLM code tools and want accuracy improvements without paying beam-search overhead on every step. A reader working on adaptive decoding would pick up the empirical pattern and the selective idea. I would send it for peer review. The benchmark numbers are large enough and the approach is concrete enough that referees should check the implementation details and controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes AdaDec, an uncertainty-guided lookahead decoding framework for LLM-based code generation. It presents an empirical study showing that many generation errors arise from token ranking mistakes at high-uncertainty steps where the correct token is present but not top-ranked. AdaDec learns model-specific uncertainty thresholds and applies a selective pause-then-rerank mechanism with short lookahead at high-uncertainty tokens. Experiments on HumanEval+, MBPP+, and DevEval report up to 20.9% absolute Pass@1 improvement over greedy decoding, with consistent outperformance of Beam Search and AdapT while preserving efficiency through selective intervention.

Significance. If the empirical results and mechanism hold under closer scrutiny, the work is significant for LLM-based code generation. It offers a targeted, efficiency-preserving alternative to uniform decoding strategies by focusing interventions on uncertain decision points that affect program logic. The benchmark gains and selective application provide a plausible path toward more reliable code synthesis without the full cost of beam search or similar methods.

major comments (2)

[§4] §4 (Experiments): The central claim of up to 20.9% absolute Pass@1 improvement lacks reporting of the number of runs, standard deviations, or confidence intervals for the gains on HumanEval+, MBPP+, and DevEval. Without these, it is impossible to determine whether the reported outperformance over greedy decoding, Beam Search, and AdapT is robust or could be explained by variance.
[§3] §3 (Method): The procedure for learning model-specific uncertainty thresholds is described at a high level but does not specify the exact uncertainty metric (e.g., entropy, negative log-probability of the top token), the validation data used for threshold selection, or the optimization criterion. This detail is load-bearing for the adaptive claim and for reproducibility of the selective lookahead trigger.

minor comments (2)

[Abstract] Abstract: The maximum 20.9% gain is stated without indicating the specific benchmark on which it occurs; adding this would improve clarity for readers.
[§2] §2 (Related Work): The comparison to AdapT would benefit from a brief statement of how AdaDec's uncertainty-triggered lookahead differs mechanistically from AdapT's adaptation strategy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central claim of up to 20.9% absolute Pass@1 improvement lacks reporting of the number of runs, standard deviations, or confidence intervals for the gains on HumanEval+, MBPP+, and DevEval. Without these, it is impossible to determine whether the reported outperformance over greedy decoding, Beam Search, and AdapT is robust or could be explained by variance.

Authors: We agree that the current manuscript does not report the number of runs or associated statistical measures such as standard deviations and confidence intervals. This omission limits the ability to fully assess robustness. In the revised version, we will update Section 4 to include results from multiple independent runs and report mean Pass@1 scores with standard deviations and confidence intervals for all benchmarks and baselines. These additions will allow readers to evaluate whether the observed gains are consistent or attributable to variance. revision: yes
Referee: [§3] §3 (Method): The procedure for learning model-specific uncertainty thresholds is described at a high level but does not specify the exact uncertainty metric (e.g., entropy, negative log-probability of the top token), the validation data used for threshold selection, or the optimization criterion. This detail is load-bearing for the adaptive claim and for reproducibility of the selective lookahead trigger.

Authors: We acknowledge that the description of threshold learning in Section 3 is high-level and omits key implementation details. In the revised manuscript, we will expand this section to specify the uncertainty metric, the validation data employed for threshold selection, and the optimization criterion used. These clarifications will improve reproducibility and better support the adaptive claims of AdaDec. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with held-out evaluation

full rationale

The paper presents an empirical study of token-level uncertainty in LLM code generation and introduces the AdaDec framework, which learns model-specific uncertainty thresholds from data and applies selective lookahead reranking. No equations, derivations, or self-citations are provided that reduce the claimed Pass@1 improvements to the fitted thresholds or inputs by construction. Performance is measured on separate held-out benchmarks (HumanEval+, MBPP+, DevEval), making the evaluation independent of the fitting process. The approach is benchmark-driven rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that model-provided token probabilities yield a usable uncertainty signal for code tokens and that short lookahead can resolve ranking errors without side effects. No new physical or mathematical entities are introduced.

free parameters (1)

model-specific uncertainty threshold
Learned per model to decide when to trigger lookahead; value not reported in abstract.

axioms (1)

domain assumption High uncertainty at a token step indicates the correct token is present but not top-ranked.
Stated in the empirical study section of the abstract.

pith-pipeline@v0.9.0 · 5760 in / 1345 out tokens · 32327 ms · 2026-05-19T10:14:58.372715+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ADADEC … uses an entropy-guided pause-then-rerank mechanism based on learned, model-specific thresholds and a lookahead strategy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Knowledge-Graph-Driven Data Synthesis for Low-Resource Software Development: A HarmonyOS Case Study
cs.SE 2025-11 unverdicted novelty 6.0

APIKG4Syn synthesizes API-oriented training data via knowledge graphs and Monte Carlo search to fine-tune a 7B model that reaches 25% pass@1 on HarmonyOS code generation, beating untuned GPT-4o at 17.59%.
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

STOP is a new learnable internal path-pruning technique that improves efficiency and accuracy of parallel reasoning in LRMs under fixed compute budgets.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 3 Pith papers · 8 internal anchors

[1]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

J. Li, Y . Li, G. Li, Z. Jin, Y . Hao, and X. Hu, “Skcoder: A sketch-based approach for automatic code generation,” in Proceedings of the 45th International Conference on Software Engineering , ser. ICSE ’23. IEEE Press, 2023, p. 2124–2135. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00179

work page doi:10.1109/icse48619.2023.00179 2023
[2]

Enhancing code generation via bidirectional comment-level mutual grounding,

Y . Di and T. Zhang, “Enhancing code generation via bidirectional comment-level mutual grounding,” 2025. [Online]. Available: https: //arxiv.org/abs/2505.07768

work page arXiv 2025
[3]

Test-case-driven programming understanding in large language models for better code generation,

Z. Tian, J. Chen, and X. Zhang, “Fixing large language models’ specification misunderstanding for better code generation,” 2024. [Online]. Available: https://arxiv.org/abs/2309.16120

work page arXiv 2024
[4]

Rocode: Integrating backtracking mechanism and program analysis in large language models for code generation,

X. Jiang, Y . Dong, Y . Tao, H. Liu, Z. Jin, W. Jiao, and G. Li, “Rocode: Integrating backtracking mechanism and program analysis in large language models for code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2411.07112

work page arXiv 2025
[5]

Soen-101: Code generation by emulating software process models using large language model agents,

F. Lin, D. J. Kim, Tse-Husn, and Chen, “Soen-101: Code generation by emulating software process models using large language model agents,”

work page
[6]

When llm-based code genera- tion meets the software development process,

[Online]. Available: https://arxiv.org/abs/2403.15852

work page arXiv
[7]

Code Llama: Open Foundation Models for Code

B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” 2024. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Li, F. Luo, Y . Xiong, and W. Liang, “Deepseek-coder: When the large language model meets programming – the rise of code intelligence,” 2024. [Online]. Available: https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Evaluating instruction-tuned large language models on code comprehension and generation,

Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y . Lou, “Evaluating instruction-tuned large language models on code comprehension and generation,” arXiv preprint arXiv:2308.01240 , 2023

work page arXiv 2023
[10]

Enhancing code generation performance of smaller models by distilling the reasoning ability of llms,

Z. Sun, C. Lyu, B. Li, Y . Wan, H. Zhang, G. Li, and Z. Jin, “Enhancing code generation performance of smaller models by distilling the reasoning ability of llms,” arXiv preprint arXiv:2403.13271 , 2024

work page arXiv 2024
[11]

Ugare, T

S. Ugare, T. Suresh, H. Kang, S. Misailovic, and G. Singh, “Improving llm code generation with grammar augmentation,” arXiv preprint arXiv:2403.01632, 2024

work page arXiv 2024
[12]

VulRepair: A T5-based automated software vulnerability repair

M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung, “Vulrepair: a t5-based automated software vulnerability repair,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2022. Association for Computing Machinery, 2022, p. 935–947. [Online]. Available...

work page doi:10.1145/3540250.3549098 2022
[13]

Inferfix: End-to-end program repair with llms,

M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2023. Association for Computing Machinery, 2023, p. 1646–1656. [Online]. A...

work page doi:10.1145/3611643.3613892 2023
[14]

Less training, more repairing please: revisiting automated program repair via zero-shot learning,

C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE

work page
[15]

Association for Computing Machinery, 2022, p. 959–971. [Online]. Available: https://doi.org/10.1145/3540250.3549101

work page doi:10.1145/3540250.3549101 2022
[16]

A Survey on Large Language Models for Code Generation

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.00515

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245,

N. Huynh and B. Lin, “Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01245

work page arXiv 2025
[18]

Evaluating large language models trained on code,

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

work page 2021
[19]

Neurologic a* esque decoding: Constrained text generation with lookahead heuristics,

X. Lu, S. Welleck, P. West, L. Jiang, J. Kasai, D. Khashabi, R. L. Bras, L. Qin, Y . Yu, R. Zellers et al., “Neurologic a* esque decoding: Constrained text generation with lookahead heuristics,” arXiv preprint arXiv:2112.08726, 2021

work page arXiv 2021
[20]

anonymous, “Adadec,” https://github.com/SYSUSELab/AdaDec, 2025

work page 2025
[21]

Stable code 3b

D. Phung, N. Pinnaparaju, R. Adithyan, M. Zhuravinskyi, J. Tow, and N. Cooper, “Stable code 3b.” [Online]. Available: https: //huggingface.co/stabilityai/stable-code-3b

work page
[22]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948

work page 1948
[24]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

T. Y . Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul et al., “Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions,” arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning

G. Lemaitre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” 2016. [Online]. Available: https://arxiv.org/abs/1609.06570

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

Lsr-mcts: Alleviating long range dependency in code generation,

T. Lu, Y . Li, L. Wang, B. Lin, J. Tang, Q. Lv, W. Xu, H.-T. Zheng, Y . Li, X. Su, and Z. Shan, “Lsr-mcts: Alleviating long range dependency in code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2504.07433

work page arXiv 2025
[27]

Adc: Enhancing function calling via adversarial datasets and code line-level feedback,

W. Zhang, Y . Zhang, L. Zhu, Q. Jia, F. Jiang, H. Guo, Z. Li, and M. Zhou, “Adc: Enhancing function calling via adversarial datasets and code line-level feedback,” 2024. [Online]. Available: https://arxiv.org/abs/2412.17754

work page arXiv 2024
[28]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

Z. Nan, Z. Guo, K. Liu, and X. Xia, “ Test Intention Guided LLM-based Unit Test Generation ,” in 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) . Los Alamitos, CA, USA: IEEE Computer Society, May 2025, pp. 779–779. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICSE55347.2025.00243

work page doi:10.1109/icse55347.2025.00243 2025
[30]

On the evaluation of large language models in unit test generation,

L. Yang, C. Yang, S. Gao, W. Wang, B. Wang, Q. Zhu, X. Chu, J. Zhou, G. Liang, Q. Wang, and J. Chen, “On the evaluation of large language models in unit test generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.18181

work page arXiv 2024
[31]

Evaluating and improving chatgpt for unit test generation,

Z. Yuan, M. Liu, S. Ding, K. Wang, Y . Chen, X. Peng, and Y . Lou, “Evaluating and improving chatgpt for unit test generation,” Proc. ACM Softw. Eng. , vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3660783

work page doi:10.1145/3660783 2024
[32]

A system for automated unit test generation using large language models and assessment of generated test suites,

A. Lops, F. Narducci, A. Ragone, M. Trizio, and C. Bartolini, “A system for automated unit test generation using large language models and assessment of generated test suites,” in 2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2025, pp. 29–36

work page 2025
[33]

Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,

G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and B. Ray, “Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,” Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3643769

work page doi:10.1145/3643769 2024
[34]

Prompting and fine-tuning large language models for automated code review comment generation,

M. A. Haider, A. B. Mostofa, S. S. B. Mosaddek, A. Iqbal, and T. Ahmed, “Prompting and fine-tuning large language models for automated code review comment generation,” 2024. [Online]. Available: https://arxiv.org/abs/2411.10129

work page arXiv 2024
[35]

A qualitative investigation into llm-generated multilingual code comments and automatic evaluation metrics,

J. Katzy, Y . Huang, G.-R. Panchu, M. Ziemlewski, P. Loizides, S. Vermeulen, A. van Deursen, and M. Izadi, “A qualitative investigation into llm-generated multilingual code comments and automatic evaluation metrics,” 2025. [Online]. Available: https://arxiv.org/abs/2505.15469

work page arXiv 2025
[36]

Improving retrieval-augmented code comment generation by retrieving for generation,

H. Lu and Z. Liu, “Improving retrieval-augmented code comment generation by retrieving for generation,” in 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME) , 2024, pp. 350–362

work page 2024
[37]

Ds-1000: A natural and reliable bench- mark for data science code generation

Y . Lai, C. Li, Y . Wang, T. Zhang, R. Zhong, L. Zettlemoyer, S. W. tau Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and reliable benchmark for data science code generation,” 2022. [Online]. Available: https://arxiv.org/abs/2211.11501

work page arXiv 2022
[38]

2308.01861 , archivePrefix=

X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y . Chen, J. Feng, C. Sha, X. Peng, and Y . Lou, “Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation,” 2023. [Online]. Available: https://arxiv.org/abs/2308.01861

work page arXiv 2023
[39]

Unilog: Automatic logging via LLM and in-context learning

Y . Zhang, W. Zhang, D. Ran, Q. Zhu, C. Dou, D. Hao, T. Xie, and L. Zhang, “Learning-based widget matching for migrating gui test cases,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. ACM, Feb. 2024, p. 1–13. [Online]. Available: http://dx.doi.org/10.1145/3597503.3623322

work page doi:10.1145/3597503.3623322 2024
[40]

Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories,

J. Li, G. Li, Y . Zhao, Y . Li, H. Liu, H. Zhu, L. Wang, K. Liu, Z. Fang, L. Wang, J. Ding, X. Zhang, Y . Zhu, Y . Dong, Z. Jin, B. Li, F. Huang, and Y . Li, “Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories,” 2024. [Online]. Available: https://arxiv.org/abs/2405.19856

work page arXiv 2024
[41]

Beam Search Strategies for Neural Machine Translation

M. Freitag and Y . Al-Onaizan, “Beam search strategies for neural machine translation,” arXiv preprint arXiv:1702.01806 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

The curious case of neural text degeneration,

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi, “The curious case of neural text degeneration,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=rygGQyrFvH

work page 2020
[43]

Hot or cold? adaptive temperature sampling for code generation with large language models,

Y . Zhu, J. Li, G. Li, Y . Zhao, J. Li, Z. Jin, and H. Mei, “Hot or cold? adaptive temperature sampling for code generation with large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2309.02772

work page arXiv 2023
[44]

Uncertainty-guided chain-of-thought for code generation with llms,

Y . Zhu, G. Li, X. Jiang, J. Li, H. Mei, Z. Jin, and Y . Dong, “Uncertainty-guided chain-of-thought for code generation with llms,”

work page
[45]

Uncertainty-guided chain-of-thought for code generation with llms.arXiv preprint arXiv:2503.15341,

[Online]. Available: https://arxiv.org/abs/2503.15341

work page arXiv

[1] [1]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

J. Li, Y . Li, G. Li, Z. Jin, Y . Hao, and X. Hu, “Skcoder: A sketch-based approach for automatic code generation,” in Proceedings of the 45th International Conference on Software Engineering , ser. ICSE ’23. IEEE Press, 2023, p. 2124–2135. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00179

work page doi:10.1109/icse48619.2023.00179 2023

[2] [2]

Enhancing code generation via bidirectional comment-level mutual grounding,

Y . Di and T. Zhang, “Enhancing code generation via bidirectional comment-level mutual grounding,” 2025. [Online]. Available: https: //arxiv.org/abs/2505.07768

work page arXiv 2025

[3] [3]

Test-case-driven programming understanding in large language models for better code generation,

Z. Tian, J. Chen, and X. Zhang, “Fixing large language models’ specification misunderstanding for better code generation,” 2024. [Online]. Available: https://arxiv.org/abs/2309.16120

work page arXiv 2024

[4] [4]

Rocode: Integrating backtracking mechanism and program analysis in large language models for code generation,

X. Jiang, Y . Dong, Y . Tao, H. Liu, Z. Jin, W. Jiao, and G. Li, “Rocode: Integrating backtracking mechanism and program analysis in large language models for code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2411.07112

work page arXiv 2025

[5] [5]

Soen-101: Code generation by emulating software process models using large language model agents,

F. Lin, D. J. Kim, Tse-Husn, and Chen, “Soen-101: Code generation by emulating software process models using large language model agents,”

work page

[6] [6]

When llm-based code genera- tion meets the software development process,

[Online]. Available: https://arxiv.org/abs/2403.15852

work page arXiv

[7] [7]

Code Llama: Open Foundation Models for Code

B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” 2024. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Li, F. Luo, Y . Xiong, and W. Liang, “Deepseek-coder: When the large language model meets programming – the rise of code intelligence,” 2024. [Online]. Available: https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Evaluating instruction-tuned large language models on code comprehension and generation,

Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y . Lou, “Evaluating instruction-tuned large language models on code comprehension and generation,” arXiv preprint arXiv:2308.01240 , 2023

work page arXiv 2023

[10] [10]

Enhancing code generation performance of smaller models by distilling the reasoning ability of llms,

Z. Sun, C. Lyu, B. Li, Y . Wan, H. Zhang, G. Li, and Z. Jin, “Enhancing code generation performance of smaller models by distilling the reasoning ability of llms,” arXiv preprint arXiv:2403.13271 , 2024

work page arXiv 2024

[11] [11]

Ugare, T

S. Ugare, T. Suresh, H. Kang, S. Misailovic, and G. Singh, “Improving llm code generation with grammar augmentation,” arXiv preprint arXiv:2403.01632, 2024

work page arXiv 2024

[12] [12]

VulRepair: A T5-based automated software vulnerability repair

M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung, “Vulrepair: a t5-based automated software vulnerability repair,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2022. Association for Computing Machinery, 2022, p. 935–947. [Online]. Available...

work page doi:10.1145/3540250.3549098 2022

[13] [13]

Inferfix: End-to-end program repair with llms,

M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2023. Association for Computing Machinery, 2023, p. 1646–1656. [Online]. A...

work page doi:10.1145/3611643.3613892 2023

[14] [14]

Less training, more repairing please: revisiting automated program repair via zero-shot learning,

C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE

work page

[15] [15]

Association for Computing Machinery, 2022, p. 959–971. [Online]. Available: https://doi.org/10.1145/3540250.3549101

work page doi:10.1145/3540250.3549101 2022

[16] [16]

A Survey on Large Language Models for Code Generation

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.00515

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245,

N. Huynh and B. Lin, “Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01245

work page arXiv 2025

[18] [18]

Evaluating large language models trained on code,

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

work page 2021

[19] [19]

Neurologic a* esque decoding: Constrained text generation with lookahead heuristics,

X. Lu, S. Welleck, P. West, L. Jiang, J. Kasai, D. Khashabi, R. L. Bras, L. Qin, Y . Yu, R. Zellers et al., “Neurologic a* esque decoding: Constrained text generation with lookahead heuristics,” arXiv preprint arXiv:2112.08726, 2021

work page arXiv 2021

[20] [20]

anonymous, “Adadec,” https://github.com/SYSUSELab/AdaDec, 2025

work page 2025

[21] [21]

Stable code 3b

D. Phung, N. Pinnaparaju, R. Adithyan, M. Zhuravinskyi, J. Tow, and N. Cooper, “Stable code 3b.” [Online]. Available: https: //huggingface.co/stabilityai/stable-code-3b

work page

[22] [22]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948

work page 1948

[24] [24]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

T. Y . Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul et al., “Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions,” arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning

G. Lemaitre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” 2016. [Online]. Available: https://arxiv.org/abs/1609.06570

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [26]

Lsr-mcts: Alleviating long range dependency in code generation,

T. Lu, Y . Li, L. Wang, B. Lin, J. Tang, Q. Lv, W. Xu, H.-T. Zheng, Y . Li, X. Su, and Z. Shan, “Lsr-mcts: Alleviating long range dependency in code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2504.07433

work page arXiv 2025

[27] [27]

Adc: Enhancing function calling via adversarial datasets and code line-level feedback,

W. Zhang, Y . Zhang, L. Zhu, Q. Jia, F. Jiang, H. Guo, Z. Li, and M. Zhou, “Adc: Enhancing function calling via adversarial datasets and code line-level feedback,” 2024. [Online]. Available: https://arxiv.org/abs/2412.17754

work page arXiv 2024

[28] [28]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[29] [29]

2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

Z. Nan, Z. Guo, K. Liu, and X. Xia, “ Test Intention Guided LLM-based Unit Test Generation ,” in 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) . Los Alamitos, CA, USA: IEEE Computer Society, May 2025, pp. 779–779. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICSE55347.2025.00243

work page doi:10.1109/icse55347.2025.00243 2025

[30] [30]

On the evaluation of large language models in unit test generation,

L. Yang, C. Yang, S. Gao, W. Wang, B. Wang, Q. Zhu, X. Chu, J. Zhou, G. Liang, Q. Wang, and J. Chen, “On the evaluation of large language models in unit test generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.18181

work page arXiv 2024

[31] [31]

Evaluating and improving chatgpt for unit test generation,

Z. Yuan, M. Liu, S. Ding, K. Wang, Y . Chen, X. Peng, and Y . Lou, “Evaluating and improving chatgpt for unit test generation,” Proc. ACM Softw. Eng. , vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3660783

work page doi:10.1145/3660783 2024

[32] [32]

A system for automated unit test generation using large language models and assessment of generated test suites,

A. Lops, F. Narducci, A. Ragone, M. Trizio, and C. Bartolini, “A system for automated unit test generation using large language models and assessment of generated test suites,” in 2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2025, pp. 29–36

work page 2025

[33] [33]

Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,

G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and B. Ray, “Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,” Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3643769

work page doi:10.1145/3643769 2024

[34] [34]

Prompting and fine-tuning large language models for automated code review comment generation,

M. A. Haider, A. B. Mostofa, S. S. B. Mosaddek, A. Iqbal, and T. Ahmed, “Prompting and fine-tuning large language models for automated code review comment generation,” 2024. [Online]. Available: https://arxiv.org/abs/2411.10129

work page arXiv 2024

[35] [35]

A qualitative investigation into llm-generated multilingual code comments and automatic evaluation metrics,

J. Katzy, Y . Huang, G.-R. Panchu, M. Ziemlewski, P. Loizides, S. Vermeulen, A. van Deursen, and M. Izadi, “A qualitative investigation into llm-generated multilingual code comments and automatic evaluation metrics,” 2025. [Online]. Available: https://arxiv.org/abs/2505.15469

work page arXiv 2025

[36] [36]

Improving retrieval-augmented code comment generation by retrieving for generation,

H. Lu and Z. Liu, “Improving retrieval-augmented code comment generation by retrieving for generation,” in 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME) , 2024, pp. 350–362

work page 2024

[37] [37]

Ds-1000: A natural and reliable bench- mark for data science code generation

Y . Lai, C. Li, Y . Wang, T. Zhang, R. Zhong, L. Zettlemoyer, S. W. tau Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and reliable benchmark for data science code generation,” 2022. [Online]. Available: https://arxiv.org/abs/2211.11501

work page arXiv 2022

[38] [38]

2308.01861 , archivePrefix=

X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y . Chen, J. Feng, C. Sha, X. Peng, and Y . Lou, “Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation,” 2023. [Online]. Available: https://arxiv.org/abs/2308.01861

work page arXiv 2023

[39] [39]

Unilog: Automatic logging via LLM and in-context learning

Y . Zhang, W. Zhang, D. Ran, Q. Zhu, C. Dou, D. Hao, T. Xie, and L. Zhang, “Learning-based widget matching for migrating gui test cases,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. ACM, Feb. 2024, p. 1–13. [Online]. Available: http://dx.doi.org/10.1145/3597503.3623322

work page doi:10.1145/3597503.3623322 2024

[40] [40]

Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories,

J. Li, G. Li, Y . Zhao, Y . Li, H. Liu, H. Zhu, L. Wang, K. Liu, Z. Fang, L. Wang, J. Ding, X. Zhang, Y . Zhu, Y . Dong, Z. Jin, B. Li, F. Huang, and Y . Li, “Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories,” 2024. [Online]. Available: https://arxiv.org/abs/2405.19856

work page arXiv 2024

[41] [41]

Beam Search Strategies for Neural Machine Translation

M. Freitag and Y . Al-Onaizan, “Beam search strategies for neural machine translation,” arXiv preprint arXiv:1702.01806 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[42] [42]

The curious case of neural text degeneration,

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi, “The curious case of neural text degeneration,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=rygGQyrFvH

work page 2020

[43] [43]

Hot or cold? adaptive temperature sampling for code generation with large language models,

Y . Zhu, J. Li, G. Li, Y . Zhao, J. Li, Z. Jin, and H. Mei, “Hot or cold? adaptive temperature sampling for code generation with large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2309.02772

work page arXiv 2023

[44] [44]

Uncertainty-guided chain-of-thought for code generation with llms,

Y . Zhu, G. Li, X. Jiang, J. Li, H. Mei, Z. Jin, and Y . Dong, “Uncertainty-guided chain-of-thought for code generation with llms,”

work page

[45] [45]

Uncertainty-guided chain-of-thought for code generation with llms.arXiv preprint arXiv:2503.15341,

[Online]. Available: https://arxiv.org/abs/2503.15341

work page arXiv