pith. sign in

arxiv: 2506.08980 · v5 · submitted 2025-06-10 · 💻 cs.SE

AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation

Pith reviewed 2026-05-19 10:14 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM code generationadaptive decodinguncertainty-guided decodinglookahead decodingPass@1 accuracyHumanEvalMBPPtoken uncertainty
0
0 comments X

The pith

AdaDec triggers short lookaheads only at high-uncertainty code tokens to rerank candidates and raises Pass@1 accuracy by up to 20.9 points over greedy decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that many LLM code generation failures occur at specific steps where the model assigns high uncertainty yet still includes the correct token lower in its ranking. AdaDec learns a model-specific threshold to detect these steps, pauses generation, runs a brief lookahead to compare candidate continuations, and selects the better-ranked path. This selective intervention improves accuracy on HumanEval+, MBPP+, and DevEval while using far less compute than full beam search or other adaptive methods. A sympathetic reader would care because it turns an observed pattern in token uncertainty into a practical fix that makes LLM coding assistants more reliable without sacrificing speed.

Core claim

Token ranking mistakes at high-uncertainty decision points cause many generation errors in code, because the correct token is often present in the distribution but not chosen first; AdaDec counters this by learning an uncertainty threshold that triggers a pause-then-rerank step using short lookahead, selecting the continuation that better matches the intended program logic and thereby lifting Pass@1 scores substantially above both greedy decoding and prior adaptive baselines.

What carries the argument

token-level pause-then-rerank mechanism driven by learned model-specific uncertainty thresholds

If this is right

  • Selective pausing preserves most of the speed of greedy decoding while correcting logic errors that uniform strategies miss.
  • The same threshold-learning approach can be applied to other code-generation models without retraining the underlying LLM.
  • Outperformance over beam search indicates that targeted reranking at uncertain points is more efficient than exhaustive search.
  • Consistent gains across HumanEval+, MBPP+, and DevEval suggest the method generalizes to both simple and realistic programming tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend naturally to other structured generation tasks such as math proofs or API call sequences where uncertainty also clusters at critical choice points.
  • If uncertainty thresholds prove stable across model sizes, the method could become a lightweight post-training adapter for any LLM used in code.
  • Developers might combine AdaDec with test-time verification to further reduce the chance that lookahead selects a locally plausible but globally incorrect path.

Load-bearing premise

Model uncertainty reliably marks the exact steps where the correct token sits in the distribution but not at the top and a short lookahead can correct it without creating new errors downstream.

What would settle it

On a benchmark where tokens at high-uncertainty steps are frequently absent from the top-k candidates during lookahead, AdaDec would show no accuracy gain or would degrade relative to greedy decoding.

Figures

Figures reproduced from arXiv: 2506.08980 by Chong Wang, Kaifeng He, Mingwei Liu, Xin Peng, Yanlin Wang, Zibin Zheng, Zike Li.

Figure 1
Figure 1. Figure 1: Entropy comparison between drift points and non-drift decoding [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Change in the average rank of ground-truth tokens above and below [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Approach Overview of ADADEC Finding 2: The observed correlation between entropy and the rank of the ground-truth token suggests that entropy can be used as an indicator to adaptively pause the decoding process and rerank uncertain tokens. However, our entropy percentile analysis shows that it is difficult to define a universal, fixed entropy threshold across all models that effectively balances pause frequ… view at source ↗
Figure 5
Figure 5. Figure 5: A case study from HumanEval At a certain point, the model DS-1.3B has produced the initial structure shown in the “Current Seq” portion. At this stage, it must decide how to proceed. A standard greedy decoding checks islower() and isupper() directly on all keys, implicitly assuming that the keys are strings. This assumption is unsafe: if any key is a non-string (e.g., an integer), a runtime error will occu… view at source ↗
read the original abstract

Code generation with large language models (LLMs) is highly sensitive to token selection during decoding, particularly at uncertain decision points that influence program logic. While standard strategies such as greedy decoding treat all tokens uniformly, they overlook code-specific uncertainty patterns, leading to suboptimal performance. This paper presents an empirical study revealing that many generation errors stem from token ranking mistakes at high-uncertainty steps, where the correct token is present but not top-ranked. Motivated by these findings, we propose AdaDec, a lookahead-based uncertainty-guided adaptive decoding framework that integrates a token-level pause-then-rerank mechanism driven by token uncertainty. AdaDec learns model-specific uncertainty thresholds and applies a lookahead-based reranking strategy when uncertainty is high. Experiments on HumanEval+, MBPP+, and DevEval benchmarks show that AdaDec improves Pass@1 accuracy by up to 20.9% in absolute terms over greedy decoding. More importantly, it consistently outperforms both competitive baselines like Beam Search and state-of-the-art adaptive decoding methods such as AdapT, while maintaining high efficiency through selective, uncertainty-triggered pausing. Our results highlight the promise of uncertainty-aware adaptive decoding for improving both the reliability and efficiency of LLM-based code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AdaDec, an uncertainty-guided lookahead decoding framework for LLM-based code generation. It presents an empirical study showing that many generation errors arise from token ranking mistakes at high-uncertainty steps where the correct token is present but not top-ranked. AdaDec learns model-specific uncertainty thresholds and applies a selective pause-then-rerank mechanism with short lookahead at high-uncertainty tokens. Experiments on HumanEval+, MBPP+, and DevEval report up to 20.9% absolute Pass@1 improvement over greedy decoding, with consistent outperformance of Beam Search and AdapT while preserving efficiency through selective intervention.

Significance. If the empirical results and mechanism hold under closer scrutiny, the work is significant for LLM-based code generation. It offers a targeted, efficiency-preserving alternative to uniform decoding strategies by focusing interventions on uncertain decision points that affect program logic. The benchmark gains and selective application provide a plausible path toward more reliable code synthesis without the full cost of beam search or similar methods.

major comments (2)
  1. [§4] §4 (Experiments): The central claim of up to 20.9% absolute Pass@1 improvement lacks reporting of the number of runs, standard deviations, or confidence intervals for the gains on HumanEval+, MBPP+, and DevEval. Without these, it is impossible to determine whether the reported outperformance over greedy decoding, Beam Search, and AdapT is robust or could be explained by variance.
  2. [§3] §3 (Method): The procedure for learning model-specific uncertainty thresholds is described at a high level but does not specify the exact uncertainty metric (e.g., entropy, negative log-probability of the top token), the validation data used for threshold selection, or the optimization criterion. This detail is load-bearing for the adaptive claim and for reproducibility of the selective lookahead trigger.
minor comments (2)
  1. [Abstract] Abstract: The maximum 20.9% gain is stated without indicating the specific benchmark on which it occurs; adding this would improve clarity for readers.
  2. [§2] §2 (Related Work): The comparison to AdapT would benefit from a brief statement of how AdaDec's uncertainty-triggered lookahead differs mechanistically from AdapT's adaptation strategy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results and methods.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central claim of up to 20.9% absolute Pass@1 improvement lacks reporting of the number of runs, standard deviations, or confidence intervals for the gains on HumanEval+, MBPP+, and DevEval. Without these, it is impossible to determine whether the reported outperformance over greedy decoding, Beam Search, and AdapT is robust or could be explained by variance.

    Authors: We agree that the current manuscript does not report the number of runs or associated statistical measures such as standard deviations and confidence intervals. This omission limits the ability to fully assess robustness. In the revised version, we will update Section 4 to include results from multiple independent runs and report mean Pass@1 scores with standard deviations and confidence intervals for all benchmarks and baselines. These additions will allow readers to evaluate whether the observed gains are consistent or attributable to variance. revision: yes

  2. Referee: [§3] §3 (Method): The procedure for learning model-specific uncertainty thresholds is described at a high level but does not specify the exact uncertainty metric (e.g., entropy, negative log-probability of the top token), the validation data used for threshold selection, or the optimization criterion. This detail is load-bearing for the adaptive claim and for reproducibility of the selective lookahead trigger.

    Authors: We acknowledge that the description of threshold learning in Section 3 is high-level and omits key implementation details. In the revised manuscript, we will expand this section to specify the uncertainty metric, the validation data employed for threshold selection, and the optimization criterion used. These clarifications will improve reproducibility and better support the adaptive claims of AdaDec. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with held-out evaluation

full rationale

The paper presents an empirical study of token-level uncertainty in LLM code generation and introduces the AdaDec framework, which learns model-specific uncertainty thresholds from data and applies selective lookahead reranking. No equations, derivations, or self-citations are provided that reduce the claimed Pass@1 improvements to the fitted thresholds or inputs by construction. Performance is measured on separate held-out benchmarks (HumanEval+, MBPP+, DevEval), making the evaluation independent of the fitting process. The approach is benchmark-driven rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that model-provided token probabilities yield a usable uncertainty signal for code tokens and that short lookahead can resolve ranking errors without side effects. No new physical or mathematical entities are introduced.

free parameters (1)
  • model-specific uncertainty threshold
    Learned per model to decide when to trigger lookahead; value not reported in abstract.
axioms (1)
  • domain assumption High uncertainty at a token step indicates the correct token is present but not top-ranked.
    Stated in the empirical study section of the abstract.

pith-pipeline@v0.9.0 · 5760 in / 1345 out tokens · 32327 ms · 2026-05-19T10:14:58.372715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

    cs.SE 2026-05 accept novelty 6.0

    A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

  2. Knowledge-Graph-Driven Data Synthesis for Low-Resource Software Development: A HarmonyOS Case Study

    cs.SE 2025-11 unverdicted novelty 6.0

    APIKG4Syn synthesizes API-oriented training data via knowledge graphs and Monte Carlo search to fine-tune a 7B model that reaches 25% pass@1 on HarmonyOS code generation, beating untuned GPT-4o at 17.59%.

  3. Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    STOP is a new learnable internal path-pruning technique that improves efficiency and accuracy of parallel reasoning in LRMs under fixed compute budgets.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 3 Pith papers · 8 internal anchors

  1. [1]

    In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

    J. Li, Y . Li, G. Li, Z. Jin, Y . Hao, and X. Hu, “Skcoder: A sketch-based approach for automatic code generation,” in Proceedings of the 45th International Conference on Software Engineering , ser. ICSE ’23. IEEE Press, 2023, p. 2124–2135. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00179

  2. [2]

    Enhancing code generation via bidirectional comment-level mutual grounding,

    Y . Di and T. Zhang, “Enhancing code generation via bidirectional comment-level mutual grounding,” 2025. [Online]. Available: https: //arxiv.org/abs/2505.07768

  3. [3]

    Test-case-driven programming understanding in large language models for better code generation,

    Z. Tian, J. Chen, and X. Zhang, “Fixing large language models’ specification misunderstanding for better code generation,” 2024. [Online]. Available: https://arxiv.org/abs/2309.16120

  4. [4]

    Rocode: Integrating backtracking mechanism and program analysis in large language models for code generation,

    X. Jiang, Y . Dong, Y . Tao, H. Liu, Z. Jin, W. Jiao, and G. Li, “Rocode: Integrating backtracking mechanism and program analysis in large language models for code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2411.07112

  5. [5]

    Soen-101: Code generation by emulating software process models using large language model agents,

    F. Lin, D. J. Kim, Tse-Husn, and Chen, “Soen-101: Code generation by emulating software process models using large language model agents,”

  6. [6]

    When llm-based code genera- tion meets the software development process,

    [Online]. Available: https://arxiv.org/abs/2403.15852

  7. [7]

    Code Llama: Open Foundation Models for Code

    B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” 2024. ...

  8. [8]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Li, F. Luo, Y . Xiong, and W. Liang, “Deepseek-coder: When the large language model meets programming – the rise of code intelligence,” 2024. [Online]. Available: https://arxiv.org/abs/2401.14196

  9. [9]

    Evaluating instruction-tuned large language models on code comprehension and generation,

    Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y . Lou, “Evaluating instruction-tuned large language models on code comprehension and generation,” arXiv preprint arXiv:2308.01240 , 2023

  10. [10]

    Enhancing code generation performance of smaller models by distilling the reasoning ability of llms,

    Z. Sun, C. Lyu, B. Li, Y . Wan, H. Zhang, G. Li, and Z. Jin, “Enhancing code generation performance of smaller models by distilling the reasoning ability of llms,” arXiv preprint arXiv:2403.13271 , 2024

  11. [11]

    Ugare, T

    S. Ugare, T. Suresh, H. Kang, S. Misailovic, and G. Singh, “Improving llm code generation with grammar augmentation,” arXiv preprint arXiv:2403.01632, 2024

  12. [12]

    VulRepair: A T5-based automated software vulnerability repair

    M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung, “Vulrepair: a t5-based automated software vulnerability repair,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2022. Association for Computing Machinery, 2022, p. 935–947. [Online]. Available...

  13. [13]

    Inferfix: End-to-end program repair with llms,

    M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2023. Association for Computing Machinery, 2023, p. 1646–1656. [Online]. A...

  14. [14]

    Less training, more repairing please: revisiting automated program repair via zero-shot learning,

    C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE

  15. [15]

    Association for Computing Machinery, 2022, p. 959–971. [Online]. Available: https://doi.org/10.1145/3540250.3549101

  16. [16]

    A Survey on Large Language Models for Code Generation

    J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.00515

  17. [17]

    Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245,

    N. Huynh and B. Lin, “Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01245

  18. [18]

    Evaluating large language models trained on code,

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

  19. [19]

    Neurologic a* esque decoding: Constrained text generation with lookahead heuristics,

    X. Lu, S. Welleck, P. West, L. Jiang, J. Kasai, D. Khashabi, R. L. Bras, L. Qin, Y . Yu, R. Zellers et al., “Neurologic a* esque decoding: Constrained text generation with lookahead heuristics,” arXiv preprint arXiv:2112.08726, 2021

  20. [20]

    anonymous, “Adadec,” https://github.com/SYSUSELab/AdaDec, 2025

  21. [21]

    Stable code 3b

    D. Phung, N. Pinnaparaju, R. Adithyan, M. Zhuravinskyi, J. Tow, and N. Cooper, “Stable code 3b.” [Online]. Available: https: //huggingface.co/stabilityai/stable-code-3b

  22. [22]

    Qwen3 Technical Report

    Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

  23. [23]

    A mathematical theory of communication,

    C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948

  24. [24]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    T. Y . Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul et al., “Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions,” arXiv preprint arXiv:2406.15877, 2024

  25. [25]

    Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning

    G. Lemaitre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” 2016. [Online]. Available: https://arxiv.org/abs/1609.06570

  26. [26]

    Lsr-mcts: Alleviating long range dependency in code generation,

    T. Lu, Y . Li, L. Wang, B. Lin, J. Tang, Q. Lv, W. Xu, H.-T. Zheng, Y . Li, X. Su, and Z. Shan, “Lsr-mcts: Alleviating long range dependency in code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2504.07433

  27. [27]

    Adc: Enhancing function calling via adversarial datasets and code line-level feedback,

    W. Zhang, Y . Zhang, L. Zhu, Q. Jia, F. Jiang, H. Guo, Z. Li, and M. Zhou, “Adc: Enhancing function calling via adversarial datasets and code line-level feedback,” 2024. [Online]. Available: https://arxiv.org/abs/2412.17754

  28. [28]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2108.07732

  29. [29]

    2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

    Z. Nan, Z. Guo, K. Liu, and X. Xia, “ Test Intention Guided LLM-based Unit Test Generation ,” in 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) . Los Alamitos, CA, USA: IEEE Computer Society, May 2025, pp. 779–779. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICSE55347.2025.00243

  30. [30]

    On the evaluation of large language models in unit test generation,

    L. Yang, C. Yang, S. Gao, W. Wang, B. Wang, Q. Zhu, X. Chu, J. Zhou, G. Liang, Q. Wang, and J. Chen, “On the evaluation of large language models in unit test generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.18181

  31. [31]

    Evaluating and improving chatgpt for unit test generation,

    Z. Yuan, M. Liu, S. Ding, K. Wang, Y . Chen, X. Peng, and Y . Lou, “Evaluating and improving chatgpt for unit test generation,” Proc. ACM Softw. Eng. , vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3660783

  32. [32]

    A system for automated unit test generation using large language models and assessment of generated test suites,

    A. Lops, F. Narducci, A. Ragone, M. Trizio, and C. Bartolini, “A system for automated unit test generation using large language models and assessment of generated test suites,” in 2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2025, pp. 29–36

  33. [33]

    Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,

    G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and B. Ray, “Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,” Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3643769

  34. [34]

    Prompting and fine-tuning large language models for automated code review comment generation,

    M. A. Haider, A. B. Mostofa, S. S. B. Mosaddek, A. Iqbal, and T. Ahmed, “Prompting and fine-tuning large language models for automated code review comment generation,” 2024. [Online]. Available: https://arxiv.org/abs/2411.10129

  35. [35]

    A qualitative investigation into llm-generated multilingual code comments and automatic evaluation metrics,

    J. Katzy, Y . Huang, G.-R. Panchu, M. Ziemlewski, P. Loizides, S. Vermeulen, A. van Deursen, and M. Izadi, “A qualitative investigation into llm-generated multilingual code comments and automatic evaluation metrics,” 2025. [Online]. Available: https://arxiv.org/abs/2505.15469

  36. [36]

    Improving retrieval-augmented code comment generation by retrieving for generation,

    H. Lu and Z. Liu, “Improving retrieval-augmented code comment generation by retrieving for generation,” in 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME) , 2024, pp. 350–362

  37. [37]

    Ds-1000: A natural and reliable bench- mark for data science code generation

    Y . Lai, C. Li, Y . Wang, T. Zhang, R. Zhong, L. Zettlemoyer, S. W. tau Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and reliable benchmark for data science code generation,” 2022. [Online]. Available: https://arxiv.org/abs/2211.11501

  38. [38]

    2308.01861 , archivePrefix=

    X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y . Chen, J. Feng, C. Sha, X. Peng, and Y . Lou, “Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation,” 2023. [Online]. Available: https://arxiv.org/abs/2308.01861

  39. [39]

    Unilog: Automatic logging via LLM and in-context learning

    Y . Zhang, W. Zhang, D. Ran, Q. Zhu, C. Dou, D. Hao, T. Xie, and L. Zhang, “Learning-based widget matching for migrating gui test cases,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. ACM, Feb. 2024, p. 1–13. [Online]. Available: http://dx.doi.org/10.1145/3597503.3623322

  40. [40]

    Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories,

    J. Li, G. Li, Y . Zhao, Y . Li, H. Liu, H. Zhu, L. Wang, K. Liu, Z. Fang, L. Wang, J. Ding, X. Zhang, Y . Zhu, Y . Dong, Z. Jin, B. Li, F. Huang, and Y . Li, “Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories,” 2024. [Online]. Available: https://arxiv.org/abs/2405.19856

  41. [41]

    Beam Search Strategies for Neural Machine Translation

    M. Freitag and Y . Al-Onaizan, “Beam search strategies for neural machine translation,” arXiv preprint arXiv:1702.01806 , 2017

  42. [42]

    The curious case of neural text degeneration,

    A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi, “The curious case of neural text degeneration,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=rygGQyrFvH

  43. [43]

    Hot or cold? adaptive temperature sampling for code generation with large language models,

    Y . Zhu, J. Li, G. Li, Y . Zhao, J. Li, Z. Jin, and H. Mei, “Hot or cold? adaptive temperature sampling for code generation with large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2309.02772

  44. [44]

    Uncertainty-guided chain-of-thought for code generation with llms,

    Y . Zhu, G. Li, X. Jiang, J. Li, H. Mei, Z. Jin, and Y . Dong, “Uncertainty-guided chain-of-thought for code generation with llms,”

  45. [45]