pith. sign in

arxiv: 2605.20473 · v1 · pith:G5RGKBWBnew · submitted 2026-05-19 · 💻 cs.SE · cs.AI· cs.LG

Code Generation by Differential Test Time Scaling

Pith reviewed 2026-05-21 06:39 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords code generationtest-time scalingcoverage-guided fuzzingbehavioral clusteringdifferential analysisLLM inference efficiencyagentic coding
0
0 comments X

The pith

DiffCodeGen selects the best code candidate by clustering execution behaviors on fuzzing-generated inputs without any extra LLM calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DiffCodeGen as a test-time scaling approach for code generation that creates diverse candidates through varied sampling and prompting, then uses coverage-guided fuzzing to produce inputs without any pre-existing tests or additional model inference. Candidates run on these inputs so their dynamic behaviors can be compared and grouped into clusters; the medoid of the largest cluster becomes the output. This method avoids the token and time costs of prior scaling techniques that depend on public tests or repeated LLM judgments, while remaining fully asynchronous and compatible with agentic workflows. A sympathetic reader cares because the approach promises higher-quality code from existing models at a small fraction of the usual inference overhead.

Core claim

DiffCodeGen generates diverse code candidates using various sampling and prompting strategies, applies coverage-guided fuzzing to synthesize inputs without requiring existing tests or large language models, executes all candidates on these inputs to capture dynamic behavior, clusters candidates by behavioral similarity, and selects the medoid of the largest cluster as the final output. Unlike prior methods, this selection uses no extra model calls and therefore incurs little to no additional token consumption; the process is fully asynchronous and naturally suited to agentic coding. Evaluations across four large language models show consistent gains over baselines and competitive or superior

What carries the argument

Coverage-guided differential analysis that synthesizes inputs via fuzzing, executes candidates to record behaviors, clusters by behavioral similarity, and selects the medoid of the largest cluster.

If this is right

  • Performance improves consistently across four different large language models without model-specific tuning.
  • Token and time costs remain a small fraction of those required by test-time scaling methods that use public tests or extra LLM inference for selection.
  • The method can be combined with reasoning models to produce further gains.
  • Because selection requires no additional model calls, the approach scales naturally to large numbers of candidates in asynchronous agentic coding setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reliance on automatically generated inputs could allow the technique to work in domains where public test suites are scarce or nonexistent.
  • Behavioral clusters might expose systematic error patterns shared across many generated solutions, offering a new diagnostic for code-generation failures.
  • The same differential-execution idea could be adapted to select among outputs in other generative tasks such as text or proof synthesis.

Load-bearing premise

The largest behavioral cluster identified from executions on the synthesized inputs reliably contains the correct or best code solution.

What would settle it

A test suite of held-out problems where the medoid of the largest cluster fails on the ground-truth tests while a candidate from a smaller cluster passes would show the selection rule does not reliably pick the best solution.

Figures

Figures reproduced from arXiv: 2605.20473 by Ethan Wang, Hao Chen, Jicheng Wang, Xuanxin Ouyang, Yifeng He.

Figure 1
Figure 1. Figure 1: An overview of the DIFFCODEGEN approach. Here, “for free” refers to performing candidate selection without any additional LLM inference, incurring no extra token cost beyond the initial candidate generation. iteratively debug generated code, then use LLM-synthesized inputs to select the best candidate based on dynamic be￾havior. Although these methods achieve strong benchmark performance, their test-availa… view at source ↗
Figure 2
Figure 2. Figure 2: Execution time comparison among different test-time scaling methods. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Token usage comparison among different test-time scaling methods. ‘Prompt‘ uses input tokens, and ‘com [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance scaling with number of samples. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Test-time scaling has emerged as a promising approach for improving code generation by exploring large solution spaces at inference time. However, existing methods often rely on public test cases that are unavailable in practice, or require extensive LLM inference for candidate selection, leading to significant token consumption and time overhead. We present DiffCodeGen, a novel test-time scaling method for code generation based on coverage-guided differential analysis. DiffCodeGen generates diverse code candidates using various sampling and prompting strategies, then applies coverage-guided fuzzing to synthesize inputs without requiring any existing tests or large language models. By executing all candidates on these inputs, DiffCodeGen captures their dynamic behavior and clusters candidates based on behavioral similarity. DiffCodeGen selects the medoid of the largest cluster as the final output. Unlike prior test-time scaling methods that invoke additional LLM inference for candidate selection, DiffCodeGen performs selection without any extra model calls, incurring little to no additional token consumption. DiffCodeGen is fully asynchronous, naturally suited to the current trend of agentic coding, and is thus efficient and highly scalable. We evaluate DiffCodeGen across 4 large language models, demonstrating consistent improvements over baselines. Compared to state-of-the-art test-time scaling methods, DiffCodeGen achieves competitive or superior performance while using only a fraction of time and tokens. DiffCodeGen is model-agnostic and can be combined with reasoning models to further boost performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DiffCodeGen, a test-time scaling method for code generation. It generates diverse code candidates via sampling and prompting strategies, synthesizes inputs using coverage-guided fuzzing without public tests or extra LLM calls, executes all candidates on these inputs to capture dynamic behavior, clusters candidates by behavioral similarity, and selects the medoid of the largest cluster as output. The paper claims consistent improvements over baselines across four LLMs, competitive or superior performance versus state-of-the-art test-time scaling methods while using only a fraction of the time and tokens, and emphasizes its model-agnostic, asynchronous design suitable for agentic coding.

Significance. If the central claims hold, this could represent a meaningful advance in efficient test-time scaling for code generation by eliminating reliance on public tests and additional model inference for selection. The coverage-guided fuzzing approach for differential behavior analysis is a notable technical choice that enables low-overhead selection. The model-agnostic property and potential combination with reasoning models are positive aspects. Reproducible evaluation across multiple LLMs would strengthen the contribution if detailed metrics confirm the efficiency gains.

major comments (2)
  1. [Abstract] Abstract: The claims of 'consistent improvements over baselines' and 'competitive or superior performance' are stated without any quantitative metrics, effect sizes, statistical significance tests, benchmark details, or evaluation protocol. This absence is load-bearing for assessing support of the central performance and efficiency claims.
  2. [Method] Method section (clustering and selection): The assumption that the largest behavioral cluster from coverage-guided fuzzing inputs reliably contains the correct or best solution is central to the no-extra-LLM-call efficiency argument. The manuscript should provide targeted analysis or counterexample experiments for cases where incorrect candidates share similar failure modes on the synthesized inputs, as this directly risks degrading accuracy while still claiming token/time savings.
minor comments (2)
  1. [Method] Provide explicit details on the coverage-guided fuzzing parameters, clustering distance metric, and candidate generation strategies to support reproducibility.
  2. [Evaluation] Ensure all baselines and comparison methods are clearly defined with references in the evaluation section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claims of 'consistent improvements over baselines' and 'competitive or superior performance' are stated without any quantitative metrics, effect sizes, statistical significance tests, benchmark details, or evaluation protocol. This absence is load-bearing for assessing support of the central performance and efficiency claims.

    Authors: We agree that the abstract would be strengthened by the inclusion of quantitative details. In the revised version, we will update the abstract to report key metrics from our evaluations, such as average pass rate improvements over baselines, token and runtime reductions relative to state-of-the-art test-time scaling methods, the specific benchmarks employed, and a brief note on the evaluation protocol. This will make the central claims more concrete and directly address the concern. revision: yes

  2. Referee: [Method] Method section (clustering and selection): The assumption that the largest behavioral cluster from coverage-guided fuzzing inputs reliably contains the correct or best solution is central to the no-extra-LLM-call efficiency argument. The manuscript should provide targeted analysis or counterexample experiments for cases where incorrect candidates share similar failure modes on the synthesized inputs, as this directly risks degrading accuracy while still claiming token/time savings.

    Authors: This is a fair and important point about the robustness of the clustering assumption. Our approach uses coverage-guided fuzzing to generate diverse inputs that aim to expose behavioral differences, and our multi-model experiments indicate that the largest cluster frequently aligns with correct solutions. To directly respond, we will add a targeted analysis subsection in the revised manuscript that examines cases of shared failure modes among incorrect candidates, reports observed frequencies, and discusses any impact on accuracy. We will include relevant examples and maintain an honest assessment of limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural pipeline with independent empirical evaluation

full rationale

The paper describes DiffCodeGen as a sequence of steps—candidate generation via sampling/prompting, coverage-guided fuzzing for input synthesis, execution to capture behaviors, similarity-based clustering, and medoid selection of the largest cluster—without any equations, fitted parameters, or derivations that reduce the output to inputs by construction. Performance claims rest on external empirical results across four LLMs rather than self-referential definitions or self-citation chains. The core heuristic (largest cluster contains the best solution) is an explicit assumption open to falsification, not a tautology or renamed fit. This leaves the method self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about the representativeness of fuzzing inputs and introduces unspecified parameters for candidate generation and clustering without independent evidence for their values.

free parameters (2)
  • candidate generation parameters
    Number and variety of sampling/prompting strategies used to produce diverse candidates.
  • fuzzing and clustering parameters
    Settings controlling input synthesis and behavioral similarity grouping.
axioms (1)
  • domain assumption Synthesized fuzzing inputs suffice to expose behavioral differences that correlate with code correctness or quality.
    Invoked when clustering is used to identify the best candidate from execution traces.

pith-pipeline@v0.9.0 · 5786 in / 1263 out tokens · 38648 ms · 2026-05-21T06:39:52.865617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 6 internal anchors

  1. [1]

    2024.URL:https : / / github

    Inbal Shani and GitHub Staff.Survey reveals AI’s impact on the developer experience. 2024.URL:https : / / github . blog/news-insights/research/survey-reveals-ais-impact-on-the-developer-experience/

  2. [2]

    2024.URL:https: //github.blog/news-insights/research/survey-ai-wave-grows/

    Kyle Daigle and GitHub Staff.Survey: The AI wave continues to grow on software development teams. 2024.URL:https: //github.blog/news-insights/research/survey-ai-wave-grows/

  3. [3]

    Mark Chen et al.Evaluating Large Language Models Trained on Code. 2021. arXiv:2107.03374 [cs.LG].URL:https: //arxiv.org/abs/2107.03374. 16 Code Generation by Differential Test Time Scaling

  4. [4]

    Evaluating Large Language Models in Class-Level Code Generation

    Xueying Du et al. “Evaluating Large Language Models in Class-Level Code Generation”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.DOI:10 . 1145 / 3597503 . 3639219.URL:https : / / doi . org / 10 . 1145 / 3597503 . 3639219

  5. [5]

    DeepSeek-AI et al.DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. 2025. arXiv: 2501.12948 [cs.CL].URL:https://arxiv.org/abs/2501.12948

  6. [6]

    s1: Simple test-time scaling

    Niklas Muennighoff et al. “s1: Simple test-time scaling”. In:Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 20275–20321. ISBN: 979-8-89176-332-6.DOI:10.18653/v1/2025.emnlp-main.1025.URL:https://aclanthology.org/2025. emnlp-main.1025/

  7. [7]

    Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. “Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning”. In:The Thirteenth International Conference on Learning Repre- sentations. 2025.URL:https://openreview.net/forum?id=4FWAwZtd2n

  8. [8]

    ACECODER: Acing Coder RL via Automated Test-Case Synthesis

    Huaye Zeng et al. “ACECODER: Acing Coder RL via Automated Test-Case Synthesis”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Com- putational Linguistics, July 2025, pp. 12023–12040.ISBN: 979-8-89176-251-0.DOI:10.18653/v1/2025.acl-long.587. URL:https://a...

  9. [9]

    scrolling screenshot

    Dacheng Li et al. “S*: Test Time Scaling for Code Generation”. In:Findings of the Association for Computational Lin- guistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 15964–15978.ISBN: 979-8-89176-335-7.DOI:10.18653/v1/2025.findings- emnlp.865.URL:https://aclanthology.org/2025. findings-emnlp.865/

  10. [10]

    Agent-RewardBench: Towards a unified benchmark for reward modeling across perception, planning, and safety in real- world multimodal agents,

    Xiancai Chen et al. “Revisit Self-Debugging with Self-Generated Tests for Code Generation”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, July 2025, pp. 18003–18023.ISBN: 979-8-89176-251-0.DOI:10.18653/v1/2025.acl- long.881.URL...

  11. [11]

    Differential Testing for Software

    William M. McKeeman. “Differential Testing for Software”. In:Digit. Tech. J.10 (1998), pp. 100–107.URL:https : //api.semanticscholar.org/CorpusID:14018070

  12. [12]

    Differential testing: a new approach to change detection

    Robert B. Evans and Alberto Savoia. “Differential testing: a new approach to change detection”. In:The 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engi- neering: Companion Papers. ESEC-FSE companion ’07. Dubrovnik, Croatia: Association for Computing Machinery, 2007, pp. 549–552...

  13. [13]

    Hunting for bugs in code coverage tools via randomized differential testing

    Yibiao Yang et al. “Hunting for bugs in code coverage tools via randomized differential testing”. In:Proceedings of the 41st International Conference on Software Engineering. ICSE ’19. Montreal, Quebec, Canada: IEEE Press, 2019, pp. 488–499. DOI:10.1109/ICSE.2019.00061.URL:https://doi.org/10.1109/ICSE.2019.00061

  14. [14]

    Emer, Mark A

    Shaohua Li and Zhendong Su. “Finding Unstable Code via Compiler-Driven Differential Testing”. In:Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. ASPLOS 2023. Vancouver, BC, Canada: Association for Computing Machinery, 2023, pp. 238–251.ISBN: 9781450399180.DOI:10.1145/...

  15. [15]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain et al. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”. In: The Thirteenth International Conference on Learning Representations. 2025.URL:https://openreview.net/forum? id=chfJJYC3iL

  16. [16]

    Finding and understanding bugs in C compilers

    Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. “Finding and understanding bugs in C compilers”. In:SIGPLAN Not.46.6 (June 2011), pp. 283–294.ISSN: 0362-1340.DOI:10.1145/1993316.1993532.URL:https://doi.org/10. 1145/1993316.1993532

  17. [17]

    Compiler validation via equivalence modulo inputs

    Vu Le, Mehrdad Afshari, and Zhendong Su. “Compiler validation via equivalence modulo inputs”. In:SIGPLAN Not.49.6 (June 2014), pp. 216–226.ISSN: 0362-1340.DOI:10.1145/2666356.2594334.URL:https://doi.org/10.1145/ 2666356.2594334

  18. [18]

    Finding compiler bugs via live code mutation

    Chengnian Sun, Vu Le, and Zhendong Su. “Finding compiler bugs via live code mutation”. In:SIGPLAN Not.51.10 (Oct. 2016), pp. 849–863.ISSN: 0362-1340.DOI:10.1145/3022671.2984038.URL:https://doi.org/10.1145/3022671. 2984038

  19. [19]

    Testing Database Engines via Pivoted Query Synthesis

    Manuel Rigger and Zhendong Su. “Testing Database Engines via Pivoted Query Synthesis”. In:Proc. ACM Program. Lang. 4.OOPSLA (Nov. 2020).DOI:10.1145/3428279.URL:https://doi.org/10.1145/3428279

  20. [20]

    Evaluating Program Semantics Reasoning with Type Inference in System $F$

    Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, and Hao Chen. “Evaluating Program Semantics Reasoning with Type Inference in System $F$”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2025.URL:https://openreview.net/forum?id=IA9RmaP0aw

  21. [21]

    Coverage-based Greybox Fuzzing as Markov Chain

    Marcel B ¨ohme, Van-Thuan Pham, and Abhik Roychoudhury. “Coverage-based Greybox Fuzzing as Markov Chain”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. CCS ’16. Vienna, Austria: Association for Computing Machinery, 2016, pp. 1032–1043.ISBN: 9781450341394.DOI:10.1145/2976749.2978428. URL:https://doi.org/10.1145/...

  22. [22]

    AFL++ : Combining Incremental Steps of Fuzzing Research

    Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. “AFL++ : Combining Incremental Steps of Fuzzing Research”. In:14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association, Aug. 2020.URL: https://www.usenix.org/conference/woot20/presentation/fioraldi

  23. [23]

    The Fuzzing Book

    Andreas Zeller, Rahul Gopinath, Marcel B ¨ohme, Gordon Fraser, and Christian Holler. “The Fuzzing Book”. In: (Jan. 2019). DOI:10.60882/cispa.24614928.v1.URL:https://publications.cispa.de/articles/book/The_Fuzzing_ Book/24614928

  24. [24]

    Matryoshka: Fuzzing Deeply Nested Branches

    Peng Chen, Jianzhong Liu, and Hao Chen. “Matryoshka: Fuzzing Deeply Nested Branches”. In:Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. CCS ’19. London, United Kingdom: Association for Computing Machinery, 2019, pp. 499–513.ISBN: 9781450367479.DOI:10.1145/3319535.3363225.URL:https: //doi.org/10.1145/3319535.3363225

  25. [25]

    Coverage-directed differential testing of JVM implementations

    Yuting Chen, Ting Su, Chengnian Sun, Zhendong Su, and Jianjun Zhao. “Coverage-directed differential testing of JVM implementations”. In:SIGPLAN Not.51.6 (June 2016), pp. 85–99.ISSN: 0362-1340.URL:https://doi.org/10.1145/ 2980983.2908095

  26. [26]

    NEZHA: Efficient Domain- Independent Differential Testing

    Theofilos Petsios, Adrian Tang, Salvatore Stolfo, Angelos D. Keromytis, and Suman Jana. “NEZHA: Efficient Domain- Independent Differential Testing”. In:Proceedings of the 2017 IEEE Symposium on Security and Privacy. SP ’17. San Jose, CA, USA: IEEE Press, 2017, pp. 615–632.ISBN: 9781509049318.DOI:10 . 1109 / SP . 2017 . 27.URL:https : //doi.org/10.1109/SP.2017.27

  27. [27]

    DifFuzz: differential fuzzing for side-channel analysis

    Shirin Nilizadeh, Yannic Noller, and Corina S. P ˘as˘areanu. “DifFuzz: differential fuzzing for side-channel analysis”. In: Proceedings of the 41st International Conference on Software Engineering. ICSE ’19. Montreal, Quebec, Canada: IEEE Press, 2019, pp. 176–187.ISBN: 9781728108698.DOI:10.1109/ICSE.2019.00122.URL:https://doi.org/10. 1109/ICSE.2019.00122

  28. [28]

    In: Proceedings of the 38th International Conference on Software Engineering

    Junjie Chen et al. “An empirical comparison of compiler testing techniques”. In:Proceedings of the 38th International Conference on Software Engineering. ICSE ’16. Austin, Texas: Association for Computing Machinery, 2016, pp. 180–190. ISBN: 9781450342056.DOI:10.1145/2884781.2884878.URL:https://doi.org/10.1145/2884781.2884878

  29. [29]

    CodeT: Code Generation with Generated Tests

    Bei Chen et al. “CodeT: Code Generation with Generated Tests”. In:The Eleventh International Conference on Learning Representations. 2023.URL:https://openreview.net/forum?id=ktrw68Cmu9c

  30. [30]

    In: Zong, C., Xia, F., Li, W., Navigli, R

    Yifeng He, Jicheng Wang, Yuyang Rong, and Hao Chen. “FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation”. In:Findings of the Association for Computational Linguistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 15642–15655.ISBN: 979-8-89176-335-7.DOI:10.18653/v1/ 2025.findings-emnlp.8...

  31. [31]

    Learning to Write with Cooperative Discriminators

    Ari Holtzman et al. “Learning to Write with Cooperative Discriminators”. In:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 1638–1649.DOI:10.18653/v1/P18-1152.URL:https://aclanthology.org/P18-1152/

  32. [32]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. “The Curious Case of Neural Text Degeneration”. In: International Conference on Learning Representations. 2020.URL:https://openreview.net/forum?id=rygGQyrFvH

  33. [33]

    Hierarchical Neural Story Generation

    Angela Fan, Mike Lewis, and Yann Dauphin. “Hierarchical Neural Story Generation”. In:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 889–898.DOI:10.18653/v1/P18- 1082.URL:https://aclanthology. org/P18-1082/

  34. [34]

    Language models are unsupervised multitask learners

    Alec Radford et al. “Language models are unsupervised multitask learners”. In:OpenAI blog1.8 (2019), p. 9

  35. [35]

    Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

    Hongxiang Zhang, Hao Chen, Muhao Chen, and Tianyi Zhang. “Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation”. In:Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 3028–3046.ISBN: 979-8-89176-332- 6.DOI:10....

  36. [36]

    A learning algorithm for Boltzmann machines

    David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. “A learning algorithm for Boltzmann machines”. In:Cog- nitive science9.1 (1985), pp. 147–169

  37. [37]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distilling the Knowledge in a Neural Network. 2015. arXiv:1503.02531 [stat.ML].URL:https://arxiv.org/abs/1503.02531

  38. [38]

    Controlling Linguistic Style Aspects in Neural Language Generation

    Jessica Ficler and Yoav Goldberg. “Controlling Linguistic Style Aspects in Neural Language Generation”. In:Proceedings of the Workshop on Stylistic Variation. Copenhagen, Denmark: Association for Computational Linguistics, Sept. 2017, pp. 94–104.DOI:10.18653/v1/W17-4912.URL:https://aclanthology.org/W17-4912/

  39. [39]

    In: Duh, K., Gomez, H., Bethard, S

    Matthew Renze. “The Effect of Sampling Temperature on Problem Solving in Large Language Models”. In:Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 7346–7356.DOI:10.18653/v1/2024.findings- emnlp.432.URL:https://aclanthology.org/ 2024.findings-emnlp.432/

  40. [40]

    Demystifying LLM-Based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. “Demystifying LLM-Based Software Engineering Agents”. In:Proc. ACM Softw. Eng.2.FSE (June 2025).DOI:10.1145/3715754.URL:https://doi.org/10.1145/ 3715754

  41. [41]

    Lipton, Mu Li, and Alexander J

    Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola.Dive into Deep Learning.https://D2L.ai. Cambridge University Press, 2023. 18 Code Generation by Differential Test Time Scaling

  42. [42]

    Holistic Evaluation of Language Models

    Percy Liang et al. “Holistic Evaluation of Language Models”. In:Transactions on Machine Learning Research(2023). Featured Certification, Expert Certification, Outstanding Certification.ISSN: 2835-8856.URL:https : / / openreview . net/forum?id=iO4LZibEqW

  43. [43]

    Quantifying Language Models’Sensitivity to Spurious Fea- tures in Prompt Design or: How I learned to start worrying about prompt formatting

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. “Quantifying Language Models’Sensitivity to Spurious Fea- tures in Prompt Design or: How I learned to start worrying about prompt formatting”. In:International Conference on Representation Learning. V ol. 2024. 2024, pp. 25055–25083.URL:https://proceedings.iclr.cc/paper_files/ paper/2024/file/6c0e...

  44. [44]

    ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

    Jingming Zhuo et al. “ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs”. In:Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 1950–1976.DOI:10 . 18653 / v1 / 2024 . findings - emnlp . 108.URL:https : / / aclanthology . org / 2024 . findings...

  45. [45]

    What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering

    “What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering”. In:Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Albuquerque, New Mexico: Association for Computational Linguistics, Apr. 2025, pp. 15...

  46. [46]

    Prompting Techniques for Secure Code Generation: A Systematic Investigation

    Catherine Tony, Nicol ´as E. D´ıaz Ferreyra, Markus Mutas, Salem Dhif, and Riccardo Scandariato. “Prompting Techniques for Secure Code Generation: A Systematic Investigation”. In:ACM Trans. Softw. Eng. Methodol.34.8 (Oct. 2025).ISSN: 1049-331X.DOI:10.1145/3722108.URL:https://doi.org/10.1145/3722108

  47. [47]

    How beginning programmers and code LLMs ( mis)read each other,

    Sydney Nguyen et al. “How Beginning Programmers and Code LLMs (Mis)read Each Other”. In:Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. CHI ’24. Honolulu, HI, USA: Association for Computing Ma- chinery, 2024.ISBN: 9798400703300.DOI:10.1145/3613904.3642706.URL:https://doi.org/10.1145/3613904. 3642706

  48. [48]

    CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing

    Shuhan Liu et al. “CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing”. In:Proceedings of the IEEE/ACM 48th International Conference on Software Engineering. ICSE ’26. Rio de Janeiro, Brazil: Association for Computing Machinery, 2026.DOI:3744916.3773111.URL:https://arxiv.org/abs/2507.16407v3

  49. [49]

    Can Language Models Solve Olympiad Programming?

    Ben Shi, Michael Tang, Karthik R Narasimhan, and Shunyu Yao. “Can Language Models Solve Olympiad Programming?” In:First Conference on Language Modeling. 2024.URL:https://openreview.net/forum?id=kGa4fMtP9l

  50. [50]

    Beam Search Strategies for Neural Machine Translation

    Markus Freitag and Yaser Al-Onaizan. “Beam Search Strategies for Neural Machine Translation”. In:Proceedings of the First Workshop on Neural Machine Translation. Vancouver: Association for Computational Linguistics, Aug. 2017, pp. 56– 60.DOI:10.18653/v1/W17-3207.URL:https://aclanthology.org/W17-3207/

  51. [51]

    NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

    Kaustubh Dhole et al. “NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation”. In:Northern European Journal of Language Technology9 (2023).DOI:10 . 3384 / nejlt . 2000 - 1533 . 2023 . 4725.URL:https : //aclanthology.org/2023.nejlt-1.5/

  52. [52]

    Prompt Perturbation Consistency Learning for Robust Language Models

    Yao Qiang et al. “Prompt Perturbation Consistency Learning for Robust Language Models”. In:Findings of the Association for Computational Linguistics: EACL 2024. St. Julian’s, Malta: Association for Computational Linguistics, Mar. 2024, pp. 1357–1370.URL:https://aclanthology.org/2024.findings-eacl.91/

  53. [53]

    Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Ra- tionales?

    Zhanke Zhou et al. “Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Ra- tionales?” In:The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.URL:https : / / openreview.net/forum?id=FbuODM02ra

  54. [54]

    Sorting through the noise: Testing robustness of information processing in pre- trained language models

    Lalchand Pandia and Allyson Ettinger. “Sorting through the noise: Testing robustness of information processing in pre- trained language models”. In:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 1583–1596.DOI: 10.18...

  55. [55]

    Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training

    Feiteng Fang et al. “Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 10028–10039.DOI:10.18653/ v1/2024.acl-lo...

  56. [56]

    Models in the Wild: On Corruption Robustness of Neural NLP Systems

    Barbara Rychalska, Dominika Basaj, Alicja Gosiewska, and Przemysław Biecek. “Models in the Wild: On Corruption Robustness of Neural NLP Systems”. In:Neural Information Processing. Ed. by Tom Gedeon, Kok Wai Wong, and Minho Lee. Cham: Springer International Publishing, 2019, pp. 235–247.ISBN: 978-3-030-36718-3

  57. [57]

    Understanding Programs by Exploiting (Fuzzing) Test Cases

    Jianyu Zhao, Yuyang Rong, Yiwen Guo, Yifeng He, and Hao Chen. “Understanding Programs by Exploiting (Fuzzing) Test Cases”. In:Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, July 2023, pp. 10667–10679.DOI:10.18653/v1/2023.findings-acl.678.URL:https: //aclanthology.org/2023.fi...

  58. [58]

    Continuous Fuzzing with libFuzzer and AddressSanitizer

    Kosta Serebryany. “Continuous Fuzzing with libFuzzer and AddressSanitizer”. In:2016 IEEE Cybersecurity Development (SecDev). 2016, pp. 157–157.URL:https://doi.org/10.1109/SecDev.2016.043

  59. [59]

    Prompt Fuzzing for Fuzz Driver Generation

    Yunlong Lyu, Yuxuan Xie, Peng Chen, and Hao Chen. “Prompt Fuzzing for Fuzz Driver Generation”. In:Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. CCS ’24. Salt Lake City, UT, USA: Association for Computing Machinery, 2024, pp. 3793–3807.ISBN: 9798400706363.DOI:10.1145/3658644.3670396. URL:https://doi.org/10.1145/3...

  60. [60]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li et al. “Competition-level code generation with AlphaCode”. In:Science378.6624 (2022), pp. 1092–1097.DOI: 10.1126/science.abq1158. eprint:https://www.science.org/doi/pdf/10.1126/science.abq1158.URL: https://www.science.org/doi/abs/10.1126/science.abq1158

  61. [61]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”. In:The Eleventh International Conference on Learning Representations. 2023.URL:https://openreview.net/forum?id=1PL1NIMMrw

  62. [62]

    Hierarchical Clustering: Objective Functions and Algorithms

    Vincent Cohen-addad, Varun Kanade, Frederik Mallmann-trenn, and Claire Mathieu. “Hierarchical Clustering: Objective Functions and Algorithms”. In:J. ACM66.4 (June 2019).ISSN: 0004-5411.DOI:10 . 1145 / 3321386.URL:https : //doi.org/10.1145/3321386

  63. [63]

    Revisiting agglomerative clustering

    Eric K. Tokuda, Cesar H. Comin, and Luciano da F. Costa. “Revisiting agglomerative clustering”. In:Physica A: Statistical Mechanics and its Applications585 (2022), p. 126433.ISSN: 0378-4371.DOI:https://doi.org/10.1016/j.physa. 2021.126433.URL:https://www.sciencedirect.com/science/article/pii/S0378437121007068

  64. [64]

    Binyuan Hui et al.Qwen2.5Coder Technical Report. 2024. arXiv:2409.12186 [cs.CL].URL:https://arxiv.org/ abs/2409.12186

  65. [65]

    Efficient memory management for large language model serving with pagedattention,

    Woosuk Kwon et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention”. In:Pro- ceedings of the 29th Symposium on Operating Systems Principles. SOSP ’23. Koblenz, Germany: Association for Comput- ing Machinery, 2023, pp. 611–626.ISBN: 9798400702297.DOI:10.1145/3600006.3613165.URL:https://doi.org/ 10.1145/3600006.3613165

  66. [67]

    2025.URL:https://github.com/NovaSky- AI/SkyThought/tree/0d190f11fd8e885bbe113aeccacba5ccde5b1102/skythought/test-time-scaling

    Dacheng Li et al.S*: Test Time Scaling for Code Generation (Source code). 2025.URL:https://github.com/NovaSky- AI/SkyThought/tree/0d190f11fd8e885bbe113aeccacba5ccde5b1102/skythought/test-time-scaling

  67. [68]

    Google.Gemini 2.5 Flash-Lite.URL:https : / / docs . cloud . google . com / vertex - ai / generative - ai / docs / models/gemini/2-5-flash-lite

  68. [69]

    APACrefauthors \ 1987

    Peter Rousseeuw. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”. In:J. Comput. Appl. Math.20.1 (Nov. 1987), pp. 53–65.ISSN: 0377-0427.DOI:10 . 1016 / 0377 - 0427(87 ) 90125 - 7.URL:https : //doi.org/10.1016/0377-0427(87)90125-7

  69. [70]

    Chris Yuhao Liu et al.Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy. 2026. arXiv:2507. 01352 [cs.CL].URL:https://arxiv.org/abs/2507.01352

  70. [71]

    QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs

    Koen Claessen and John Hughes. “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs”. In:Pro- ceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming. ICFP ’00. New York, NY , USA: Association for Computing Machinery, 2000, pp. 268–279.ISBN: 1581132026.URL:https://doi.org/10.1145/ 351240.351266

  71. [72]

    Property-Based Testing in Practice

    Harrison Goldstein, Joseph W. Cutler, Daniel Dickstein, Benjamin C. Pierce, and Andrew Head. “Property-Based Testing in Practice”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.URL:https : / / doi . org / 10 . 1145 / 3597503.3639581

  72. [73]

    Oracle-Guided Program Selection from Large Language Models

    Zhiyu Fan, Haifeng Ruan, Sergey Mechtaev, and Abhik Roychoudhury. “Oracle-Guided Program Selection from Large Language Models”. In:Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2024. Vienna, Austria: Association for Computing Machinery, 2024, pp. 628–640.ISBN: 9798400706127.DOI: 10.1145/3650212.3680308...

  73. [74]

    A Survey on Large Language Models for Code Generation

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. “A Survey on Large Language Models for Code Generation”. In:ACM Trans. Softw. Eng. Methodol.35.2 (Jan. 2026).ISSN: 1049-331X.DOI:10.1145/3747588.URL: https://doi.org/10.1145/3747588

  74. [75]

    An Yang et al.Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL].URL:https://arxiv.org/abs/2505. 09388

  75. [76]

    Gupta, Neereja Sundaresan, Thomas Alexander, Christopher J

    Daya Guo et al. “DeepSeekR1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature645.8081 (Sept. 2025), pp. 633–638.ISSN: 1476-4687.DOI:10.1038/s41586- 025- 09422- z.URL:http://dx.doi.org/10.1038/ s41586-025-09422-z

  76. [77]

    SWE-bench: Can Language Models Resolve Real-world Github Issues?

    Carlos E Jimenez et al. “SWE-bench: Can Language Models Resolve Real-world Github Issues?” In:The Twelfth Interna- tional Conference on Learning Representations. 2024.URL:https://openreview.net/forum?id=VTF8yNQM66

  77. [78]

    In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

    Qi Guo et al. “Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.DOI:10.1145/3597503.3623306.URL:https://doi.org/10. 1145/3597503.3623306

  78. [79]

    In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. “Automated Program Repair in the Era of Large Pre-trained Language Models”. In:2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2023, pp. 1482– 1494.DOI:10.1109/ICSE48619.2023.00129. 20 Code Generation by Differential Test Time Scaling

  79. [80]

    UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing

    Yifeng He et al. “UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing”. In:Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2024. Vienna, Austria: Association for Computing Machinery, 2024, pp. 1061–1072.ISBN: 9798400706127.DOI: 10.1145/3650212.3680...

  80. [81]

    LitSearch: A retrieval benchmark for scientific literature search

    Weimin Xiong, Yiwen Guo, and Hao Chen. “The Program Testing Ability of Large Language Models for Code”. In:Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. Miami, Florida, US: Association for Computational Linguistics, Nov. 2024, pp. 23–34.DOI:10.18653/v1/2024.emnlp- industry.3. URL:https://aclantho...

Showing first 80 references.