Code Generation by Differential Test Time Scaling

Ethan Wang; Hao Chen; Jicheng Wang; Xuanxin Ouyang; Yifeng He

arxiv: 2605.20473 · v1 · pith:G5RGKBWBnew · submitted 2026-05-19 · 💻 cs.SE · cs.AI· cs.LG

Code Generation by Differential Test Time Scaling

Yifeng He , Ethan Wang , Jicheng Wang , Xuanxin Ouyang , Hao Chen This is my paper

Pith reviewed 2026-05-21 06:39 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords code generationtest-time scalingcoverage-guided fuzzingbehavioral clusteringdifferential analysisLLM inference efficiencyagentic coding

0 comments

The pith

DiffCodeGen selects the best code candidate by clustering execution behaviors on fuzzing-generated inputs without any extra LLM calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DiffCodeGen as a test-time scaling approach for code generation that creates diverse candidates through varied sampling and prompting, then uses coverage-guided fuzzing to produce inputs without any pre-existing tests or additional model inference. Candidates run on these inputs so their dynamic behaviors can be compared and grouped into clusters; the medoid of the largest cluster becomes the output. This method avoids the token and time costs of prior scaling techniques that depend on public tests or repeated LLM judgments, while remaining fully asynchronous and compatible with agentic workflows. A sympathetic reader cares because the approach promises higher-quality code from existing models at a small fraction of the usual inference overhead.

Core claim

DiffCodeGen generates diverse code candidates using various sampling and prompting strategies, applies coverage-guided fuzzing to synthesize inputs without requiring existing tests or large language models, executes all candidates on these inputs to capture dynamic behavior, clusters candidates by behavioral similarity, and selects the medoid of the largest cluster as the final output. Unlike prior methods, this selection uses no extra model calls and therefore incurs little to no additional token consumption; the process is fully asynchronous and naturally suited to agentic coding. Evaluations across four large language models show consistent gains over baselines and competitive or superior

What carries the argument

Coverage-guided differential analysis that synthesizes inputs via fuzzing, executes candidates to record behaviors, clusters by behavioral similarity, and selects the medoid of the largest cluster.

If this is right

Performance improves consistently across four different large language models without model-specific tuning.
Token and time costs remain a small fraction of those required by test-time scaling methods that use public tests or extra LLM inference for selection.
The method can be combined with reasoning models to produce further gains.
Because selection requires no additional model calls, the approach scales naturally to large numbers of candidates in asynchronous agentic coding setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reliance on automatically generated inputs could allow the technique to work in domains where public test suites are scarce or nonexistent.
Behavioral clusters might expose systematic error patterns shared across many generated solutions, offering a new diagnostic for code-generation failures.
The same differential-execution idea could be adapted to select among outputs in other generative tasks such as text or proof synthesis.

Load-bearing premise

The largest behavioral cluster identified from executions on the synthesized inputs reliably contains the correct or best code solution.

What would settle it

A test suite of held-out problems where the medoid of the largest cluster fails on the ground-truth tests while a candidate from a smaller cluster passes would show the selection rule does not reliably pick the best solution.

Figures

Figures reproduced from arXiv: 2605.20473 by Ethan Wang, Hao Chen, Jicheng Wang, Xuanxin Ouyang, Yifeng He.

**Figure 1.** Figure 1: An overview of the DIFFCODEGEN approach. Here, “for free” refers to performing candidate selection without any additional LLM inference, incurring no extra token cost beyond the initial candidate generation. iteratively debug generated code, then use LLM-synthesized inputs to select the best candidate based on dynamic behavior. Although these methods achieve strong benchmark performance, their test-availa… view at source ↗

**Figure 2.** Figure 2: Execution time comparison among different test-time scaling methods. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Token usage comparison among different test-time scaling methods. ‘Prompt‘ uses input tokens, and ‘com [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Performance scaling with number of samples. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Test-time scaling has emerged as a promising approach for improving code generation by exploring large solution spaces at inference time. However, existing methods often rely on public test cases that are unavailable in practice, or require extensive LLM inference for candidate selection, leading to significant token consumption and time overhead. We present DiffCodeGen, a novel test-time scaling method for code generation based on coverage-guided differential analysis. DiffCodeGen generates diverse code candidates using various sampling and prompting strategies, then applies coverage-guided fuzzing to synthesize inputs without requiring any existing tests or large language models. By executing all candidates on these inputs, DiffCodeGen captures their dynamic behavior and clusters candidates based on behavioral similarity. DiffCodeGen selects the medoid of the largest cluster as the final output. Unlike prior test-time scaling methods that invoke additional LLM inference for candidate selection, DiffCodeGen performs selection without any extra model calls, incurring little to no additional token consumption. DiffCodeGen is fully asynchronous, naturally suited to the current trend of agentic coding, and is thus efficient and highly scalable. We evaluate DiffCodeGen across 4 large language models, demonstrating consistent improvements over baselines. Compared to state-of-the-art test-time scaling methods, DiffCodeGen achieves competitive or superior performance while using only a fraction of time and tokens. DiffCodeGen is model-agnostic and can be combined with reasoning models to further boost performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiffCodeGen pairs coverage-guided fuzzing with behavioral clustering to pick code candidates without public tests or extra LLM calls, but the largest-cluster rule looks vulnerable when wrong solutions fail alike.

read the letter

The main point for you is that this paper presents DiffCodeGen as a test-time scaling technique for LLM code generation that relies on coverage-guided fuzzing to create inputs and then clusters candidate behaviors to select the output without needing public tests or more model calls. The new part is the differential analysis through behavioral clustering after fuzzing. It generates candidates with different sampling and prompting, runs coverage-guided fuzzing to make inputs, executes all candidates on those to see how they behave, groups similar ones, and takes the medoid from the largest cluster. This keeps token use low and works asynchronously, which fits agentic setups. The evaluation on four models shows gains over baselines and competitive results against other scaling methods with less time and tokens. That efficiency focus is where it adds value for practical deployment. One soft spot is the core selection rule. If wrong candidates end up in the same behavioral cluster because they fail similarly on the fuzzed inputs, the method could output a bad solution while still looking efficient. The abstract does not give numbers or details on the evaluation protocol, which makes it hard to assess how often this happens or how strong the improvements really are. The stress test concern about shared failure modes seems relevant here. This work is for people building or studying LLM tools for software engineering who need methods that scale without high inference costs. A reader focused on test-time compute or fuzzing applications in verification might pick up useful ideas from the pipeline. I would recommend sending it for peer review. The idea is distinct enough and addresses a real constraint, even if the results section will need close checking for the clustering reliability.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DiffCodeGen, a test-time scaling method for code generation. It generates diverse code candidates via sampling and prompting strategies, synthesizes inputs using coverage-guided fuzzing without public tests or extra LLM calls, executes all candidates on these inputs to capture dynamic behavior, clusters candidates by behavioral similarity, and selects the medoid of the largest cluster as output. The paper claims consistent improvements over baselines across four LLMs, competitive or superior performance versus state-of-the-art test-time scaling methods while using only a fraction of the time and tokens, and emphasizes its model-agnostic, asynchronous design suitable for agentic coding.

Significance. If the central claims hold, this could represent a meaningful advance in efficient test-time scaling for code generation by eliminating reliance on public tests and additional model inference for selection. The coverage-guided fuzzing approach for differential behavior analysis is a notable technical choice that enables low-overhead selection. The model-agnostic property and potential combination with reasoning models are positive aspects. Reproducible evaluation across multiple LLMs would strengthen the contribution if detailed metrics confirm the efficiency gains.

major comments (2)

[Abstract] Abstract: The claims of 'consistent improvements over baselines' and 'competitive or superior performance' are stated without any quantitative metrics, effect sizes, statistical significance tests, benchmark details, or evaluation protocol. This absence is load-bearing for assessing support of the central performance and efficiency claims.
[Method] Method section (clustering and selection): The assumption that the largest behavioral cluster from coverage-guided fuzzing inputs reliably contains the correct or best solution is central to the no-extra-LLM-call efficiency argument. The manuscript should provide targeted analysis or counterexample experiments for cases where incorrect candidates share similar failure modes on the synthesized inputs, as this directly risks degrading accuracy while still claiming token/time savings.

minor comments (2)

[Method] Provide explicit details on the coverage-guided fuzzing parameters, clustering distance metric, and candidate generation strategies to support reproducibility.
[Evaluation] Ensure all baselines and comparison methods are clearly defined with references in the evaluation section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The claims of 'consistent improvements over baselines' and 'competitive or superior performance' are stated without any quantitative metrics, effect sizes, statistical significance tests, benchmark details, or evaluation protocol. This absence is load-bearing for assessing support of the central performance and efficiency claims.

Authors: We agree that the abstract would be strengthened by the inclusion of quantitative details. In the revised version, we will update the abstract to report key metrics from our evaluations, such as average pass rate improvements over baselines, token and runtime reductions relative to state-of-the-art test-time scaling methods, the specific benchmarks employed, and a brief note on the evaluation protocol. This will make the central claims more concrete and directly address the concern. revision: yes
Referee: [Method] Method section (clustering and selection): The assumption that the largest behavioral cluster from coverage-guided fuzzing inputs reliably contains the correct or best solution is central to the no-extra-LLM-call efficiency argument. The manuscript should provide targeted analysis or counterexample experiments for cases where incorrect candidates share similar failure modes on the synthesized inputs, as this directly risks degrading accuracy while still claiming token/time savings.

Authors: This is a fair and important point about the robustness of the clustering assumption. Our approach uses coverage-guided fuzzing to generate diverse inputs that aim to expose behavioral differences, and our multi-model experiments indicate that the largest cluster frequently aligns with correct solutions. To directly respond, we will add a targeted analysis subsection in the revised manuscript that examines cases of shared failure modes among incorrect candidates, reports observed frequencies, and discusses any impact on accuracy. We will include relevant examples and maintain an honest assessment of limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural pipeline with independent empirical evaluation

full rationale

The paper describes DiffCodeGen as a sequence of steps—candidate generation via sampling/prompting, coverage-guided fuzzing for input synthesis, execution to capture behaviors, similarity-based clustering, and medoid selection of the largest cluster—without any equations, fitted parameters, or derivations that reduce the output to inputs by construction. Performance claims rest on external empirical results across four LLMs rather than self-referential definitions or self-citation chains. The core heuristic (largest cluster contains the best solution) is an explicit assumption open to falsification, not a tautology or renamed fit. This leaves the method self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about the representativeness of fuzzing inputs and introduces unspecified parameters for candidate generation and clustering without independent evidence for their values.

free parameters (2)

candidate generation parameters
Number and variety of sampling/prompting strategies used to produce diverse candidates.
fuzzing and clustering parameters
Settings controlling input synthesis and behavioral similarity grouping.

axioms (1)

domain assumption Synthesized fuzzing inputs suffice to expose behavioral differences that correlate with code correctness or quality.
Invoked when clustering is used to identify the best candidate from execution traces.

pith-pipeline@v0.9.0 · 5786 in / 1263 out tokens · 38648 ms · 2026-05-21T06:39:52.865617+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DIFFCODEGEN clusters the generated code candidates based on their dynamic behavior and selects the candidate with the shortest relative distance to all other candidates in the largest cluster as the final output.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose DIFFCODEGEN, a novel test-time scaling method combining differential testing and dynamic software analysis.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 6 internal anchors

[1]

2024.URL:https : / / github

Inbal Shani and GitHub Staff.Survey reveals AI’s impact on the developer experience. 2024.URL:https : / / github . blog/news-insights/research/survey-reveals-ais-impact-on-the-developer-experience/

work page 2024
[2]

2024.URL:https: //github.blog/news-insights/research/survey-ai-wave-grows/

Kyle Daigle and GitHub Staff.Survey: The AI wave continues to grow on software development teams. 2024.URL:https: //github.blog/news-insights/research/survey-ai-wave-grows/

work page 2024
[3]

Mark Chen et al.Evaluating Large Language Models Trained on Code. 2021. arXiv:2107.03374 [cs.LG].URL:https: //arxiv.org/abs/2107.03374. 16 Code Generation by Differential Test Time Scaling

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Evaluating Large Language Models in Class-Level Code Generation

Xueying Du et al. “Evaluating Large Language Models in Class-Level Code Generation”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.DOI:10 . 1145 / 3597503 . 3639219.URL:https : / / doi . org / 10 . 1145 / 3597503 . 3639219

work page 2024
[5]

DeepSeek-AI et al.DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. 2025. arXiv: 2501.12948 [cs.CL].URL:https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

s1: Simple test-time scaling

Niklas Muennighoff et al. “s1: Simple test-time scaling”. In:Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 20275–20321. ISBN: 979-8-89176-332-6.DOI:10.18653/v1/2025.emnlp-main.1025.URL:https://aclanthology.org/2025. emnlp-main.1025/

work page doi:10.18653/v1/2025.emnlp-main.1025.url:https://aclanthology.org/2025 2025
[7]

Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. “Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning”. In:The Thirteenth International Conference on Learning Repre- sentations. 2025.URL:https://openreview.net/forum?id=4FWAwZtd2n

work page 2025
[8]

ACECODER: Acing Coder RL via Automated Test-Case Synthesis

Huaye Zeng et al. “ACECODER: Acing Coder RL via Automated Test-Case Synthesis”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Com- putational Linguistics, July 2025, pp. 12023–12040.ISBN: 979-8-89176-251-0.DOI:10.18653/v1/2025.acl-long.587. URL:https://a...

work page doi:10.18653/v1/2025.acl-long.587 2025
[9]

scrolling screenshot

Dacheng Li et al. “S*: Test Time Scaling for Code Generation”. In:Findings of the Association for Computational Lin- guistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 15964–15978.ISBN: 979-8-89176-335-7.DOI:10.18653/v1/2025.findings- emnlp.865.URL:https://aclanthology.org/2025. findings-emnlp.865/

work page doi:10.18653/v1/2025.findings- 2025
[10]

Agent-RewardBench: Towards a unified benchmark for reward modeling across perception, planning, and safety in real- world multimodal agents,

Xiancai Chen et al. “Revisit Self-Debugging with Self-Generated Tests for Code Generation”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, July 2025, pp. 18003–18023.ISBN: 979-8-89176-251-0.DOI:10.18653/v1/2025.acl- long.881.URL...

work page doi:10.18653/v1/2025.acl- 2025
[11]

Differential Testing for Software

William M. McKeeman. “Differential Testing for Software”. In:Digit. Tech. J.10 (1998), pp. 100–107.URL:https : //api.semanticscholar.org/CorpusID:14018070

work page 1998
[12]

Differential testing: a new approach to change detection

Robert B. Evans and Alberto Savoia. “Differential testing: a new approach to change detection”. In:The 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engi- neering: Companion Papers. ESEC-FSE companion ’07. Dubrovnik, Croatia: Association for Computing Machinery, 2007, pp. 549–552...

work page doi:10.1145/1295014.1295038.url:https://doi.org/10.1145/1295014 2007
[13]

Hunting for bugs in code coverage tools via randomized differential testing

Yibiao Yang et al. “Hunting for bugs in code coverage tools via randomized differential testing”. In:Proceedings of the 41st International Conference on Software Engineering. ICSE ’19. Montreal, Quebec, Canada: IEEE Press, 2019, pp. 488–499. DOI:10.1109/ICSE.2019.00061.URL:https://doi.org/10.1109/ICSE.2019.00061

work page doi:10.1109/icse.2019.00061.url:https://doi.org/10.1109/icse.2019.00061 2019
[14]

Emer, Mark A

Shaohua Li and Zhendong Su. “Finding Unstable Code via Compiler-Driven Differential Testing”. In:Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. ASPLOS 2023. Vancouver, BC, Canada: Association for Computing Machinery, 2023, pp. 238–251.ISBN: 9781450399180.DOI:10.1145/...

work page doi:10.1145/3582016.3582053.url:https://doi.org/10.1145/3582016.3582053 2023
[15]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain et al. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”. In: The Thirteenth International Conference on Learning Representations. 2025.URL:https://openreview.net/forum? id=chfJJYC3iL

work page 2025
[16]

Finding and understanding bugs in C compilers

Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. “Finding and understanding bugs in C compilers”. In:SIGPLAN Not.46.6 (June 2011), pp. 283–294.ISSN: 0362-1340.DOI:10.1145/1993316.1993532.URL:https://doi.org/10. 1145/1993316.1993532

work page doi:10.1145/1993316.1993532.url:https://doi.org/10 2011
[17]

Compiler validation via equivalence modulo inputs

Vu Le, Mehrdad Afshari, and Zhendong Su. “Compiler validation via equivalence modulo inputs”. In:SIGPLAN Not.49.6 (June 2014), pp. 216–226.ISSN: 0362-1340.DOI:10.1145/2666356.2594334.URL:https://doi.org/10.1145/ 2666356.2594334

work page doi:10.1145/2666356.2594334.url:https://doi.org/10.1145/ 2014
[18]

Finding compiler bugs via live code mutation

Chengnian Sun, Vu Le, and Zhendong Su. “Finding compiler bugs via live code mutation”. In:SIGPLAN Not.51.10 (Oct. 2016), pp. 849–863.ISSN: 0362-1340.DOI:10.1145/3022671.2984038.URL:https://doi.org/10.1145/3022671. 2984038

work page doi:10.1145/3022671.2984038.url:https://doi.org/10.1145/3022671 2016
[19]

Testing Database Engines via Pivoted Query Synthesis

Manuel Rigger and Zhendong Su. “Testing Database Engines via Pivoted Query Synthesis”. In:Proc. ACM Program. Lang. 4.OOPSLA (Nov. 2020).DOI:10.1145/3428279.URL:https://doi.org/10.1145/3428279

work page doi:10.1145/3428279.url:https://doi.org/10.1145/3428279 2020
[20]

Evaluating Program Semantics Reasoning with Type Inference in System $F$

Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, and Hao Chen. “Evaluating Program Semantics Reasoning with Type Inference in System $F$”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2025.URL:https://openreview.net/forum?id=IA9RmaP0aw

work page 2025
[21]

Coverage-based Greybox Fuzzing as Markov Chain

Marcel B ¨ohme, Van-Thuan Pham, and Abhik Roychoudhury. “Coverage-based Greybox Fuzzing as Markov Chain”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. CCS ’16. Vienna, Austria: Association for Computing Machinery, 2016, pp. 1032–1043.ISBN: 9781450341394.DOI:10.1145/2976749.2978428. URL:https://doi.org/10.1145/...

work page doi:10.1145/2976749.2978428 2016
[22]

AFL++ : Combining Incremental Steps of Fuzzing Research

Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. “AFL++ : Combining Incremental Steps of Fuzzing Research”. In:14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association, Aug. 2020.URL: https://www.usenix.org/conference/woot20/presentation/fioraldi

work page 2020
[23]

The Fuzzing Book

Andreas Zeller, Rahul Gopinath, Marcel B ¨ohme, Gordon Fraser, and Christian Holler. “The Fuzzing Book”. In: (Jan. 2019). DOI:10.60882/cispa.24614928.v1.URL:https://publications.cispa.de/articles/book/The_Fuzzing_ Book/24614928

work page doi:10.60882/cispa.24614928.v1.url:https://publications.cispa.de/articles/book/the_fuzzing_ 2019
[24]

Matryoshka: Fuzzing Deeply Nested Branches

Peng Chen, Jianzhong Liu, and Hao Chen. “Matryoshka: Fuzzing Deeply Nested Branches”. In:Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. CCS ’19. London, United Kingdom: Association for Computing Machinery, 2019, pp. 499–513.ISBN: 9781450367479.DOI:10.1145/3319535.3363225.URL:https: //doi.org/10.1145/3319535.3363225

work page doi:10.1145/3319535.3363225.url:https: 2019
[25]

Coverage-directed differential testing of JVM implementations

Yuting Chen, Ting Su, Chengnian Sun, Zhendong Su, and Jianjun Zhao. “Coverage-directed differential testing of JVM implementations”. In:SIGPLAN Not.51.6 (June 2016), pp. 85–99.ISSN: 0362-1340.URL:https://doi.org/10.1145/ 2980983.2908095

work page arXiv 2016
[26]

NEZHA: Efficient Domain- Independent Differential Testing

Theofilos Petsios, Adrian Tang, Salvatore Stolfo, Angelos D. Keromytis, and Suman Jana. “NEZHA: Efficient Domain- Independent Differential Testing”. In:Proceedings of the 2017 IEEE Symposium on Security and Privacy. SP ’17. San Jose, CA, USA: IEEE Press, 2017, pp. 615–632.ISBN: 9781509049318.DOI:10 . 1109 / SP . 2017 . 27.URL:https : //doi.org/10.1109/SP.2017.27

work page doi:10.1109/sp.2017.27 2017
[27]

DifFuzz: differential fuzzing for side-channel analysis

Shirin Nilizadeh, Yannic Noller, and Corina S. P ˘as˘areanu. “DifFuzz: differential fuzzing for side-channel analysis”. In: Proceedings of the 41st International Conference on Software Engineering. ICSE ’19. Montreal, Quebec, Canada: IEEE Press, 2019, pp. 176–187.ISBN: 9781728108698.DOI:10.1109/ICSE.2019.00122.URL:https://doi.org/10. 1109/ICSE.2019.00122

work page doi:10.1109/icse.2019.00122.url:https://doi.org/10 2019
[28]

In: Proceedings of the 38th International Conference on Software Engineering

Junjie Chen et al. “An empirical comparison of compiler testing techniques”. In:Proceedings of the 38th International Conference on Software Engineering. ICSE ’16. Austin, Texas: Association for Computing Machinery, 2016, pp. 180–190. ISBN: 9781450342056.DOI:10.1145/2884781.2884878.URL:https://doi.org/10.1145/2884781.2884878

work page doi:10.1145/2884781.2884878.url:https://doi.org/10.1145/2884781.2884878 2016
[29]

CodeT: Code Generation with Generated Tests

Bei Chen et al. “CodeT: Code Generation with Generated Tests”. In:The Eleventh International Conference on Learning Representations. 2023.URL:https://openreview.net/forum?id=ktrw68Cmu9c

work page 2023
[30]

In: Zong, C., Xia, F., Li, W., Navigli, R

Yifeng He, Jicheng Wang, Yuyang Rong, and Hao Chen. “FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation”. In:Findings of the Association for Computational Linguistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 15642–15655.ISBN: 979-8-89176-335-7.DOI:10.18653/v1/ 2025.findings-emnlp.8...

work page doi:10.18653/v1/ 2025
[31]

Learning to Write with Cooperative Discriminators

Ari Holtzman et al. “Learning to Write with Cooperative Discriminators”. In:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 1638–1649.DOI:10.18653/v1/P18-1152.URL:https://aclanthology.org/P18-1152/

work page doi:10.18653/v1/p18-1152.url:https://aclanthology.org/p18-1152/ 2018
[32]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. “The Curious Case of Neural Text Degeneration”. In: International Conference on Learning Representations. 2020.URL:https://openreview.net/forum?id=rygGQyrFvH

work page 2020
[33]

Hierarchical Neural Story Generation

Angela Fan, Mike Lewis, and Yann Dauphin. “Hierarchical Neural Story Generation”. In:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 889–898.DOI:10.18653/v1/P18- 1082.URL:https://aclanthology. org/P18-1082/

work page doi:10.18653/v1/p18- 2018
[34]

Language models are unsupervised multitask learners

Alec Radford et al. “Language models are unsupervised multitask learners”. In:OpenAI blog1.8 (2019), p. 9

work page 2019
[35]

Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Hongxiang Zhang, Hao Chen, Muhao Chen, and Tianyi Zhang. “Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation”. In:Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 3028–3046.ISBN: 979-8-89176-332- 6.DOI:10....

work page doi:10.18653/v1/2025.emnlp-main.150.url:https://aclanthology.org/2025.emnlp-main.150/ 2025
[36]

A learning algorithm for Boltzmann machines

David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. “A learning algorithm for Boltzmann machines”. In:Cog- nitive science9.1 (1985), pp. 147–169

work page 1985
[37]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distilling the Knowledge in a Neural Network. 2015. arXiv:1503.02531 [stat.ML].URL:https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[38]

Controlling Linguistic Style Aspects in Neural Language Generation

Jessica Ficler and Yoav Goldberg. “Controlling Linguistic Style Aspects in Neural Language Generation”. In:Proceedings of the Workshop on Stylistic Variation. Copenhagen, Denmark: Association for Computational Linguistics, Sept. 2017, pp. 94–104.DOI:10.18653/v1/W17-4912.URL:https://aclanthology.org/W17-4912/

work page doi:10.18653/v1/w17-4912.url:https://aclanthology.org/w17-4912/ 2017
[39]

In: Duh, K., Gomez, H., Bethard, S

Matthew Renze. “The Effect of Sampling Temperature on Problem Solving in Large Language Models”. In:Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 7346–7356.DOI:10.18653/v1/2024.findings- emnlp.432.URL:https://aclanthology.org/ 2024.findings-emnlp.432/

work page doi:10.18653/v1/2024.findings- 2024
[40]

Demystifying LLM-Based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. “Demystifying LLM-Based Software Engineering Agents”. In:Proc. ACM Softw. Eng.2.FSE (June 2025).DOI:10.1145/3715754.URL:https://doi.org/10.1145/ 3715754

work page doi:10.1145/3715754.url:https://doi.org/10.1145/ 2025
[41]

Lipton, Mu Li, and Alexander J

Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola.Dive into Deep Learning.https://D2L.ai. Cambridge University Press, 2023. 18 Code Generation by Differential Test Time Scaling

work page 2023
[42]

Holistic Evaluation of Language Models

Percy Liang et al. “Holistic Evaluation of Language Models”. In:Transactions on Machine Learning Research(2023). Featured Certification, Expert Certification, Outstanding Certification.ISSN: 2835-8856.URL:https : / / openreview . net/forum?id=iO4LZibEqW

work page 2023
[43]

Quantifying Language Models’Sensitivity to Spurious Fea- tures in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. “Quantifying Language Models’Sensitivity to Spurious Fea- tures in Prompt Design or: How I learned to start worrying about prompt formatting”. In:International Conference on Representation Learning. V ol. 2024. 2024, pp. 25055–25083.URL:https://proceedings.iclr.cc/paper_files/ paper/2024/file/6c0e...

work page 2024
[44]

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

Jingming Zhuo et al. “ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs”. In:Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 1950–1976.DOI:10 . 18653 / v1 / 2024 . findings - emnlp . 108.URL:https : / / aclanthology . org / 2024 . findings...

work page 2024
[45]

What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering

“What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering”. In:Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Albuquerque, New Mexico: Association for Computational Linguistics, Apr. 2025, pp. 15...

work page 2025
[46]

Prompting Techniques for Secure Code Generation: A Systematic Investigation

Catherine Tony, Nicol ´as E. D´ıaz Ferreyra, Markus Mutas, Salem Dhif, and Riccardo Scandariato. “Prompting Techniques for Secure Code Generation: A Systematic Investigation”. In:ACM Trans. Softw. Eng. Methodol.34.8 (Oct. 2025).ISSN: 1049-331X.DOI:10.1145/3722108.URL:https://doi.org/10.1145/3722108

work page doi:10.1145/3722108.url:https://doi.org/10.1145/3722108 2025
[47]

How beginning programmers and code LLMs ( mis)read each other,

Sydney Nguyen et al. “How Beginning Programmers and Code LLMs (Mis)read Each Other”. In:Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. CHI ’24. Honolulu, HI, USA: Association for Computing Ma- chinery, 2024.ISBN: 9798400703300.DOI:10.1145/3613904.3642706.URL:https://doi.org/10.1145/3613904. 3642706

work page doi:10.1145/3613904.3642706.url:https://doi.org/10.1145/3613904 2024
[48]

CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing

Shuhan Liu et al. “CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing”. In:Proceedings of the IEEE/ACM 48th International Conference on Software Engineering. ICSE ’26. Rio de Janeiro, Brazil: Association for Computing Machinery, 2026.DOI:3744916.3773111.URL:https://arxiv.org/abs/2507.16407v3

work page arXiv 2026
[49]

Can Language Models Solve Olympiad Programming?

Ben Shi, Michael Tang, Karthik R Narasimhan, and Shunyu Yao. “Can Language Models Solve Olympiad Programming?” In:First Conference on Language Modeling. 2024.URL:https://openreview.net/forum?id=kGa4fMtP9l

work page 2024
[50]

Beam Search Strategies for Neural Machine Translation

Markus Freitag and Yaser Al-Onaizan. “Beam Search Strategies for Neural Machine Translation”. In:Proceedings of the First Workshop on Neural Machine Translation. Vancouver: Association for Computational Linguistics, Aug. 2017, pp. 56– 60.DOI:10.18653/v1/W17-3207.URL:https://aclanthology.org/W17-3207/

work page doi:10.18653/v1/w17-3207.url:https://aclanthology.org/w17-3207/ 2017
[51]

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Kaustubh Dhole et al. “NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation”. In:Northern European Journal of Language Technology9 (2023).DOI:10 . 3384 / nejlt . 2000 - 1533 . 2023 . 4725.URL:https : //aclanthology.org/2023.nejlt-1.5/

work page 2023
[52]

Prompt Perturbation Consistency Learning for Robust Language Models

Yao Qiang et al. “Prompt Perturbation Consistency Learning for Robust Language Models”. In:Findings of the Association for Computational Linguistics: EACL 2024. St. Julian’s, Malta: Association for Computational Linguistics, Mar. 2024, pp. 1357–1370.URL:https://aclanthology.org/2024.findings-eacl.91/

work page 2024
[53]

Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Ra- tionales?

Zhanke Zhou et al. “Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Ra- tionales?” In:The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.URL:https : / / openreview.net/forum?id=FbuODM02ra

work page 2024
[54]

Sorting through the noise: Testing robustness of information processing in pre- trained language models

Lalchand Pandia and Allyson Ettinger. “Sorting through the noise: Testing robustness of information processing in pre- trained language models”. In:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 1583–1596.DOI: 10.18...

work page doi:10.18653/v1/2021.emnlp-main.119.url:https://aclanthology.org/2021.emnlp-main.119/ 2021
[55]

Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training

Feiteng Fang et al. “Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 10028–10039.DOI:10.18653/ v1/2024.acl-lo...

work page 2024
[56]

Models in the Wild: On Corruption Robustness of Neural NLP Systems

Barbara Rychalska, Dominika Basaj, Alicja Gosiewska, and Przemysław Biecek. “Models in the Wild: On Corruption Robustness of Neural NLP Systems”. In:Neural Information Processing. Ed. by Tom Gedeon, Kok Wai Wong, and Minho Lee. Cham: Springer International Publishing, 2019, pp. 235–247.ISBN: 978-3-030-36718-3

work page 2019
[57]

Understanding Programs by Exploiting (Fuzzing) Test Cases

Jianyu Zhao, Yuyang Rong, Yiwen Guo, Yifeng He, and Hao Chen. “Understanding Programs by Exploiting (Fuzzing) Test Cases”. In:Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, July 2023, pp. 10667–10679.DOI:10.18653/v1/2023.findings-acl.678.URL:https: //aclanthology.org/2023.fi...

work page doi:10.18653/v1/2023.findings-acl.678.url:https: 2023
[58]

Continuous Fuzzing with libFuzzer and AddressSanitizer

Kosta Serebryany. “Continuous Fuzzing with libFuzzer and AddressSanitizer”. In:2016 IEEE Cybersecurity Development (SecDev). 2016, pp. 157–157.URL:https://doi.org/10.1109/SecDev.2016.043

work page doi:10.1109/secdev.2016.043 2016
[59]

Prompt Fuzzing for Fuzz Driver Generation

Yunlong Lyu, Yuxuan Xie, Peng Chen, and Hao Chen. “Prompt Fuzzing for Fuzz Driver Generation”. In:Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. CCS ’24. Salt Lake City, UT, USA: Association for Computing Machinery, 2024, pp. 3793–3807.ISBN: 9798400706363.DOI:10.1145/3658644.3670396. URL:https://doi.org/10.1145/3...

work page doi:10.1145/3658644.3670396 2024
[60]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li et al. “Competition-level code generation with AlphaCode”. In:Science378.6624 (2022), pp. 1092–1097.DOI: 10.1126/science.abq1158. eprint:https://www.science.org/doi/pdf/10.1126/science.abq1158.URL: https://www.science.org/doi/abs/10.1126/science.abq1158

work page doi:10.1126/science.abq1158 2022
[61]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”. In:The Eleventh International Conference on Learning Representations. 2023.URL:https://openreview.net/forum?id=1PL1NIMMrw

work page 2023
[62]

Hierarchical Clustering: Objective Functions and Algorithms

Vincent Cohen-addad, Varun Kanade, Frederik Mallmann-trenn, and Claire Mathieu. “Hierarchical Clustering: Objective Functions and Algorithms”. In:J. ACM66.4 (June 2019).ISSN: 0004-5411.DOI:10 . 1145 / 3321386.URL:https : //doi.org/10.1145/3321386

work page doi:10.1145/3321386 2019
[63]

Revisiting agglomerative clustering

Eric K. Tokuda, Cesar H. Comin, and Luciano da F. Costa. “Revisiting agglomerative clustering”. In:Physica A: Statistical Mechanics and its Applications585 (2022), p. 126433.ISSN: 0378-4371.DOI:https://doi.org/10.1016/j.physa. 2021.126433.URL:https://www.sciencedirect.com/science/article/pii/S0378437121007068

work page doi:10.1016/j.physa 2022
[64]

Binyuan Hui et al.Qwen2.5Coder Technical Report. 2024. arXiv:2409.12186 [cs.CL].URL:https://arxiv.org/ abs/2409.12186

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Efficient memory management for large language model serving with pagedattention,

Woosuk Kwon et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention”. In:Pro- ceedings of the 29th Symposium on Operating Systems Principles. SOSP ’23. Koblenz, Germany: Association for Comput- ing Machinery, 2023, pp. 611–626.ISBN: 9798400702297.DOI:10.1145/3600006.3613165.URL:https://doi.org/ 10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165.url:https://doi.org/ 2023
[67]

2025.URL:https://github.com/NovaSky- AI/SkyThought/tree/0d190f11fd8e885bbe113aeccacba5ccde5b1102/skythought/test-time-scaling

Dacheng Li et al.S*: Test Time Scaling for Code Generation (Source code). 2025.URL:https://github.com/NovaSky- AI/SkyThought/tree/0d190f11fd8e885bbe113aeccacba5ccde5b1102/skythought/test-time-scaling

work page 2025
[68]

Google.Gemini 2.5 Flash-Lite.URL:https : / / docs . cloud . google . com / vertex - ai / generative - ai / docs / models/gemini/2-5-flash-lite

work page
[69]

APACrefauthors \ 1987

Peter Rousseeuw. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”. In:J. Comput. Appl. Math.20.1 (Nov. 1987), pp. 53–65.ISSN: 0377-0427.DOI:10 . 1016 / 0377 - 0427(87 ) 90125 - 7.URL:https : //doi.org/10.1016/0377-0427(87)90125-7

work page doi:10.1016/0377-0427(87)90125-7 1987
[70]

Chris Yuhao Liu et al.Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy. 2026. arXiv:2507. 01352 [cs.CL].URL:https://arxiv.org/abs/2507.01352

work page internal anchor Pith review Pith/arXiv arXiv 2026
[71]

QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs

Koen Claessen and John Hughes. “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs”. In:Pro- ceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming. ICFP ’00. New York, NY , USA: Association for Computing Machinery, 2000, pp. 268–279.ISBN: 1581132026.URL:https://doi.org/10.1145/ 351240.351266

work page arXiv 2000
[72]

Property-Based Testing in Practice

Harrison Goldstein, Joseph W. Cutler, Daniel Dickstein, Benjamin C. Pierce, and Andrew Head. “Property-Based Testing in Practice”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.URL:https : / / doi . org / 10 . 1145 / 3597503.3639581

work page arXiv 2024
[73]

Oracle-Guided Program Selection from Large Language Models

Zhiyu Fan, Haifeng Ruan, Sergey Mechtaev, and Abhik Roychoudhury. “Oracle-Guided Program Selection from Large Language Models”. In:Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2024. Vienna, Austria: Association for Computing Machinery, 2024, pp. 628–640.ISBN: 9798400706127.DOI: 10.1145/3650212.3680308...

work page doi:10.1145/3650212.3680308.url:https://doi.org/10.1145/3650212.3680308 2024
[74]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. “A Survey on Large Language Models for Code Generation”. In:ACM Trans. Softw. Eng. Methodol.35.2 (Jan. 2026).ISSN: 1049-331X.DOI:10.1145/3747588.URL: https://doi.org/10.1145/3747588

work page doi:10.1145/3747588.url: 2026
[75]

An Yang et al.Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL].URL:https://arxiv.org/abs/2505. 09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Gupta, Neereja Sundaresan, Thomas Alexander, Christopher J

Daya Guo et al. “DeepSeekR1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature645.8081 (Sept. 2025), pp. 633–638.ISSN: 1476-4687.DOI:10.1038/s41586- 025- 09422- z.URL:http://dx.doi.org/10.1038/ s41586-025-09422-z

work page doi:10.1038/s41586- 2025
[77]

SWE-bench: Can Language Models Resolve Real-world Github Issues?

Carlos E Jimenez et al. “SWE-bench: Can Language Models Resolve Real-world Github Issues?” In:The Twelfth Interna- tional Conference on Learning Representations. 2024.URL:https://openreview.net/forum?id=VTF8yNQM66

work page 2024
[78]

In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Qi Guo et al. “Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.DOI:10.1145/3597503.3623306.URL:https://doi.org/10. 1145/3597503.3623306

work page doi:10.1145/3597503.3623306.url:https://doi.org/10 2024
[79]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. “Automated Program Repair in the Era of Large Pre-trained Language Models”. In:2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2023, pp. 1482– 1494.DOI:10.1109/ICSE48619.2023.00129. 20 Code Generation by Differential Test Time Scaling

work page doi:10.1109/icse48619.2023.00129 2023
[80]

UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing

Yifeng He et al. “UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing”. In:Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2024. Vienna, Austria: Association for Computing Machinery, 2024, pp. 1061–1072.ISBN: 9798400706127.DOI: 10.1145/3650212.3680...

work page doi:10.1145/3650212.3680342.url:https://doi.org/10.1145/3650212.3680342 2024
[81]

LitSearch: A retrieval benchmark for scientific literature search

Weimin Xiong, Yiwen Guo, and Hao Chen. “The Program Testing Ability of Large Language Models for Code”. In:Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. Miami, Florida, US: Association for Computational Linguistics, Nov. 2024, pp. 23–34.DOI:10.18653/v1/2024.emnlp- industry.3. URL:https://aclantho...

work page doi:10.18653/v1/2024.emnlp- 2024

Showing first 80 references.

[1] [1]

2024.URL:https : / / github

Inbal Shani and GitHub Staff.Survey reveals AI’s impact on the developer experience. 2024.URL:https : / / github . blog/news-insights/research/survey-reveals-ais-impact-on-the-developer-experience/

work page 2024

[2] [2]

2024.URL:https: //github.blog/news-insights/research/survey-ai-wave-grows/

Kyle Daigle and GitHub Staff.Survey: The AI wave continues to grow on software development teams. 2024.URL:https: //github.blog/news-insights/research/survey-ai-wave-grows/

work page 2024

[3] [3]

Mark Chen et al.Evaluating Large Language Models Trained on Code. 2021. arXiv:2107.03374 [cs.LG].URL:https: //arxiv.org/abs/2107.03374. 16 Code Generation by Differential Test Time Scaling

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Evaluating Large Language Models in Class-Level Code Generation

Xueying Du et al. “Evaluating Large Language Models in Class-Level Code Generation”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.DOI:10 . 1145 / 3597503 . 3639219.URL:https : / / doi . org / 10 . 1145 / 3597503 . 3639219

work page 2024

[5] [5]

DeepSeek-AI et al.DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. 2025. arXiv: 2501.12948 [cs.CL].URL:https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

s1: Simple test-time scaling

Niklas Muennighoff et al. “s1: Simple test-time scaling”. In:Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 20275–20321. ISBN: 979-8-89176-332-6.DOI:10.18653/v1/2025.emnlp-main.1025.URL:https://aclanthology.org/2025. emnlp-main.1025/

work page doi:10.18653/v1/2025.emnlp-main.1025.url:https://aclanthology.org/2025 2025

[7] [7]

Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. “Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning”. In:The Thirteenth International Conference on Learning Repre- sentations. 2025.URL:https://openreview.net/forum?id=4FWAwZtd2n

work page 2025

[8] [8]

ACECODER: Acing Coder RL via Automated Test-Case Synthesis

Huaye Zeng et al. “ACECODER: Acing Coder RL via Automated Test-Case Synthesis”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Com- putational Linguistics, July 2025, pp. 12023–12040.ISBN: 979-8-89176-251-0.DOI:10.18653/v1/2025.acl-long.587. URL:https://a...

work page doi:10.18653/v1/2025.acl-long.587 2025

[9] [9]

scrolling screenshot

Dacheng Li et al. “S*: Test Time Scaling for Code Generation”. In:Findings of the Association for Computational Lin- guistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 15964–15978.ISBN: 979-8-89176-335-7.DOI:10.18653/v1/2025.findings- emnlp.865.URL:https://aclanthology.org/2025. findings-emnlp.865/

work page doi:10.18653/v1/2025.findings- 2025

[10] [10]

Agent-RewardBench: Towards a unified benchmark for reward modeling across perception, planning, and safety in real- world multimodal agents,

Xiancai Chen et al. “Revisit Self-Debugging with Self-Generated Tests for Code Generation”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, July 2025, pp. 18003–18023.ISBN: 979-8-89176-251-0.DOI:10.18653/v1/2025.acl- long.881.URL...

work page doi:10.18653/v1/2025.acl- 2025

[11] [11]

Differential Testing for Software

William M. McKeeman. “Differential Testing for Software”. In:Digit. Tech. J.10 (1998), pp. 100–107.URL:https : //api.semanticscholar.org/CorpusID:14018070

work page 1998

[12] [12]

Differential testing: a new approach to change detection

Robert B. Evans and Alberto Savoia. “Differential testing: a new approach to change detection”. In:The 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engi- neering: Companion Papers. ESEC-FSE companion ’07. Dubrovnik, Croatia: Association for Computing Machinery, 2007, pp. 549–552...

work page doi:10.1145/1295014.1295038.url:https://doi.org/10.1145/1295014 2007

[13] [13]

Hunting for bugs in code coverage tools via randomized differential testing

Yibiao Yang et al. “Hunting for bugs in code coverage tools via randomized differential testing”. In:Proceedings of the 41st International Conference on Software Engineering. ICSE ’19. Montreal, Quebec, Canada: IEEE Press, 2019, pp. 488–499. DOI:10.1109/ICSE.2019.00061.URL:https://doi.org/10.1109/ICSE.2019.00061

work page doi:10.1109/icse.2019.00061.url:https://doi.org/10.1109/icse.2019.00061 2019

[14] [14]

Emer, Mark A

Shaohua Li and Zhendong Su. “Finding Unstable Code via Compiler-Driven Differential Testing”. In:Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. ASPLOS 2023. Vancouver, BC, Canada: Association for Computing Machinery, 2023, pp. 238–251.ISBN: 9781450399180.DOI:10.1145/...

work page doi:10.1145/3582016.3582053.url:https://doi.org/10.1145/3582016.3582053 2023

[15] [15]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain et al. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”. In: The Thirteenth International Conference on Learning Representations. 2025.URL:https://openreview.net/forum? id=chfJJYC3iL

work page 2025

[16] [16]

Finding and understanding bugs in C compilers

Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. “Finding and understanding bugs in C compilers”. In:SIGPLAN Not.46.6 (June 2011), pp. 283–294.ISSN: 0362-1340.DOI:10.1145/1993316.1993532.URL:https://doi.org/10. 1145/1993316.1993532

work page doi:10.1145/1993316.1993532.url:https://doi.org/10 2011

[17] [17]

Compiler validation via equivalence modulo inputs

Vu Le, Mehrdad Afshari, and Zhendong Su. “Compiler validation via equivalence modulo inputs”. In:SIGPLAN Not.49.6 (June 2014), pp. 216–226.ISSN: 0362-1340.DOI:10.1145/2666356.2594334.URL:https://doi.org/10.1145/ 2666356.2594334

work page doi:10.1145/2666356.2594334.url:https://doi.org/10.1145/ 2014

[18] [18]

Finding compiler bugs via live code mutation

Chengnian Sun, Vu Le, and Zhendong Su. “Finding compiler bugs via live code mutation”. In:SIGPLAN Not.51.10 (Oct. 2016), pp. 849–863.ISSN: 0362-1340.DOI:10.1145/3022671.2984038.URL:https://doi.org/10.1145/3022671. 2984038

work page doi:10.1145/3022671.2984038.url:https://doi.org/10.1145/3022671 2016

[19] [19]

Testing Database Engines via Pivoted Query Synthesis

Manuel Rigger and Zhendong Su. “Testing Database Engines via Pivoted Query Synthesis”. In:Proc. ACM Program. Lang. 4.OOPSLA (Nov. 2020).DOI:10.1145/3428279.URL:https://doi.org/10.1145/3428279

work page doi:10.1145/3428279.url:https://doi.org/10.1145/3428279 2020

[20] [20]

Evaluating Program Semantics Reasoning with Type Inference in System $F$

Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, and Hao Chen. “Evaluating Program Semantics Reasoning with Type Inference in System $F$”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2025.URL:https://openreview.net/forum?id=IA9RmaP0aw

work page 2025

[21] [21]

Coverage-based Greybox Fuzzing as Markov Chain

Marcel B ¨ohme, Van-Thuan Pham, and Abhik Roychoudhury. “Coverage-based Greybox Fuzzing as Markov Chain”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. CCS ’16. Vienna, Austria: Association for Computing Machinery, 2016, pp. 1032–1043.ISBN: 9781450341394.DOI:10.1145/2976749.2978428. URL:https://doi.org/10.1145/...

work page doi:10.1145/2976749.2978428 2016

[22] [22]

AFL++ : Combining Incremental Steps of Fuzzing Research

Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. “AFL++ : Combining Incremental Steps of Fuzzing Research”. In:14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association, Aug. 2020.URL: https://www.usenix.org/conference/woot20/presentation/fioraldi

work page 2020

[23] [23]

The Fuzzing Book

Andreas Zeller, Rahul Gopinath, Marcel B ¨ohme, Gordon Fraser, and Christian Holler. “The Fuzzing Book”. In: (Jan. 2019). DOI:10.60882/cispa.24614928.v1.URL:https://publications.cispa.de/articles/book/The_Fuzzing_ Book/24614928

work page doi:10.60882/cispa.24614928.v1.url:https://publications.cispa.de/articles/book/the_fuzzing_ 2019

[24] [24]

Matryoshka: Fuzzing Deeply Nested Branches

Peng Chen, Jianzhong Liu, and Hao Chen. “Matryoshka: Fuzzing Deeply Nested Branches”. In:Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. CCS ’19. London, United Kingdom: Association for Computing Machinery, 2019, pp. 499–513.ISBN: 9781450367479.DOI:10.1145/3319535.3363225.URL:https: //doi.org/10.1145/3319535.3363225

work page doi:10.1145/3319535.3363225.url:https: 2019

[25] [25]

Coverage-directed differential testing of JVM implementations

Yuting Chen, Ting Su, Chengnian Sun, Zhendong Su, and Jianjun Zhao. “Coverage-directed differential testing of JVM implementations”. In:SIGPLAN Not.51.6 (June 2016), pp. 85–99.ISSN: 0362-1340.URL:https://doi.org/10.1145/ 2980983.2908095

work page arXiv 2016

[26] [26]

NEZHA: Efficient Domain- Independent Differential Testing

Theofilos Petsios, Adrian Tang, Salvatore Stolfo, Angelos D. Keromytis, and Suman Jana. “NEZHA: Efficient Domain- Independent Differential Testing”. In:Proceedings of the 2017 IEEE Symposium on Security and Privacy. SP ’17. San Jose, CA, USA: IEEE Press, 2017, pp. 615–632.ISBN: 9781509049318.DOI:10 . 1109 / SP . 2017 . 27.URL:https : //doi.org/10.1109/SP.2017.27

work page doi:10.1109/sp.2017.27 2017

[27] [27]

DifFuzz: differential fuzzing for side-channel analysis

Shirin Nilizadeh, Yannic Noller, and Corina S. P ˘as˘areanu. “DifFuzz: differential fuzzing for side-channel analysis”. In: Proceedings of the 41st International Conference on Software Engineering. ICSE ’19. Montreal, Quebec, Canada: IEEE Press, 2019, pp. 176–187.ISBN: 9781728108698.DOI:10.1109/ICSE.2019.00122.URL:https://doi.org/10. 1109/ICSE.2019.00122

work page doi:10.1109/icse.2019.00122.url:https://doi.org/10 2019

[28] [28]

In: Proceedings of the 38th International Conference on Software Engineering

Junjie Chen et al. “An empirical comparison of compiler testing techniques”. In:Proceedings of the 38th International Conference on Software Engineering. ICSE ’16. Austin, Texas: Association for Computing Machinery, 2016, pp. 180–190. ISBN: 9781450342056.DOI:10.1145/2884781.2884878.URL:https://doi.org/10.1145/2884781.2884878

work page doi:10.1145/2884781.2884878.url:https://doi.org/10.1145/2884781.2884878 2016

[29] [29]

CodeT: Code Generation with Generated Tests

Bei Chen et al. “CodeT: Code Generation with Generated Tests”. In:The Eleventh International Conference on Learning Representations. 2023.URL:https://openreview.net/forum?id=ktrw68Cmu9c

work page 2023

[30] [30]

In: Zong, C., Xia, F., Li, W., Navigli, R

Yifeng He, Jicheng Wang, Yuyang Rong, and Hao Chen. “FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation”. In:Findings of the Association for Computational Linguistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 15642–15655.ISBN: 979-8-89176-335-7.DOI:10.18653/v1/ 2025.findings-emnlp.8...

work page doi:10.18653/v1/ 2025

[31] [31]

Learning to Write with Cooperative Discriminators

Ari Holtzman et al. “Learning to Write with Cooperative Discriminators”. In:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 1638–1649.DOI:10.18653/v1/P18-1152.URL:https://aclanthology.org/P18-1152/

work page doi:10.18653/v1/p18-1152.url:https://aclanthology.org/p18-1152/ 2018

[32] [32]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. “The Curious Case of Neural Text Degeneration”. In: International Conference on Learning Representations. 2020.URL:https://openreview.net/forum?id=rygGQyrFvH

work page 2020

[33] [33]

Hierarchical Neural Story Generation

Angela Fan, Mike Lewis, and Yann Dauphin. “Hierarchical Neural Story Generation”. In:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 889–898.DOI:10.18653/v1/P18- 1082.URL:https://aclanthology. org/P18-1082/

work page doi:10.18653/v1/p18- 2018

[34] [34]

Language models are unsupervised multitask learners

Alec Radford et al. “Language models are unsupervised multitask learners”. In:OpenAI blog1.8 (2019), p. 9

work page 2019

[35] [35]

Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Hongxiang Zhang, Hao Chen, Muhao Chen, and Tianyi Zhang. “Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation”. In:Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 3028–3046.ISBN: 979-8-89176-332- 6.DOI:10....

work page doi:10.18653/v1/2025.emnlp-main.150.url:https://aclanthology.org/2025.emnlp-main.150/ 2025

[36] [36]

A learning algorithm for Boltzmann machines

David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. “A learning algorithm for Boltzmann machines”. In:Cog- nitive science9.1 (1985), pp. 147–169

work page 1985

[37] [37]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distilling the Knowledge in a Neural Network. 2015. arXiv:1503.02531 [stat.ML].URL:https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[38] [38]

Controlling Linguistic Style Aspects in Neural Language Generation

Jessica Ficler and Yoav Goldberg. “Controlling Linguistic Style Aspects in Neural Language Generation”. In:Proceedings of the Workshop on Stylistic Variation. Copenhagen, Denmark: Association for Computational Linguistics, Sept. 2017, pp. 94–104.DOI:10.18653/v1/W17-4912.URL:https://aclanthology.org/W17-4912/

work page doi:10.18653/v1/w17-4912.url:https://aclanthology.org/w17-4912/ 2017

[39] [39]

In: Duh, K., Gomez, H., Bethard, S

Matthew Renze. “The Effect of Sampling Temperature on Problem Solving in Large Language Models”. In:Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 7346–7356.DOI:10.18653/v1/2024.findings- emnlp.432.URL:https://aclanthology.org/ 2024.findings-emnlp.432/

work page doi:10.18653/v1/2024.findings- 2024

[40] [40]

Demystifying LLM-Based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. “Demystifying LLM-Based Software Engineering Agents”. In:Proc. ACM Softw. Eng.2.FSE (June 2025).DOI:10.1145/3715754.URL:https://doi.org/10.1145/ 3715754

work page doi:10.1145/3715754.url:https://doi.org/10.1145/ 2025

[41] [41]

Lipton, Mu Li, and Alexander J

Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola.Dive into Deep Learning.https://D2L.ai. Cambridge University Press, 2023. 18 Code Generation by Differential Test Time Scaling

work page 2023

[42] [42]

Holistic Evaluation of Language Models

Percy Liang et al. “Holistic Evaluation of Language Models”. In:Transactions on Machine Learning Research(2023). Featured Certification, Expert Certification, Outstanding Certification.ISSN: 2835-8856.URL:https : / / openreview . net/forum?id=iO4LZibEqW

work page 2023

[43] [43]

Quantifying Language Models’Sensitivity to Spurious Fea- tures in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. “Quantifying Language Models’Sensitivity to Spurious Fea- tures in Prompt Design or: How I learned to start worrying about prompt formatting”. In:International Conference on Representation Learning. V ol. 2024. 2024, pp. 25055–25083.URL:https://proceedings.iclr.cc/paper_files/ paper/2024/file/6c0e...

work page 2024

[44] [44]

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

Jingming Zhuo et al. “ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs”. In:Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 1950–1976.DOI:10 . 18653 / v1 / 2024 . findings - emnlp . 108.URL:https : / / aclanthology . org / 2024 . findings...

work page 2024

[45] [45]

What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering

“What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering”. In:Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Albuquerque, New Mexico: Association for Computational Linguistics, Apr. 2025, pp. 15...

work page 2025

[46] [46]

Prompting Techniques for Secure Code Generation: A Systematic Investigation

Catherine Tony, Nicol ´as E. D´ıaz Ferreyra, Markus Mutas, Salem Dhif, and Riccardo Scandariato. “Prompting Techniques for Secure Code Generation: A Systematic Investigation”. In:ACM Trans. Softw. Eng. Methodol.34.8 (Oct. 2025).ISSN: 1049-331X.DOI:10.1145/3722108.URL:https://doi.org/10.1145/3722108

work page doi:10.1145/3722108.url:https://doi.org/10.1145/3722108 2025

[47] [47]

How beginning programmers and code LLMs ( mis)read each other,

Sydney Nguyen et al. “How Beginning Programmers and Code LLMs (Mis)read Each Other”. In:Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. CHI ’24. Honolulu, HI, USA: Association for Computing Ma- chinery, 2024.ISBN: 9798400703300.DOI:10.1145/3613904.3642706.URL:https://doi.org/10.1145/3613904. 3642706

work page doi:10.1145/3613904.3642706.url:https://doi.org/10.1145/3613904 2024

[48] [48]

CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing

Shuhan Liu et al. “CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing”. In:Proceedings of the IEEE/ACM 48th International Conference on Software Engineering. ICSE ’26. Rio de Janeiro, Brazil: Association for Computing Machinery, 2026.DOI:3744916.3773111.URL:https://arxiv.org/abs/2507.16407v3

work page arXiv 2026

[49] [49]

Can Language Models Solve Olympiad Programming?

Ben Shi, Michael Tang, Karthik R Narasimhan, and Shunyu Yao. “Can Language Models Solve Olympiad Programming?” In:First Conference on Language Modeling. 2024.URL:https://openreview.net/forum?id=kGa4fMtP9l

work page 2024

[50] [50]

Beam Search Strategies for Neural Machine Translation

Markus Freitag and Yaser Al-Onaizan. “Beam Search Strategies for Neural Machine Translation”. In:Proceedings of the First Workshop on Neural Machine Translation. Vancouver: Association for Computational Linguistics, Aug. 2017, pp. 56– 60.DOI:10.18653/v1/W17-3207.URL:https://aclanthology.org/W17-3207/

work page doi:10.18653/v1/w17-3207.url:https://aclanthology.org/w17-3207/ 2017

[51] [51]

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Kaustubh Dhole et al. “NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation”. In:Northern European Journal of Language Technology9 (2023).DOI:10 . 3384 / nejlt . 2000 - 1533 . 2023 . 4725.URL:https : //aclanthology.org/2023.nejlt-1.5/

work page 2023

[52] [52]

Prompt Perturbation Consistency Learning for Robust Language Models

Yao Qiang et al. “Prompt Perturbation Consistency Learning for Robust Language Models”. In:Findings of the Association for Computational Linguistics: EACL 2024. St. Julian’s, Malta: Association for Computational Linguistics, Mar. 2024, pp. 1357–1370.URL:https://aclanthology.org/2024.findings-eacl.91/

work page 2024

[53] [53]

Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Ra- tionales?

Zhanke Zhou et al. “Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Ra- tionales?” In:The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.URL:https : / / openreview.net/forum?id=FbuODM02ra

work page 2024

[54] [54]

Sorting through the noise: Testing robustness of information processing in pre- trained language models

Lalchand Pandia and Allyson Ettinger. “Sorting through the noise: Testing robustness of information processing in pre- trained language models”. In:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 1583–1596.DOI: 10.18...

work page doi:10.18653/v1/2021.emnlp-main.119.url:https://aclanthology.org/2021.emnlp-main.119/ 2021

[55] [55]

Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training

Feiteng Fang et al. “Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 10028–10039.DOI:10.18653/ v1/2024.acl-lo...

work page 2024

[56] [56]

Models in the Wild: On Corruption Robustness of Neural NLP Systems

Barbara Rychalska, Dominika Basaj, Alicja Gosiewska, and Przemysław Biecek. “Models in the Wild: On Corruption Robustness of Neural NLP Systems”. In:Neural Information Processing. Ed. by Tom Gedeon, Kok Wai Wong, and Minho Lee. Cham: Springer International Publishing, 2019, pp. 235–247.ISBN: 978-3-030-36718-3

work page 2019

[57] [57]

Understanding Programs by Exploiting (Fuzzing) Test Cases

Jianyu Zhao, Yuyang Rong, Yiwen Guo, Yifeng He, and Hao Chen. “Understanding Programs by Exploiting (Fuzzing) Test Cases”. In:Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, July 2023, pp. 10667–10679.DOI:10.18653/v1/2023.findings-acl.678.URL:https: //aclanthology.org/2023.fi...

work page doi:10.18653/v1/2023.findings-acl.678.url:https: 2023

[58] [58]

Continuous Fuzzing with libFuzzer and AddressSanitizer

Kosta Serebryany. “Continuous Fuzzing with libFuzzer and AddressSanitizer”. In:2016 IEEE Cybersecurity Development (SecDev). 2016, pp. 157–157.URL:https://doi.org/10.1109/SecDev.2016.043

work page doi:10.1109/secdev.2016.043 2016

[59] [59]

Prompt Fuzzing for Fuzz Driver Generation

Yunlong Lyu, Yuxuan Xie, Peng Chen, and Hao Chen. “Prompt Fuzzing for Fuzz Driver Generation”. In:Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. CCS ’24. Salt Lake City, UT, USA: Association for Computing Machinery, 2024, pp. 3793–3807.ISBN: 9798400706363.DOI:10.1145/3658644.3670396. URL:https://doi.org/10.1145/3...

work page doi:10.1145/3658644.3670396 2024

[60] [60]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li et al. “Competition-level code generation with AlphaCode”. In:Science378.6624 (2022), pp. 1092–1097.DOI: 10.1126/science.abq1158. eprint:https://www.science.org/doi/pdf/10.1126/science.abq1158.URL: https://www.science.org/doi/abs/10.1126/science.abq1158

work page doi:10.1126/science.abq1158 2022

[61] [61]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”. In:The Eleventh International Conference on Learning Representations. 2023.URL:https://openreview.net/forum?id=1PL1NIMMrw

work page 2023

[62] [62]

Hierarchical Clustering: Objective Functions and Algorithms

Vincent Cohen-addad, Varun Kanade, Frederik Mallmann-trenn, and Claire Mathieu. “Hierarchical Clustering: Objective Functions and Algorithms”. In:J. ACM66.4 (June 2019).ISSN: 0004-5411.DOI:10 . 1145 / 3321386.URL:https : //doi.org/10.1145/3321386

work page doi:10.1145/3321386 2019

[63] [63]

Revisiting agglomerative clustering

Eric K. Tokuda, Cesar H. Comin, and Luciano da F. Costa. “Revisiting agglomerative clustering”. In:Physica A: Statistical Mechanics and its Applications585 (2022), p. 126433.ISSN: 0378-4371.DOI:https://doi.org/10.1016/j.physa. 2021.126433.URL:https://www.sciencedirect.com/science/article/pii/S0378437121007068

work page doi:10.1016/j.physa 2022

[64] [64]

Binyuan Hui et al.Qwen2.5Coder Technical Report. 2024. arXiv:2409.12186 [cs.CL].URL:https://arxiv.org/ abs/2409.12186

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Efficient memory management for large language model serving with pagedattention,

Woosuk Kwon et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention”. In:Pro- ceedings of the 29th Symposium on Operating Systems Principles. SOSP ’23. Koblenz, Germany: Association for Comput- ing Machinery, 2023, pp. 611–626.ISBN: 9798400702297.DOI:10.1145/3600006.3613165.URL:https://doi.org/ 10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165.url:https://doi.org/ 2023

[66] [67]

2025.URL:https://github.com/NovaSky- AI/SkyThought/tree/0d190f11fd8e885bbe113aeccacba5ccde5b1102/skythought/test-time-scaling

Dacheng Li et al.S*: Test Time Scaling for Code Generation (Source code). 2025.URL:https://github.com/NovaSky- AI/SkyThought/tree/0d190f11fd8e885bbe113aeccacba5ccde5b1102/skythought/test-time-scaling

work page 2025

[67] [68]

Google.Gemini 2.5 Flash-Lite.URL:https : / / docs . cloud . google . com / vertex - ai / generative - ai / docs / models/gemini/2-5-flash-lite

work page

[68] [69]

APACrefauthors \ 1987

Peter Rousseeuw. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”. In:J. Comput. Appl. Math.20.1 (Nov. 1987), pp. 53–65.ISSN: 0377-0427.DOI:10 . 1016 / 0377 - 0427(87 ) 90125 - 7.URL:https : //doi.org/10.1016/0377-0427(87)90125-7

work page doi:10.1016/0377-0427(87)90125-7 1987

[69] [70]

Chris Yuhao Liu et al.Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy. 2026. arXiv:2507. 01352 [cs.CL].URL:https://arxiv.org/abs/2507.01352

work page internal anchor Pith review Pith/arXiv arXiv 2026

[70] [71]

QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs

Koen Claessen and John Hughes. “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs”. In:Pro- ceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming. ICFP ’00. New York, NY , USA: Association for Computing Machinery, 2000, pp. 268–279.ISBN: 1581132026.URL:https://doi.org/10.1145/ 351240.351266

work page arXiv 2000

[71] [72]

Property-Based Testing in Practice

Harrison Goldstein, Joseph W. Cutler, Daniel Dickstein, Benjamin C. Pierce, and Andrew Head. “Property-Based Testing in Practice”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.URL:https : / / doi . org / 10 . 1145 / 3597503.3639581

work page arXiv 2024

[72] [73]

Oracle-Guided Program Selection from Large Language Models

Zhiyu Fan, Haifeng Ruan, Sergey Mechtaev, and Abhik Roychoudhury. “Oracle-Guided Program Selection from Large Language Models”. In:Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2024. Vienna, Austria: Association for Computing Machinery, 2024, pp. 628–640.ISBN: 9798400706127.DOI: 10.1145/3650212.3680308...

work page doi:10.1145/3650212.3680308.url:https://doi.org/10.1145/3650212.3680308 2024

[73] [74]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. “A Survey on Large Language Models for Code Generation”. In:ACM Trans. Softw. Eng. Methodol.35.2 (Jan. 2026).ISSN: 1049-331X.DOI:10.1145/3747588.URL: https://doi.org/10.1145/3747588

work page doi:10.1145/3747588.url: 2026

[74] [75]

An Yang et al.Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL].URL:https://arxiv.org/abs/2505. 09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [76]

Gupta, Neereja Sundaresan, Thomas Alexander, Christopher J

Daya Guo et al. “DeepSeekR1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature645.8081 (Sept. 2025), pp. 633–638.ISSN: 1476-4687.DOI:10.1038/s41586- 025- 09422- z.URL:http://dx.doi.org/10.1038/ s41586-025-09422-z

work page doi:10.1038/s41586- 2025

[76] [77]

SWE-bench: Can Language Models Resolve Real-world Github Issues?

Carlos E Jimenez et al. “SWE-bench: Can Language Models Resolve Real-world Github Issues?” In:The Twelfth Interna- tional Conference on Learning Representations. 2024.URL:https://openreview.net/forum?id=VTF8yNQM66

work page 2024

[77] [78]

In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Qi Guo et al. “Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.DOI:10.1145/3597503.3623306.URL:https://doi.org/10. 1145/3597503.3623306

work page doi:10.1145/3597503.3623306.url:https://doi.org/10 2024

[78] [79]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. “Automated Program Repair in the Era of Large Pre-trained Language Models”. In:2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2023, pp. 1482– 1494.DOI:10.1109/ICSE48619.2023.00129. 20 Code Generation by Differential Test Time Scaling

work page doi:10.1109/icse48619.2023.00129 2023

[79] [80]

UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing

Yifeng He et al. “UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing”. In:Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2024. Vienna, Austria: Association for Computing Machinery, 2024, pp. 1061–1072.ISBN: 9798400706127.DOI: 10.1145/3650212.3680...

work page doi:10.1145/3650212.3680342.url:https://doi.org/10.1145/3650212.3680342 2024

[80] [81]

LitSearch: A retrieval benchmark for scientific literature search

Weimin Xiong, Yiwen Guo, and Hao Chen. “The Program Testing Ability of Large Language Models for Code”. In:Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. Miami, Florida, US: Association for Computational Linguistics, Nov. 2024, pp. 23–34.DOI:10.18653/v1/2024.emnlp- industry.3. URL:https://aclantho...

work page doi:10.18653/v1/2024.emnlp- 2024